Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

engine: require iterators to specify an upper bound #26872

Merged
merged 2 commits into from
Jul 6, 2018

Conversation

benesch
Copy link
Contributor

@benesch benesch commented Jun 20, 2018

As promised. The issue the first commit fixes took longer than I'd like to admit to track down.

I've done my best to keep the patch as small as possible in case we want to backport to 2.0. The changes to the C++ DBIterator are unfortunately large, though. Open to less disruptive ways of plumbing the upper bound there, but I'm not seeing anything obvious.


While investigating range deletion performance issues, we realized that
large swaths of contiguous tombstones can cause massive performance
issues when seeking. Seeks of more than 15 minutes (!) were observed on
one cluster 0. The problem is that every key overlapping the range
deletion tombstone must be loaded from disk, decompressed, and compared
with the sequence number of the tombstone until a key is found that is
not covered by the tombstone. If contiguous keys representing hundreds
of gigabytes are covered by tombstones, RocksDB will need to scan
hundreds of gigabytes of data. Needless to say, performance suffers.

We have plans to improve seek performance in the face of range deletion
tombstones upstream, but we can mitigate the issue locally, too.
Iteration outside of tests is nearly always in the context of a range or
some bounded span of local keys. By plumbing knowledge of the upper
bound we care about for each scan to RocksDB, we can invalidate the
iterator early, once the upper bound has been exceeded, rather than
scanning over potentially hundreds of gigabytes of deleted keys just to
find a key that we don't care about.

To ensure we don't forget to specify an upper bound in a critical path,
this commit requires that all iterators are either prefix iterators or
declare their upper bound. This makes iterating in tests rather irksome,
but I think the tradeoff is worthwhile.

@benesch benesch requested review from bdarnell, tbg, petermattis and a team June 20, 2018 21:32
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@benesch benesch force-pushed the iter-upper-bound branch 3 times, most recently from f9ba809 to a0a2e7a Compare June 21, 2018 05:45
@petermattis
Copy link
Collaborator

I'll have to take a closer look at the change to BaseDeltaIterator. I don't have a suggestion for how to avoid changing all of the call sites like you did. I suppose you could add an engine.UnlimitedIterOptions global, but I'm not sure if that is any clearer or more concise than using IterOptions{UpperBound: roachpb.KeyMax} as you've done.


Reviewed 2 of 2 files at r1.
Review status: :shipit: complete! 0 of 0 LGTMs obtained


c-deps/libroach/batch.cc, line 116 at r2 (raw file):

// The prefix_same_as_start and upper_bound settings must also be applied to
// base_iterator for correct operation.
class BaseDeltaIterator : public rocksdb::Iterator {

Unrelated to this change, but I've recently realized that a WriteBatchWithIndex is really just another layer of memtable and all of the BaseDeltaIterator crap could be folded into MergingIterator and DBIter. Too large a refactor at this point.


pkg/storage/engine/rocksdb.go, line 1915 at r2 (raw file):

// MaxTimestamp fields in IterOptions. Separating normal and time-bound
// iterators is silly: any iterator, even prefix iterators, can become
// time-bound by plumbing the right options through to RocksDB.

Looks like you didn't plumb iter-upper-bound into NewTimeBoundIterator.


pkg/storage/engine/rocksdb.go, line 1931 at r2 (raw file):

func (r *rocksDBIterator) setUpperBound(key roachpb.Key) {
	C.DBIterSetUpperBound(r.iter, goToCKey(MakeMVCCMetadataKey(key)))

Should you be checking here that len(key) != 0 similar to the check in rocksDBIterator.init?


Comments from Reviewable

@petermattis
Copy link
Collaborator

Btw, should we be plumbing LowerBound for reverse scans?


Review status: :shipit: complete! 0 of 0 LGTMs obtained


Comments from Reviewable

@bdarnell
Copy link
Contributor

Review status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/engine/engine.go, line 138 at r2 (raw file):

	// prefix will not work correctly (keys may be skipped).
	Prefix bool
	// If UpperBound is non-nil, the iterator will become invalid after seeking

The PR description says this field is required (and it looks like you've added it everywhere), but this comment says that nil makes it unbounded. One of those needs to be updated (I'd prefer to just disallow nil)


Comments from Reviewable

@benesch
Copy link
Contributor Author

benesch commented Jun 26, 2018

Btw, should we be plumbing LowerBound for reverse scans?

Probably, but there's only one place, stateloader.LoadLastIndex, that we use reverse scans. (Is that really true? Huh.) It's in the local keyspace so unlikely to have any problems with range deletion tombstones. Mind if I leave that to another PR? This one is getting real large.


Review status: :shipit: complete! 0 of 0 LGTMs obtained


c-deps/libroach/batch.cc, line 116 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Unrelated to this change, but I've recently realized that a WriteBatchWithIndex is really just another layer of memtable and all of the BaseDeltaIterator crap could be folded into MergingIterator and DBIter. Too large a refactor at this point.

Ack.


pkg/storage/engine/engine.go, line 138 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The PR description says this field is required (and it looks like you've added it everywhere), but this comment says that nil makes it unbounded. One of those needs to be updated (I'd prefer to just disallow nil)

Well, it's not required if Prefix is specified, in which case a nil UpperBound does mean unbounded. Suggestions for better wording?


pkg/storage/engine/rocksdb.go, line 1915 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Looks like you didn't plumb iter-upper-bound into NewTimeBoundIterator.

Ok, done. Time bound iterators are now just normal iterators with the MinTimestampHint and MaxTimestampHint options set.


pkg/storage/engine/rocksdb.go, line 1931 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Should you be checking here that len(key) != 0 similar to the check in rocksDBIterator.init?

Yes, thanks. Done.


Comments from Reviewable

@bdarnell
Copy link
Contributor

Probably, but there's only one place, stateloader.LoadLastIndex, that we use reverse scans. (Is that really true? Huh.)

And batcheval/cmd_reverse_scan.go


Review status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/engine/engine.go, line 138 at r2 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Well, it's not required if Prefix is specified, in which case a nil UpperBound does mean unbounded. Suggestions for better wording?

I'd change this comment then to "UpperBound gives this iterator an upper bound. Attempts to seek past this point will invalidate the iterator. UpperBound must be provided unless Prefix is true, in which case the end of the prefix will be used as the upper bound.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Per Ben's comment, MVCCScan(..., reverse=true,...) (libroach/mvcc.cc) uses iter->Prev().


Review status: :shipit: complete! 0 of 0 LGTMs obtained


c-deps/libroach/iterator.h, line 38 at r3 (raw file):

  rocksdb::Slice upper_bound;
  std::string min_timestamp_hint;
  std::string max_timestamp_hint;

Why did you need to add {min,max}_timestamp_hint here? I'm not seeing where they are used.


Comments from Reviewable

@benesch
Copy link
Contributor Author

benesch commented Jun 27, 2018

Ah! Gotcha. OK to leave to another PR?


Review status: :shipit: complete! 0 of 0 LGTMs obtained


c-deps/libroach/iterator.h, line 38 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Why did you need to add {min,max}_timestamp_hint here? I'm not seeing where they are used.

Oops. They were a vestige of my first try at this, which allowed updating the timestamp bounds on an already-created iterator. Turns out this fails badly: once an SST is filtered out by a table_filter, it is gone for good. You really need a fresh iterator when you change the timestamp bounds.


pkg/storage/engine/engine.go, line 138 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'd change this comment then to "UpperBound gives this iterator an upper bound. Attempts to seek past this point will invalidate the iterator. UpperBound must be provided unless Prefix is true, in which case the end of the prefix will be used as the upper bound.

Done.


Comments from Reviewable

@petermattis
Copy link
Collaborator

:lgtm:

Ah! Gotcha. OK to leave to another PR?

Yes


Review status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


Comments from Reviewable

craig bot pushed a commit that referenced this pull request Jul 5, 2018
27061: storage: improve safety of merge lock r=tschottdorf,bdarnell a=benesch

Fix a known correctness bug in the merge locking scheme. See the comment
updates within for details.

As an added benefit, this fix frees store.RemoveReplica of conditionals
to deal with uninitialized replicas. Acquiring the merge lock now always
results in an initialized right-hand replica, so store.MergeRange is
always removing an initialized right-hand replica.

There are no tests for this because testing this particular edge case is
quite difficult and I'm primarily interested in avoiding the obviously
wrong call to batch.ClearRange(nil, nil) that occurs when calling
store.RemoveReplica on an unintialized replica. This nonsensical
ClearRange is, somehow, not currently an error, but it is preventing
#26872 ("engine: require iterators to specify an upper bound") from
landing.

Release note: None

---

P.S. The `storage.RemoveOptions` struct now bundles precisely one parameter. Let me know if you'd prefer I change back the signature of `store.removeReplicaInternal` to take the bool directly. Bool parameters are frowned upon, though, so I wasn't sure what was preferred.

Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
In-memory unit tests typically run with a RocksDB cache size of between
1 and 10MB. Our cache portioning logic took this to mean that the entire
cache should be used for the write buffer (i.e., one memtable), and so
allocated zero bytes for the block cache. This was fine for most tests,
since an in-memory cache for an in-memory RocksDB instance is
unnecessary.

TestStoreMetrics, however, was living on the edge. It asserts that the
block cache has some minimum number of cache hits over the duration of
the test. A future commit that optimizes iteration caused the hit rate
to drop to zero, failing the test. (Aside: it is surprising that the hit
rate of a cache of capacity zero could ever be non-zero.)

Guarantee at least 1MB is allocated to the block cache so tests can
safely make assertions about the number of block cache hits. Also bump
the minimum number of cache hits in TestStoreMetrics back up to its
original value of ten; 8b890e6 decreased it to four in an effort to make
the test less flaky, but I'm now reliably seeing more than forty cache
hits.

Release note: None
While investigating range deletion performance issues, we realized that
large swaths of contiguous tombstones can cause massive performance
issues when seeking. Seeks of more than 15 minutes (!) were observed on
one cluster [0]. The problem is that every key overlapping the range
deletion tombstone must be loaded from disk, decompressed, and compared
with the sequence number of the tombstone until a key is found that is
not covered by the tombstone. If contiguous keys representing hundreds
of gigabytes are covered by tombstones, RocksDB will need to scan
hundreds of gigabytes of data. Needless to say, performance suffers.

We have plans to improve seek performance in the face of range deletion
tombstones upstream, but we can mitigate the issue locally, too.
Iteration outside of tests is nearly always in the context of a range or
some bounded span of local keys. By plumbing knowledge of the upper
bound we care about for each scan to RocksDB, we can invalidate the
iterator early, once the upper bound has been exceeded, rather than
scanning over potentially hundreds of gigabytes of deleted keys just to
find a key that we don't care about.

To ensure we don't forget to specify an upper bound in a critical path,
this commit requires that all iterators are either prefix iterators or
declare their upper bound. This makes iterating in tests rather irksome,
but I think the tradeoff is worthwhile.

[0]: cockroachdb#24029 (comment)

Release note: None
@benesch
Copy link
Contributor Author

benesch commented Jul 6, 2018

bors r=petermattis,bdarnell

craig bot pushed a commit that referenced this pull request Jul 6, 2018
26789: Makefile: don't globally install binaries during lint r=BramGruneir a=benesch

The build system takes pains to scope all binaries it installs to
REPO/bin to avoid polluting the user's GOPATH/bin with binaries that are
internal to the Cockroach build system.

The lint target was violating this rule by running 'go install
./pkg/...', which installed every package in the repository into
GOPATH/bin. This commit adjusts the rule to run 'go build ./pkg/...'
instead, which installs the necessary .a files into GOPATH/pkg without
installing any binaries into GOPATH/bin.

Fix #26633.

Release note: None

26872: engine: require iterators to specify an upper bound r=petermattis,bdarnell a=benesch

As promised. The issue the first commit fixes took longer than I'd like to admit to track down.

I've done my best to keep the patch as small as possible in case we want to backport to 2.0. The changes to the C++ DBIterator are unfortunately large, though. Open to less disruptive ways of plumbing the upper bound there, but I'm not seeing anything obvious.

---

While investigating range deletion performance issues, we realized that
large swaths of contiguous tombstones can cause massive performance
issues when seeking. Seeks of more than 15 minutes (!) were observed on
one cluster [0]. The problem is that every key overlapping the range
deletion tombstone must be loaded from disk, decompressed, and compared
with the sequence number of the tombstone until a key is found that is
not covered by the tombstone. If contiguous keys representing hundreds
of gigabytes are covered by tombstones, RocksDB will need to scan
hundreds of gigabytes of data. Needless to say, performance suffers.

We have plans to improve seek performance in the face of range deletion
tombstones upstream, but we can mitigate the issue locally, too.
Iteration outside of tests is nearly always in the context of a range or
some bounded span of local keys. By plumbing knowledge of the upper
bound we care about for each scan to RocksDB, we can invalidate the
iterator early, once the upper bound has been exceeded, rather than
scanning over potentially hundreds of gigabytes of deleted keys just to
find a key that we don't care about.

To ensure we don't forget to specify an upper bound in a critical path,
this commit requires that all iterators are either prefix iterators or
declare their upper bound. This makes iterating in tests rather irksome,
but I think the tradeoff is worthwhile.

[0]: #24029 (comment)

Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
@craig
Copy link
Contributor

craig bot commented Jul 6, 2018

Build succeeded

@craig craig bot merged commit 0622016 into cockroachdb:master Jul 6, 2018
@benesch benesch deleted the iter-upper-bound branch July 10, 2018 14:30
benesch added a commit to benesch/cockroach that referenced this pull request Oct 31, 2018
In cockroachdb#26872 it was mandated that all iterators be either prefix iterators
or explicitly set an upper bound. This vastly improved performance with
range tombstones.

For completeness, we need to specify lower bounds on iterators that scan
in reverse. (This was left out of cockroachdb#26872 so that it could merge ASAP.)
This commit adds the necessary plumbing, and applies appropriate lower
bounds in the two code paths that perform reverse iteration.

Release note: None
craig bot pushed a commit that referenced this pull request Nov 1, 2018
32072: storage/engine: allow specifying lower bounds r=petermattis a=benesch

**storage/engine: allow specifying lower bounds**

In #26872 it was mandated that all iterators be either prefix iterators
or explicitly set an upper bound. This vastly improved performance with
range tombstones.

For completeness, we need to specify lower bounds on iterators that scan
in reverse. (This was left out of #26872 so that it could merge ASAP.)
This commit adds the necessary plumbing, and applies appropriate lower
bounds in the two code paths that perform reverse iteration.

---

**storage: improve layering of mvcc scan/iterate functions**

mvccScanInternal previously took both a Reader and an Iterator. If the
Iterator was nil, it would use the Reader to create itself an Iterator;
otherwise, it would use Iterator and entirely ignore Reader.

This was impressively confusing. Just teach all of mvccScanInternal's
callers (there are only five) to create an iterator if they don't have
one already.

We could further reduce duplication by collapsing all scan variants
(MVCCScanToBytes, MVCCReverseScan) into one function, MVCCScan, that
takes a ScanOptions struct, but that cleanup is left to a future commit.

---

@jordanlewis this should improve your life when you get back to #31909.

Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants