-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
engine: require iterators to specify an upper bound #26872
Conversation
f9ba809
to
a0a2e7a
Compare
I'll have to take a closer look at the change to Reviewed 2 of 2 files at r1. c-deps/libroach/batch.cc, line 116 at r2 (raw file):
Unrelated to this change, but I've recently realized that a pkg/storage/engine/rocksdb.go, line 1915 at r2 (raw file):
Looks like you didn't plumb iter-upper-bound into pkg/storage/engine/rocksdb.go, line 1931 at r2 (raw file):
Should you be checking here that Comments from Reviewable |
Btw, should we be plumbing Review status: complete! 0 of 0 LGTMs obtained Comments from Reviewable |
Review status: complete! 0 of 0 LGTMs obtained pkg/storage/engine/engine.go, line 138 at r2 (raw file):
The PR description says this field is required (and it looks like you've added it everywhere), but this comment says that nil makes it unbounded. One of those needs to be updated (I'd prefer to just disallow nil) Comments from Reviewable |
a0a2e7a
to
b811855
Compare
Probably, but there's only one place, stateloader.LoadLastIndex, that we use reverse scans. (Is that really true? Huh.) It's in the local keyspace so unlikely to have any problems with range deletion tombstones. Mind if I leave that to another PR? This one is getting real large. Review status: complete! 0 of 0 LGTMs obtained c-deps/libroach/batch.cc, line 116 at r2 (raw file): Previously, petermattis (Peter Mattis) wrote…
Ack. pkg/storage/engine/engine.go, line 138 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Well, it's not required if Prefix is specified, in which case a nil UpperBound does mean unbounded. Suggestions for better wording? pkg/storage/engine/rocksdb.go, line 1915 at r2 (raw file): Previously, petermattis (Peter Mattis) wrote…
Ok, done. Time bound iterators are now just normal iterators with the MinTimestampHint and MaxTimestampHint options set. pkg/storage/engine/rocksdb.go, line 1931 at r2 (raw file): Previously, petermattis (Peter Mattis) wrote…
Yes, thanks. Done. Comments from Reviewable |
And Review status: complete! 0 of 0 LGTMs obtained pkg/storage/engine/engine.go, line 138 at r2 (raw file): Previously, benesch (Nikhil Benesch) wrote…
I'd change this comment then to "UpperBound gives this iterator an upper bound. Attempts to seek past this point will invalidate the iterator. UpperBound must be provided unless Prefix is true, in which case the end of the prefix will be used as the upper bound. Comments from Reviewable |
Per Ben's comment, Review status: complete! 0 of 0 LGTMs obtained c-deps/libroach/iterator.h, line 38 at r3 (raw file):
Why did you need to add Comments from Reviewable |
b811855
to
ab3e5fb
Compare
Ah! Gotcha. OK to leave to another PR? Review status: complete! 0 of 0 LGTMs obtained c-deps/libroach/iterator.h, line 38 at r3 (raw file): Previously, petermattis (Peter Mattis) wrote…
Oops. They were a vestige of my first try at this, which allowed updating the timestamp bounds on an already-created iterator. Turns out this fails badly: once an SST is filtered out by a table_filter, it is gone for good. You really need a fresh iterator when you change the timestamp bounds. pkg/storage/engine/engine.go, line 138 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. Comments from Reviewable |
Yes Review status: complete! 0 of 0 LGTMs obtained (and 1 stale) Comments from Reviewable |
27061: storage: improve safety of merge lock r=tschottdorf,bdarnell a=benesch Fix a known correctness bug in the merge locking scheme. See the comment updates within for details. As an added benefit, this fix frees store.RemoveReplica of conditionals to deal with uninitialized replicas. Acquiring the merge lock now always results in an initialized right-hand replica, so store.MergeRange is always removing an initialized right-hand replica. There are no tests for this because testing this particular edge case is quite difficult and I'm primarily interested in avoiding the obviously wrong call to batch.ClearRange(nil, nil) that occurs when calling store.RemoveReplica on an unintialized replica. This nonsensical ClearRange is, somehow, not currently an error, but it is preventing #26872 ("engine: require iterators to specify an upper bound") from landing. Release note: None --- P.S. The `storage.RemoveOptions` struct now bundles precisely one parameter. Let me know if you'd prefer I change back the signature of `store.removeReplicaInternal` to take the bool directly. Bool parameters are frowned upon, though, so I wasn't sure what was preferred. Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
In-memory unit tests typically run with a RocksDB cache size of between 1 and 10MB. Our cache portioning logic took this to mean that the entire cache should be used for the write buffer (i.e., one memtable), and so allocated zero bytes for the block cache. This was fine for most tests, since an in-memory cache for an in-memory RocksDB instance is unnecessary. TestStoreMetrics, however, was living on the edge. It asserts that the block cache has some minimum number of cache hits over the duration of the test. A future commit that optimizes iteration caused the hit rate to drop to zero, failing the test. (Aside: it is surprising that the hit rate of a cache of capacity zero could ever be non-zero.) Guarantee at least 1MB is allocated to the block cache so tests can safely make assertions about the number of block cache hits. Also bump the minimum number of cache hits in TestStoreMetrics back up to its original value of ten; 8b890e6 decreased it to four in an effort to make the test less flaky, but I'm now reliably seeing more than forty cache hits. Release note: None
While investigating range deletion performance issues, we realized that large swaths of contiguous tombstones can cause massive performance issues when seeking. Seeks of more than 15 minutes (!) were observed on one cluster [0]. The problem is that every key overlapping the range deletion tombstone must be loaded from disk, decompressed, and compared with the sequence number of the tombstone until a key is found that is not covered by the tombstone. If contiguous keys representing hundreds of gigabytes are covered by tombstones, RocksDB will need to scan hundreds of gigabytes of data. Needless to say, performance suffers. We have plans to improve seek performance in the face of range deletion tombstones upstream, but we can mitigate the issue locally, too. Iteration outside of tests is nearly always in the context of a range or some bounded span of local keys. By plumbing knowledge of the upper bound we care about for each scan to RocksDB, we can invalidate the iterator early, once the upper bound has been exceeded, rather than scanning over potentially hundreds of gigabytes of deleted keys just to find a key that we don't care about. To ensure we don't forget to specify an upper bound in a critical path, this commit requires that all iterators are either prefix iterators or declare their upper bound. This makes iterating in tests rather irksome, but I think the tradeoff is worthwhile. [0]: cockroachdb#24029 (comment) Release note: None
bors r=petermattis,bdarnell |
26789: Makefile: don't globally install binaries during lint r=BramGruneir a=benesch The build system takes pains to scope all binaries it installs to REPO/bin to avoid polluting the user's GOPATH/bin with binaries that are internal to the Cockroach build system. The lint target was violating this rule by running 'go install ./pkg/...', which installed every package in the repository into GOPATH/bin. This commit adjusts the rule to run 'go build ./pkg/...' instead, which installs the necessary .a files into GOPATH/pkg without installing any binaries into GOPATH/bin. Fix #26633. Release note: None 26872: engine: require iterators to specify an upper bound r=petermattis,bdarnell a=benesch As promised. The issue the first commit fixes took longer than I'd like to admit to track down. I've done my best to keep the patch as small as possible in case we want to backport to 2.0. The changes to the C++ DBIterator are unfortunately large, though. Open to less disruptive ways of plumbing the upper bound there, but I'm not seeing anything obvious. --- While investigating range deletion performance issues, we realized that large swaths of contiguous tombstones can cause massive performance issues when seeking. Seeks of more than 15 minutes (!) were observed on one cluster [0]. The problem is that every key overlapping the range deletion tombstone must be loaded from disk, decompressed, and compared with the sequence number of the tombstone until a key is found that is not covered by the tombstone. If contiguous keys representing hundreds of gigabytes are covered by tombstones, RocksDB will need to scan hundreds of gigabytes of data. Needless to say, performance suffers. We have plans to improve seek performance in the face of range deletion tombstones upstream, but we can mitigate the issue locally, too. Iteration outside of tests is nearly always in the context of a range or some bounded span of local keys. By plumbing knowledge of the upper bound we care about for each scan to RocksDB, we can invalidate the iterator early, once the upper bound has been exceeded, rather than scanning over potentially hundreds of gigabytes of deleted keys just to find a key that we don't care about. To ensure we don't forget to specify an upper bound in a critical path, this commit requires that all iterators are either prefix iterators or declare their upper bound. This makes iterating in tests rather irksome, but I think the tradeoff is worthwhile. [0]: #24029 (comment) Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
Build succeeded |
In cockroachdb#26872 it was mandated that all iterators be either prefix iterators or explicitly set an upper bound. This vastly improved performance with range tombstones. For completeness, we need to specify lower bounds on iterators that scan in reverse. (This was left out of cockroachdb#26872 so that it could merge ASAP.) This commit adds the necessary plumbing, and applies appropriate lower bounds in the two code paths that perform reverse iteration. Release note: None
32072: storage/engine: allow specifying lower bounds r=petermattis a=benesch **storage/engine: allow specifying lower bounds** In #26872 it was mandated that all iterators be either prefix iterators or explicitly set an upper bound. This vastly improved performance with range tombstones. For completeness, we need to specify lower bounds on iterators that scan in reverse. (This was left out of #26872 so that it could merge ASAP.) This commit adds the necessary plumbing, and applies appropriate lower bounds in the two code paths that perform reverse iteration. --- **storage: improve layering of mvcc scan/iterate functions** mvccScanInternal previously took both a Reader and an Iterator. If the Iterator was nil, it would use the Reader to create itself an Iterator; otherwise, it would use Iterator and entirely ignore Reader. This was impressively confusing. Just teach all of mvccScanInternal's callers (there are only five) to create an iterator if they don't have one already. We could further reduce duplication by collapsing all scan variants (MVCCScanToBytes, MVCCReverseScan) into one function, MVCCScan, that takes a ScanOptions struct, but that cleanup is left to a future commit. --- @jordanlewis this should improve your life when you get back to #31909. Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
As promised. The issue the first commit fixes took longer than I'd like to admit to track down.
I've done my best to keep the patch as small as possible in case we want to backport to 2.0. The changes to the C++ DBIterator are unfortunately large, though. Open to less disruptive ways of plumbing the upper bound there, but I'm not seeing anything obvious.
While investigating range deletion performance issues, we realized that
large swaths of contiguous tombstones can cause massive performance
issues when seeking. Seeks of more than 15 minutes (!) were observed on
one cluster 0. The problem is that every key overlapping the range
deletion tombstone must be loaded from disk, decompressed, and compared
with the sequence number of the tombstone until a key is found that is
not covered by the tombstone. If contiguous keys representing hundreds
of gigabytes are covered by tombstones, RocksDB will need to scan
hundreds of gigabytes of data. Needless to say, performance suffers.
We have plans to improve seek performance in the face of range deletion
tombstones upstream, but we can mitigate the issue locally, too.
Iteration outside of tests is nearly always in the context of a range or
some bounded span of local keys. By plumbing knowledge of the upper
bound we care about for each scan to RocksDB, we can invalidate the
iterator early, once the upper bound has been exceeded, rather than
scanning over potentially hundreds of gigabytes of deleted keys just to
find a key that we don't care about.
To ensure we don't forget to specify an upper bound in a critical path,
this commit requires that all iterators are either prefix iterators or
declare their upper bound. This makes iterating in tests rather irksome,
but I think the tradeoff is worthwhile.