-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sstable: optimize seeks to use next #860
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first commit is from #858
Reviewable status: 0 of 8 files reviewed, all discussions resolved (waiting on @itsbilal, @jbowens, and @petermattis)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now passes the metamorphic test running for 2 hours. I am going to clean this up to add the optimization for SeekLT too.
I looked into existing workloads/microbenchmarks that may show improvement:
- pebble ycsb does not set bounds so won't benefit
- BenchmarkMVCCScan uses an iterator for a single scan (does not reuse) and consecutive scans are randomly positioned wrt each other. So won't benefit.
Suggestions on what to try are welcome.
For my geospatial query, there is ~12% improvement with this optimization.
One issue is that to make use of this optimization we will need to change pebble_iterator.go to stop using lowerBoundBuf
, upperBoundBuf
https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/pebble_iterator.go#L145-L159
The reuse of these buffers defeats this optimization since it changes the bounds from under all the (levelIters/singleLevelIterators) iterators since they are sharing the same slice, so when the singleLevelIterator.SetBounds
does the bounds comparison, it no longer has the actual previous bounds since the slice has been changed. Changing these bounds slices from under these iterators is a dangerous thing anyway -- it just happens to work now. But it will incur two memory allocations each time we reuse the iterator. Thoughts?
Reviewable status: 0 of 9 files reviewed, all discussions resolved (waiting on @itsbilal, @jbowens, and @petermattis)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my geospatial query, there is ~12% improvement with this optimization.
I compared Pebble master with RocksDB and then with Pebble with both these commits. These are just eyeball numbers: Pebble and RocksDB are both similar ~83ms. With these changes Pebble is ~66ms, so a significant improvement.
Reviewable status: 0 of 9 files reviewed, all discussions resolved (waiting on @itsbilal, @jbowens, and @petermattis)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue is that to make use of this optimization we will need to change pebble_iterator.go to stop using lowerBoundBuf, upperBoundBuf
This is because SetBounds()
doesn't make a copy of its inputs, right? I wonder if that is problematic.
I also want to understand our SetBounds()
calls in CRDB better. The geospatial query is a single BatchRequest
with thousands of Scans. That BatchRequest
will be subdivided along range boundaries and the same iterator will be reused for all of the Scan requests. We should only have to call SetBounds
once for the BatchRequest
, not for each Scan
. I suppose the abstractions don't make this easy. This is from memory, so it is worth verifying that I'm not misremembering.
I compared Pebble master with RocksDB and then with Pebble with both these commits. These are just eyeball numbers: Pebble and RocksDB are both similar ~83ms. With these changes Pebble is ~66ms, so a significant improvement.
Very nice.
Reviewable status: 0 of 9 files reviewed, all discussions resolved (waiting on @itsbilal, @jbowens, and @petermattis)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because SetBounds() doesn't make a copy of its inputs, right? I wonder if that is problematic.
Yes, it doesn't make a copy, and that would also fix it. But the memory allocation will remain since it can't reuse a "buffer" slice for the copy since it would result in the same issue -- singleLevelIter.{lower,upper}
would change before singleLevelIter.SetBounds
was called.
I also want to understand our SetBounds() calls in CRDB better. The geospatial query is a single BatchRequest with thousands of Scans. That BatchRequest will be subdivided along range boundaries and the same iterator will be reused for all of the Scan requests. We should only have to call SetBounds once for the BatchRequest, not for each Scan.
Each ScanRequest
in the batch has a Key
, EndKey
, and those bounds are used for the iterator -- calling SetBounds
once with the spanning union of the spans would make the MVCC code more complicated since it would need to keep track of stepping out of bounds.
Reviewable status: 0 of 9 files reviewed, all discussions resolved (waiting on @itsbilal and @jbowens)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each ScanRequest in the batch has a Key, EndKey, and those bounds are used for the iterator -- calling SetBounds once with the spanning union of the spans would make the MVCC code more complicated since it would need to keep track of stepping out of bounds.
I was thinking we could call SetBounds
once with the bounds of the Range. For historical context, initially we didn't set an upper or lower bound for scans, but then we ran into problems in RocksDB where we'd iterate outside of a range's boundaries and into region of tombstones. Of course, if we set the bounds to the range boundaries, then we'd need to check the scan against EndKey
, so perhaps we're not gaining anything with avoiding SetBounds
calls on each Scan request.
Reviewable status: 0 of 9 files reviewed, all discussions resolved (waiting on @itsbilal and @jbowens)
94c5540
to
b476cb5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is cleaned up and ready for a proper review. The microbenchmark results are included.
I was thinking we could call SetBounds once with the bounds of the Range. For historical context, initially we didn't set an upper or lower bound for scans, but then we ran into problems in RocksDB where we'd iterate outside of a range's boundaries and into region of tombstones. Of course, if we set the bounds to the range boundaries, then we'd need to check the scan against EndKey, so perhaps we're not gaining anything with avoiding SetBounds calls on each Scan request.
Interesting point about checking in the caller against the scan bounds. It pushes more work into the caller, but may be more efficient since we would compare once, versus each singleLevelIterator doing the comparison. Though the heap may also have more iterators in this setting so we may give back those potential gains. I'd like to delay exploring this option since there are more possible tradeoffs. Also, we'd still want to get the benefit of optimizing consecutive seeks that stay at the same block, and the current bounds approach gives us an easier way to assert an invariant.
Reviewable status: 0 of 10 files reviewed, all discussions resolved (waiting on @itsbilal and @jbowens)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 6 files at r1.
Reviewable status: 4 of 10 files reviewed, 1 unresolved discussion (waiting on @itsbilal, @jbowens, and @sumeerbhola)
level_iter.go, line 76 at r1 (raw file):
newIters tableNewIters // When rangeDelIter != nil, the caller requires that a range del iterator // corresponding to the current file be placed in *rangeDelIter. When this
These first two sentences took me a few tries to understand.
Is the first sentence saying that If rangeDelIter != nil
, it must point to a range del iterator corresponding to the current file?
This optimization is used when the seek would position at the current block of the iterator. It relies on having bounds set on the iteraor since that is we can claim certain invariants on the current iterator position. Ascending or descending bounds is how CockroachDB executes batched scans, and the SeqSeekGEWithBounds microbenchmarks, that show significant improvement, imitate this pattern. name old time/op new time/op delta MergingIterSeekGE/restart=16/count=1-16 841ns ± 5% 806ns ± 3% -4.12% (p=0.014 n=8+10) MergingIterSeekGE/restart=16/count=2-16 1.78µs ± 9% 1.67µs ± 4% -6.21% (p=0.001 n=9+10) MergingIterSeekGE/restart=16/count=3-16 2.60µs ±19% 2.46µs ± 2% ~ (p=0.529 n=10+10) MergingIterSeekGE/restart=16/count=4-16 3.44µs ±10% 3.34µs ± 4% ~ (p=0.123 n=10+10) MergingIterSeekGE/restart=16/count=5-16 4.35µs ± 5% 4.35µs ± 4% ~ (p=0.825 n=10+9) MergingIterNext/restart=16/count=1-16 37.7ns ± 4% 37.2ns ± 1% -1.29% (p=0.008 n=9+10) MergingIterNext/restart=16/count=2-16 60.7ns ± 6% 59.6ns ± 1% ~ (p=0.209 n=10+10) MergingIterNext/restart=16/count=3-16 77.4ns ± 3% 75.8ns ± 1% -2.10% (p=0.036 n=10+9) MergingIterNext/restart=16/count=4-16 86.9ns ± 2% 87.7ns ± 1% +0.91% (p=0.007 n=9+10) MergingIterNext/restart=16/count=5-16 104ns ± 2% 103ns ± 2% ~ (p=0.907 n=10+9) MergingIterPrev/restart=16/count=1-16 52.2ns ± 3% 51.6ns ± 2% ~ (p=0.118 n=10+10) MergingIterPrev/restart=16/count=2-16 76.5ns ± 2% 75.8ns ± 2% ~ (p=0.130 n=9+9) MergingIterPrev/restart=16/count=3-16 94.3ns ± 2% 92.2ns ± 2% -2.22% (p=0.001 n=10+10) MergingIterPrev/restart=16/count=4-16 103ns ± 1% 102ns ± 2% -0.87% (p=0.038 n=10+10) MergingIterPrev/restart=16/count=5-16 118ns ± 2% 117ns ± 1% ~ (p=0.085 n=10+8) MergingIterSeqSeekGEWithBounds/levelCount=5-16 3.64µs ± 7% 0.73µs ± 2% -79.86% (p=0.000 n=10+10) name old time/op new time/op delta LevelIterSeekGE/restart=16/count=5-16 1.23µs ± 4% 1.23µs ± 3% ~ (p=0.765 n=10+9) LevelIterSeqSeekGEWithBounds/restart=16/count=5-16 862ns ± 8% 147ns ± 3% -82.93% (p=0.000 n=10+10) LevelIterNext/restart=16/count=5-16 18.7ns ± 4% 18.5ns ± 4% ~ (p=0.386 n=10+9) LevelIterPrev/restart=16/count=5-16 31.4ns ± 8% 32.5ns ± 7% ~ (p=0.055 n=10+10)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: 0 of 5 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @jbowens)
level_iter.go, line 76 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
These first two sentences took me a few tries to understand.
Is the first sentence saying that If
rangeDelIter != nil
, it must point to a range del iterator corresponding to the current file?
Correct. The phrasing was indeed awkward -- I've adopted yours.
These small optimizations are for index joins with large numbers of rows, which are common with geospatial queries. - No longer uses the keyToInputRowIndices map which was consuming about 1.5% in cpu profiles - Sorts the spans to use optimizations at the storage layer, like cockroachdb/pebble#860 - Avoids constructing the partial key string since it is not used Release note: None
The previous behavior was risky and also defeats the optimization introduced in cockroachdb/pebble#860 Release note: None
The previous behavior was risky and also defeats the optimization introduced in cockroachdb/pebble#860 Release note: None
52941: storage: no longer mutate bounds slices from under the pebble.Iterator r=sumeerbhola a=sumeerbhola The previous behavior was risky and also defeats the optimization introduced in cockroachdb/pebble#860 Release note: None 52943: rowexec: tiny optimizations to invertedJoiner r=sumeerbhola a=sumeerbhola The unnecessary sort was consuming 1% cpu in profiles on a sample query, since geospatial joins generate a large number of spans. The larger batching also helps a small amount by slightly reducing the total number of spans. Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
These small optimizations are for index joins with large numbers of rows, which are common with geospatial queries. - No longer uses the keyToInputRowIndices map which was consuming about 1.5% in cpu profiles - Sorts the spans to use optimizations at the storage layer, like cockroachdb/pebble#860 - Avoids constructing the partial key string since it is not used Release note: None
These small optimizations are for index joins with large numbers of rows, which are common with geospatial queries. - No longer uses the keyToInputRowIndices map which was consuming about 1.5% in cpu profiles - Sorts the spans to use optimizations at the storage layer, like cockroachdb/pebble#860 - Avoids constructing the partial key string since it is not used Release note: None
52952: rowexec: joinReader optimizations for index joins r=sumeerbhola a=sumeerbhola These small optimizations are for index joins with large numbers of rows, which are common with geospatial queries. - No longer uses the keyToInputRowIndices map which was consuming about 1.5% in cpu profiles - Sorts the spans to use optimizations at the storage layer, like cockroachdb/pebble#860 - Avoids constructing the partial key string since it is not used Release note: None 53296: builtins: Implement ST_FlipCoordinates on arguments {geometry} r=otan a=themistoklik Should adopt PostGIS behavior. Release note (sql change): Implemented geometry builtin `ST_FlipCoordinates` Closes #48932 53307: implement ST_SharedPaths r=otan a=devenbhooshan Implemented the geometry based builtin `ST_SharedPaths` 53343: sql: remove auto-parallelism of UNION ALL r=yuzefovich a=yuzefovich Previously, `UNION ALL` construct was handled by simply merging the streams which could be handled by unordered synchronizers which are the source of concurrency in the engine. As a result, in some extreme cases (like when we have many UNION ALL constructs in the same query) we could have basically an unbounded parallelism that would stress the resources too much and could lead to OOM crashes. This issue is now fixed by using ordered synchronizers for the merging of the streams (which runs all inputs in the same goroutine). This approach is taken instead of implementing a separate UnionAll processor or operator because it is least invasive and should be easily backportable. Fixes: #51548. Release note (sql change): The concurrency of evaluation of UNION ALL queries has been reduced. Previously, such queries could end up crashing the server due to OOM in some extreme cases, and now that should be fixed at the expense of possible minor reduction in performance. 53358: sql: support the `USAGE` privilege on schemas r=rohany a=rohany Fixes #53342. This commit adds the `USAGE` privilege to schemas. The `USAGE` privilege allows a user to resolve objects under a schema. Release note (sql change): Support the `USAGE` privelege on schemas. 53421: importccl: ignore ANALYZE in PGDUMP r=mjibson a=mjibson Release note: None Fixes #53419 Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Themis Papavasileiou <tpapavasileiou@bol.com> Co-authored-by: Deven Bhooshan <devenbhooshan@gmail.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu> Co-authored-by: Matt Jibson <matt.jibson@gmail.com>
This is a WIP and has the changes for singleLevelIterator.SeekGE.
It requires strengthening the requirement that the user call SeekGE
with a key that respects the lower bound.
The SeqSeek benchmarks are better
Overall, between this and the preceding PR, the MergingIterSeqSeek
benchmark shows a 5x improvement.
I'd like some feedback before adding SeekLT, SeekPrefixGE, and the
twoLevelIterator changes.
The more complete microbenchmark results comparing with master show
some slowdown, possibly because of wasted work when there is poor
locality. I need to investigate some more.