sstable: optimize seeks to use next #860

sumeerbhola · 2020-08-11T19:46:51Z

This is a WIP and has the changes for singleLevelIterator.SeekGE.
It requires strengthening the requirement that the user call SeekGE
with a key that respects the lower bound.

The SeqSeek benchmarks are better

name                                                 old time/op    new time/op    delta
LevelIterSeqSeekGEWithBounds/restart=16/count=5-16     661ns ± 1%     152ns ± 1%  -76.96%  (p=0.000 n=8+9)
MergingIterSeqSeekGEWithBounds/levelCount=5-16        1.97µs ± 2%    0.72µs ± 2%  -63.37%  (p=0.000 n=9+10)

Overall, between this and the preceding PR, the MergingIterSeqSeek
benchmark shows a 5x improvement.

I'd like some feedback before adding SeekLT, SeekPrefixGE, and the
twoLevelIterator changes.

The more complete microbenchmark results comparing with master show
some slowdown, possibly because of wasted work when there is poor
locality. I need to investigate some more.

name                                            old time/op    new time/op    delta
MergingIterSeekGE/restart=16/count=1-16            783ns ± 6%     771ns ± 2%     ~     (p=0.986 n=10+9)
MergingIterSeekGE/restart=16/count=2-16           1.58µs ± 3%    1.59µs ± 2%   +1.07%  (p=0.045 n=10+9)
MergingIterSeekGE/restart=16/count=3-16           2.36µs ± 1%    2.41µs ± 2%   +2.07%  (p=0.000 n=8+10)
MergingIterSeekGE/restart=16/count=4-16           3.21µs ± 3%    3.26µs ± 2%   +1.64%  (p=0.005 n=10+9)
MergingIterSeekGE/restart=16/count=5-16           4.07µs ± 3%    4.16µs ± 1%   +2.10%  (p=0.008 n=10+9)
MergingIterNext/restart=16/count=1-16             37.4ns ± 0%    36.8ns ± 0%   -1.59%  (p=0.000 n=9+8)
MergingIterNext/restart=16/count=2-16             60.1ns ± 2%    60.3ns ± 1%     ~     (p=0.195 n=10+10)
MergingIterNext/restart=16/count=3-16             76.5ns ± 1%    76.4ns ± 1%     ~     (p=0.984 n=10+9)
MergingIterNext/restart=16/count=4-16             88.6ns ± 1%    88.7ns ± 2%     ~     (p=0.826 n=10+9)
MergingIterNext/restart=16/count=5-16              104ns ± 1%     105ns ± 1%   +1.42%  (p=0.003 n=9+8)
MergingIterPrev/restart=16/count=1-16             50.5ns ± 2%    55.6ns ± 4%  +10.07%  (p=0.000 n=10+10)
MergingIterPrev/restart=16/count=2-16             74.2ns ± 1%    78.5ns ± 1%   +5.88%  (p=0.000 n=10+8)
MergingIterPrev/restart=16/count=3-16             90.9ns ± 2%    99.4ns ± 5%   +9.32%  (p=0.000 n=9+10)
MergingIterPrev/restart=16/count=4-16              101ns ± 0%     108ns ± 1%   +7.30%  (p=0.000 n=8+8)
MergingIterPrev/restart=16/count=5-16              117ns ± 3%     135ns ± 7%  +15.04%  (p=0.000 n=10+10)
MergingIterSeqSeekGEWithBounds/levelCount=5-16    3.68µs ±11%    0.75µs ± 3%  -79.58%  (p=0.000 n=10+9)

name                                                old time/op    new time/op    delta
LevelIterSeekGE/restart=16/count=5-16                 1.29µs ± 3%    1.48µs ±13%  +14.78%  (p=0.000 n=10+9)
LevelIterSeqSeekGEWithBounds/restart=16/count=5-16     892ns ± 2%     155ns ± 1%  -82.59%  (p=0.000 n=10+8)
LevelIterNext/restart=16/count=5-16                   19.0ns ± 2%    20.6ns ±11%   +8.64%  (p=0.000 n=10+10)

cockroach-teamcity · 2020-08-11T19:46:58Z

This change is

sumeerbhola

The first commit is from #858

Reviewable status: 0 of 8 files reviewed, all discussions resolved (waiting on @itsbilal, @jbowens, and @petermattis)

sumeerbhola

This now passes the metamorphic test running for 2 hours. I am going to clean this up to add the optimization for SeekLT too.

I looked into existing workloads/microbenchmarks that may show improvement:

pebble ycsb does not set bounds so won't benefit
BenchmarkMVCCScan uses an iterator for a single scan (does not reuse) and consecutive scans are randomly positioned wrt each other. So won't benefit.

Suggestions on what to try are welcome.

For my geospatial query, there is ~12% improvement with this optimization.

One issue is that to make use of this optimization we will need to change pebble_iterator.go to stop using lowerBoundBuf, upperBoundBuf https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/pebble_iterator.go#L145-L159
The reuse of these buffers defeats this optimization since it changes the bounds from under all the (levelIters/singleLevelIterators) iterators since they are sharing the same slice, so when the singleLevelIterator.SetBounds does the bounds comparison, it no longer has the actual previous bounds since the slice has been changed. Changing these bounds slices from under these iterators is a dangerous thing anyway -- it just happens to work now. But it will incur two memory allocations each time we reuse the iterator. Thoughts?

Reviewable status: 0 of 9 files reviewed, all discussions resolved (waiting on @itsbilal, @jbowens, and @petermattis)

sumeerbhola

For my geospatial query, there is ~12% improvement with this optimization.

I compared Pebble master with RocksDB and then with Pebble with both these commits. These are just eyeball numbers: Pebble and RocksDB are both similar ~83ms. With these changes Pebble is ~66ms, so a significant improvement.

Reviewable status: 0 of 9 files reviewed, all discussions resolved (waiting on @itsbilal, @jbowens, and @petermattis)

petermattis

One issue is that to make use of this optimization we will need to change pebble_iterator.go to stop using lowerBoundBuf, upperBoundBuf

This is because SetBounds() doesn't make a copy of its inputs, right? I wonder if that is problematic.

I also want to understand our SetBounds() calls in CRDB better. The geospatial query is a single BatchRequest with thousands of Scans. That BatchRequest will be subdivided along range boundaries and the same iterator will be reused for all of the Scan requests. We should only have to call SetBounds once for the BatchRequest, not for each Scan. I suppose the abstractions don't make this easy. This is from memory, so it is worth verifying that I'm not misremembering.

I compared Pebble master with RocksDB and then with Pebble with both these commits. These are just eyeball numbers: Pebble and RocksDB are both similar ~83ms. With these changes Pebble is ~66ms, so a significant improvement.

Very nice.

Reviewable status: 0 of 9 files reviewed, all discussions resolved (waiting on @itsbilal, @jbowens, and @petermattis)

sumeerbhola

This is because SetBounds() doesn't make a copy of its inputs, right? I wonder if that is problematic.

Yes, it doesn't make a copy, and that would also fix it. But the memory allocation will remain since it can't reuse a "buffer" slice for the copy since it would result in the same issue -- singleLevelIter.{lower,upper} would change before singleLevelIter.SetBounds was called.

I also want to understand our SetBounds() calls in CRDB better. The geospatial query is a single BatchRequest with thousands of Scans. That BatchRequest will be subdivided along range boundaries and the same iterator will be reused for all of the Scan requests. We should only have to call SetBounds once for the BatchRequest, not for each Scan.

Each ScanRequest in the batch has a Key, EndKey, and those bounds are used for the iterator -- calling SetBounds once with the spanning union of the spans would make the MVCC code more complicated since it would need to keep track of stepping out of bounds.

Reviewable status: 0 of 9 files reviewed, all discussions resolved (waiting on @itsbilal and @jbowens)

petermattis

Each ScanRequest in the batch has a Key, EndKey, and those bounds are used for the iterator -- calling SetBounds once with the spanning union of the spans would make the MVCC code more complicated since it would need to keep track of stepping out of bounds.

I was thinking we could call SetBounds once with the bounds of the Range. For historical context, initially we didn't set an upper or lower bound for scans, but then we ran into problems in RocksDB where we'd iterate outside of a range's boundaries and into region of tombstones. Of course, if we set the bounds to the range boundaries, then we'd need to check the scan against EndKey, so perhaps we're not gaining anything with avoiding SetBounds calls on each Scan request.

Reviewable status: 0 of 9 files reviewed, all discussions resolved (waiting on @itsbilal and @jbowens)

sumeerbhola

This PR is cleaned up and ready for a proper review. The microbenchmark results are included.

I was thinking we could call SetBounds once with the bounds of the Range. For historical context, initially we didn't set an upper or lower bound for scans, but then we ran into problems in RocksDB where we'd iterate outside of a range's boundaries and into region of tombstones. Of course, if we set the bounds to the range boundaries, then we'd need to check the scan against EndKey, so perhaps we're not gaining anything with avoiding SetBounds calls on each Scan request.

Interesting point about checking in the caller against the scan bounds. It pushes more work into the caller, but may be more efficient since we would compare once, versus each singleLevelIterator doing the comparison. Though the heap may also have more iterators in this setting so we may give back those potential gains. I'd like to delay exploring this option since there are more possible tradeoffs. Also, we'd still want to get the benefit of optimizing consecutive seeks that stay at the same block, and the current bounds approach gives us an easier way to assert an invariant.

Reviewable status: 0 of 10 files reviewed, all discussions resolved (waiting on @itsbilal and @jbowens)

jbowens

Reviewed 4 of 6 files at r1.
Reviewable status: 4 of 10 files reviewed, 1 unresolved discussion (waiting on @itsbilal, @jbowens, and @sumeerbhola)

level_iter.go, line 76 at r1 (raw file):

	newIters tableNewIters
	// When rangeDelIter != nil, the caller requires that a range del iterator
	// corresponding to the current file be placed in *rangeDelIter. When this

These first two sentences took me a few tries to understand.

Is the first sentence saying that If rangeDelIter != nil, it must point to a range del iterator corresponding to the current file?

This optimization is used when the seek would position at the current block of the iterator. It relies on having bounds set on the iteraor since that is we can claim certain invariants on the current iterator position. Ascending or descending bounds is how CockroachDB executes batched scans, and the SeqSeekGEWithBounds microbenchmarks, that show significant improvement, imitate this pattern. name old time/op new time/op delta MergingIterSeekGE/restart=16/count=1-16 841ns ± 5% 806ns ± 3% -4.12% (p=0.014 n=8+10) MergingIterSeekGE/restart=16/count=2-16 1.78µs ± 9% 1.67µs ± 4% -6.21% (p=0.001 n=9+10) MergingIterSeekGE/restart=16/count=3-16 2.60µs ±19% 2.46µs ± 2% ~ (p=0.529 n=10+10) MergingIterSeekGE/restart=16/count=4-16 3.44µs ±10% 3.34µs ± 4% ~ (p=0.123 n=10+10) MergingIterSeekGE/restart=16/count=5-16 4.35µs ± 5% 4.35µs ± 4% ~ (p=0.825 n=10+9) MergingIterNext/restart=16/count=1-16 37.7ns ± 4% 37.2ns ± 1% -1.29% (p=0.008 n=9+10) MergingIterNext/restart=16/count=2-16 60.7ns ± 6% 59.6ns ± 1% ~ (p=0.209 n=10+10) MergingIterNext/restart=16/count=3-16 77.4ns ± 3% 75.8ns ± 1% -2.10% (p=0.036 n=10+9) MergingIterNext/restart=16/count=4-16 86.9ns ± 2% 87.7ns ± 1% +0.91% (p=0.007 n=9+10) MergingIterNext/restart=16/count=5-16 104ns ± 2% 103ns ± 2% ~ (p=0.907 n=10+9) MergingIterPrev/restart=16/count=1-16 52.2ns ± 3% 51.6ns ± 2% ~ (p=0.118 n=10+10) MergingIterPrev/restart=16/count=2-16 76.5ns ± 2% 75.8ns ± 2% ~ (p=0.130 n=9+9) MergingIterPrev/restart=16/count=3-16 94.3ns ± 2% 92.2ns ± 2% -2.22% (p=0.001 n=10+10) MergingIterPrev/restart=16/count=4-16 103ns ± 1% 102ns ± 2% -0.87% (p=0.038 n=10+10) MergingIterPrev/restart=16/count=5-16 118ns ± 2% 117ns ± 1% ~ (p=0.085 n=10+8) MergingIterSeqSeekGEWithBounds/levelCount=5-16 3.64µs ± 7% 0.73µs ± 2% -79.86% (p=0.000 n=10+10) name old time/op new time/op delta LevelIterSeekGE/restart=16/count=5-16 1.23µs ± 4% 1.23µs ± 3% ~ (p=0.765 n=10+9) LevelIterSeqSeekGEWithBounds/restart=16/count=5-16 862ns ± 8% 147ns ± 3% -82.93% (p=0.000 n=10+10) LevelIterNext/restart=16/count=5-16 18.7ns ± 4% 18.5ns ± 4% ~ (p=0.386 n=10+9) LevelIterPrev/restart=16/count=5-16 31.4ns ± 8% 32.5ns ± 7% ~ (p=0.055 n=10+10)

sumeerbhola

TFTR!

Reviewable status: 0 of 5 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @jbowens)

level_iter.go, line 76 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

These first two sentences took me a few tries to understand.

Is the first sentence saying that If rangeDelIter != nil, it must point to a range del iterator corresponding to the current file?

Correct. The phrasing was indeed awkward -- I've adopted yours.

These small optimizations are for index joins with large numbers of rows, which are common with geospatial queries. - No longer uses the keyToInputRowIndices map which was consuming about 1.5% in cpu profiles - Sorts the spans to use optimizations at the storage layer, like cockroachdb/pebble#860 - Avoids constructing the partial key string since it is not used Release note: None

The previous behavior was risky and also defeats the optimization introduced in cockroachdb/pebble#860 Release note: None

52941: storage: no longer mutate bounds slices from under the pebble.Iterator r=sumeerbhola a=sumeerbhola The previous behavior was risky and also defeats the optimization introduced in cockroachdb/pebble#860 Release note: None 52943: rowexec: tiny optimizations to invertedJoiner r=sumeerbhola a=sumeerbhola The unnecessary sort was consuming 1% cpu in profiles on a sample query, since geospatial joins generate a large number of spans. The larger batching also helps a small amount by slightly reducing the total number of spans. Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>

These small optimizations are for index joins with large numbers of rows, which are common with geospatial queries. - No longer uses the keyToInputRowIndices map which was consuming about 1.5% in cpu profiles - Sorts the spans to use optimizations at the storage layer, like cockroachdb/pebble#860 - Avoids constructing the partial key string since it is not used Release note: None

52952: rowexec: joinReader optimizations for index joins r=sumeerbhola a=sumeerbhola These small optimizations are for index joins with large numbers of rows, which are common with geospatial queries. - No longer uses the keyToInputRowIndices map which was consuming about 1.5% in cpu profiles - Sorts the spans to use optimizations at the storage layer, like cockroachdb/pebble#860 - Avoids constructing the partial key string since it is not used Release note: None 53296: builtins: Implement ST_FlipCoordinates on arguments {geometry} r=otan a=themistoklik Should adopt PostGIS behavior. Release note (sql change): Implemented geometry builtin `ST_FlipCoordinates` Closes #48932 53307: implement ST_SharedPaths r=otan a=devenbhooshan Implemented the geometry based builtin `ST_SharedPaths` 53343: sql: remove auto-parallelism of UNION ALL r=yuzefovich a=yuzefovich Previously, `UNION ALL` construct was handled by simply merging the streams which could be handled by unordered synchronizers which are the source of concurrency in the engine. As a result, in some extreme cases (like when we have many UNION ALL constructs in the same query) we could have basically an unbounded parallelism that would stress the resources too much and could lead to OOM crashes. This issue is now fixed by using ordered synchronizers for the merging of the streams (which runs all inputs in the same goroutine). This approach is taken instead of implementing a separate UnionAll processor or operator because it is least invasive and should be easily backportable. Fixes: #51548. Release note (sql change): The concurrency of evaluation of UNION ALL queries has been reduced. Previously, such queries could end up crashing the server due to OOM in some extreme cases, and now that should be fixed at the expense of possible minor reduction in performance. 53358: sql: support the `USAGE` privilege on schemas r=rohany a=rohany Fixes #53342. This commit adds the `USAGE` privilege to schemas. The `USAGE` privilege allows a user to resolve objects under a schema. Release note (sql change): Support the `USAGE` privelege on schemas. 53421: importccl: ignore ANALYZE in PGDUMP r=mjibson a=mjibson Release note: None Fixes #53419 Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Themis Papavasileiou <tpapavasileiou@bol.com> Co-authored-by: Deven Bhooshan <devenbhooshan@gmail.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu> Co-authored-by: Matt Jibson <matt.jibson@gmail.com>

sumeerbhola requested review from itsbilal, jbowens and petermattis August 11, 2020 19:46

sumeerbhola commented Aug 11, 2020

View reviewed changes

sumeerbhola force-pushed the sli branch from b51588a to 1ec6230 Compare August 13, 2020 15:57

sumeerbhola commented Aug 13, 2020

View reviewed changes

petermattis reviewed Aug 14, 2020

View reviewed changes

sumeerbhola commented Aug 14, 2020

View reviewed changes

petermattis reviewed Aug 14, 2020

View reviewed changes

sumeerbhola force-pushed the sli branch 4 times, most recently from 94c5540 to b476cb5 Compare August 16, 2020 20:22

sumeerbhola changed the title ~~[WIP] sstable: optimize seeks to use next when possible~~ sstable: optimize seeks to use next Aug 16, 2020

sumeerbhola commented Aug 16, 2020

View reviewed changes

jbowens approved these changes Aug 17, 2020

View reviewed changes

sumeerbhola force-pushed the sli branch from b476cb5 to ff28578 Compare August 17, 2020 20:23

sumeerbhola force-pushed the sli branch from ff28578 to 42f8d56 Compare August 17, 2020 20:29

sumeerbhola commented Aug 17, 2020

View reviewed changes

sumeerbhola merged commit a24bc6c into cockroachdb:master Aug 17, 2020

sumeerbhola mentioned this pull request Aug 17, 2020

storage: no longer mutate bounds slices from under the pebble.Iterator cockroachdb/cockroach#52941

Merged

sumeerbhola mentioned this pull request Aug 18, 2020

rowexec: joinReader optimizations for index joins cockroachdb/cockroach#52952

Merged

sumeerbhola deleted the sli branch October 26, 2021 13:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sstable: optimize seeks to use next #860

sstable: optimize seeks to use next #860

sumeerbhola commented Aug 11, 2020

cockroach-teamcity commented Aug 11, 2020

sumeerbhola left a comment

sumeerbhola left a comment

sumeerbhola left a comment

petermattis left a comment

sumeerbhola left a comment

petermattis left a comment

sumeerbhola left a comment

jbowens left a comment

sumeerbhola left a comment

sstable: optimize seeks to use next #860

sstable: optimize seeks to use next #860

Conversation

sumeerbhola commented Aug 11, 2020

cockroach-teamcity commented Aug 11, 2020

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment