Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sstable: optimize seeks to use next #860

Merged
merged 1 commit into from
Aug 17, 2020
Merged

Conversation

sumeerbhola
Copy link
Collaborator

This is a WIP and has the changes for singleLevelIterator.SeekGE.
It requires strengthening the requirement that the user call SeekGE
with a key that respects the lower bound.

The SeqSeek benchmarks are better

name                                                 old time/op    new time/op    delta
LevelIterSeqSeekGEWithBounds/restart=16/count=5-16     661ns ± 1%     152ns ± 1%  -76.96%  (p=0.000 n=8+9)
MergingIterSeqSeekGEWithBounds/levelCount=5-16        1.97µs ± 2%    0.72µs ± 2%  -63.37%  (p=0.000 n=9+10)

Overall, between this and the preceding PR, the MergingIterSeqSeek
benchmark shows a 5x improvement.

I'd like some feedback before adding SeekLT, SeekPrefixGE, and the
twoLevelIterator changes.

The more complete microbenchmark results comparing with master show
some slowdown, possibly because of wasted work when there is poor
locality. I need to investigate some more.

name                                            old time/op    new time/op    delta
MergingIterSeekGE/restart=16/count=1-16            783ns ± 6%     771ns ± 2%     ~     (p=0.986 n=10+9)
MergingIterSeekGE/restart=16/count=2-16           1.58µs ± 3%    1.59µs ± 2%   +1.07%  (p=0.045 n=10+9)
MergingIterSeekGE/restart=16/count=3-16           2.36µs ± 1%    2.41µs ± 2%   +2.07%  (p=0.000 n=8+10)
MergingIterSeekGE/restart=16/count=4-16           3.21µs ± 3%    3.26µs ± 2%   +1.64%  (p=0.005 n=10+9)
MergingIterSeekGE/restart=16/count=5-16           4.07µs ± 3%    4.16µs ± 1%   +2.10%  (p=0.008 n=10+9)
MergingIterNext/restart=16/count=1-16             37.4ns ± 0%    36.8ns ± 0%   -1.59%  (p=0.000 n=9+8)
MergingIterNext/restart=16/count=2-16             60.1ns ± 2%    60.3ns ± 1%     ~     (p=0.195 n=10+10)
MergingIterNext/restart=16/count=3-16             76.5ns ± 1%    76.4ns ± 1%     ~     (p=0.984 n=10+9)
MergingIterNext/restart=16/count=4-16             88.6ns ± 1%    88.7ns ± 2%     ~     (p=0.826 n=10+9)
MergingIterNext/restart=16/count=5-16              104ns ± 1%     105ns ± 1%   +1.42%  (p=0.003 n=9+8)
MergingIterPrev/restart=16/count=1-16             50.5ns ± 2%    55.6ns ± 4%  +10.07%  (p=0.000 n=10+10)
MergingIterPrev/restart=16/count=2-16             74.2ns ± 1%    78.5ns ± 1%   +5.88%  (p=0.000 n=10+8)
MergingIterPrev/restart=16/count=3-16             90.9ns ± 2%    99.4ns ± 5%   +9.32%  (p=0.000 n=9+10)
MergingIterPrev/restart=16/count=4-16              101ns ± 0%     108ns ± 1%   +7.30%  (p=0.000 n=8+8)
MergingIterPrev/restart=16/count=5-16              117ns ± 3%     135ns ± 7%  +15.04%  (p=0.000 n=10+10)
MergingIterSeqSeekGEWithBounds/levelCount=5-16    3.68µs ±11%    0.75µs ± 3%  -79.58%  (p=0.000 n=10+9)

name                                                old time/op    new time/op    delta
LevelIterSeekGE/restart=16/count=5-16                 1.29µs ± 3%    1.48µs ±13%  +14.78%  (p=0.000 n=10+9)
LevelIterSeqSeekGEWithBounds/restart=16/count=5-16     892ns ± 2%     155ns ± 1%  -82.59%  (p=0.000 n=10+8)
LevelIterNext/restart=16/count=5-16                   19.0ns ± 2%    20.6ns ±11%   +8.64%  (p=0.000 n=10+10)

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first commit is from #858

Reviewable status: 0 of 8 files reviewed, all discussions resolved (waiting on @itsbilal, @jbowens, and @petermattis)

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now passes the metamorphic test running for 2 hours. I am going to clean this up to add the optimization for SeekLT too.

I looked into existing workloads/microbenchmarks that may show improvement:

  • pebble ycsb does not set bounds so won't benefit
  • BenchmarkMVCCScan uses an iterator for a single scan (does not reuse) and consecutive scans are randomly positioned wrt each other. So won't benefit.

Suggestions on what to try are welcome.

For my geospatial query, there is ~12% improvement with this optimization.

One issue is that to make use of this optimization we will need to change pebble_iterator.go to stop using lowerBoundBuf, upperBoundBuf https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/pebble_iterator.go#L145-L159
The reuse of these buffers defeats this optimization since it changes the bounds from under all the (levelIters/singleLevelIterators) iterators since they are sharing the same slice, so when the singleLevelIterator.SetBounds does the bounds comparison, it no longer has the actual previous bounds since the slice has been changed. Changing these bounds slices from under these iterators is a dangerous thing anyway -- it just happens to work now. But it will incur two memory allocations each time we reuse the iterator. Thoughts?

Reviewable status: 0 of 9 files reviewed, all discussions resolved (waiting on @itsbilal, @jbowens, and @petermattis)

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my geospatial query, there is ~12% improvement with this optimization.

I compared Pebble master with RocksDB and then with Pebble with both these commits. These are just eyeball numbers: Pebble and RocksDB are both similar ~83ms. With these changes Pebble is ~66ms, so a significant improvement.

Reviewable status: 0 of 9 files reviewed, all discussions resolved (waiting on @itsbilal, @jbowens, and @petermattis)

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue is that to make use of this optimization we will need to change pebble_iterator.go to stop using lowerBoundBuf, upperBoundBuf

This is because SetBounds() doesn't make a copy of its inputs, right? I wonder if that is problematic.

I also want to understand our SetBounds() calls in CRDB better. The geospatial query is a single BatchRequest with thousands of Scans. That BatchRequest will be subdivided along range boundaries and the same iterator will be reused for all of the Scan requests. We should only have to call SetBounds once for the BatchRequest, not for each Scan. I suppose the abstractions don't make this easy. This is from memory, so it is worth verifying that I'm not misremembering.

I compared Pebble master with RocksDB and then with Pebble with both these commits. These are just eyeball numbers: Pebble and RocksDB are both similar ~83ms. With these changes Pebble is ~66ms, so a significant improvement.

Very nice.

Reviewable status: 0 of 9 files reviewed, all discussions resolved (waiting on @itsbilal, @jbowens, and @petermattis)

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because SetBounds() doesn't make a copy of its inputs, right? I wonder if that is problematic.

Yes, it doesn't make a copy, and that would also fix it. But the memory allocation will remain since it can't reuse a "buffer" slice for the copy since it would result in the same issue -- singleLevelIter.{lower,upper} would change before singleLevelIter.SetBounds was called.

I also want to understand our SetBounds() calls in CRDB better. The geospatial query is a single BatchRequest with thousands of Scans. That BatchRequest will be subdivided along range boundaries and the same iterator will be reused for all of the Scan requests. We should only have to call SetBounds once for the BatchRequest, not for each Scan.

Each ScanRequest in the batch has a Key, EndKey, and those bounds are used for the iterator -- calling SetBounds once with the spanning union of the spans would make the MVCC code more complicated since it would need to keep track of stepping out of bounds.

Reviewable status: 0 of 9 files reviewed, all discussions resolved (waiting on @itsbilal and @jbowens)

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each ScanRequest in the batch has a Key, EndKey, and those bounds are used for the iterator -- calling SetBounds once with the spanning union of the spans would make the MVCC code more complicated since it would need to keep track of stepping out of bounds.

I was thinking we could call SetBounds once with the bounds of the Range. For historical context, initially we didn't set an upper or lower bound for scans, but then we ran into problems in RocksDB where we'd iterate outside of a range's boundaries and into region of tombstones. Of course, if we set the bounds to the range boundaries, then we'd need to check the scan against EndKey, so perhaps we're not gaining anything with avoiding SetBounds calls on each Scan request.

Reviewable status: 0 of 9 files reviewed, all discussions resolved (waiting on @itsbilal and @jbowens)

@sumeerbhola sumeerbhola changed the title [WIP] sstable: optimize seeks to use next when possible sstable: optimize seeks to use next Aug 16, 2020
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is cleaned up and ready for a proper review. The microbenchmark results are included.

I was thinking we could call SetBounds once with the bounds of the Range. For historical context, initially we didn't set an upper or lower bound for scans, but then we ran into problems in RocksDB where we'd iterate outside of a range's boundaries and into region of tombstones. Of course, if we set the bounds to the range boundaries, then we'd need to check the scan against EndKey, so perhaps we're not gaining anything with avoiding SetBounds calls on each Scan request.

Interesting point about checking in the caller against the scan bounds. It pushes more work into the caller, but may be more efficient since we would compare once, versus each singleLevelIterator doing the comparison. Though the heap may also have more iterators in this setting so we may give back those potential gains. I'd like to delay exploring this option since there are more possible tradeoffs. Also, we'd still want to get the benefit of optimizing consecutive seeks that stay at the same block, and the current bounds approach gives us an easier way to assert an invariant.

Reviewable status: 0 of 10 files reviewed, all discussions resolved (waiting on @itsbilal and @jbowens)

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:LGTM:

Reviewed 4 of 6 files at r1.
Reviewable status: 4 of 10 files reviewed, 1 unresolved discussion (waiting on @itsbilal, @jbowens, and @sumeerbhola)


level_iter.go, line 76 at r1 (raw file):

	newIters tableNewIters
	// When rangeDelIter != nil, the caller requires that a range del iterator
	// corresponding to the current file be placed in *rangeDelIter. When this

These first two sentences took me a few tries to understand.

Is the first sentence saying that If rangeDelIter != nil, it must point to a range del iterator corresponding to the current file?

This optimization is used when the seek would position at the
current block of the iterator. It relies on having bounds set
on the iteraor since that is we can claim certain invariants
on the current iterator position. Ascending or descending bounds
is how CockroachDB executes batched scans, and the
SeqSeekGEWithBounds microbenchmarks, that show significant
improvement, imitate this pattern.

name                                            old time/op    new time/op    delta
MergingIterSeekGE/restart=16/count=1-16            841ns ± 5%     806ns ± 3%   -4.12%  (p=0.014 n=8+10)
MergingIterSeekGE/restart=16/count=2-16           1.78µs ± 9%    1.67µs ± 4%   -6.21%  (p=0.001 n=9+10)
MergingIterSeekGE/restart=16/count=3-16           2.60µs ±19%    2.46µs ± 2%     ~     (p=0.529 n=10+10)
MergingIterSeekGE/restart=16/count=4-16           3.44µs ±10%    3.34µs ± 4%     ~     (p=0.123 n=10+10)
MergingIterSeekGE/restart=16/count=5-16           4.35µs ± 5%    4.35µs ± 4%     ~     (p=0.825 n=10+9)
MergingIterNext/restart=16/count=1-16             37.7ns ± 4%    37.2ns ± 1%   -1.29%  (p=0.008 n=9+10)
MergingIterNext/restart=16/count=2-16             60.7ns ± 6%    59.6ns ± 1%     ~     (p=0.209 n=10+10)
MergingIterNext/restart=16/count=3-16             77.4ns ± 3%    75.8ns ± 1%   -2.10%  (p=0.036 n=10+9)
MergingIterNext/restart=16/count=4-16             86.9ns ± 2%    87.7ns ± 1%   +0.91%  (p=0.007 n=9+10)
MergingIterNext/restart=16/count=5-16              104ns ± 2%     103ns ± 2%     ~     (p=0.907 n=10+9)
MergingIterPrev/restart=16/count=1-16             52.2ns ± 3%    51.6ns ± 2%     ~     (p=0.118 n=10+10)
MergingIterPrev/restart=16/count=2-16             76.5ns ± 2%    75.8ns ± 2%     ~     (p=0.130 n=9+9)
MergingIterPrev/restart=16/count=3-16             94.3ns ± 2%    92.2ns ± 2%   -2.22%  (p=0.001 n=10+10)
MergingIterPrev/restart=16/count=4-16              103ns ± 1%     102ns ± 2%   -0.87%  (p=0.038 n=10+10)
MergingIterPrev/restart=16/count=5-16              118ns ± 2%     117ns ± 1%     ~     (p=0.085 n=10+8)
MergingIterSeqSeekGEWithBounds/levelCount=5-16    3.64µs ± 7%    0.73µs ± 2%  -79.86%  (p=0.000 n=10+10)

name                                                old time/op    new time/op    delta
LevelIterSeekGE/restart=16/count=5-16                 1.23µs ± 4%    1.23µs ± 3%     ~     (p=0.765 n=10+9)
LevelIterSeqSeekGEWithBounds/restart=16/count=5-16     862ns ± 8%     147ns ± 3%  -82.93%  (p=0.000 n=10+10)
LevelIterNext/restart=16/count=5-16                   18.7ns ± 4%    18.5ns ± 4%     ~     (p=0.386 n=10+9)
LevelIterPrev/restart=16/count=5-16                   31.4ns ± 8%    32.5ns ± 7%     ~     (p=0.055 n=10+10)
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: 0 of 5 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @jbowens)


level_iter.go, line 76 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

These first two sentences took me a few tries to understand.

Is the first sentence saying that If rangeDelIter != nil, it must point to a range del iterator corresponding to the current file?

Correct. The phrasing was indeed awkward -- I've adopted yours.

@sumeerbhola sumeerbhola merged commit a24bc6c into cockroachdb:master Aug 17, 2020
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this pull request Aug 18, 2020
These small optimizations are for index joins with large
numbers of rows, which are common with geospatial queries.
- No longer uses the keyToInputRowIndices map which
  was consuming about 1.5% in cpu profiles
- Sorts the spans to use optimizations at the storage layer,
  like cockroachdb/pebble#860
- Avoids constructing the partial key string since it is
  not used

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this pull request Aug 18, 2020
The previous behavior was risky and also defeats the optimization
introduced in cockroachdb/pebble#860

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this pull request Aug 18, 2020
The previous behavior was risky and also defeats the optimization
introduced in cockroachdb/pebble#860

Release note: None
craig bot pushed a commit to cockroachdb/cockroach that referenced this pull request Aug 18, 2020
52941: storage: no longer mutate bounds slices from under the pebble.Iterator r=sumeerbhola a=sumeerbhola

The previous behavior was risky and also defeats the optimization
introduced in cockroachdb/pebble#860

Release note: None

52943: rowexec: tiny optimizations to invertedJoiner r=sumeerbhola a=sumeerbhola

The unnecessary sort was consuming 1% cpu in profiles on a sample
query, since geospatial joins generate a large number of spans.
The larger batching also helps a small amount by slightly reducing
the total number of spans.

Release note: None

Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this pull request Aug 21, 2020
These small optimizations are for index joins with large
numbers of rows, which are common with geospatial queries.
- No longer uses the keyToInputRowIndices map which
  was consuming about 1.5% in cpu profiles
- Sorts the spans to use optimizations at the storage layer,
  like cockroachdb/pebble#860
- Avoids constructing the partial key string since it is
  not used

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this pull request Aug 25, 2020
These small optimizations are for index joins with large
numbers of rows, which are common with geospatial queries.
- No longer uses the keyToInputRowIndices map which
  was consuming about 1.5% in cpu profiles
- Sorts the spans to use optimizations at the storage layer,
  like cockroachdb/pebble#860
- Avoids constructing the partial key string since it is
  not used

Release note: None
craig bot pushed a commit to cockroachdb/cockroach that referenced this pull request Aug 25, 2020
52952: rowexec: joinReader optimizations for index joins r=sumeerbhola a=sumeerbhola

These small optimizations are for index joins with large
numbers of rows, which are common with geospatial queries.
- No longer uses the keyToInputRowIndices map which
  was consuming about 1.5% in cpu profiles
- Sorts the spans to use optimizations at the storage layer,
  like cockroachdb/pebble#860
- Avoids constructing the partial key string since it is
  not used

Release note: None

53296: builtins: Implement ST_FlipCoordinates on arguments {geometry} r=otan a=themistoklik

Should adopt PostGIS behavior. 

Release note (sql change): Implemented geometry builtin `ST_FlipCoordinates`
Closes #48932 

53307: implement ST_SharedPaths r=otan a=devenbhooshan

Implemented the geometry based builtin `ST_SharedPaths`

53343: sql: remove auto-parallelism of UNION ALL r=yuzefovich a=yuzefovich

Previously, `UNION ALL` construct was handled by simply merging the
streams which could be handled by unordered synchronizers which are the
source of concurrency in the engine. As a result, in some extreme cases
(like when we have many UNION ALL constructs in the same query) we could
have basically an unbounded parallelism that would stress the resources
too much and could lead to OOM crashes.

This issue is now fixed by using ordered synchronizers for the merging
of the streams (which runs all inputs in the same goroutine). This
approach is taken instead of implementing a separate UnionAll processor
or operator because it is least invasive and should be easily
backportable.

Fixes: #51548.

Release note (sql change): The concurrency of evaluation of UNION ALL
queries has been reduced. Previously, such queries could end up crashing
the server due to OOM in some extreme cases, and now that should be
fixed at the expense of possible minor reduction in performance.

53358: sql: support the `USAGE` privilege on schemas r=rohany a=rohany

Fixes #53342.

This commit adds the `USAGE` privilege to schemas. The `USAGE` privilege
allows a user to resolve objects under a schema.

Release note (sql change): Support the `USAGE` privelege on schemas.

53421: importccl: ignore ANALYZE in PGDUMP r=mjibson a=mjibson

Release note: None

Fixes #53419

Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
Co-authored-by: Themis Papavasileiou <tpapavasileiou@bol.com>
Co-authored-by: Deven Bhooshan <devenbhooshan@gmail.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
Co-authored-by: Matt Jibson <matt.jibson@gmail.com>
@sumeerbhola sumeerbhola deleted the sli branch October 26, 2021 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants