engine: require iterators to specify an upper bound #26872

benesch · 2018-06-20T21:32:43Z

As promised. The issue the first commit fixes took longer than I'd like to admit to track down.

I've done my best to keep the patch as small as possible in case we want to backport to 2.0. The changes to the C++ DBIterator are unfortunately large, though. Open to less disruptive ways of plumbing the upper bound there, but I'm not seeing anything obvious.

While investigating range deletion performance issues, we realized that
large swaths of contiguous tombstones can cause massive performance
issues when seeking. Seeks of more than 15 minutes (!) were observed on
one cluster 0. The problem is that every key overlapping the range
deletion tombstone must be loaded from disk, decompressed, and compared
with the sequence number of the tombstone until a key is found that is
not covered by the tombstone. If contiguous keys representing hundreds
of gigabytes are covered by tombstones, RocksDB will need to scan
hundreds of gigabytes of data. Needless to say, performance suffers.

We have plans to improve seek performance in the face of range deletion
tombstones upstream, but we can mitigate the issue locally, too.
Iteration outside of tests is nearly always in the context of a range or
some bounded span of local keys. By plumbing knowledge of the upper
bound we care about for each scan to RocksDB, we can invalidate the
iterator early, once the upper bound has been exceeded, rather than
scanning over potentially hundreds of gigabytes of deleted keys just to
find a key that we don't care about.

To ensure we don't forget to specify an upper bound in a critical path,
this commit requires that all iterators are either prefix iterators or
declare their upper bound. This makes iterating in tests rather irksome,
but I think the tradeoff is worthwhile.

cockroach-teamcity · 2018-06-20T21:32:50Z

This change is

petermattis · 2018-06-22T01:34:46Z

I'll have to take a closer look at the change to BaseDeltaIterator. I don't have a suggestion for how to avoid changing all of the call sites like you did. I suppose you could add an engine.UnlimitedIterOptions global, but I'm not sure if that is any clearer or more concise than using IterOptions{UpperBound: roachpb.KeyMax} as you've done.

Reviewed 2 of 2 files at r1.
Review status: complete! 0 of 0 LGTMs obtained

c-deps/libroach/batch.cc, line 116 at r2 (raw file):

// The prefix_same_as_start and upper_bound settings must also be applied to
// base_iterator for correct operation.
class BaseDeltaIterator : public rocksdb::Iterator {

Unrelated to this change, but I've recently realized that a WriteBatchWithIndex is really just another layer of memtable and all of the BaseDeltaIterator crap could be folded into MergingIterator and DBIter. Too large a refactor at this point.

pkg/storage/engine/rocksdb.go, line 1915 at r2 (raw file):

// MaxTimestamp fields in IterOptions. Separating normal and time-bound
// iterators is silly: any iterator, even prefix iterators, can become
// time-bound by plumbing the right options through to RocksDB.

Looks like you didn't plumb iter-upper-bound into NewTimeBoundIterator.

pkg/storage/engine/rocksdb.go, line 1931 at r2 (raw file):

func (r *rocksDBIterator) setUpperBound(key roachpb.Key) {
	C.DBIterSetUpperBound(r.iter, goToCKey(MakeMVCCMetadataKey(key)))

Should you be checking here that len(key) != 0 similar to the check in rocksDBIterator.init?

Comments from Reviewable

petermattis · 2018-06-22T01:35:16Z

Btw, should we be plumbing LowerBound for reverse scans?

Review status: complete! 0 of 0 LGTMs obtained

Comments from Reviewable

bdarnell · 2018-06-25T15:56:39Z

Review status: complete! 0 of 0 LGTMs obtained

pkg/storage/engine/engine.go, line 138 at r2 (raw file):

	// prefix will not work correctly (keys may be skipped).
	Prefix bool
	// If UpperBound is non-nil, the iterator will become invalid after seeking

The PR description says this field is required (and it looks like you've added it everywhere), but this comment says that nil makes it unbounded. One of those needs to be updated (I'd prefer to just disallow nil)

Comments from Reviewable

benesch · 2018-06-26T01:14:24Z

Btw, should we be plumbing LowerBound for reverse scans?

Probably, but there's only one place, stateloader.LoadLastIndex, that we use reverse scans. (Is that really true? Huh.) It's in the local keyspace so unlikely to have any problems with range deletion tombstones. Mind if I leave that to another PR? This one is getting real large.

Review status: complete! 0 of 0 LGTMs obtained

c-deps/libroach/batch.cc, line 116 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Unrelated to this change, but I've recently realized that a WriteBatchWithIndex is really just another layer of memtable and all of the BaseDeltaIterator crap could be folded into MergingIterator and DBIter. Too large a refactor at this point.

Ack.

pkg/storage/engine/engine.go, line 138 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The PR description says this field is required (and it looks like you've added it everywhere), but this comment says that nil makes it unbounded. One of those needs to be updated (I'd prefer to just disallow nil)

Well, it's not required if Prefix is specified, in which case a nil UpperBound does mean unbounded. Suggestions for better wording?

pkg/storage/engine/rocksdb.go, line 1915 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Looks like you didn't plumb iter-upper-bound into NewTimeBoundIterator.

Ok, done. Time bound iterators are now just normal iterators with the MinTimestampHint and MaxTimestampHint options set.

pkg/storage/engine/rocksdb.go, line 1931 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Should you be checking here that len(key) != 0 similar to the check in rocksDBIterator.init?

Yes, thanks. Done.

Comments from Reviewable

bdarnell · 2018-06-26T02:11:03Z

Probably, but there's only one place, stateloader.LoadLastIndex, that we use reverse scans. (Is that really true? Huh.)

And batcheval/cmd_reverse_scan.go

Review status: complete! 0 of 0 LGTMs obtained

pkg/storage/engine/engine.go, line 138 at r2 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Well, it's not required if Prefix is specified, in which case a nil UpperBound does mean unbounded. Suggestions for better wording?

I'd change this comment then to "UpperBound gives this iterator an upper bound. Attempts to seek past this point will invalidate the iterator. UpperBound must be provided unless Prefix is true, in which case the end of the prefix will be used as the upper bound.

Comments from Reviewable

petermattis · 2018-06-26T16:50:05Z

Per Ben's comment, MVCCScan(..., reverse=true,...) (libroach/mvcc.cc) uses iter->Prev().

Review status: complete! 0 of 0 LGTMs obtained

c-deps/libroach/iterator.h, line 38 at r3 (raw file):

  rocksdb::Slice upper_bound;
  std::string min_timestamp_hint;
  std::string max_timestamp_hint;

Why did you need to add {min,max}_timestamp_hint here? I'm not seeing where they are used.

Comments from Reviewable

benesch · 2018-06-27T22:15:43Z

Ah! Gotcha. OK to leave to another PR?

Review status: complete! 0 of 0 LGTMs obtained

c-deps/libroach/iterator.h, line 38 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Why did you need to add {min,max}_timestamp_hint here? I'm not seeing where they are used.

Oops. They were a vestige of my first try at this, which allowed updating the timestamp bounds on an already-created iterator. Turns out this fails badly: once an SST is filtered out by a table_filter, it is gone for good. You really need a fresh iterator when you change the timestamp bounds.

pkg/storage/engine/engine.go, line 138 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'd change this comment then to "UpperBound gives this iterator an upper bound. Attempts to seek past this point will invalidate the iterator. UpperBound must be provided unless Prefix is true, in which case the end of the prefix will be used as the upper bound.

Done.

Comments from Reviewable

petermattis · 2018-06-28T13:51:40Z

Ah! Gotcha. OK to leave to another PR?

Yes

Review status: complete! 0 of 0 LGTMs obtained (and 1 stale)

Comments from Reviewable

27061: storage: improve safety of merge lock r=tschottdorf,bdarnell a=benesch Fix a known correctness bug in the merge locking scheme. See the comment updates within for details. As an added benefit, this fix frees store.RemoveReplica of conditionals to deal with uninitialized replicas. Acquiring the merge lock now always results in an initialized right-hand replica, so store.MergeRange is always removing an initialized right-hand replica. There are no tests for this because testing this particular edge case is quite difficult and I'm primarily interested in avoiding the obviously wrong call to batch.ClearRange(nil, nil) that occurs when calling store.RemoveReplica on an unintialized replica. This nonsensical ClearRange is, somehow, not currently an error, but it is preventing #26872 ("engine: require iterators to specify an upper bound") from landing. Release note: None --- P.S. The `storage.RemoveOptions` struct now bundles precisely one parameter. Let me know if you'd prefer I change back the signature of `store.removeReplicaInternal` to take the bool directly. Bool parameters are frowned upon, though, so I wasn't sure what was preferred. Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>

In-memory unit tests typically run with a RocksDB cache size of between 1 and 10MB. Our cache portioning logic took this to mean that the entire cache should be used for the write buffer (i.e., one memtable), and so allocated zero bytes for the block cache. This was fine for most tests, since an in-memory cache for an in-memory RocksDB instance is unnecessary. TestStoreMetrics, however, was living on the edge. It asserts that the block cache has some minimum number of cache hits over the duration of the test. A future commit that optimizes iteration caused the hit rate to drop to zero, failing the test. (Aside: it is surprising that the hit rate of a cache of capacity zero could ever be non-zero.) Guarantee at least 1MB is allocated to the block cache so tests can safely make assertions about the number of block cache hits. Also bump the minimum number of cache hits in TestStoreMetrics back up to its original value of ten; 8b890e6 decreased it to four in an effort to make the test less flaky, but I'm now reliably seeing more than forty cache hits. Release note: None

While investigating range deletion performance issues, we realized that large swaths of contiguous tombstones can cause massive performance issues when seeking. Seeks of more than 15 minutes (!) were observed on one cluster [0]. The problem is that every key overlapping the range deletion tombstone must be loaded from disk, decompressed, and compared with the sequence number of the tombstone until a key is found that is not covered by the tombstone. If contiguous keys representing hundreds of gigabytes are covered by tombstones, RocksDB will need to scan hundreds of gigabytes of data. Needless to say, performance suffers. We have plans to improve seek performance in the face of range deletion tombstones upstream, but we can mitigate the issue locally, too. Iteration outside of tests is nearly always in the context of a range or some bounded span of local keys. By plumbing knowledge of the upper bound we care about for each scan to RocksDB, we can invalidate the iterator early, once the upper bound has been exceeded, rather than scanning over potentially hundreds of gigabytes of deleted keys just to find a key that we don't care about. To ensure we don't forget to specify an upper bound in a critical path, this commit requires that all iterators are either prefix iterators or declare their upper bound. This makes iterating in tests rather irksome, but I think the tradeoff is worthwhile. [0]: cockroachdb#24029 (comment) Release note: None

benesch · 2018-07-06T20:27:46Z

bors r=petermattis,bdarnell

26789: Makefile: don't globally install binaries during lint r=BramGruneir a=benesch The build system takes pains to scope all binaries it installs to REPO/bin to avoid polluting the user's GOPATH/bin with binaries that are internal to the Cockroach build system. The lint target was violating this rule by running 'go install ./pkg/...', which installed every package in the repository into GOPATH/bin. This commit adjusts the rule to run 'go build ./pkg/...' instead, which installs the necessary .a files into GOPATH/pkg without installing any binaries into GOPATH/bin. Fix #26633. Release note: None 26872: engine: require iterators to specify an upper bound r=petermattis,bdarnell a=benesch As promised. The issue the first commit fixes took longer than I'd like to admit to track down. I've done my best to keep the patch as small as possible in case we want to backport to 2.0. The changes to the C++ DBIterator are unfortunately large, though. Open to less disruptive ways of plumbing the upper bound there, but I'm not seeing anything obvious. --- While investigating range deletion performance issues, we realized that large swaths of contiguous tombstones can cause massive performance issues when seeking. Seeks of more than 15 minutes (!) were observed on one cluster [0]. The problem is that every key overlapping the range deletion tombstone must be loaded from disk, decompressed, and compared with the sequence number of the tombstone until a key is found that is not covered by the tombstone. If contiguous keys representing hundreds of gigabytes are covered by tombstones, RocksDB will need to scan hundreds of gigabytes of data. Needless to say, performance suffers. We have plans to improve seek performance in the face of range deletion tombstones upstream, but we can mitigate the issue locally, too. Iteration outside of tests is nearly always in the context of a range or some bounded span of local keys. By plumbing knowledge of the upper bound we care about for each scan to RocksDB, we can invalidate the iterator early, once the upper bound has been exceeded, rather than scanning over potentially hundreds of gigabytes of deleted keys just to find a key that we don't care about. To ensure we don't forget to specify an upper bound in a critical path, this commit requires that all iterators are either prefix iterators or declare their upper bound. This makes iterating in tests rather irksome, but I think the tradeoff is worthwhile. [0]: #24029 (comment) Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>

craig · 2018-07-06T21:00:08Z

Build succeeded

GitHub CI (Cockroach)

In cockroachdb#26872 it was mandated that all iterators be either prefix iterators or explicitly set an upper bound. This vastly improved performance with range tombstones. For completeness, we need to specify lower bounds on iterators that scan in reverse. (This was left out of cockroachdb#26872 so that it could merge ASAP.) This commit adds the necessary plumbing, and applies appropriate lower bounds in the two code paths that perform reverse iteration. Release note: None

@jordanlewis

32072: storage/engine: allow specifying lower bounds r=petermattis a=benesch **storage/engine: allow specifying lower bounds** In #26872 it was mandated that all iterators be either prefix iterators or explicitly set an upper bound. This vastly improved performance with range tombstones. For completeness, we need to specify lower bounds on iterators that scan in reverse. (This was left out of #26872 so that it could merge ASAP.) This commit adds the necessary plumbing, and applies appropriate lower bounds in the two code paths that perform reverse iteration. --- **storage: improve layering of mvcc scan/iterate functions** mvccScanInternal previously took both a Reader and an Iterator. If the Iterator was nil, it would use the Reader to create itself an Iterator; otherwise, it would use Iterator and entirely ignore Reader. This was impressively confusing. Just teach all of mvccScanInternal's callers (there are only five) to create an iterator if they don't have one already. We could further reduce duplication by collapsing all scan variants (MVCCScanToBytes, MVCCReverseScan) into one function, MVCCScan, that takes a ScanOptions struct, but that cleanup is left to a future commit. --- @jordanlewis this should improve your life when you get back to #31909. Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>

benesch requested review from bdarnell, tbg, petermattis and a team June 20, 2018 21:32

benesch force-pushed the iter-upper-bound branch 3 times, most recently from f9ba809 to a0a2e7a Compare June 21, 2018 05:45

benesch force-pushed the iter-upper-bound branch from a0a2e7a to b811855 Compare June 26, 2018 01:06

benesch force-pushed the iter-upper-bound branch from b811855 to ab3e5fb Compare June 27, 2018 22:10

benesch mentioned this pull request Jun 27, 2018

compactor: purge suggestions that have live data #26488

Merged

benesch mentioned this pull request Jun 28, 2018

storage: improve safety of merge lock #27061

Merged

benesch added 2 commits July 5, 2018 16:45

benesch force-pushed the iter-upper-bound branch from ab3e5fb to 0622016 Compare July 5, 2018 20:53

benesch mentioned this pull request Jul 5, 2018

storage: reverse iterators should specify lower bound #27217

Closed

craig bot merged commit 0622016 into cockroachdb:master Jul 6, 2018

benesch deleted the iter-upper-bound branch July 10, 2018 14:30

benesch mentioned this pull request Oct 31, 2018

storage/engine: allow specifying lower bounds #32072

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

engine: require iterators to specify an upper bound #26872

engine: require iterators to specify an upper bound #26872

benesch commented Jun 20, 2018

cockroach-teamcity commented Jun 20, 2018

petermattis commented Jun 22, 2018

petermattis commented Jun 22, 2018

bdarnell commented Jun 25, 2018

benesch commented Jun 26, 2018

bdarnell commented Jun 26, 2018

petermattis commented Jun 26, 2018

benesch commented Jun 27, 2018

petermattis commented Jun 28, 2018

benesch commented Jul 6, 2018

craig bot commented Jul 6, 2018

engine: require iterators to specify an upper bound #26872

engine: require iterators to specify an upper bound #26872

Conversation

benesch commented Jun 20, 2018

cockroach-teamcity commented Jun 20, 2018

petermattis commented Jun 22, 2018

petermattis commented Jun 22, 2018

bdarnell commented Jun 25, 2018

benesch commented Jun 26, 2018

bdarnell commented Jun 26, 2018

petermattis commented Jun 26, 2018

benesch commented Jun 27, 2018

petermattis commented Jun 28, 2018

benesch commented Jul 6, 2018

craig bot commented Jul 6, 2018

Build succeeded