Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range deletion performance improvements + cleanup #4014

Closed
wants to merge 5 commits into from

Conversation

benesch
Copy link
Contributor

@benesch benesch commented Jun 18, 2018

This fixes the same performance issue that #3992 fixes but with much more invasive cleanup.

I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z.

With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level.

@ajkr please take a look when you have a minute!

@benesch
Copy link
Contributor Author

benesch commented Jun 19, 2018

The test failures here should be fixed by #4021.

@yiwu-arbug yiwu-arbug requested a review from ajkr June 21, 2018 02:10
benesch added a commit to benesch/cockroach that referenced this pull request Jun 21, 2018
The current implementation of range deletion tombstones in RocksDB
suffers from a performance bug that causes excessive CPU usage on every
read operation in a database with many range tombstones. Dropping a
large table can easily result in several thousand range deletion
tombstones in one store, resulting in an unusable cluster as documented
in cockroachdb#24029.

Backport a refactoring of range deletion tombstone that fixes the
performance problem. This refactoring has also been proposed upstream as
facebook/rocksdb#4014.

A more minimal change was also proposed in facebook/rocksdb#3992--and
that patch better highlights the exact nature of the bug than the patch
backported here, for those looking to understand the problem. But this
refactoring, though more invasive, gets us one step closer to solving a
related problem where range deletions can cause excessively large
compactions (cockroachdb#26693). These large compactions do not appear to brick the
cluster but undoubtedly have some impact on performance.

Fix cockroachdb#24029.

Release note: None
Copy link
Contributor

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments on the first 4 commits. I still need to look at the last commit.

// specified to advance one-by-one through deletions until one is found with its
// interval containing the key. This will typically be faster than doing a full
// binary search (kBinarySearch).
enum class RangeDelPositioningMode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh? My C++ is outdated. I had to Google what enum class did. Neat.

// before start of deletion intervals
return false;
// An UncollapsedRangeDelMap is quick to create but slow to answer ShouldDelete
// queries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it quicker to create than a CollapsedRangeDelMap? I don't think it is. My understanding is that an UncollapsedRangeDelMap is needed during sstable ingestion, but that understanding is fuzzy. Do you have more details to add here?

Copy link
Contributor Author

@benesch benesch Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance is the whole point of having an uncollapsed mode. If it's not actually faster we should just get rid of it. It's faster to build if you've got lots of overlapping ranges, but perhaps that doesn't matter when there are no lurking n^2 bugs in the overlapped mode.

Adding an SST calls IsRangeOverlapped, which was only implemented for the uncollapsed mode because it just wasn't needed for the collapsed mode. There's no reason it couldn't be implemented for collapsed mode.

for (const auto& tombstone : rep_) {
if (ucmp_->Compare(start, tombstone.end_key_) < 0 &&
ucmp_->Compare(tombstone.start_key_, end) <= 0 &&
ucmp_->Compare(tombstone.start_key_, tombstone.end_key_) < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is tombstone.start_key_ >= tombstone.end_key_? I can see how a user might accidentally create such tombstones, but why are they ever persisted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. I copied this code verbatim, so I'm not entirely sure. Perhaps they were accidentally persisted in some early version of range deletions and therefore this case needs to be checked at read time for backwards compatibility?


typedef std::multiset<RangeTombstone, TombstoneStartKeyComparator> Rep;

class Iterator : public RangeDelIterator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can Iterator and Rep be private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterator is used by other classes, so doesn't it need to be public? Rep is public so that Iterator can access it, but we could also make Iterator a friend class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterator is only used as RangeDelIterator by other classes. I think it should be fine to make Iterator private and make it a friend of the parent class (via friend class Iterator) so that it can access Rep. My old-school C++ style is to put these private declarations at the top of the parent class before the public (i.e. before line 26).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. This works.

bool IsRangeOverlapped(const Slice&, const Slice&) {
// so far only implemented for non-collapsed mode since file ingestion (only
// client) doesn't use collapsing
assert(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assert only fires in debug builds. Should this instead be something like fprintf(stderr, ...); abort()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

std::unique_ptr<RangeDelMap> RangeDelAggregator::NewRangeDelMap() {
RangeDelMap* tombstone_map;
if (collapse_deletions_) {
tombstone_map = new CollapsedRangeDelMap(icmp_.user_comparator());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be return new CollapseRangeDelMap(...)? Or does the std::unique_ptr<> not allow for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, doesn't compile. Perhaps there's some C++ magic I don't know about, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see that unique_ptr doesn't allow for that. Carry on.

return;
void AddTombstone(RangeTombstone t) {
auto it = rep_.upper_bound(t.start_key_);
auto prev_seq = [&]() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember the rules for whether this heap allocates or not. It might be better to pass it as a parameter to this functor as I'm pretty sure that won't allocate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lambdas don't heap allocate unless you new them according to http://www.elbeno.com/blog/?p=1068 and https://stackoverflow.com/questions/12202656/c11-lambda-implementation-and-memory-model.) Clang seems to have no trouble inlining this either based on my read of the generated assembly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer.

if (rep_iter->second.seq_ < new_range_dels_iter->seq_) {
untermed_seq = rep_iter->second.seq_;
rep_iter->second.seq_ = new_range_dels_iter->seq_;
// Look all the existing transition points that overlap the new tombstone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Look all/Look at all/g

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// (It wouldn't be incorrect to leave it in, but it unnecessarily
// increases the size of the map.)
it = rep_.erase(it);
it--;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're decrementing it here, only to increment it on line 199. Perhaps just continue instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// new transition point for the new tombstone's end point.
SequenceNumber end_seq = 0;

if (t.seq_ > prev_seq()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel you should add some diagrammatic comments here. For example:

// We're adding a new tombstone which is newer than an existing tombstone.
//
// 2: a-------m
// 1: a----------------z
//
// or
//
// 2: a----------------z
// 1: a-------m
//
// or
//
// 2:       m-----t
// 1: a----------------z
//
// or
//
// 2:         m--------z
// 1: a-----------t
//
// In all fource cases, we add an entry with the tombstone's start key. Note that in the latter two cases the existing
// tombstone's start key will cover the range to the new start key. We need to remember `end_seq` in 
// case the new tombstone does not completely cover the existing tombstones (the first and third cases).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Let me know if you find them confusing.

@@ -66,6 +67,7 @@ class UncollapsedRangeDelMap : public RangeDelMap {
Iterator(const Rep& rep) : rep_(rep), iter_(rep.begin()) {}
bool Valid() const override { return iter_ != rep_.end(); }
void Next() override { iter_++; }
void Seek(const Slice&) override { assert(false); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to other comment: perhaps abort() to ensure this is accidentally called in production code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@benesch benesch force-pushed the clean-range-tombstones branch 3 times, most recently from 81610ea to eccaf2c Compare June 28, 2018 15:11
// searched for the last key less than K. K is covered iff the map entry has a
// larger seqno than K. As an example, consider the key h @ 4. It would be
// compared against the map entry g → 3 and determined to be uncovered. By
// contrast, the key h @ 2 would be determined to be covered.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely comment.

std::unique_ptr<RangeDelMap> RangeDelAggregator::NewRangeDelMap() {
RangeDelMap* tombstone_map;
if (collapse_deletions_) {
tombstone_map = new CollapsedRangeDelMap(icmp_.user_comparator());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see that unique_ptr doesn't allow for that. Carry on.

Copy link
Contributor

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// ^ ^
// Do nothing.
}
it++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++it (not sure if this makes a difference in C++ anymore)

// 3: I---M OR 3: A-----------M
// 2: ----k 2: c-------k
// ^ ^
// Do nothing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the new comments are great.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@benesch has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow so many great things in this PR: moving AddToBuilder logic to compaction/flush, eliminating ShouldAddTombstones, redesigning the complicated collapsing logic and documenting it, and making perf much better (which I also verified with db_bench). Thanks so much for these improvements!


class Iterator : public RangeDelIterator {
void SeekPastSentinels() {
while (Valid() && iter_->second == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to have consecutive sentinels? My understanding was there should only be one sentinel when there's a gap between tombstones, and one sentinel at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so it was possible to get an extra sentinel when inserting a tombstone whose start and end keys were equal, like [a, a). I've added an extra check at the beginning of the method to filter these out from the get-go. So now your proposal works—thanks! I've replaced the loop with a single if and renamed the method MaybeSeekPastSentinel accordingly.

Iterator(const Rep& rep) : rep_(rep), iter_(rep.begin()) {}

bool Valid() const override {
return iter_ != rep_.end() && std::next(iter_) != rep_.end();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need the right side of the && expression? If we remove it, any operation's SeekPastSentinels should position the iterator at a valid tombstone or at rep_.end(). Then the left side of the && expression should be sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right. Done.

}

bool RangeDelAggregator::IsRangeOverlapped(const Slice& start,
const Slice& end) {
// so far only implemented for non-collapsed mode since file ingestion (only
// client) doesn't use collapsing
assert(!collapse_deletions_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this assertion might be worth keeping. The abort() you added can only be hit if this is called on an initialized aggregator. But most often this function will return early due to uninitialized rep_, and this assertion can find logic errors even in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Done.

}

if (bottommost_level_ && tombstone.seq_ <= earliest_snapshot) {
range_del_out_stats->num_range_del_drop_obsolete++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there was a TODO before about double counting since these are split across files. Is it still a problem after this PR? If so can we bring back the TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, that's not fixed yet. @petermattis and I are hoping to address it soon though! Added back the TODO.

@ajkr
Copy link
Contributor

ajkr commented Jul 11, 2018

The unit tests should pass if you rebase on #4115, I think.

This makes it easier to refer to a range positioning mode outside of the
scope of a RangeDelAggregator without polluting the global scope with
the individual enum constants.
The uncollapsed representation was previously only tested by the
integration tests in db_range_del_test.
@facebook-github-bot
Copy link
Contributor

@benesch has updated the pull request. Re-import the pull request

The range deletion aggregator supports "uncollapsed" and "collapsed"
tombstone maps. Uncollapsed tombstone maps are naturally represented as
a std::multiset, while collapsed tombstone maps are naturally
represented as a std::map.

The previous implementation used a std::multimap that was general enough
to store either the uncollapsed or the collapsed representation.
Confusingly, its keys and values had different meanings depending on
which representation was being used.

Extract a TombstoneMap interface which has two concrete implementations,
uncollapsed and collapsed. This allows the implementations to use their
natural representation, and keeps the code for manipulating that
representation within one class.

In the process, simplify and comment the CollapsedTombstoneMap::
AddTombstone method. This refactor brings with it a large performance
improvement. The details of the performance bug in the old
implementation, and a more targeted fix, can be found in an earlier,
abandoned PR, facebook#3992.

The refactor also exposed a bug in the ObsoleteTombstoneCleanup test, which was
installing tombstones like [dr1, dr1) which cover no keys due to the
exclusivity of the upper bound.
Implement a merging iterator over stripes in a RangeDelAggregator. This
resolves a TODO about keeping table-modifying code out of the
RangeDelAggregator. It also paves the way to splitting output files
containing tombstones during compactions when they span too many keys in
the level below the compaction output level.
@facebook-github-bot
Copy link
Contributor

@benesch has updated the pull request. Re-import the pull request

Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to review, @ajkr! Glad you guys are interested in the changes.

I think I've addressed all your feedback, and I've rebased on the latest master. We'll know shortly if that's enough to get the tests to pass.

I've also run make format on every commit, as I let some style violations slip in. Apologies for any extra diff noise that might cause.


class Iterator : public RangeDelIterator {
void SeekPastSentinels() {
while (Valid() && iter_->second == 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so it was possible to get an extra sentinel when inserting a tombstone whose start and end keys were equal, like [a, a). I've added an extra check at the beginning of the method to filter these out from the get-go. So now your proposal works—thanks! I've replaced the loop with a single if and renamed the method MaybeSeekPastSentinel accordingly.

Iterator(const Rep& rep) : rep_(rep), iter_(rep.begin()) {}

bool Valid() const override {
return iter_ != rep_.end() && std::next(iter_) != rep_.end();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right. Done.

}

if (bottommost_level_ && tombstone.seq_ <= earliest_snapshot) {
range_del_out_stats->num_range_del_drop_obsolete++;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, that's not fixed yet. @petermattis and I are hoping to address it soon though! Added back the TODO.

}

bool RangeDelAggregator::IsRangeOverlapped(const Slice& start,
const Slice& end) {
// so far only implemented for non-collapsed mode since file ingestion (only
// client) doesn't use collapsing
assert(!collapse_deletions_);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Done.

tiny change to try fixing travis
@facebook-github-bot
Copy link
Contributor

@benesch has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@benesch
Copy link
Contributor Author

benesch commented Jul 12, 2018

Thanks for the fix @ajkr! The build failures look unrelated now.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajkr is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

petermattis pushed a commit to petermattis/rocksdb that referenced this pull request Jul 13, 2018
Summary:
This fixes the same performance issue that facebook#3992 fixes but with much more invasive cleanup.

I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z.

With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level.

ajkr please take a look when you have a minute!
Pull Request resolved: facebook#4014

Differential Revision: D8773253

Pulled By: ajkr

fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
rcane pushed a commit to rcane/rocksdb that referenced this pull request Sep 13, 2018
Summary:
This fixes the same performance issue that facebook#3992 fixes but with much more invasive cleanup.

I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z.

With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level.

ajkr please take a look when you have a minute!
Pull Request resolved: facebook#4014

Differential Revision: D8773253

Pulled By: ajkr

fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
facebook-github-bot pushed a commit that referenced this pull request Sep 17, 2018
Summary:
To measure the results of upcoming DeleteRange v2 work, this commit adds
simple benchmarks for RangeDelAggregator. It measures the average time
for AddTombstones and ShouldDelete calls.

Using this to compare the results before #4014 and on the latest master (using the default arguments) produces the following results:

Before #4014:
```
=======================
Results:
=======================
AddTombstones:          1356.28 us
ShouldDelete:           0.401732 us
```

Latest master:
```
=======================
Results:
=======================
AddTombstones:          740.82 us
ShouldDelete:           0.383271 us
```
Pull Request resolved: #4363

Differential Revision: D9881676

Pulled By: abhimadan

fbshipit-source-id: 793e7d61aa4b9d47eb917bbcc03f08695b5e5442
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants