-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Range deletion performance improvements + cleanup #4014
Conversation
048949e
to
a4b556e
Compare
The test failures here should be fixed by #4021. |
The current implementation of range deletion tombstones in RocksDB suffers from a performance bug that causes excessive CPU usage on every read operation in a database with many range tombstones. Dropping a large table can easily result in several thousand range deletion tombstones in one store, resulting in an unusable cluster as documented in cockroachdb#24029. Backport a refactoring of range deletion tombstone that fixes the performance problem. This refactoring has also been proposed upstream as facebook/rocksdb#4014. A more minimal change was also proposed in facebook/rocksdb#3992--and that patch better highlights the exact nature of the bug than the patch backported here, for those looking to understand the problem. But this refactoring, though more invasive, gets us one step closer to solving a related problem where range deletions can cause excessively large compactions (cockroachdb#26693). These large compactions do not appear to brick the cluster but undoubtedly have some impact on performance. Fix cockroachdb#24029. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments on the first 4 commits. I still need to look at the last commit.
// specified to advance one-by-one through deletions until one is found with its | ||
// interval containing the key. This will typically be faster than doing a full | ||
// binary search (kBinarySearch). | ||
enum class RangeDelPositioningMode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh? My C++ is outdated. I had to Google what enum class
did. Neat.
db/range_del_aggregator.cc
Outdated
// before start of deletion intervals | ||
return false; | ||
// An UncollapsedRangeDelMap is quick to create but slow to answer ShouldDelete | ||
// queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it quicker to create than a CollapsedRangeDelMap
? I don't think it is. My understanding is that an UncollapsedRangeDelMap
is needed during sstable ingestion, but that understanding is fuzzy. Do you have more details to add here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance is the whole point of having an uncollapsed mode. If it's not actually faster we should just get rid of it. It's faster to build if you've got lots of overlapping ranges, but perhaps that doesn't matter when there are no lurking n^2 bugs in the overlapped mode.
Adding an SST calls IsRangeOverlapped, which was only implemented for the uncollapsed mode because it just wasn't needed for the collapsed mode. There's no reason it couldn't be implemented for collapsed mode.
for (const auto& tombstone : rep_) { | ||
if (ucmp_->Compare(start, tombstone.end_key_) < 0 && | ||
ucmp_->Compare(tombstone.start_key_, end) <= 0 && | ||
ucmp_->Compare(tombstone.start_key_, tombstone.end_key_) < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is tombstone.start_key_ >= tombstone.end_key_
? I can see how a user might accidentally create such tombstones, but why are they ever persisted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question. I copied this code verbatim, so I'm not entirely sure. Perhaps they were accidentally persisted in some early version of range deletions and therefore this case needs to be checked at read time for backwards compatibility?
db/range_del_aggregator.cc
Outdated
|
||
typedef std::multiset<RangeTombstone, TombstoneStartKeyComparator> Rep; | ||
|
||
class Iterator : public RangeDelIterator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can Iterator
and Rep
be private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterator
is used by other classes, so doesn't it need to be public? Rep
is public so that Iterator
can access it, but we could also make Iterator
a friend class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterator
is only used as RangeDelIterator
by other classes. I think it should be fine to make Iterator
private and make it a friend of the parent class (via friend class Iterator
) so that it can access Rep
. My old-school C++ style is to put these private declarations at the top of the parent class before the public
(i.e. before line 26).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. This works.
db/range_del_aggregator.cc
Outdated
bool IsRangeOverlapped(const Slice&, const Slice&) { | ||
// so far only implemented for non-collapsed mode since file ingestion (only | ||
// client) doesn't use collapsing | ||
assert(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assert only fires in debug builds. Should this instead be something like fprintf(stderr, ...); abort()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
db/range_del_aggregator.cc
Outdated
std::unique_ptr<RangeDelMap> RangeDelAggregator::NewRangeDelMap() { | ||
RangeDelMap* tombstone_map; | ||
if (collapse_deletions_) { | ||
tombstone_map = new CollapsedRangeDelMap(icmp_.user_comparator()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this be return new CollapseRangeDelMap(...)
? Or does the std::unique_ptr<>
not allow for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, doesn't compile. Perhaps there's some C++ magic I don't know about, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I see that unique_ptr
doesn't allow for that. Carry on.
db/range_del_aggregator.cc
Outdated
return; | ||
void AddTombstone(RangeTombstone t) { | ||
auto it = rep_.upper_bound(t.start_key_); | ||
auto prev_seq = [&]() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember the rules for whether this heap allocates or not. It might be better to pass it
as a parameter to this functor as I'm pretty sure that won't allocate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lambdas don't heap allocate unless you new
them according to http://www.elbeno.com/blog/?p=1068 and https://stackoverflow.com/questions/12202656/c11-lambda-implementation-and-memory-model.) Clang seems to have no trouble inlining this either based on my read of the generated assembly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer.
db/range_del_aggregator.cc
Outdated
if (rep_iter->second.seq_ < new_range_dels_iter->seq_) { | ||
untermed_seq = rep_iter->second.seq_; | ||
rep_iter->second.seq_ = new_range_dels_iter->seq_; | ||
// Look all the existing transition points that overlap the new tombstone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Look all/Look at all/g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
db/range_del_aggregator.cc
Outdated
// (It wouldn't be incorrect to leave it in, but it unnecessarily | ||
// increases the size of the map.) | ||
it = rep_.erase(it); | ||
it--; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're decrementing it
here, only to increment it on line 199. Perhaps just continue
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
db/range_del_aggregator.cc
Outdated
// new transition point for the new tombstone's end point. | ||
SequenceNumber end_seq = 0; | ||
|
||
if (t.seq_ > prev_seq()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel you should add some diagrammatic comments here. For example:
// We're adding a new tombstone which is newer than an existing tombstone.
//
// 2: a-------m
// 1: a----------------z
//
// or
//
// 2: a----------------z
// 1: a-------m
//
// or
//
// 2: m-----t
// 1: a----------------z
//
// or
//
// 2: m--------z
// 1: a-----------t
//
// In all fource cases, we add an entry with the tombstone's start key. Note that in the latter two cases the existing
// tombstone's start key will cover the range to the new start key. We need to remember `end_seq` in
// case the new tombstone does not completely cover the existing tombstones (the first and third cases).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Let me know if you find them confusing.
db/range_del_aggregator.cc
Outdated
@@ -66,6 +67,7 @@ class UncollapsedRangeDelMap : public RangeDelMap { | |||
Iterator(const Rep& rep) : rep_(rep), iter_(rep.begin()) {} | |||
bool Valid() const override { return iter_ != rep_.end(); } | |||
void Next() override { iter_++; } | |||
void Seek(const Slice&) override { assert(false); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to other comment: perhaps abort()
to ensure this is accidentally called in production code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
81610ea
to
eccaf2c
Compare
// searched for the last key less than K. K is covered iff the map entry has a | ||
// larger seqno than K. As an example, consider the key h @ 4. It would be | ||
// compared against the map entry g → 3 and determined to be uncovered. By | ||
// contrast, the key h @ 2 would be determined to be covered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lovely comment.
db/range_del_aggregator.cc
Outdated
std::unique_ptr<RangeDelMap> RangeDelAggregator::NewRangeDelMap() { | ||
RangeDelMap* tombstone_map; | ||
if (collapse_deletions_) { | ||
tombstone_map = new CollapsedRangeDelMap(icmp_.user_comparator()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I see that unique_ptr
doesn't allow for that. Carry on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
db/range_del_aggregator.cc
Outdated
// ^ ^ | ||
// Do nothing. | ||
} | ||
it++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++it
(not sure if this makes a difference in C++ anymore)
// 3: I---M OR 3: A-----------M | ||
// 2: ----k 2: c-------k | ||
// ^ ^ | ||
// Do nothing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the new comments are great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
eccaf2c
to
68f3d32
Compare
@benesch has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow so many great things in this PR: moving AddToBuilder
logic to compaction/flush, eliminating ShouldAddTombstones
, redesigning the complicated collapsing logic and documenting it, and making perf much better (which I also verified with db_bench). Thanks so much for these improvements!
db/range_del_aggregator.cc
Outdated
|
||
class Iterator : public RangeDelIterator { | ||
void SeekPastSentinels() { | ||
while (Valid() && iter_->second == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to have consecutive sentinels? My understanding was there should only be one sentinel when there's a gap between tombstones, and one sentinel at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so it was possible to get an extra sentinel when inserting a tombstone whose start and end keys were equal, like [a, a). I've added an extra check at the beginning of the method to filter these out from the get-go. So now your proposal works—thanks! I've replaced the loop with a single if
and renamed the method MaybeSeekPastSentinel accordingly.
db/range_del_aggregator.cc
Outdated
Iterator(const Rep& rep) : rep_(rep), iter_(rep.begin()) {} | ||
|
||
bool Valid() const override { | ||
return iter_ != rep_.end() && std::next(iter_) != rep_.end(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need the right side of the &&
expression? If we remove it, any operation's SeekPastSentinels
should position the iterator at a valid tombstone or at rep_.end()
. Then the left side of the &&
expression should be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right. Done.
db/range_del_aggregator.cc
Outdated
} | ||
|
||
bool RangeDelAggregator::IsRangeOverlapped(const Slice& start, | ||
const Slice& end) { | ||
// so far only implemented for non-collapsed mode since file ingestion (only | ||
// client) doesn't use collapsing | ||
assert(!collapse_deletions_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this assertion might be worth keeping. The abort()
you added can only be hit if this is called on an initialized aggregator. But most often this function will return early due to uninitialized rep_
, and this assertion can find logic errors even in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Done.
db/compaction_job.cc
Outdated
} | ||
|
||
if (bottommost_level_ && tombstone.seq_ <= earliest_snapshot) { | ||
range_del_out_stats->num_range_del_drop_obsolete++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there was a TODO before about double counting since these are split across files. Is it still a problem after this PR? If so can we bring back the TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah, that's not fixed yet. @petermattis and I are hoping to address it soon though! Added back the TODO.
The unit tests should pass if you rebase on #4115, I think. |
This makes it easier to refer to a range positioning mode outside of the scope of a RangeDelAggregator without polluting the global scope with the individual enum constants.
The uncollapsed representation was previously only tested by the integration tests in db_range_del_test.
68f3d32
to
dccdd45
Compare
@benesch has updated the pull request. Re-import the pull request |
The range deletion aggregator supports "uncollapsed" and "collapsed" tombstone maps. Uncollapsed tombstone maps are naturally represented as a std::multiset, while collapsed tombstone maps are naturally represented as a std::map. The previous implementation used a std::multimap that was general enough to store either the uncollapsed or the collapsed representation. Confusingly, its keys and values had different meanings depending on which representation was being used. Extract a TombstoneMap interface which has two concrete implementations, uncollapsed and collapsed. This allows the implementations to use their natural representation, and keeps the code for manipulating that representation within one class. In the process, simplify and comment the CollapsedTombstoneMap:: AddTombstone method. This refactor brings with it a large performance improvement. The details of the performance bug in the old implementation, and a more targeted fix, can be found in an earlier, abandoned PR, facebook#3992. The refactor also exposed a bug in the ObsoleteTombstoneCleanup test, which was installing tombstones like [dr1, dr1) which cover no keys due to the exclusivity of the upper bound.
Implement a merging iterator over stripes in a RangeDelAggregator. This resolves a TODO about keeping table-modifying code out of the RangeDelAggregator. It also paves the way to splitting output files containing tombstones during compactions when they span too many keys in the level below the compaction output level.
dccdd45
to
b678450
Compare
@benesch has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to review, @ajkr! Glad you guys are interested in the changes.
I think I've addressed all your feedback, and I've rebased on the latest master. We'll know shortly if that's enough to get the tests to pass.
I've also run make format
on every commit, as I let some style violations slip in. Apologies for any extra diff noise that might cause.
db/range_del_aggregator.cc
Outdated
|
||
class Iterator : public RangeDelIterator { | ||
void SeekPastSentinels() { | ||
while (Valid() && iter_->second == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so it was possible to get an extra sentinel when inserting a tombstone whose start and end keys were equal, like [a, a). I've added an extra check at the beginning of the method to filter these out from the get-go. So now your proposal works—thanks! I've replaced the loop with a single if
and renamed the method MaybeSeekPastSentinel accordingly.
db/range_del_aggregator.cc
Outdated
Iterator(const Rep& rep) : rep_(rep), iter_(rep.begin()) {} | ||
|
||
bool Valid() const override { | ||
return iter_ != rep_.end() && std::next(iter_) != rep_.end(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right. Done.
db/compaction_job.cc
Outdated
} | ||
|
||
if (bottommost_level_ && tombstone.seq_ <= earliest_snapshot) { | ||
range_del_out_stats->num_range_del_drop_obsolete++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah, that's not fixed yet. @petermattis and I are hoping to address it soon though! Added back the TODO.
db/range_del_aggregator.cc
Outdated
} | ||
|
||
bool RangeDelAggregator::IsRangeOverlapped(const Slice& start, | ||
const Slice& end) { | ||
// so far only implemented for non-collapsed mode since file ingestion (only | ||
// client) doesn't use collapsing | ||
assert(!collapse_deletions_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Done.
tiny change to try fixing travis
@benesch has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Thanks for the fix @ajkr! The build failures look unrelated now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajkr is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: This fixes the same performance issue that facebook#3992 fixes but with much more invasive cleanup. I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z. With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level. ajkr please take a look when you have a minute! Pull Request resolved: facebook#4014 Differential Revision: D8773253 Pulled By: ajkr fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
Summary: This fixes the same performance issue that facebook#3992 fixes but with much more invasive cleanup. I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z. With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level. ajkr please take a look when you have a minute! Pull Request resolved: facebook#4014 Differential Revision: D8773253 Pulled By: ajkr fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
Summary: To measure the results of upcoming DeleteRange v2 work, this commit adds simple benchmarks for RangeDelAggregator. It measures the average time for AddTombstones and ShouldDelete calls. Using this to compare the results before #4014 and on the latest master (using the default arguments) produces the following results: Before #4014: ``` ======================= Results: ======================= AddTombstones: 1356.28 us ShouldDelete: 0.401732 us ``` Latest master: ``` ======================= Results: ======================= AddTombstones: 740.82 us ShouldDelete: 0.383271 us ``` Pull Request resolved: #4363 Differential Revision: D9881676 Pulled By: abhimadan fbshipit-source-id: 793e7d61aa4b9d47eb917bbcc03f08695b5e5442
This fixes the same performance issue that #3992 fixes but with much more invasive cleanup.
I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z.
With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level.
@ajkr please take a look when you have a minute!