-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce RangeDelAggregatorV2 #4649
Conversation
8d22ab5
to
ac51bbe
Compare
Here are the results from running the microbenchmarks with the v1 and v2 RangeDelAggregator:
So |
fa2b2c3
to
0d60f46
Compare
0d60f46
to
03b9cdc
Compare
03b9cdc
to
4124d17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! sorry for the long delay.
stats.time_add_tombstones += stop_watch_add_tombstones.ElapsedNanos(); | ||
fragmented_range_tombstone_lists.emplace_back( | ||
new rocksdb::FragmentedRangeTombstoneList( | ||
rocksdb::MakeRangeDelIterator(persistent_range_tombstones), icmp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should also have a mode for persistent_range_tombstones
created ordered by begin key. I tried out the benchmark just now and the profile shows most of time spent in sorting. Hopefully this won't be necessary in the common case when either (1) meta-blocks were written with a recent rocksdb version, or (2) an old rocksdb version was used but there weren't old snapshots. (not suggesting you do it in this PR, btw.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of only having a mode where tombstones are created in order, since having a sorted list should be the common case (i.e., the sorting should only happen once). But yeah, this PR is already large enough, so I'll do this in a follow-up PR. Thanks for noticing this!
{"n", UncutEndpoint(""), UncutEndpoint(""), 0, true /* invalid */}, | ||
{"", InternalValue("d", 7), UncutEndpoint("e"), 10}}); | ||
|
||
VerifySeekForPrev( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these tests are nicely structured - thanks
db/range_del_aggregator_v2.cc
Outdated
// We do not need to adjust largest to properly truncate range | ||
// tombstones that extend past the boundary. | ||
} else if (parsed_largest.sequence == 0) { | ||
// No range tombstone from this sstable can cover largest (or a range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this rely on the invariant that we cannot have two same user keys both with seqnum zero - one as largest key in current file and one as smallest key in the next file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'll make this more clear in the comment.
hm I tried correctness test as a sanity check and it failed. Not sure why yet.
|
Actually it happens before this PR too when |
According to git bisect the culprit is 7528130. |
Summary: The old RangeDelAggregator did expensive pre-processing work to create a collapsed, binary-searchable representation of range tombstones. With FragmentedRangeTombstoneIterator, much of this work is now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking in each iterator to find a covering tombstone in ShouldDelete, while doing minimal work in AddTombstones. The old RangeDelAggregator is still used during flush/compaction for now, though RangeDelAggregatorV2 will support those uses in a future PR.
4124d17
to
9bf7695
Compare
I addressed the comments. After rebasing, crash tests look good. I'll land this now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@abhimadan has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
return; | ||
} | ||
iters_.emplace_back(new TruncatedRangeDelIterator(std::move(input_iter), | ||
icmp_, smallest, largest)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like iters_
will contain the range tombstones from every sstable ever seen during iteration. That could become increasingly expensive in a large scan. I think you only need to keep 1 iterator per-level greater than L0, 1 for each sstable in L0 and and 1 for each memtable. I was imagining this would be done via some sort of LevelTruncatedRangeDelIterator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we had some discussion around this. Right now there is some difficulty because a file's range tombstones are added to the aggregator simply when a regular (point-key) LevelIterator
opens that file. However we cannot analogously drop tombstones from the aggregator when the Leveliterator
advances to the next file because of cases like the following:
- L1 has file1 containing key
"a"
and range tombstone["b", "c")
- L1 has file2 containing key
"d"
- L2 has file3 containing key
"b"
- User calls
Seek("b")
.- The L1
LevelIterator
opens file1 and adds its range tombstones to the aggregator. - The L1
LevelIterator
doesn't find any key greater than or equal to the lookup key, so it proceeds to close file1 and open file2. When file2 is opened, its range tombstones are also added to the aggregator. - The L2
LevelIterator
opens file3 and finds key"b"
. We callShouldDelete
on it and the range tombstone from file1 must still be in the aggregator.
- The L1
I believe @abhimadan has a proposal involving active and inactive lists, and tracking file boundaries, to overcome this obstacle without having to restructure how we add tombstones to the aggregator. Will let him write it :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the intro @ajkr :)
In this PR, we don't keep track of iterator positions, so we can't make any optimizations. However, in #4677, we keep track of the following data structures (in the forward case; the reverse case is similar so I won't describe it here):
- an active iterator min-heap ordered by end key, which contains iterators which currently point to tombstones covering the ShouldDelete key
- an inactive iterator min-heap ordered by start key, which contains iterators which currently point to tombstones starting after the ShouldDelete key
- a consumed list, which contains iterators whose tombstones all end before the ShouldDelete key
This way, we only need to seek in the inactive and active iterators, while still allowing data iterators to add range tombstones to the aggregator as necessary. This improves performance for longer range scans, but in the current implementation, we don't free consumed iterators. I think that will require some more thought, though I have a few ideas on how to approach it (e.g., using file boundaries to determine whether an iterator can be cleaned up or not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now there is some difficulty because a file's range tombstones are added to the aggregator simply when a regular (point-key) LevelIterator opens that file.
It's interesting that handling of the range tombstones would in some ways be easier if the range tombstones were interleaved with point operations. If that were true, we wouldn't move past an sstable until the range tombstone was iterated past by other levels. Perhaps there is a way to fake that here. I'm imagining having LevelIterator
return a "sentinel" key when an sstable boundary is encountered. These sentinel keys would be skipped by MergingIterator
, but would prevent a LevelIterator
from advancing to the next table until the sentinel key is the smallest key in the heap. For the upper boundary, the sentinel key is only necessary if a range tombstone is the largest key in the table and we know that is true if the upper boundary is a range tombstone sentinel key (kMaxSequenceNumber
+kTypeRangeDeletion
). Similarly, for the lower bound we only need a sentinel key if a range tombstone is the smallest key in the table. This is true if the smallest boundary has a key with the type kTypeRangeDeletion
. Rather than returning that key directly we could translate that key into a normal point deletion which the MergingIterator
and DBIter
would merrily process. That's fine because the lower bound of a range deletion is inclusive so we know that key is already deleted at that sequence number.
There might be problems with this direction, but it seems a lot simpler than tracking active and inactive iterators in RangeDelAggregator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting idea. If I understand correctly, in your idea, we only keep track of range tombstones that are currently pointed at by an iterator in MergingIterator
, and use a sentinel key to prevent us from preemptively forgetting about truncated tombstones? Although that would solve the tombstone lifetime problem, I think it would introduce some other problems. For one thing, we would no longer be able to use bloom filters effectively, since we can't skip files that potentially contain range tombstones that delete keys at a lower level (this could be avoided by keeping the meta-block around and using it in these cases, but it would increase read-amp, which reduces the effectiveness of inlining tombstones). Also, the number of range tombstone fragments would depend on the number of keys above the tombstone in the same file (since the ones below would have been dropped during compaction). This makes write-amp somewhat unintuitive, and can significantly increase compaction output size in pathological cases (though in practice this might not be an issue; I'm not sure).
Also, I'm not sure how much this simplifies things (though I've thought a lot about the other design so this is probably a bit biased). Although we won't need to distinguish between active and inactive tombstone iterators anymore, we would still need to use a RangeDelAggregator
-like approach, since MergingIterator
would eagerly move past an inline range tombstone as soon as Next()
or Prev()
is called and it's on top of the heap, even if the tombstone covers keys in lower levels. This means that keeping track of range tombstone lifetimes is still somewhat disconnected from MergingIterator
advancement, so we don't gain much from inlining them (though please correct me if I've misunderstood your proposal). Memory management is also not too difficult in the active/inactive tombstone iterator design, since we can delete iterators once we move past their file boundaries (I haven't tried this yet, but since we truncate tombstones at those boundaries, this shouldn't cause correctness issues), so there isn't a significant advantage there either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are a couple things we're conflating:
(1) having BlockBasedTableIterator
return a sentinel key for the beginning/end of a file when it's extended by range tombstones so it'll be closed at the same time the tombstones become irrelevant; and
(2) having all range tombstones inlined in the data blocks.
It looks like we're mostly talking about the downsides of (2), but what about (1)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having BlockBasedTableIterator return a sentinel key for the beginning/end of a file when it's extended by range tombstones so it'll be closed at the same time the tombstones become irrelevant
I was thinking this would be done by LevelIter
since it has a handle on the FileMetadata
, though I suppose we can plumb that into BlockBasedTableIterator
.
Yes. You can think of these sentinel keys as preventing iteration past an sstable when a range tombstone is the cause of the sstable boundary.
Did I confuse you by mentioning interleaving range tombstones with point operations? I think we should definitely stick with the segregated tombstone approach. My suggestion is purely about the sentinel keys being able to simplify the tombstone lifetime problem for RangeDelAggregatorV2. I'm not seeing how that suggestion affects bloom filters or anything else. |
Oh, sorry about that, I did get confused by the interleaving comment, so most of my earlier comments are moot. OK, I'm going to try responding again, and hopefully I've understood everything this time. So to summarize, the goal here is that we want to simplify tombstone lifetime logic in Sorry again about the back-and-forth on figuring that out. I agree that it simplifies lifetime logic in |
Summary: The old RangeDelAggregator did expensive pre-processing work
to create a collapsed, binary-searchable representation of range
tombstones. With FragmentedRangeTombstoneIterator, much of this work is
now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
in each iterator to find a covering tombstone in ShouldDelete, while
doing minimal work in AddTombstones. The old RangeDelAggregator is still
used during flush/compaction for now, though RangeDelAggregatorV2 will
support those uses in a future PR.