Introduce RangeDelAggregatorV2 #4649

abhimadan · 2018-11-07T19:45:28Z

Summary: The old RangeDelAggregator did expensive pre-processing work
to create a collapsed, binary-searchable representation of range
tombstones. With FragmentedRangeTombstoneIterator, much of this work is
now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
in each iterator to find a covering tombstone in ShouldDelete, while
doing minimal work in AddTombstones. The old RangeDelAggregator is still
used during flush/compaction for now, though RangeDelAggregatorV2 will
support those uses in a future PR.

abhimadan · 2018-11-07T22:15:37Z

Here are the results from running the microbenchmarks with the v1 and v2 RangeDelAggregator:

[abhishekmadan@devvm907.atn1 ~/rocksdb] ./range_del_aggregator_bench --should_deletes_per_run=100 --add_tombstones_per_run=5
=========================
Results:
=========================
AddTombstones:           307.871 us
ShouldDelete (first):    0.607064 us
ShouldDelete (rest):     0.0545043 us
[abhishekmadan@devvm907.atn1 ~/rocksdb] ./range_del_aggregator_bench --should_deletes_per_run=100 --add_tombstones_per_run=5 --use_v2_aggregator=true
=========================
Results:
=========================
AddTombstones:           0.322408 us
ShouldDelete (first):    2.16054 us
ShouldDelete (rest):     0.268877 us

So AddTombstones is now much faster, though ShouldDelete is slower, which implies that RangeDelAggregatorV2 will be less efficient than V1 for sufficiently long range scans. Subsequent ShouldDelete calls during a scan can be sped up here by avoiding a re-seek every call, though this work will be done in a future PR.

ajkr

lgtm! sorry for the long delay.

ajkr · 2018-11-19T20:14:10Z

db/range_del_aggregator_bench.cc

-      stats.time_add_tombstones += stop_watch_add_tombstones.ElapsedNanos();
+      fragmented_range_tombstone_lists.emplace_back(
+          new rocksdb::FragmentedRangeTombstoneList(
+              rocksdb::MakeRangeDelIterator(persistent_range_tombstones), icmp,


I wonder if we should also have a mode for persistent_range_tombstones created ordered by begin key. I tried out the benchmark just now and the profile shows most of time spent in sorting. Hopefully this won't be necessary in the common case when either (1) meta-blocks were written with a recent rocksdb version, or (2) an old rocksdb version was used but there weren't old snapshots. (not suggesting you do it in this PR, btw.)

I was thinking of only having a mode where tombstones are created in order, since having a sorted list should be the common case (i.e., the sorting should only happen once). But yeah, this PR is already large enough, so I'll do this in a follow-up PR. Thanks for noticing this!

ajkr · 2018-11-19T22:52:19Z

db/range_del_aggregator_v2_test.cc

+       {"n", UncutEndpoint(""), UncutEndpoint(""), 0, true /* invalid */},
+       {"", InternalValue("d", 7), UncutEndpoint("e"), 10}});
+
+  VerifySeekForPrev(


these tests are nicely structured - thanks

ajkr · 2018-11-19T23:07:05Z

db/range_del_aggregator_v2.cc

+      // We do not need to adjust largest to properly truncate range
+      // tombstones that extend past the boundary.
+    } else if (parsed_largest.sequence == 0) {
+      // No range tombstone from this sstable can cover largest (or a range


Does this rely on the invariant that we cannot have two same user keys both with seqnum zero - one as largest key in current file and one as smallest key in the next file?

Yes. I'll make this more clear in the comment.

ajkr · 2018-11-19T23:37:46Z

hm I tried correctness test as a sanity check and it failed. Not sure why yet.

$ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --simple --delrangepercent=1 --delpercent=4 --write_buffer_size=1048576 -
-max_bytes_for_level_base=4194304 --target_file_size_base=1048576 --value_size_mult=33 --max_background_compactions=12 --max_key=10000000 --interval=30
...
Verification failed for column family 0 key 465: Value not found: NotFound:
Crash-recovery verification failed :(
2018/11/19-15:35:28  Starting database operations
2018/11/19-15:35:28  Starting verification
Verification failed :(

ajkr · 2018-11-19T23:46:41Z

Actually it happens before this PR too when db_stress runs with merge operator and DeleteRange, hm..

ajkr · 2018-11-20T00:02:52Z

According to git bisect the culprit is 7528130.

abhimadan · 2018-11-20T02:06:34Z

The bug introduced by #4493 is fixed by #4698. I'll re-run crash tests here once that PR lands.

Summary: The old RangeDelAggregator did expensive pre-processing work to create a collapsed, binary-searchable representation of range tombstones. With FragmentedRangeTombstoneIterator, much of this work is now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking in each iterator to find a covering tombstone in ShouldDelete, while doing minimal work in AddTombstones. The old RangeDelAggregator is still used during flush/compaction for now, though RangeDelAggregatorV2 will support those uses in a future PR.

abhimadan · 2018-11-20T22:27:40Z

I addressed the comments. After rebasing, crash tests look good. I'll land this now.

facebook-github-bot

@abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2018-11-21T01:21:44Z

@abhimadan has updated the pull request. Re-import the pull request

facebook-github-bot

@abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

petermattis · 2018-11-20T14:21:19Z

db/range_del_aggregator_v2.cc

+    return;
+  }
+  iters_.emplace_back(new TruncatedRangeDelIterator(std::move(input_iter),
+                                                    icmp_, smallest, largest));


It looks like iters_ will contain the range tombstones from every sstable ever seen during iteration. That could become increasingly expensive in a large scan. I think you only need to keep 1 iterator per-level greater than L0, 1 for each sstable in L0 and and 1 for each memtable. I was imagining this would be done via some sort of LevelTruncatedRangeDelIterator.

Yes, we had some discussion around this. Right now there is some difficulty because a file's range tombstones are added to the aggregator simply when a regular (point-key) LevelIterator opens that file. However we cannot analogously drop tombstones from the aggregator when the Leveliterator advances to the next file because of cases like the following:

L1 has file1 containing key "a" and range tombstone ["b", "c")

L1 has file2 containing key "d"

L2 has file3 containing key "b"

User calls Seek("b").

The L1 LevelIterator opens file1 and adds its range tombstones to the aggregator.

The L1 LevelIterator doesn't find any key greater than or equal to the lookup key, so it proceeds to close file1 and open file2. When file2 is opened, its range tombstones are also added to the aggregator.

The L2 LevelIterator opens file3 and finds key "b". We call ShouldDelete on it and the range tombstone from file1 must still be in the aggregator.

I believe @abhimadan has a proposal involving active and inactive lists, and tracking file boundaries, to overcome this obstacle without having to restructure how we add tombstones to the aggregator. Will let him write it :).

Thanks for the intro @ajkr :)

In this PR, we don't keep track of iterator positions, so we can't make any optimizations. However, in #4677, we keep track of the following data structures (in the forward case; the reverse case is similar so I won't describe it here):

an active iterator min-heap ordered by end key, which contains iterators which currently point to tombstones covering the ShouldDelete key

an inactive iterator min-heap ordered by start key, which contains iterators which currently point to tombstones starting after the ShouldDelete key

a consumed list, which contains iterators whose tombstones all end before the ShouldDelete key

This way, we only need to seek in the inactive and active iterators, while still allowing data iterators to add range tombstones to the aggregator as necessary. This improves performance for longer range scans, but in the current implementation, we don't free consumed iterators. I think that will require some more thought, though I have a few ideas on how to approach it (e.g., using file boundaries to determine whether an iterator can be cleaned up or not).

Right now there is some difficulty because a file's range tombstones are added to the aggregator simply when a regular (point-key) LevelIterator opens that file.

It's interesting that handling of the range tombstones would in some ways be easier if the range tombstones were interleaved with point operations. If that were true, we wouldn't move past an sstable until the range tombstone was iterated past by other levels. Perhaps there is a way to fake that here. I'm imagining having LevelIterator return a "sentinel" key when an sstable boundary is encountered. These sentinel keys would be skipped by MergingIterator, but would prevent a LevelIterator from advancing to the next table until the sentinel key is the smallest key in the heap. For the upper boundary, the sentinel key is only necessary if a range tombstone is the largest key in the table and we know that is true if the upper boundary is a range tombstone sentinel key (kMaxSequenceNumber+kTypeRangeDeletion). Similarly, for the lower bound we only need a sentinel key if a range tombstone is the smallest key in the table. This is true if the smallest boundary has a key with the type kTypeRangeDeletion. Rather than returning that key directly we could translate that key into a normal point deletion which the MergingIterator and DBIter would merrily process. That's fine because the lower bound of a range deletion is inclusive so we know that key is already deleted at that sequence number.

There might be problems with this direction, but it seems a lot simpler than tracking active and inactive iterators in RangeDelAggregator.

This is an interesting idea. If I understand correctly, in your idea, we only keep track of range tombstones that are currently pointed at by an iterator in MergingIterator, and use a sentinel key to prevent us from preemptively forgetting about truncated tombstones? Although that would solve the tombstone lifetime problem, I think it would introduce some other problems. For one thing, we would no longer be able to use bloom filters effectively, since we can't skip files that potentially contain range tombstones that delete keys at a lower level (this could be avoided by keeping the meta-block around and using it in these cases, but it would increase read-amp, which reduces the effectiveness of inlining tombstones). Also, the number of range tombstone fragments would depend on the number of keys above the tombstone in the same file (since the ones below would have been dropped during compaction). This makes write-amp somewhat unintuitive, and can significantly increase compaction output size in pathological cases (though in practice this might not be an issue; I'm not sure).

Also, I'm not sure how much this simplifies things (though I've thought a lot about the other design so this is probably a bit biased). Although we won't need to distinguish between active and inactive tombstone iterators anymore, we would still need to use a RangeDelAggregator-like approach, since MergingIterator would eagerly move past an inline range tombstone as soon as Next() or Prev() is called and it's on top of the heap, even if the tombstone covers keys in lower levels. This means that keeping track of range tombstone lifetimes is still somewhat disconnected from MergingIterator advancement, so we don't gain much from inlining them (though please correct me if I've misunderstood your proposal). Memory management is also not too difficult in the active/inactive tombstone iterator design, since we can delete iterators once we move past their file boundaries (I haven't tried this yet, but since we truncate tombstones at those boundaries, this shouldn't cause correctness issues), so there isn't a significant advantage there either.

I think there are a couple things we're conflating:

(1) having BlockBasedTableIterator return a sentinel key for the beginning/end of a file when it's extended by range tombstones so it'll be closed at the same time the tombstones become irrelevant; and
(2) having all range tombstones inlined in the data blocks.

It looks like we're mostly talking about the downsides of (2), but what about (1)?

having BlockBasedTableIterator return a sentinel key for the beginning/end of a file when it's extended by range tombstones so it'll be closed at the same time the tombstones become irrelevant

I was thinking this would be done by LevelIter since it has a handle on the FileMetadata, though I suppose we can plumb that into BlockBasedTableIterator.

petermattis · 2018-11-26T20:37:34Z

If I understand correctly, in your idea, we only keep track of range tombstones that are currently pointed at by an iterator in MergingIterator, and use a sentinel key to prevent us from preemptively forgetting about truncated tombstones?

Yes. You can think of these sentinel keys as preventing iteration past an sstable when a range tombstone is the cause of the sstable boundary.

Although that would solve the tombstone lifetime problem, I think it would introduce some other problems. For one thing, we would no longer be able to use bloom filters effectively, since we can't skip files that potentially contain range tombstones that delete keys at a lower level (this could be avoided by keeping the meta-block around and using it in these cases, but it would increase read-amp, which reduces the effectiveness of inlining tombstones). Also, the number of range tombstone fragments would depend on the number of keys above the tombstone in the same file (since the ones below would have been dropped during compaction). This makes write-amp somewhat unintuitive, and can significantly increase compaction output size in pathological cases (though in practice this might not be an issue; I'm not sure).

Did I confuse you by mentioning interleaving range tombstones with point operations? I think we should definitely stick with the segregated tombstone approach. My suggestion is purely about the sentinel keys being able to simplify the tombstone lifetime problem for RangeDelAggregatorV2. I'm not seeing how that suggestion affects bloom filters or anything else.

abhimadan · 2018-11-26T21:51:08Z

Oh, sorry about that, I did get confused by the interleaving comment, so most of my earlier comments are moot. OK, I'm going to try responding again, and hopefully I've understood everything this time.

So to summarize, the goal here is that we want to simplify tombstone lifetime logic in RangeDelAggregatorV2 by tying a table tombstone iterator's lifetime with its corresponding BlockBasedTableIterator lifetime (maybe by providing a DeleteTombstones method). The problem with that is what's mentioned in Andrew's comment, and your proposal solves this by delaying the time when LevelIterator opens the next file.

Sorry again about the back-and-forth on figuring that out. I agree that it simplifies lifetime logic in RangeDelAggregatorV2, since we only need to figure out when DBIter has moved past file boundaries in just LevelIterator and not also in RangeDelAggregatorV2.

abhimadan added the WIP Work in progress label Nov 7, 2018

abhimadan requested a review from ajkr November 7, 2018 19:45

facebook-github-bot added the CLA Signed label Nov 7, 2018

abhimadan force-pushed the range-del-agg-merging-iter branch 3 times, most recently from 8d22ab5 to ac51bbe Compare November 7, 2018 22:08

abhimadan force-pushed the range-del-agg-merging-iter branch 6 times, most recently from fa2b2c3 to 0d60f46 Compare November 13, 2018 18:32

abhimadan mentioned this pull request Nov 13, 2018

Speed up range scans with range tombstones #4677

Closed

abhimadan force-pushed the range-del-agg-merging-iter branch from 0d60f46 to 03b9cdc Compare November 14, 2018 18:58

abhimadan mentioned this pull request Nov 14, 2018

Modify FragmentedRangeTombstoneList member layout #4632

Closed

abhimadan force-pushed the range-del-agg-merging-iter branch from 03b9cdc to 4124d17 Compare November 16, 2018 18:48

abhimadan removed the WIP Work in progress label Nov 16, 2018

ajkr approved these changes Nov 19, 2018

View reviewed changes

abhimadan added 3 commits November 20, 2018 14:16

Update TARGETS

01707a3

Clarify comment on truncated iter largest key

9bf7695

abhimadan force-pushed the range-del-agg-merging-iter branch from 4124d17 to 9bf7695 Compare November 20, 2018 22:27

Update with make format

a43aec9

facebook-github-bot reviewed Nov 20, 2018

View reviewed changes

facebook-github-bot reviewed Nov 21, 2018

View reviewed changes

Fix TARGETS file typo

e394683

facebook-github-bot reviewed Nov 21, 2018

View reviewed changes

petermattis reviewed Nov 21, 2018

View reviewed changes

facebook-github-bot closed this in 457f77b Nov 21, 2018

abhimadan deleted the range-del-agg-merging-iter branch November 21, 2018 18:58

absolute8511 mentioned this pull request Apr 2, 2019

Rocksdb DeleteRange v2 should be merged to improve read and scan with many delete ranges youzan/ZanRedisDB#26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce RangeDelAggregatorV2 #4649

Introduce RangeDelAggregatorV2 #4649

abhimadan commented Nov 7, 2018 •

edited

Loading

abhimadan commented Nov 7, 2018

ajkr left a comment

ajkr Nov 19, 2018

abhimadan Nov 20, 2018

ajkr Nov 19, 2018

ajkr Nov 19, 2018

abhimadan Nov 20, 2018

ajkr commented Nov 19, 2018

ajkr commented Nov 19, 2018

ajkr commented Nov 20, 2018

abhimadan commented Nov 20, 2018

abhimadan commented Nov 20, 2018

facebook-github-bot left a comment

facebook-github-bot left a comment

facebook-github-bot left a comment

facebook-github-bot commented Nov 21, 2018

facebook-github-bot left a comment

petermattis Nov 20, 2018

ajkr Nov 21, 2018 •

edited

Loading

abhimadan Nov 21, 2018

petermattis Nov 25, 2018

abhimadan Nov 26, 2018

ajkr Nov 26, 2018

petermattis Nov 26, 2018

petermattis commented Nov 26, 2018

abhimadan commented Nov 26, 2018 •

edited

Loading

Introduce RangeDelAggregatorV2 #4649

Introduce RangeDelAggregatorV2 #4649

Conversation

abhimadan commented Nov 7, 2018 • edited Loading

abhimadan commented Nov 7, 2018

ajkr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajkr commented Nov 19, 2018

ajkr commented Nov 19, 2018

ajkr commented Nov 20, 2018

abhimadan commented Nov 20, 2018

abhimadan commented Nov 20, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Nov 21, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajkr Nov 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petermattis commented Nov 26, 2018

abhimadan commented Nov 26, 2018 • edited Loading

abhimadan commented Nov 7, 2018 •

edited

Loading

ajkr Nov 21, 2018 •

edited

Loading

abhimadan commented Nov 26, 2018 •

edited

Loading