Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce RangeDelAggregatorV2 #4649

Closed

Conversation

abhimadan
Copy link
Contributor

@abhimadan abhimadan commented Nov 7, 2018

Summary: The old RangeDelAggregator did expensive pre-processing work
to create a collapsed, binary-searchable representation of range
tombstones. With FragmentedRangeTombstoneIterator, much of this work is
now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
in each iterator to find a covering tombstone in ShouldDelete, while
doing minimal work in AddTombstones. The old RangeDelAggregator is still
used during flush/compaction for now, though RangeDelAggregatorV2 will
support those uses in a future PR.

@abhimadan abhimadan added the WIP Work in progress label Nov 7, 2018
@abhimadan abhimadan requested a review from ajkr November 7, 2018 19:45
@abhimadan abhimadan force-pushed the range-del-agg-merging-iter branch 3 times, most recently from 8d22ab5 to ac51bbe Compare November 7, 2018 22:08
@abhimadan
Copy link
Contributor Author

Here are the results from running the microbenchmarks with the v1 and v2 RangeDelAggregator:

[abhishekmadan@devvm907.atn1 ~/rocksdb] ./range_del_aggregator_bench --should_deletes_per_run=100 --add_tombstones_per_run=5
=========================
Results:
=========================
AddTombstones:           307.871 us
ShouldDelete (first):    0.607064 us
ShouldDelete (rest):     0.0545043 us
[abhishekmadan@devvm907.atn1 ~/rocksdb] ./range_del_aggregator_bench --should_deletes_per_run=100 --add_tombstones_per_run=5 --use_v2_aggregator=true
=========================
Results:
=========================
AddTombstones:           0.322408 us
ShouldDelete (first):    2.16054 us
ShouldDelete (rest):     0.268877 us

So AddTombstones is now much faster, though ShouldDelete is slower, which implies that RangeDelAggregatorV2 will be less efficient than V1 for sufficiently long range scans. Subsequent ShouldDelete calls during a scan can be sped up here by avoiding a re-seek every call, though this work will be done in a future PR.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! sorry for the long delay.

stats.time_add_tombstones += stop_watch_add_tombstones.ElapsedNanos();
fragmented_range_tombstone_lists.emplace_back(
new rocksdb::FragmentedRangeTombstoneList(
rocksdb::MakeRangeDelIterator(persistent_range_tombstones), icmp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should also have a mode for persistent_range_tombstones created ordered by begin key. I tried out the benchmark just now and the profile shows most of time spent in sorting. Hopefully this won't be necessary in the common case when either (1) meta-blocks were written with a recent rocksdb version, or (2) an old rocksdb version was used but there weren't old snapshots. (not suggesting you do it in this PR, btw.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of only having a mode where tombstones are created in order, since having a sorted list should be the common case (i.e., the sorting should only happen once). But yeah, this PR is already large enough, so I'll do this in a follow-up PR. Thanks for noticing this!

{"n", UncutEndpoint(""), UncutEndpoint(""), 0, true /* invalid */},
{"", InternalValue("d", 7), UncutEndpoint("e"), 10}});

VerifySeekForPrev(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests are nicely structured - thanks

// We do not need to adjust largest to properly truncate range
// tombstones that extend past the boundary.
} else if (parsed_largest.sequence == 0) {
// No range tombstone from this sstable can cover largest (or a range
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this rely on the invariant that we cannot have two same user keys both with seqnum zero - one as largest key in current file and one as smallest key in the next file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'll make this more clear in the comment.

@ajkr
Copy link
Contributor

ajkr commented Nov 19, 2018

hm I tried correctness test as a sanity check and it failed. Not sure why yet.

$ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --simple --delrangepercent=1 --delpercent=4 --write_buffer_size=1048576 -
-max_bytes_for_level_base=4194304 --target_file_size_base=1048576 --value_size_mult=33 --max_background_compactions=12 --max_key=10000000 --interval=30
...
Verification failed for column family 0 key 465: Value not found: NotFound:
Crash-recovery verification failed :(
2018/11/19-15:35:28  Starting database operations
2018/11/19-15:35:28  Starting verification
Verification failed :(

@ajkr
Copy link
Contributor

ajkr commented Nov 19, 2018

Actually it happens before this PR too when db_stress runs with merge operator and DeleteRange, hm..

@ajkr
Copy link
Contributor

ajkr commented Nov 20, 2018

According to git bisect the culprit is 7528130.

@abhimadan
Copy link
Contributor Author

The bug introduced by #4493 is fixed by #4698. I'll re-run crash tests here once that PR lands.

Summary: The old RangeDelAggregator did expensive pre-processing work
to create a collapsed, binary-searchable representation of range
tombstones. With FragmentedRangeTombstoneIterator, much of this work is
now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
in each iterator to find a covering tombstone in ShouldDelete, while
doing minimal work in AddTombstones. The old RangeDelAggregator is still
used during flush/compaction for now, though RangeDelAggregatorV2 will
support those uses in a future PR.
@abhimadan
Copy link
Contributor Author

I addressed the comments. After rebasing, crash tests look good. I'll land this now.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@abhimadan has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

return;
}
iters_.emplace_back(new TruncatedRangeDelIterator(std::move(input_iter),
icmp_, smallest, largest));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like iters_ will contain the range tombstones from every sstable ever seen during iteration. That could become increasingly expensive in a large scan. I think you only need to keep 1 iterator per-level greater than L0, 1 for each sstable in L0 and and 1 for each memtable. I was imagining this would be done via some sort of LevelTruncatedRangeDelIterator.

Copy link
Contributor

@ajkr ajkr Nov 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we had some discussion around this. Right now there is some difficulty because a file's range tombstones are added to the aggregator simply when a regular (point-key) LevelIterator opens that file. However we cannot analogously drop tombstones from the aggregator when the Leveliterator advances to the next file because of cases like the following:

  • L1 has file1 containing key "a" and range tombstone ["b", "c")
  • L1 has file2 containing key "d"
  • L2 has file3 containing key "b"
  • User calls Seek("b").
    • The L1 LevelIterator opens file1 and adds its range tombstones to the aggregator.
    • The L1 LevelIterator doesn't find any key greater than or equal to the lookup key, so it proceeds to close file1 and open file2. When file2 is opened, its range tombstones are also added to the aggregator.
    • The L2 LevelIterator opens file3 and finds key "b". We call ShouldDelete on it and the range tombstone from file1 must still be in the aggregator.

I believe @abhimadan has a proposal involving active and inactive lists, and tracking file boundaries, to overcome this obstacle without having to restructure how we add tombstones to the aggregator. Will let him write it :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the intro @ajkr :)

In this PR, we don't keep track of iterator positions, so we can't make any optimizations. However, in #4677, we keep track of the following data structures (in the forward case; the reverse case is similar so I won't describe it here):

  • an active iterator min-heap ordered by end key, which contains iterators which currently point to tombstones covering the ShouldDelete key
  • an inactive iterator min-heap ordered by start key, which contains iterators which currently point to tombstones starting after the ShouldDelete key
  • a consumed list, which contains iterators whose tombstones all end before the ShouldDelete key

This way, we only need to seek in the inactive and active iterators, while still allowing data iterators to add range tombstones to the aggregator as necessary. This improves performance for longer range scans, but in the current implementation, we don't free consumed iterators. I think that will require some more thought, though I have a few ideas on how to approach it (e.g., using file boundaries to determine whether an iterator can be cleaned up or not).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now there is some difficulty because a file's range tombstones are added to the aggregator simply when a regular (point-key) LevelIterator opens that file.

It's interesting that handling of the range tombstones would in some ways be easier if the range tombstones were interleaved with point operations. If that were true, we wouldn't move past an sstable until the range tombstone was iterated past by other levels. Perhaps there is a way to fake that here. I'm imagining having LevelIterator return a "sentinel" key when an sstable boundary is encountered. These sentinel keys would be skipped by MergingIterator, but would prevent a LevelIterator from advancing to the next table until the sentinel key is the smallest key in the heap. For the upper boundary, the sentinel key is only necessary if a range tombstone is the largest key in the table and we know that is true if the upper boundary is a range tombstone sentinel key (kMaxSequenceNumber+kTypeRangeDeletion). Similarly, for the lower bound we only need a sentinel key if a range tombstone is the smallest key in the table. This is true if the smallest boundary has a key with the type kTypeRangeDeletion. Rather than returning that key directly we could translate that key into a normal point deletion which the MergingIterator and DBIter would merrily process. That's fine because the lower bound of a range deletion is inclusive so we know that key is already deleted at that sequence number.

There might be problems with this direction, but it seems a lot simpler than tracking active and inactive iterators in RangeDelAggregator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting idea. If I understand correctly, in your idea, we only keep track of range tombstones that are currently pointed at by an iterator in MergingIterator, and use a sentinel key to prevent us from preemptively forgetting about truncated tombstones? Although that would solve the tombstone lifetime problem, I think it would introduce some other problems. For one thing, we would no longer be able to use bloom filters effectively, since we can't skip files that potentially contain range tombstones that delete keys at a lower level (this could be avoided by keeping the meta-block around and using it in these cases, but it would increase read-amp, which reduces the effectiveness of inlining tombstones). Also, the number of range tombstone fragments would depend on the number of keys above the tombstone in the same file (since the ones below would have been dropped during compaction). This makes write-amp somewhat unintuitive, and can significantly increase compaction output size in pathological cases (though in practice this might not be an issue; I'm not sure).

Also, I'm not sure how much this simplifies things (though I've thought a lot about the other design so this is probably a bit biased). Although we won't need to distinguish between active and inactive tombstone iterators anymore, we would still need to use a RangeDelAggregator-like approach, since MergingIterator would eagerly move past an inline range tombstone as soon as Next() or Prev() is called and it's on top of the heap, even if the tombstone covers keys in lower levels. This means that keeping track of range tombstone lifetimes is still somewhat disconnected from MergingIterator advancement, so we don't gain much from inlining them (though please correct me if I've misunderstood your proposal). Memory management is also not too difficult in the active/inactive tombstone iterator design, since we can delete iterators once we move past their file boundaries (I haven't tried this yet, but since we truncate tombstones at those boundaries, this shouldn't cause correctness issues), so there isn't a significant advantage there either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a couple things we're conflating:

(1) having BlockBasedTableIterator return a sentinel key for the beginning/end of a file when it's extended by range tombstones so it'll be closed at the same time the tombstones become irrelevant; and
(2) having all range tombstones inlined in the data blocks.

It looks like we're mostly talking about the downsides of (2), but what about (1)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having BlockBasedTableIterator return a sentinel key for the beginning/end of a file when it's extended by range tombstones so it'll be closed at the same time the tombstones become irrelevant

I was thinking this would be done by LevelIter since it has a handle on the FileMetadata, though I suppose we can plumb that into BlockBasedTableIterator.

@abhimadan abhimadan deleted the range-del-agg-merging-iter branch November 21, 2018 18:58
@petermattis
Copy link
Contributor

If I understand correctly, in your idea, we only keep track of range tombstones that are currently pointed at by an iterator in MergingIterator, and use a sentinel key to prevent us from preemptively forgetting about truncated tombstones?

Yes. You can think of these sentinel keys as preventing iteration past an sstable when a range tombstone is the cause of the sstable boundary.

Although that would solve the tombstone lifetime problem, I think it would introduce some other problems. For one thing, we would no longer be able to use bloom filters effectively, since we can't skip files that potentially contain range tombstones that delete keys at a lower level (this could be avoided by keeping the meta-block around and using it in these cases, but it would increase read-amp, which reduces the effectiveness of inlining tombstones). Also, the number of range tombstone fragments would depend on the number of keys above the tombstone in the same file (since the ones below would have been dropped during compaction). This makes write-amp somewhat unintuitive, and can significantly increase compaction output size in pathological cases (though in practice this might not be an issue; I'm not sure).

Did I confuse you by mentioning interleaving range tombstones with point operations? I think we should definitely stick with the segregated tombstone approach. My suggestion is purely about the sentinel keys being able to simplify the tombstone lifetime problem for RangeDelAggregatorV2. I'm not seeing how that suggestion affects bloom filters or anything else.

@abhimadan
Copy link
Contributor Author

abhimadan commented Nov 26, 2018

Oh, sorry about that, I did get confused by the interleaving comment, so most of my earlier comments are moot. OK, I'm going to try responding again, and hopefully I've understood everything this time.

So to summarize, the goal here is that we want to simplify tombstone lifetime logic in RangeDelAggregatorV2 by tying a table tombstone iterator's lifetime with its corresponding BlockBasedTableIterator lifetime (maybe by providing a DeleteTombstones method). The problem with that is what's mentioned in Andrew's comment, and your proposal solves this by delaying the time when LevelIterator opens the next file.

Sorry again about the back-and-forth on figuring that out. I agree that it simplifies lifetime logic in RangeDelAggregatorV2, since we only need to figure out when DBIter has moved past file boundaries in just LevelIterator and not also in RangeDelAggregatorV2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants