Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DeleteRange in transactions #4812

Open
abhimadan opened this issue Dec 21, 2018 · 10 comments
Open

Support DeleteRange in transactions #4812

abhimadan opened this issue Dec 21, 2018 · 10 comments
Labels
up-for-grabs Up for grabs

Comments

@abhimadan
Copy link
Contributor

abhimadan commented Dec 21, 2018

Now that DeleteRange is performant, it would be nice to support it in transactions as well.
Summary of the essential changes:

  1. Finding the last visible RangeDelete tombstone needs to be updated to make use of ReadCallback::IsVisible same as the regular reads.
    upper_bound_, std::greater<SequenceNumber>());
  2. WritePrepared tests assumed the key cannot be RangeDelete. The tests must be updated to add RangeDelete type anywhere Delete and SingleDelete is used: https://github.com/facebook/rocksdb/blob/master/utilities/transactions/write_prepared_transaction_test.cc#L513
  3. Finally the existing tests for range delete should be extended to include transactions (extended with ::DeleteRange) API as well.
@abhimadan abhimadan added the up-for-grabs Up for grabs label Dec 21, 2018
@maysamyabandeh maysamyabandeh removed the up-for-grabs Up for grabs label Jan 28, 2019
@maysamyabandeh
Copy link
Contributor

Since DeleteRange and WritePrepared Transactions are developed in parallel we need to make sure that they are consistent with each other. One difference to watch out for is that in WritePrepared the prepare and commit sequence numbers are separate: the prepare sequence number is stored in DB with the key/value and commit is stored outside.
I went through the code to figure the potential places that needs careful consideration:

  1. There is a TODO here for duplicate key detection:
    // TODO(myabandeh): when transactional DeleteRange support is added,

    to include the end key but I think it is fine since the end key is encoded in the value, thus not causing duplicate key issue.
  2. In Memtable::Get there is a comparison of seq to max_covering_tombstone_seq which is obtained from DeleteRange tombstone.
    max_covering_tombstone_seq > seq) {

    I read the code a bit but could not figure what is the significance of this sequence number and whether it is consistent with separation of prepare and commit sequence number introduced in WritePrepared transactions.
    Same goes for GetContext:
    *max_covering_tombstone_seq_ > parsed_key.sequence) {
  3. WritePrepared tests assumed the key cannot be RangeDelete. The tests must be updated to add RangeDelete type anywhere Delete and SingleDelete is used: https://github.com/facebook/rocksdb/blob/master/utilities/transactions/write_prepared_transaction_test.cc#L513
  4. ::ShouldDelete that checks against the deleted ranges, should only care about user key and ignore the sequence number. However it uses internal key comparator instead of user comparator:
    icmp_->Compare(parsed, (*active_iters_.top())->start_key()) < 0) {

    We need to clarify why so that the sequence number, that is interpreted differently in write prepared transactions, is not used meaningfully here.
  5. Finally the existing tests for range delete should be extended to include transactions (extended with ::DeleteRange) API as well.
    We can discuss these items here, see which one could be potential issue, and specify what exactly needs to be done to extend transactions with DeleteRange.

@ajkr What do you think?

@ajkr
Copy link
Contributor

ajkr commented Mar 26, 2019

I read the code a bit but could not figure what is the significance of this sequence number and whether it is consistent with separation of prepare and commit sequence number introduced in WritePrepared transactions.

Unfortunately I don't know much about write-prepared to answer your question, but here's some more detail about that seqnum from the DeleteRange side.

max_covering_tombstone_seq is the seqnum of the newest range tombstone we've found so far that covers the user key Get() is looking up. (Side note: I suppose it could've also been called first_covering_tombstone_seq because Get() looks at memtables and SSTs from newest-to-oldest, so after it finds one covering tombstone it will not find another one with a higher seqnum.)

It is used in the places you mentioned to check whether an internal key found in a memtable or SST is deleted by a range tombstone. It is also used to short-circuit the lookup process, e.g.,

rocksdb/db/version_set.cc

Lines 1224 to 1228 in 3c5d1b1

if (*max_covering_tombstone_seq > 0) {
// The remaining files we look at will only contain covered keys, so we
// stop here.
break;
}

The above logic exploits how we examine memtables/SSTs in newest-to-oldest order. If any covering range tombstone was found in an earlier memtable or SST, then it's definitely newer than any internal key we might find now or later in the lookup process, so we can give up early.

@ajkr
Copy link
Contributor

ajkr commented Mar 27, 2019

::ShouldDelete that checks against the deleted ranges, should only care about user key and ignore the sequence number. However it uses internal key comparator instead of user comparator:

I always have trouble remembering this one. Luckily @abhimadan documented it here: https://github.com/facebook/rocksdb/wiki/DeleteRange-Implementation#internal-key-range-tombstone-boundaries.

@maysamyabandeh
Copy link
Contributor

Thanks @ajkr for the clear explanation. It was very helpful. I update the issue description will the list of required changes that we have figured so far.

@maysamyabandeh
Copy link
Contributor

@ajkr One more question, I do not see MaxCoveringTombstoneSeqnum taking into account the snapshot number with which we are doing the read. How does it know if the DeleteRange Tombstone is in the read snapshot (and not newer than that)?

@ajkr
Copy link
Contributor

ajkr commented Mar 28, 2019

@maysamyabandeh The FragmentedRangeTombstoneIterator has some internal logic preventing it from considering range tombstones newer than this read_seq:

fragmented_tombstone_list, comparator_.comparator, read_seq);
. That read_seq appears to come from here in the case of Get():

rocksdb/db/db_impl.cc

Lines 1363 to 1392 in 89ab138

SequenceNumber snapshot;
if (read_options.snapshot != nullptr) {
// Note: In WritePrepared txns this is not necessary but not harmful
// either. Because prep_seq > snapshot => commit_seq > snapshot so if
// a snapshot is specified we should be fine with skipping seq numbers
// that are greater than that.
//
// In WriteUnprepared, we cannot set snapshot in the lookup key because we
// may skip uncommitted data that should be visible to the transaction for
// reading own writes.
snapshot =
reinterpret_cast<const SnapshotImpl*>(read_options.snapshot)->number_;
if (callback) {
snapshot = std::max(snapshot, callback->MaxUnpreparedSequenceNumber());
}
} else {
// Since we get and reference the super version before getting
// the snapshot number, without a mutex protection, it is possible
// that a memtable switch happened in the middle and not all the
// data for this snapshot is available. But it will contain all
// the data available in the super version we have, which is also
// a valid snapshot to read from.
// We shouldn't get snapshot before finding and referencing the super
// version because a flush happening in between may compact away data for
// the snapshot, but the snapshot is earlier than the data overwriting it,
// so users may see wrong results.
snapshot = last_seq_same_as_publish_seq_
? versions_->LastSequence()
: versions_->LastPublishedSequence();
}

@maysamyabandeh maysamyabandeh added the up-for-grabs Up for grabs label Mar 28, 2019
@ghost ghost self-assigned this Feb 22, 2020
@feliwir
Copy link

feliwir commented Apr 21, 2020

Any updates on this?

@feliwir
Copy link

feliwir commented Oct 14, 2021

We still need this in 2021 :-/

@chenyukang
Copy link

Any update on this?
We still need this in 2023 :-/

@koudelka
Copy link

koudelka commented Jan 6, 2024

Would love this for 2024 😢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
up-for-grabs Up for grabs
Projects
None yet
Development

No branches or pull requests

6 participants