Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: investigate SingleDelete for raft log truncation #8979

Open
dt opened this issue Aug 31, 2016 · 9 comments
Open

kvserver: investigate SingleDelete for raft log truncation #8979

dt opened this issue Aug 31, 2016 · 9 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-kv KV Team
Projects

Comments

@dt
Copy link
Member

dt commented Aug 31, 2016

@tbg on 2023-04-10:

Re-using this issue to focus exclusively on SingleDelete for the raft log, see #8979 (comment).


On #4321, @Simkha mentioned SingleDelete and its handling in compactions:

The following link points to a comment in CompactionIterator::NextFromInput(), the comment explains when presence of SingleDelete would compact out (i.e. clear) the stored key and the SingleDelete record. This iterator is used during meltable flush.

https://github.com/facebook/rocksdb/blob/2fc2fd92a94dba8fbad5679fce4522c4769aab27/db/compaction_iterator.cc#L227-L260

If you can adapt your code to meet the restrictions of SingleDelete, then I think, it could be useful.

https://github.com/facebook/rocksdb/wiki/Single-Delete

Jira issue: CRDB-6156

Epic CRDB-39898

@dt dt added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Aug 31, 2016
@tbg
Copy link
Member

tbg commented Aug 31, 2016

See
#4196 (comment)

On Wed, Aug 31, 2016 at 10:45 AM David Taylor notifications@github.com
wrote:

On #4321 #4321, @Simkha
https://github.com/simkha mentioned SingleDelete:

This info could be helpful for you. I was studying rocksdb sources and
thought I could share this little info.

The following link points to a comment in
CompactionIterator::NextFromInput(), the comment explains when presence of
SingleDelete would compact out (i.e. clear) the stored key and the
SingleDelete record. This iterator is used during meltable flush.

https://github.com/facebook/rocksdb/blob/2fc2fd92a94dba8fbad5679fce4522c4769aab27/db/compaction_iterator.cc#L227-L260

If you can adapt your code to meet the restrictions of SingleDelete, then
I think, it could be useful.

https://github.com/facebook/rocksdb/wiki/Single-Delete


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#8979, or mute the thread
https://github.com/notifications/unsubscribe-auth/AE135HPcrF3fzqoBMh68n2XHzck8DzB4ks5qlZOTgaJpZM4JxsgK
.

-- Tobias

@petermattis
Copy link
Collaborator

I think we might be able to use SingleDelete for non-versioned keys such as the transaction keys. Currently, those keys are written multiple times (via Put) and then deleted a single time. But we can change that so that we call SingleDelete before overwriting. This would result in a series of Put, SingleDelete, Put, SingleDelete, Put, SingleDelete operations for the key which I believe is kosher.

@ghost
Copy link

ghost commented Sep 1, 2016

Interestingly, according to CompactionIterator, currently, the keys deleted with DB::Delete() will only be deleted during compaction, they will not be deleted during memtable flush (unlike SingleDelete). You can tell this by compaction_ != nullptr check.

For completeness, here's the call stack leading to CompactionIterator during memtable flush:

- DBImpl::BackgroundFlush in db/db_impl.cc
- DBImpl::FlushMemTableToOutputFile
- FlushJob::Run -- db/flush_job.cc
- FlushJob::WriteLevel0Table
- BuildTable -- db/builder.cc
- CompactionIterator::Next
- CompactionIterator::NextFromInput -- db/compaction_iterator.cc

@ghost
Copy link

ghost commented Sep 2, 2016

Note that DB::Delete()d keys are probably deleted somewhere else in the code because body of the kTypeDeletion case in CompactionIterator::NextFromInput only drops the deletion entry (by skipping over it), not the key-value itself.

@ghost
Copy link

ghost commented Oct 19, 2016

The following text from here seems to be pointing to what is the real difference between SingleDelete and Delete:

if you never overwrite existing keys, you can try to use DB::SingleDelete() instead of Delete() to kill tombstones sooner. Tombstones will be dropped after it meets the original keys, rather than compacted to the last level.

@petermattis petermattis added this to the Later milestone Feb 22, 2017
@nvanbenschoten nvanbenschoten added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Aug 17, 2018
@nvanbenschoten
Copy link
Member

Expanding on @petermattis's comment #8979 (comment), we should explore whether SingleDelete can be used for Raft log entries during log truncations. The impetus for this is that Raft log entries should almost never get very deep into the LSM because log truncations will often follow in close succession. Following this logic, it would be ideal if the deletion tombstones laid down by log truncations also avoided getting very deep into the LSM.

At the moment, this is not the case because log truncations use Delete operations to clean up log entry keys. The effect of this is that even if log entries are removed from the LSM quickly, log entry tombstones stay in the LSM all the way until they hit the bottom. There are a few reasons why this is undersirable. One is that it contributes to write amplification and causes RocksDB to perform extra work. Another is that we've seen that if we're not careful, these tombstones can get in the way of other operations and cause significant slowdown. SingleDelete operations would allow us to avoid both of these problems to some degree.

In order to use SingleDelete operations for Raft log truncation, we'd have to ensure that no log entry is ever written to RocksDB twice without an intermediate SingleDelete. This seems reasonable, as we have all the necessary information (e.g. prevLastIndex) to enforce this invariant in Replica.append. In fact, we already have an optimization that depends on us making the distinction between whether a log entry is new or whether it's overwriting an existing entry:

if ent.Index > prevLastIndex {
err = engine.MVCCBlindPut(ctx, batch, &diff, key, hlc.Timestamp{}, value, nil /* txn */)
} else {
err = engine.MVCCPut(ctx, batch, &diff, key, hlc.Timestamp{}, value, nil /* txn */)
}

We'd also have to be careful about how SingleDelete interacts with existing Delete and RangeDelete tombstones. For instance, at the moment it looks like SingleDelete will actually remove Delete tombstones, which is absolutely not what we want and something we'd have to be careful about if migrating to the use of SingleDelete.

It's also possible that recent changes to DeleteRange have made it the ideal choice for raft log truncation instead. We should experiment with both options.

@nvanbenschoten nvanbenschoten added this to Backlog in Core Performance Aug 21, 2018
@petermattis petermattis removed this from the Later milestone Oct 5, 2018
@nvanbenschoten
Copy link
Member

Another interesting use of this is in cleaning up MVCCMetadata keys that are removed when resolving intents. If we were careful about updating these keys then we could use SingleDelete operations to remove them.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 2, 2019
Informs cockroachdb#8979.

This commit doesn't actually use the operation yet, but any experiments
will require this plumbing, so it made sense to pull it out and properly
test it.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 2, 2019
Informs cockroachdb#8979.

This commit doesn't actually use the operation yet, but any experiments
will require this plumbing, so it made sense to pull it out and properly
test it.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 2, 2019
Informs cockroachdb#8979.

This commit doesn't actually use the operation yet, but any experiments
will require this plumbing, so it made sense to pull it out and properly
test it.

Release note: None
craig bot pushed a commit that referenced this issue Jan 2, 2019
33436: cli/interactive_tests: unskip test_url_db_override r=knz a=knz

Fixes #30796.

... and use a more clearly guaranteed-bad DNS name for the expected
failure. Suggested by @petermattis.

Release note: None

33440: libroach/engine: plumb support for SingleDelete operation r=nvanbenschoten a=nvanbenschoten

Informs #8979.

This commit doesn't actually use the operation yet, but any experiments
will require this plumbing, so it made sense to pull it out and properly
test it.

Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@nvanbenschoten nvanbenschoten added this to Incoming in KV via automation Apr 16, 2019
@nvanbenschoten nvanbenschoten removed this from Backlog in Core Performance Apr 16, 2019
@lunevalex lunevalex moved this from Incoming to Replication in KV Jul 15, 2020
@lunevalex lunevalex added this to Incoming in (Deprecated) Storage via automation Jul 27, 2020
@lunevalex lunevalex removed this from Replication in KV Jul 27, 2020
@sumeerbhola
Copy link
Collaborator

The separated lock table uses SingleDelete for locks (MVCCMetadata) when possible. The remaining place where this could be used is for the raft log.

@jlinder jlinder added the T-storage Storage Team label Jun 16, 2021
@jbowens jbowens moved this from Incoming to Backlog in (Deprecated) Storage Aug 15, 2022
@tbg tbg changed the title storage: investigate SingleDelete kvserver: investigate SingleDelete for raft log truncation Apr 10, 2023
@tbg tbg removed this from Backlog in (Deprecated) Storage Apr 10, 2023
@tbg tbg added T-kv-replication and removed T-storage Storage Team labels Apr 10, 2023
@blathers-crl
Copy link

blathers-crl bot commented Apr 10, 2023

cc @cockroachdb/replication

@tbg tbg added A-kv-replication Relating to Raft, consensus, and coordination. and removed A-storage Relating to our storage engine (Pebble) on-disk storage. labels Apr 10, 2023
@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024
@blathers-crl blathers-crl bot added this to Incoming in KV Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-kv KV Team
Projects
KV
Incoming
Development

No branches or pull requests

6 participants