Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

single-file bottom-level compaction when snapshot released #3009

Closed
wants to merge 5 commits into from

Conversation

ajkr
Copy link
Contributor

@ajkr ajkr commented Oct 15, 2017

When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys.

  • Changed CompactionPicker to compact files in BottommostFilesMarkedForCompaction(). These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys.
  • Changed ReleaseSnapshot() to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in bottommost_files_mark_threshold_, which allows us to avoid recomputing marked files for most snapshot releases.
  • Changed VersionStorageInfo to track the list of bottommost files, which is recomputed every time the version changes by UpdateBottommostFiles(). The list of marked bottommost files is first computed in ComputeBottommostFilesMarkedForCompaction() when the version changes, but may also be recomputed when ReleaseSnapshot() is called.
  • Extracted core logic of Compaction::IsBottommostLevel() into VersionStorageInfo::RangeMightExistAfterSortedRun() since logic to check whether a file is bottommost is now necessary outside of compaction.

Test Plan:

  • unit test
  • db_bench in-memory randomtransaction where each transaction writes two keys in a large DB with ~400 bottommost files.

Populate DB command:

TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=randomtransaction -num=50000000 -transaction_db=1 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12

Benchmark command:

TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=randomtransaction -use_existing_db=1 -num=1000000 -transaction_db=1 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -transaction_set_snapshot=1

Before:

randomtransaction : 54.730 micros/op 18271 ops/sec; 0.4 MB/s ( transactions:1000000 aborts:0)

After:

randomtransaction : 54.267 micros/op 18427 ops/sec; 0.4 MB/s ( transactions:1000000 aborts:0)

  • perf (same setup as above):

Before:

0.03% 0.00% db_bench db_bench [.] rocksdb::DBImpl::ReleaseSnapshot

After:

0.12% 0.07% db_bench db_bench [.] rocksdb::DBImpl::ReleaseSnapshot

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ajkr ajkr force-pushed the del-triggered-compact-bottom-level branch from d2c735f to e39f2cc Compare October 15, 2017 22:02
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. View: changes, changes since last import

@ajkr ajkr force-pushed the del-triggered-compact-bottom-level branch from 1b0389f to e39f2cc Compare October 17, 2017 18:52
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request.

@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. View: changes, changes since last import

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ajkr ajkr changed the title single-file bottom-level compactions for obsolete snapshots single-file bottom-level compaction when snapshot released Oct 18, 2017
Copy link
Contributor

@sagar0 sagar0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good.

I'd prefer not to touch the unrelated files, iterator.cc and slice_test.cc, in this PR (which were updated by make format I think).

@ajkr
Copy link
Contributor Author

ajkr commented Oct 18, 2017

Sure I'll revert them, I need to be careful not to run make format after git commit as it seems that's when it formats unrelated files.

@ajkr ajkr force-pushed the del-triggered-compact-bottom-level branch from 0c36298 to 85a13c0 Compare October 19, 2017 05:56
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Collaborator

@igorcanadi igorcanadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, @ajkr . I have a couple of small comments inline, but overall this is a high quality PR!

start_level_inputs_.level = output_level_ = start_level_ =
level_and_file.first;
start_level_inputs_.files = {level_and_file.second};
if (compaction_picker_->ExpandInputsToCleanCut(cf_name_, vstorage_,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why you need this? If you are compacting only a bottom-most file, they don't have any other files to compact with, right? In all cases you'll only ever compact a single file, as I understand it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff is expected to compact Lmax-1 as well, if nothing is overlapping in the bottommost level.

Copy link
Contributor Author

@ajkr ajkr Oct 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@igorcanadi - the files in bottom level can have the same user key at their endpoints. Example:

  • file 1's last key is deletion of user key A with seqnum 20
  • file 2's first key is put of user key A with seqnum 10

If they are compacted alone, compaction of file 1 will drop the deletion, and compaction of file 2 will keep the user key. We need to make sure file 1 and file 2 are picked together for compaction; then, both of these internal keys will be dropped.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, good point :)

if (i == vstorage_->BottommostFilesMarkedForCompaction().size()) {
start_level_inputs_.clear();
}
}
if (!start_level_inputs_.empty()) {
compaction_reason_ = CompactionReason::kFilesMarkedForCompaction;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want to add another reason which is kBottommostFiles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I was a bit lazy here.

@@ -1697,6 +1701,58 @@ void VersionStorageInfo::GenerateLevel0NonOverlapping() {
}
}

void VersionStorageInfo::UpdateBottommostFiles() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This is not updating a bottommost file list, it's actually generating it, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that sounds right.

db/version_set.h Outdated
// Among bottommost files (assumes they've already been computed), marks the
// ones that have keys that would be eliminated if recompacted, according to
// the seqnum of the oldest existing snapshot.
// REQUIRES: DB mutex held
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would add a comment saying that since its behavior depends on oldest_sequence_number_ it has to be called every time oldest_sequence_number_ changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

facebook-github-bot pushed a commit that referenced this pull request Oct 20, 2017
Summary:
Add options to `db_stress` (correctness testing tool) to randomly acquire snapshot and release it after some period of time. It's useful for correctness testing of #3009, as well as other parts of compaction that behave differently depending on which snapshots are held.
Closes #3038

Differential Revision: D6086501

Pulled By: ajkr

fbshipit-source-id: 3ec0d8666c78ac507f1f808887c4ff759ba9b865
@ajkr
Copy link
Contributor Author

ajkr commented Oct 25, 2017

Thanks a lot for the detailed review, @igorcanadi!

@igorcanadi
Copy link
Collaborator

Thank you for great work :)

@ajkr ajkr force-pushed the del-triggered-compact-bottom-level branch from 85a13c0 to c10a7cd Compare October 25, 2017 23:06
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request.

@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. View: changes, changes since last import

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajkr is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

sagar0 pushed a commit that referenced this pull request Oct 27, 2017
Summary:
When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys.

- Changed `CompactionPicker` to compact files in `BottommostFilesMarkedForCompaction()`. These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys.
- Changed `ReleaseSnapshot()` to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in `bottommost_files_mark_threshold_`, which allows us to avoid recomputing marked files for most snapshot releases.
- Changed `VersionStorageInfo` to track the list of bottommost files, which is recomputed every time the version changes by `UpdateBottommostFiles()`. The list of marked bottommost files is first computed in `ComputeBottommostFilesMarkedForCompaction()` when the version changes, but may also be recomputed when `ReleaseSnapshot()` is called.
- Extracted core logic of `Compaction::IsBottommostLevel()` into `VersionStorageInfo::RangeMightExistAfterSortedRun()` since logic to check whether a file is bottommost is now necessary outside of compaction.
Closes #3009

Differential Revision: D6062044

Pulled By: ajkr

fbshipit-source-id: 123d201cf140715a7d5928e8b3cb4f9cd9f7ad21
siying added a commit to siying/rocksdb that referenced this pull request Mar 22, 2019
Summary: With facebook#3009 we go through every CF
to check whether a bottommost compaction is needed to be triggered. This is done
within DB mutex. What we do within DB mutex may heavily influece the write throughput
we can achieve, so we always want to minimize work there.

Here we try to avoid this for-loop by first check a global threshold. In most of
the time, the CF loop can be avoided.

Test Plan: Run all existing tests.
facebook-github-bot pushed a commit that referenced this pull request Mar 26, 2019
Summary:
With #3009 we go through every CF
to check whether a bottommost compaction is needed to be triggered. This is done
within DB mutex. What we do within DB mutex may heavily influece the write throughput
we can achieve, so we always want to minimize work there.

Here we try to avoid this for-loop by first check a global threshold. In most of
the time, the CF loop can be avoided.
Pull Request resolved: #5090

Differential Revision: D14582684

Pulled By: siying

fbshipit-source-id: 968f6d9bb6affe1a5ebc4910b418300b076f166f
vagogte pushed a commit to vagogte/rocksdb that referenced this pull request Jun 18, 2019
Summary:
With facebook#3009 we go through every CF
to check whether a bottommost compaction is needed to be triggered. This is done
within DB mutex. What we do within DB mutex may heavily influece the write throughput
we can achieve, so we always want to minimize work there.

Here we try to avoid this for-loop by first check a global threshold. In most of
the time, the CF loop can be avoided.
Pull Request resolved: facebook#5090

Differential Revision: D14582684

Pulled By: siying

fbshipit-source-id: 968f6d9bb6affe1a5ebc4910b418300b076f166f
facebook-github-bot pushed a commit that referenced this pull request Aug 17, 2023
Summary:
For leveled compaction, RocksDB has a special kind of compaction with reason "kBottommmostFiles" that compacts bottommost level files to clear data held by snapshots (more detail in #3009). Such compactions can happen soon after a relevant snapshot is released. For some use cases, a bottommost file may contain only a small amount of keys that can be cleared, so compacting such a file has a high write amp. In addition, these bottommost files may be compacted in compactions with reason other than "kBottommmostFiles" if we wait for some time (so that enough data is ingested to trigger such a compaction). This PR introduces an option `bottommost_file_compaction_delay` to specify the delay of these bottommost level single file compactions.

* The main change is in `VersionStorageInfo::ComputeBottommostFilesMarkedForCompaction()` where we only add a file to `bottommost_files_marked_for_compaction_` if it oldest_snapshot is larger than its non-zero largest_seqno **and** the file is old enough. Note that if a file is not old enough but its largest_seqno is less than oldest_snapshot, we exclude it from the calculation of `bottommost_files_mark_threshold_`. This makes the change simpler, but such a file's eligibility for compaction will only be checked the next time `ComputeBottommostFilesMarkedForCompaction()` is called. This happens when a new Version is created (compaction, flush, SetOptions()...), a new enough snapshot is released (`VersionStorageInfo::UpdateOldestSnapshot()`) or when a compaction is picked and compaction score has to be re-calculated.

Pull Request resolved: #11701

Test Plan:
* Add two unit tests to test when bottommost_file_compaction_delay > 0.
* Ran crash test with the new option.

Reviewed By: jaykorean, ajkr

Differential Revision: D48331564

Pulled By: cbi42

fbshipit-source-id: c584f3dc5f6354fce3ed65f4c6366dc450b15ba8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants