-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
single-file bottom-level compaction when snapshot released #3009
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
d2c735f
to
e39f2cc
Compare
@ajkr has updated the pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@ajkr has updated the pull request. View: changes, changes since last import |
1b0389f
to
e39f2cc
Compare
@ajkr has updated the pull request. |
@ajkr has updated the pull request. View: changes, changes since last import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good.
I'd prefer not to touch the unrelated files, iterator.cc and slice_test.cc, in this PR (which were updated by make format
I think).
Sure I'll revert them, I need to be careful not to run |
0c36298
to
85a13c0
Compare
@ajkr has updated the pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, @ajkr . I have a couple of small comments inline, but overall this is a high quality PR!
start_level_inputs_.level = output_level_ = start_level_ = | ||
level_and_file.first; | ||
start_level_inputs_.files = {level_and_file.second}; | ||
if (compaction_picker_->ExpandInputsToCleanCut(cf_name_, vstorage_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why you need this? If you are compacting only a bottom-most file, they don't have any other files to compact with, right? In all cases you'll only ever compact a single file, as I understand it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This diff is expected to compact Lmax-1 as well, if nothing is overlapping in the bottommost level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@igorcanadi - the files in bottom level can have the same user key at their endpoints. Example:
- file 1's last key is deletion of user key A with seqnum 20
- file 2's first key is put of user key A with seqnum 10
If they are compacted alone, compaction of file 1 will drop the deletion, and compaction of file 2 will keep the user key. We need to make sure file 1 and file 2 are picked together for compaction; then, both of these internal keys will be dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, good point :)
db/compaction_picker.cc
Outdated
if (i == vstorage_->BottommostFilesMarkedForCompaction().size()) { | ||
start_level_inputs_.clear(); | ||
} | ||
} | ||
if (!start_level_inputs_.empty()) { | ||
compaction_reason_ = CompactionReason::kFilesMarkedForCompaction; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you want to add another reason which is kBottommostFiles
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I was a bit lazy here.
db/version_set.cc
Outdated
@@ -1697,6 +1701,58 @@ void VersionStorageInfo::GenerateLevel0NonOverlapping() { | |||
} | |||
} | |||
|
|||
void VersionStorageInfo::UpdateBottommostFiles() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This is not updating a bottommost file list, it's actually generating it, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that sounds right.
db/version_set.h
Outdated
// Among bottommost files (assumes they've already been computed), marks the | ||
// ones that have keys that would be eliminated if recompacted, according to | ||
// the seqnum of the oldest existing snapshot. | ||
// REQUIRES: DB mutex held |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I would add a comment saying that since its behavior depends on oldest_sequence_number_ it has to be called every time oldest_sequence_number_ changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
Summary: Add options to `db_stress` (correctness testing tool) to randomly acquire snapshot and release it after some period of time. It's useful for correctness testing of #3009, as well as other parts of compaction that behave differently depending on which snapshots are held. Closes #3038 Differential Revision: D6086501 Pulled By: ajkr fbshipit-source-id: 3ec0d8666c78ac507f1f808887c4ff759ba9b865
Thanks a lot for the detailed review, @igorcanadi! |
Thank you for great work :) |
85a13c0
to
c10a7cd
Compare
@ajkr has updated the pull request. |
@ajkr has updated the pull request. View: changes, changes since last import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajkr is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys. - Changed `CompactionPicker` to compact files in `BottommostFilesMarkedForCompaction()`. These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys. - Changed `ReleaseSnapshot()` to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in `bottommost_files_mark_threshold_`, which allows us to avoid recomputing marked files for most snapshot releases. - Changed `VersionStorageInfo` to track the list of bottommost files, which is recomputed every time the version changes by `UpdateBottommostFiles()`. The list of marked bottommost files is first computed in `ComputeBottommostFilesMarkedForCompaction()` when the version changes, but may also be recomputed when `ReleaseSnapshot()` is called. - Extracted core logic of `Compaction::IsBottommostLevel()` into `VersionStorageInfo::RangeMightExistAfterSortedRun()` since logic to check whether a file is bottommost is now necessary outside of compaction. Closes #3009 Differential Revision: D6062044 Pulled By: ajkr fbshipit-source-id: 123d201cf140715a7d5928e8b3cb4f9cd9f7ad21
Summary: With facebook#3009 we go through every CF to check whether a bottommost compaction is needed to be triggered. This is done within DB mutex. What we do within DB mutex may heavily influece the write throughput we can achieve, so we always want to minimize work there. Here we try to avoid this for-loop by first check a global threshold. In most of the time, the CF loop can be avoided. Test Plan: Run all existing tests.
Summary: With #3009 we go through every CF to check whether a bottommost compaction is needed to be triggered. This is done within DB mutex. What we do within DB mutex may heavily influece the write throughput we can achieve, so we always want to minimize work there. Here we try to avoid this for-loop by first check a global threshold. In most of the time, the CF loop can be avoided. Pull Request resolved: #5090 Differential Revision: D14582684 Pulled By: siying fbshipit-source-id: 968f6d9bb6affe1a5ebc4910b418300b076f166f
Summary: With facebook#3009 we go through every CF to check whether a bottommost compaction is needed to be triggered. This is done within DB mutex. What we do within DB mutex may heavily influece the write throughput we can achieve, so we always want to minimize work there. Here we try to avoid this for-loop by first check a global threshold. In most of the time, the CF loop can be avoided. Pull Request resolved: facebook#5090 Differential Revision: D14582684 Pulled By: siying fbshipit-source-id: 968f6d9bb6affe1a5ebc4910b418300b076f166f
Summary: For leveled compaction, RocksDB has a special kind of compaction with reason "kBottommmostFiles" that compacts bottommost level files to clear data held by snapshots (more detail in #3009). Such compactions can happen soon after a relevant snapshot is released. For some use cases, a bottommost file may contain only a small amount of keys that can be cleared, so compacting such a file has a high write amp. In addition, these bottommost files may be compacted in compactions with reason other than "kBottommmostFiles" if we wait for some time (so that enough data is ingested to trigger such a compaction). This PR introduces an option `bottommost_file_compaction_delay` to specify the delay of these bottommost level single file compactions. * The main change is in `VersionStorageInfo::ComputeBottommostFilesMarkedForCompaction()` where we only add a file to `bottommost_files_marked_for_compaction_` if it oldest_snapshot is larger than its non-zero largest_seqno **and** the file is old enough. Note that if a file is not old enough but its largest_seqno is less than oldest_snapshot, we exclude it from the calculation of `bottommost_files_mark_threshold_`. This makes the change simpler, but such a file's eligibility for compaction will only be checked the next time `ComputeBottommostFilesMarkedForCompaction()` is called. This happens when a new Version is created (compaction, flush, SetOptions()...), a new enough snapshot is released (`VersionStorageInfo::UpdateOldestSnapshot()`) or when a compaction is picked and compaction score has to be re-calculated. Pull Request resolved: #11701 Test Plan: * Add two unit tests to test when bottommost_file_compaction_delay > 0. * Ran crash test with the new option. Reviewed By: jaykorean, ajkr Differential Revision: D48331564 Pulled By: cbi42 fbshipit-source-id: c584f3dc5f6354fce3ed65f4c6366dc450b15ba8
When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys.
CompactionPicker
to compact files inBottommostFilesMarkedForCompaction()
. These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys.ReleaseSnapshot()
to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked inbottommost_files_mark_threshold_
, which allows us to avoid recomputing marked files for most snapshot releases.VersionStorageInfo
to track the list of bottommost files, which is recomputed every time the version changes byUpdateBottommostFiles()
. The list of marked bottommost files is first computed inComputeBottommostFilesMarkedForCompaction()
when the version changes, but may also be recomputed whenReleaseSnapshot()
is called.Compaction::IsBottommostLevel()
intoVersionStorageInfo::RangeMightExistAfterSortedRun()
since logic to check whether a file is bottommost is now necessary outside of compaction.Test Plan:
Populate DB command:
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=randomtransaction -num=50000000 -transaction_db=1 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12
Benchmark command:
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=randomtransaction -use_existing_db=1 -num=1000000 -transaction_db=1 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -transaction_set_snapshot=1
Before:
randomtransaction : 54.730 micros/op 18271 ops/sec; 0.4 MB/s ( transactions:1000000 aborts:0)
After:
randomtransaction : 54.267 micros/op 18427 ops/sec; 0.4 MB/s ( transactions:1000000 aborts:0)
Before:
0.03% 0.00% db_bench db_bench [.] rocksdb::DBImpl::ReleaseSnapshot
After:
0.12% 0.07% db_bench db_bench [.] rocksdb::DBImpl::ReleaseSnapshot