single-file bottom-level compaction when snapshot released #3009

ajkr · 2017-10-15T21:07:20Z

When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys.

Changed CompactionPicker to compact files in BottommostFilesMarkedForCompaction(). These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys.
Changed ReleaseSnapshot() to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in bottommost_files_mark_threshold_, which allows us to avoid recomputing marked files for most snapshot releases.
Changed VersionStorageInfo to track the list of bottommost files, which is recomputed every time the version changes by UpdateBottommostFiles(). The list of marked bottommost files is first computed in ComputeBottommostFilesMarkedForCompaction() when the version changes, but may also be recomputed when ReleaseSnapshot() is called.
Extracted core logic of Compaction::IsBottommostLevel() into VersionStorageInfo::RangeMightExistAfterSortedRun() since logic to check whether a file is bottommost is now necessary outside of compaction.

Test Plan:

unit test
db_bench in-memory randomtransaction where each transaction writes two keys in a large DB with ~400 bottommost files.

Populate DB command:

TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=randomtransaction -num=50000000 -transaction_db=1 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12

Benchmark command:

TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=randomtransaction -use_existing_db=1 -num=1000000 -transaction_db=1 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -transaction_set_snapshot=1

Before:

randomtransaction : 54.730 micros/op 18271 ops/sec; 0.4 MB/s ( transactions:1000000 aborts:0)

After:

randomtransaction : 54.267 micros/op 18427 ops/sec; 0.4 MB/s ( transactions:1000000 aborts:0)

perf (same setup as above):

Before:

0.03% 0.00% db_bench db_bench [.] rocksdb::DBImpl::ReleaseSnapshot

After:

0.12% 0.07% db_bench db_bench [.] rocksdb::DBImpl::ReleaseSnapshot

facebook-github-bot

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2017-10-15T22:02:34Z

@ajkr has updated the pull request.

facebook-github-bot

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2017-10-17T18:50:24Z

@ajkr has updated the pull request. View: changes, changes since last import

facebook-github-bot · 2017-10-17T18:52:35Z

@ajkr has updated the pull request.

facebook-github-bot · 2017-10-17T18:52:55Z

@ajkr has updated the pull request. View: changes, changes since last import

facebook-github-bot

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

sagar0

Looks pretty good.

I'd prefer not to touch the unrelated files, iterator.cc and slice_test.cc, in this PR (which were updated by make format I think).

ajkr · 2017-10-18T23:05:31Z

Sure I'll revert them, I need to be careful not to run make format after git commit as it seems that's when it formats unrelated files.

facebook-github-bot · 2017-10-19T05:57:07Z

@ajkr has updated the pull request.

facebook-github-bot

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

igorcanadi

This looks great, @ajkr . I have a couple of small comments inline, but overall this is a high quality PR!

igorcanadi · 2017-10-19T17:46:40Z

db/compaction_picker.cc

+        start_level_inputs_.level = output_level_ = start_level_ =
+            level_and_file.first;
+        start_level_inputs_.files = {level_and_file.second};
+        if (compaction_picker_->ExpandInputsToCleanCut(cf_name_, vstorage_,


Can you explain why you need this? If you are compacting only a bottom-most file, they don't have any other files to compact with, right? In all cases you'll only ever compact a single file, as I understand it?

This diff is expected to compact Lmax-1 as well, if nothing is overlapping in the bottommost level.

@igorcanadi - the files in bottom level can have the same user key at their endpoints. Example:

file 1's last key is deletion of user key A with seqnum 20

file 2's first key is put of user key A with seqnum 10

If they are compacted alone, compaction of file 1 will drop the deletion, and compaction of file 2 will keep the user key. We need to make sure file 1 and file 2 are picked together for compaction; then, both of these internal keys will be dropped.

Got it, good point :)

igorcanadi · 2017-10-19T17:47:25Z

db/compaction_picker.cc

+      if (i == vstorage_->BottommostFilesMarkedForCompaction().size()) {
+        start_level_inputs_.clear();
+      }
+    }
    if (!start_level_inputs_.empty()) {
      compaction_reason_ = CompactionReason::kFilesMarkedForCompaction;


Don't you want to add another reason which is kBottommostFiles?

Sure, I was a bit lazy here.

igorcanadi · 2017-10-19T17:52:32Z

db/version_set.cc

@@ -1697,6 +1701,58 @@ void VersionStorageInfo::GenerateLevel0NonOverlapping() {
  }
 }

+void VersionStorageInfo::UpdateBottommostFiles() {


Nit: This is not updating a bottommost file list, it's actually generating it, right?

Yes that sounds right.

igorcanadi · 2017-10-19T17:58:43Z

db/version_set.h

+  // Among bottommost files (assumes they've already been computed), marks the
+  // ones that have keys that would be eliminated if recompacted, according to
+  // the seqnum of the oldest existing snapshot.
+  // REQUIRES: DB mutex held


Nit: I would add a comment saying that since its behavior depends on oldest_sequence_number_ it has to be called every time oldest_sequence_number_ changes.

Summary: Add options to `db_stress` (correctness testing tool) to randomly acquire snapshot and release it after some period of time. It's useful for correctness testing of #3009, as well as other parts of compaction that behave differently depending on which snapshots are held. Closes #3038 Differential Revision: D6086501 Pulled By: ajkr fbshipit-source-id: 3ec0d8666c78ac507f1f808887c4ff759ba9b865

ajkr · 2017-10-25T21:32:29Z

Thanks a lot for the detailed review, @igorcanadi!

igorcanadi · 2017-10-25T22:42:21Z

Thank you for great work :)

facebook-github-bot · 2017-10-25T23:06:22Z

@ajkr has updated the pull request.

facebook-github-bot · 2017-10-25T23:09:23Z

@ajkr has updated the pull request. View: changes, changes since last import

facebook-github-bot

@ajkr is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys. - Changed `CompactionPicker` to compact files in `BottommostFilesMarkedForCompaction()`. These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys. - Changed `ReleaseSnapshot()` to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in `bottommost_files_mark_threshold_`, which allows us to avoid recomputing marked files for most snapshot releases. - Changed `VersionStorageInfo` to track the list of bottommost files, which is recomputed every time the version changes by `UpdateBottommostFiles()`. The list of marked bottommost files is first computed in `ComputeBottommostFilesMarkedForCompaction()` when the version changes, but may also be recomputed when `ReleaseSnapshot()` is called. - Extracted core logic of `Compaction::IsBottommostLevel()` into `VersionStorageInfo::RangeMightExistAfterSortedRun()` since logic to check whether a file is bottommost is now necessary outside of compaction. Closes #3009 Differential Revision: D6062044 Pulled By: ajkr fbshipit-source-id: 123d201cf140715a7d5928e8b3cb4f9cd9f7ad21

Summary: With facebook#3009 we go through every CF to check whether a bottommost compaction is needed to be triggered. This is done within DB mutex. What we do within DB mutex may heavily influece the write throughput we can achieve, so we always want to minimize work there. Here we try to avoid this for-loop by first check a global threshold. In most of the time, the CF loop can be avoided. Test Plan: Run all existing tests.

Summary: With #3009 we go through every CF to check whether a bottommost compaction is needed to be triggered. This is done within DB mutex. What we do within DB mutex may heavily influece the write throughput we can achieve, so we always want to minimize work there. Here we try to avoid this for-loop by first check a global threshold. In most of the time, the CF loop can be avoided. Pull Request resolved: #5090 Differential Revision: D14582684 Pulled By: siying fbshipit-source-id: 968f6d9bb6affe1a5ebc4910b418300b076f166f

Summary: With facebook#3009 we go through every CF to check whether a bottommost compaction is needed to be triggered. This is done within DB mutex. What we do within DB mutex may heavily influece the write throughput we can achieve, so we always want to minimize work there. Here we try to avoid this for-loop by first check a global threshold. In most of the time, the CF loop can be avoided. Pull Request resolved: facebook#5090 Differential Revision: D14582684 Pulled By: siying fbshipit-source-id: 968f6d9bb6affe1a5ebc4910b418300b076f166f

Summary: For leveled compaction, RocksDB has a special kind of compaction with reason "kBottommmostFiles" that compacts bottommost level files to clear data held by snapshots (more detail in #3009). Such compactions can happen soon after a relevant snapshot is released. For some use cases, a bottommost file may contain only a small amount of keys that can be cleared, so compacting such a file has a high write amp. In addition, these bottommost files may be compacted in compactions with reason other than "kBottommmostFiles" if we wait for some time (so that enough data is ingested to trigger such a compaction). This PR introduces an option `bottommost_file_compaction_delay` to specify the delay of these bottommost level single file compactions. * The main change is in `VersionStorageInfo::ComputeBottommostFilesMarkedForCompaction()` where we only add a file to `bottommost_files_marked_for_compaction_` if it oldest_snapshot is larger than its non-zero largest_seqno **and** the file is old enough. Note that if a file is not old enough but its largest_seqno is less than oldest_snapshot, we exclude it from the calculation of `bottommost_files_mark_threshold_`. This makes the change simpler, but such a file's eligibility for compaction will only be checked the next time `ComputeBottommostFilesMarkedForCompaction()` is called. This happens when a new Version is created (compaction, flush, SetOptions()...), a new enough snapshot is released (`VersionStorageInfo::UpdateOldestSnapshot()`) or when a compaction is picked and compaction score has to be re-calculated. Pull Request resolved: #11701 Test Plan: * Add two unit tests to test when bottommost_file_compaction_delay > 0. * Ran crash test with the new option. Reviewed By: jaykorean, ajkr Differential Revision: D48331564 Pulled By: cbi42 fbshipit-source-id: c584f3dc5f6354fce3ed65f4c6366dc450b15ba8

facebook-github-bot added the CLA Signed label Oct 15, 2017

facebook-github-bot reviewed Oct 15, 2017

View reviewed changes

ajkr force-pushed the del-triggered-compact-bottom-level branch from d2c735f to e39f2cc Compare October 15, 2017 22:02

facebook-github-bot reviewed Oct 15, 2017

View reviewed changes

ajkr force-pushed the del-triggered-compact-bottom-level branch from 1b0389f to e39f2cc Compare October 17, 2017 18:52

facebook-github-bot reviewed Oct 17, 2017

View reviewed changes

ajkr requested review from sagar0 and yiwu-arbug October 17, 2017 20:24

ajkr mentioned this pull request Oct 18, 2017

db_stress support long-held snapshots #3038

Closed

ajkr changed the title ~~single-file bottom-level compactions for obsolete snapshots~~ single-file bottom-level compaction when snapshot released Oct 18, 2017

sagar0 approved these changes Oct 18, 2017

View reviewed changes

ajkr force-pushed the del-triggered-compact-bottom-level branch from 0c36298 to 85a13c0 Compare October 19, 2017 05:56

facebook-github-bot reviewed Oct 19, 2017

View reviewed changes

igorcanadi approved these changes Oct 19, 2017

View reviewed changes

ajkr added 4 commits October 25, 2017 16:05

single-file bottom-level compactions for obsolete snapshots

522965d

cache threshold to reduce ReleaseSnapshot CPU

bd74292

handle user key overlap with neighboring file

63d51d6

address comments

c10a7cd

ajkr force-pushed the del-triggered-compact-bottom-level branch from 85a13c0 to c10a7cd Compare October 25, 2017 23:06

address more comments

677c7bb

facebook-github-bot reviewed Oct 25, 2017

View reviewed changes

facebook-github-bot closed this in 9b18cc2 Oct 25, 2017

siying mentioned this pull request Mar 22, 2019

Avoid to go through every CF for every ReleaseSnapshot() #5090

Closed

cbi42 mentioned this pull request Aug 14, 2023

Delay bottommost level single file compactions #11701

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

single-file bottom-level compaction when snapshot released #3009

single-file bottom-level compaction when snapshot released #3009

ajkr commented Oct 15, 2017 •

edited

Loading

facebook-github-bot left a comment

facebook-github-bot commented Oct 15, 2017

facebook-github-bot left a comment

facebook-github-bot commented Oct 17, 2017

facebook-github-bot commented Oct 17, 2017

facebook-github-bot commented Oct 17, 2017

facebook-github-bot left a comment

sagar0 left a comment

ajkr commented Oct 18, 2017

facebook-github-bot commented Oct 19, 2017

facebook-github-bot left a comment

igorcanadi left a comment

igorcanadi Oct 19, 2017

yoshinorim Oct 25, 2017

ajkr Oct 25, 2017 •

edited

Loading

igorcanadi Oct 25, 2017

igorcanadi Oct 19, 2017

ajkr Oct 25, 2017

igorcanadi Oct 19, 2017

ajkr Oct 25, 2017

igorcanadi Oct 19, 2017

ajkr Oct 25, 2017

ajkr commented Oct 25, 2017

igorcanadi commented Oct 25, 2017

facebook-github-bot commented Oct 25, 2017

facebook-github-bot commented Oct 25, 2017

facebook-github-bot left a comment

single-file bottom-level compaction when snapshot released #3009

single-file bottom-level compaction when snapshot released #3009

Conversation

ajkr commented Oct 15, 2017 • edited Loading

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 15, 2017

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 17, 2017

facebook-github-bot commented Oct 17, 2017

facebook-github-bot commented Oct 17, 2017

facebook-github-bot left a comment

Choose a reason for hiding this comment

sagar0 left a comment

Choose a reason for hiding this comment

ajkr commented Oct 18, 2017

facebook-github-bot commented Oct 19, 2017

facebook-github-bot left a comment

Choose a reason for hiding this comment

igorcanadi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajkr Oct 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajkr commented Oct 25, 2017

igorcanadi commented Oct 25, 2017

facebook-github-bot commented Oct 25, 2017

facebook-github-bot commented Oct 25, 2017

facebook-github-bot left a comment

Choose a reason for hiding this comment

ajkr commented Oct 15, 2017 •

edited

Loading

ajkr Oct 25, 2017 •

edited

Loading