Integrate compaction resumption with DB::OpenAndCompact()#13984
Integrate compaction resumption with DB::OpenAndCompact()#13984hx235 wants to merge 4 commits intofacebook:mainfrom
Conversation
c5b63fa to
c6d04f5
Compare
61a7d1d to
b20b70b
Compare
aac314e to
6e02ba1
Compare
| // Track total processed input records for progress reporting by combining: | ||
| // - Resumed count: records already processed before compaction was | ||
| // interrupted | ||
| // - Current count: records scanned in the current compaction session | ||
| // Only update when both tracking mechanisms provide accurate counts to ensure | ||
| // reliability. | ||
| subcompaction_progress.num_processed_input_records = | ||
| c_iter->HasNumInputEntryScanned() ? c_iter->NumInputEntryScanned() : 0; | ||
| c_iter->HasNumInputEntryScanned() && | ||
| sub_compact->compaction_job_stats.has_accurate_num_input_records | ||
| ? c_iter->NumInputEntryScanned() + | ||
| sub_compact->compaction_job_stats.num_input_records | ||
| : 0; |
There was a problem hiding this comment.
Bug fix to #13983 for the case where a cancelled compaction is cancelled again. Will be tested in stress test support PR that's coming up.
jaykorean
left a comment
There was a problem hiding this comment.
Still need more time to review, but leaving the first set of comments
db/db_impl/db_impl_secondary.cc
Outdated
| return s; | ||
| } else if (latest_compaction_progress_file.empty()) { | ||
| // CASE 1: Resume requested but no progress file found - start fresh | ||
| ROCKS_LOG_INFO(immutable_db_options_.info_log, |
There was a problem hiding this comment.
nit: this may spam the logger
There was a problem hiding this comment.
Thanks! I think it can be replaced by existing logging already.
There was a problem hiding this comment.
Actually I took a look and concluded these are the essential log to figure out why resuming is not happening. It's only printed once per OpenAndCompact call. Is there any concern on that frequency?
There was a problem hiding this comment.
At the end of the day, the value for this log is that it tells us there was no prior compaction progress which should be the majority of the cases.
IMO, as long as we can tell if the attempted compaction (whether it succeeded or failed) had had prior compaction or not at the end would be sufficient.
I'm okay we can log this for now in the initial rollout to see if we see too many of these and remove this later.
There was a problem hiding this comment.
I see. For "had had prior compaction or not at the end would be sufficient.", what would be your proposal way to tell by the way?
There was a problem hiding this comment.
what would be your proposal way to tell?
wondering if CompactionOutputs::Output can have a simple boolean field (e.g. restored). We can set it to true in CompactionJob::RestoreCompactionOutputs(), and then log during CompactionJob::VerifyOutputFiles()
This could give us an immediate answer whether output file was restored from prior compaction or created from current compaction at the verification step.
There was a problem hiding this comment.
Hi - i will keep this in mind. Thanks!
6e02ba1 to
f3b9917
Compare
| // LIMITATION: Don't save progress if the current key has already been scanned | ||
| // (looked ahead) in the input but not yet output. This can happen with merge | ||
| // operations, single deletes, and deletes at the bottommost level where | ||
| // CompactionIterator needs to look ahead to process multiple entries for the | ||
| // same user key before outputting a result. If we saved progress and resumed | ||
| // at this boundary, the resumed session would see and process the same input | ||
| // key again through Seek(), leading to incorrect double-counting in | ||
| // number of processed input entries and input count verification failure | ||
| // | ||
| // TODO(hx235): Offset num_processed_input_records to avoid double counting | ||
| // instead of disabling progress persistence. | ||
| if (c_iter->IsCurrentKeyAlreadyScanned()) { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
@jaykorean another bug fix motivated by stress test in the case where the next_key_to_compact can be already counted in num_processed_input_records. See new UT in compaction_job_test.cc. I have some idea of the fix so we can relax this constraint but is not going to block this PR.
db/db_impl/db_impl_secondary.cc
Outdated
| return s; | ||
| } else if (latest_compaction_progress_file.empty()) { | ||
| // CASE 1: Resume requested but no progress file found - start fresh | ||
| ROCKS_LOG_INFO(immutable_db_options_.info_log, |
There was a problem hiding this comment.
At the end of the day, the value for this log is that it tells us there was no prior compaction progress which should be the majority of the cases.
IMO, as long as we can tell if the attempted compaction (whether it succeeded or failed) had had prior compaction or not at the end would be sufficient.
I'm okay we can log this for now in the initial rollout to see if we see too many of these and remove this later.
f3b9917 to
5b9aae8
Compare
| // TODO(hx235): Resuming compaction is currently incompatible with | ||
| // paranoid_file_checks=true because OutputValidator hash verification would | ||
| // fail during compaction resumption. Before interruption, resuming | ||
| // compaction needs to persist the hash of each output file to enable | ||
| // validation after resumption. Alternatively and preferably, we could move | ||
| // the output verification to happen immediately after each output file is | ||
| // created. This workaround currently disables resuming compaction when | ||
| // paranoid_file_checks is enabled. Note that paranoid_file_checks is | ||
| // disabled by default. | ||
| bool allow_resumption = | ||
| options.allow_resumption && | ||
| !cfd->GetLatestMutableCFOptions().paranoid_file_checks; |
f5064fa to
e0266e6
Compare
jaykorean
left a comment
There was a problem hiding this comment.
Mostly nitpick comments. The only thing that I'd suggest addressing is DeleteFileIfExists()
Almost there!
| {{KeyStr("a", 1U, kTypeValue), "val1"}, | ||
| {KeyStr("b", 2U, kTypeValue), "val2"}} /* input_file_1 */, | ||
| {{KeyStr("bb", 3U, kTypeValue), "val3"}, | ||
| {KeyStr(kCancelBeforeThisKey, 4U, kTypeMerge), |
| // - Phase 3: Resume with resumption support (loads Phase 1's progress) and | ||
| // complete | ||
| // Validates: Resumption state can be toggled; | ||
| kMultipleCancelToggleResumption |
| s = FinalizeCompactionProgressWriter(compaction_progress_writer); | ||
| } | ||
| if (!s.ok()) { | ||
| return s; |
There was a problem hiding this comment.
Doesn't have to be in this PR, but just one thing to think about.
Do we want to fail the whole compaction when FinalizeCompactionProgressWriter() fails due to some transient filesystem issues? wondering if we can simply progress the compaction without compaction_progress_writer.
At this point, we know that we've successfully finished PrepareCompactionProgressState(), meaning that the tmp dir either contains the latest compaction progress or is completely clean. If current compaction (without progress writer) fails again, then the next compaction with allow_resumption=true will attempt to resume from prior progress again.
This may complicate things or potentially create a bug-prone race scenarios, so it might not be worth handling, but just wanted to bring it up for a potential future optimization.
| // - OpenAndCompact() starts a fresh compaction from scratch. | ||
| // - No progress will be saved during execution, so interruptions require | ||
| // starting over completely. | ||
| // - CRITICAL REQUIREMENT: The `output_directory` associated MUST be empty |
e0266e6 to
de651d7
Compare
jaykorean
left a comment
There was a problem hiding this comment.
Awesome work! Thanks for the prior refactor PRs to make compaction flow code more readable and modular. I also appreciate well-thought test coverages and detailed comments on the APIs
…paction (#14041) Summary: **Context/Summary:** - Add resumable compaction to stress test with adaptive progress cancellation - Add fault injection to remote compaction - Fix a real minor bug in a couple testing framework bugs with remote compaction Pull Request resolved: #14041 Test Plan: - Rehearsal stress test, finding bugs for #13984 effectively and did not create new failures. Reviewed By: jaykorean Differential Revision: D84524194 Pulled By: hx235 fbshipit-source-id: 42b4264e428c6739631ed9aa5eb02723367510bc
Context/Summary:
This is stacked on top of #13983 and integrate compaction resumption with OpenAndCompact().
Flow of resuming: DB::OpenAndCompact() -> Compaction progress file -> SubcompactionProgress -> CompactionJob
Flow of persistence: CompactionJob -> SubcompactionProgress -> Compaction progress file -> DB that is called with OpenAndCompact()
This PR focuses on DB::OpenAndCompact() -> Compaction progress file -> SubcompactionProgress and Compaction progress file -> DB that is called with OpenAndCompact()
Resume Flow
Progress File Creation
Test plan:
Performance testing:
1. Latency
Using
allow_resumption = false
Input files: 101 files, 10000 keys
OpenAndCompact() API call : 26766256.000 micros/op 0 ops/sec 26.766 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 27837249.000 micros/op 0 ops/sec 27.837 seconds 1 operations;
Input files: 101 files, 10000 keys
OpenAndCompact() API call : 26546234.000 micros/op 0 ops/sec 26.546 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 27918621.000 micros/op 0 ops/sec 27.919 seconds 1 operations;
OpenAndCompact [AVG 2 runs] : 0 (± 0) ops/sec
Input files: 101 files, 10000 keys
OpenAndCompact() API call : 42243571.000 micros/op 0 ops/sec 42.244 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 43497581.000 micros/op 0 ops/sec 43.498 seconds 1 operations;
OpenAndCompact [AVG 3 runs] : 0 (± 0) ops/sec
Input files: 101 files, 10000 keys
OpenAndCompact() API call : 34241357.000 micros/op 0 ops/sec 34.241 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 35655346.000 micros/op 0 ops/sec 35.655 seconds 1 operations;
OpenAndCompact [AVG 4 runs] : 0 (± 0) ops/sec
Input files: 101 files, 10000 keys
OpenAndCompact() API call : 27083361.000 micros/op 0 ops/sec 27.083 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 28487999.000 micros/op 0 ops/sec 28.488 seconds 1 operations;
OpenAndCompact [AVG 5 runs] : 0 (± 0) ops/sec
OpenAndCompact [AVG 5 runs] : 0 (± 0) ops/sec; 31669.681 ms/op
OpenAndCompact [MEDIAN 5 runs] : 0 ops/sec
allow_resumption= true
Input files: 101 files, 10000 keys
OpenAndCompact() API call : 25446470.000 micros/op 0 ops/sec 25.446 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 26833415.000 micros/op 0 ops/sec 26.833 seconds 1 operations;
Input files: 101 files, 10000 keys
OpenAndCompact() API call : 240745.000 micros/op 0 ops/sec 0.241 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 244934.000 micros/op 4 ops/sec 0.245 seconds 1 operations;
OpenAndCompact [AVG 2 runs] : 2 (± 3) ops/sec
Input files: 101 files, 10000 keys
OpenAndCompact() API call : 24843383.000 micros/op 0 ops/sec 24.843 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 26192235.000 micros/op 0 ops/sec 26.192 seconds 1 operations;
OpenAndCompact [AVG 3 runs] : 1 (± 2) ops/sec
Input files: 101 files, 10000 keys
OpenAndCompact() API call : 270819.000 micros/op 0 ops/sec 0.271 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 275140.000 micros/op 3 ops/sec 0.275 seconds 1 operations;
OpenAndCompact [AVG 4 runs] : 1 (± 2) ops/sec
Input files: 101 files, 10000 keys
OpenAndCompact() API call : 23038311.000 micros/op 0 ops/sec 23.038 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 24439097.000 micros/op 0 ops/sec 24.439 seconds 1 operations;
OpenAndCompact [AVG 5 runs] : 1 (± 1) ops/sec
OpenAndCompact [AVG 5 runs] : 1 (± 1) ops/sec; 638.417 ms/op
OpenAndCompact [MEDIAN 5 runs] : 0 ops/sec
Persistence cost: Compare only the odd OpenAndCompact() across two configurations -> it's actually faster.
Resumption saving: Compare only the even OpenAndCompact() across two configurations -> (0.2 - 26.766 ) / 26.766 * 100 = 99.25% improvement, reflecting the fact that all the compaction progress is redone without the allow_resumption feature
2. Memory usage (in case SubcompactionProgress storing its own memory copies of output filemetadata in https://github.com/facebook/rocksdb/pull/13983/files is a trouble)
allow_resumption = false
Peak memory: 275828 KB
allow_resumption = true
Peak memory: 277204 KB (regress 0.49% memory usage, most likely due to storing own copies of output files' file metadata in subcompaction progress)
Near-term follow up: