Skip to content

Integrate compaction resumption with DB::OpenAndCompact()#13984

Closed
hx235 wants to merge 4 commits intofacebook:mainfrom
hx235:resumable_compaction_secondary_db
Closed

Integrate compaction resumption with DB::OpenAndCompact()#13984
hx235 wants to merge 4 commits intofacebook:mainfrom
hx235:resumable_compaction_secondary_db

Conversation

@hx235
Copy link
Contributor

@hx235 hx235 commented Sep 20, 2025

Context/Summary:

This is stacked on top of #13983 and integrate compaction resumption with OpenAndCompact().

Flow of resuming: DB::OpenAndCompact() -> Compaction progress file -> SubcompactionProgress -> CompactionJob
Flow of persistence: CompactionJob -> SubcompactionProgress -> Compaction progress file -> DB that is called with OpenAndCompact()

This PR focuses on DB::OpenAndCompact() -> Compaction progress file -> SubcompactionProgress and Compaction progress file -> DB that is called with OpenAndCompact()

Resume Flow

  1. Check configuration. Right now paranoid_file_check=true (by default false) is not yet compatible with allow_resumption=true. Also only single subcompaction is supported as OpenAndCompact() does not partition compaction anyway
  2. Scan compaction output files for latest, old and temporary compaction progress file and output files. If latest compaction progress file exists, we should resume.
  3. Clean up older or temporary progress files if any. They can exist if the last OpenAndCompact() crashed during resume flow
  4. If any, parse the latest progress file into CompactionProgress and clean up extra compaction output files that are not yet tracked. These compaction output files can exist as tracking every output file is just best-effort and interrupted output files in the middle is not tracked as progress yet.
  5. If allow_resumption=false or no valid compaction progress is found or parsed, clean up the latest progress file and existing compaction output files to start fresh compaction. If the clean up itself fails, fail the OpenAndCompact() call to prevent resuming with inconsistency between output files and progress file.

Progress File Creation

  1. Create temporary progress file
  2. Persist the progress from latest compaction progress file to the temporary progress file. This is to simplify resuming from an interrupted compaction that was just resumed. Similar to how manifest recovery works.
  3. Rename the temporary progress file to the newer compaction progress so it atomically becomes the "new" latest progress file
  4. Delete the "old" latest progress file since it's useless now.

Test plan:

Performance testing:

1. Latency
Using

./db_bench --benchmarks=OpenAndCompact[X5] --openandcompact_test_cancel_on_odd=false --openandcompact_cancel_after_seconds=0 --openandcompact_allow_resumption=$openandcompact_allow_resumption  --use_existing_db=true --db=$db --disable_auto_compactions=true --compression_type=none --secondary_path=$secondary_path  --target_file_size_base=268435456

allow_resumption = false
Input files: 101 files, 10000 keys
OpenAndCompact() API call : 26766256.000 micros/op 0 ops/sec 26.766 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 27837249.000 micros/op 0 ops/sec 27.837 seconds 1 operations;

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 26546234.000 micros/op 0 ops/sec 26.546 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 27918621.000 micros/op 0 ops/sec 27.919 seconds 1 operations;
OpenAndCompact [AVG 2 runs] : 0 (± 0) ops/sec

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 42243571.000 micros/op 0 ops/sec 42.244 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 43497581.000 micros/op 0 ops/sec 43.498 seconds 1 operations;
OpenAndCompact [AVG 3 runs] : 0 (± 0) ops/sec

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 34241357.000 micros/op 0 ops/sec 34.241 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 35655346.000 micros/op 0 ops/sec 35.655 seconds 1 operations;
OpenAndCompact [AVG 4 runs] : 0 (± 0) ops/sec

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 27083361.000 micros/op 0 ops/sec 27.083 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 28487999.000 micros/op 0 ops/sec 28.488 seconds 1 operations;
OpenAndCompact [AVG 5 runs] : 0 (± 0) ops/sec

OpenAndCompact [AVG 5 runs] : 0 (± 0) ops/sec; 31669.681 ms/op
OpenAndCompact [MEDIAN 5 runs] : 0 ops/sec

allow_resumption= true
Input files: 101 files, 10000 keys
OpenAndCompact() API call : 25446470.000 micros/op 0 ops/sec 25.446 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 26833415.000 micros/op 0 ops/sec 26.833 seconds 1 operations;

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 240745.000 micros/op 0 ops/sec 0.241 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 244934.000 micros/op 4 ops/sec 0.245 seconds 1 operations;
OpenAndCompact [AVG 2 runs] : 2 (± 3) ops/sec

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 24843383.000 micros/op 0 ops/sec 24.843 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 26192235.000 micros/op 0 ops/sec 26.192 seconds 1 operations;
OpenAndCompact [AVG 3 runs] : 1 (± 2) ops/sec

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 270819.000 micros/op 0 ops/sec 0.271 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 275140.000 micros/op 3 ops/sec 0.275 seconds 1 operations;
OpenAndCompact [AVG 4 runs] : 1 (± 2) ops/sec

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 23038311.000 micros/op 0 ops/sec 23.038 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 24439097.000 micros/op 0 ops/sec 24.439 seconds 1 operations;
OpenAndCompact [AVG 5 runs] : 1 (± 1) ops/sec

OpenAndCompact [AVG 5 runs] : 1 (± 1) ops/sec; 638.417 ms/op
OpenAndCompact [MEDIAN 5 runs] : 0 ops/sec

Persistence cost: Compare only the odd OpenAndCompact() across two configurations -> it's actually faster.
Resumption saving: Compare only the even OpenAndCompact() across two configurations -> (0.2 - 26.766 ) / 26.766 * 100 = 99.25% improvement, reflecting the fact that all the compaction progress is redone without the allow_resumption feature

2. Memory usage (in case SubcompactionProgress storing its own memory copies of output filemetadata in https://github.com/facebook/rocksdb/pull/13983/files is a trouble)

// ~= 90 output files
/usr/bin/time -f "
Resource Summary:
 Wall time: %e seconds
 CPU time: %U user + %S system (%P total)
 Peak memory: %M KB
 Page faults: %F major + %R minor
" ./db_bench --benchmarks=OpenAndCompact[X1] --openandcompact_test_cancel_on_odd=false --openandcompact_cancel_after_seconds=0 --openandcompact_allow_resumption=$openandcompact_allow_resumption  --use_existing_db=true --db=$db --disable_auto_compactions=true --compression_type=none --secondary_path=$secondary_path  --target_file_size_base=268435456

allow_resumption = false
Peak memory: 275828 KB
allow_resumption = true
Peak memory: 277204 KB (regress 0.49% memory usage, most likely due to storing own copies of output files' file metadata in subcompaction progress)

Near-term follow up:

@meta-cla meta-cla bot added the CLA Signed label Sep 20, 2025
@hx235 hx235 force-pushed the resumable_compaction_secondary_db branch from c5b63fa to c6d04f5 Compare September 20, 2025 11:28
@hx235 hx235 changed the title Integrate compaction resumption with OpenAndCompact() Integrate compaction resumption with DB::OpenAndCompact() Sep 20, 2025
@hx235 hx235 requested a review from jaykorean September 20, 2025 22:52
@hx235 hx235 force-pushed the resumable_compaction_secondary_db branch 2 times, most recently from 61a7d1d to b20b70b Compare September 24, 2025 22:39
@hx235 hx235 force-pushed the resumable_compaction_secondary_db branch 2 times, most recently from aac314e to 6e02ba1 Compare October 2, 2025 16:14
Comment on lines +2868 to +2879
// Track total processed input records for progress reporting by combining:
// - Resumed count: records already processed before compaction was
// interrupted
// - Current count: records scanned in the current compaction session
// Only update when both tracking mechanisms provide accurate counts to ensure
// reliability.
subcompaction_progress.num_processed_input_records =
c_iter->HasNumInputEntryScanned() ? c_iter->NumInputEntryScanned() : 0;
c_iter->HasNumInputEntryScanned() &&
sub_compact->compaction_job_stats.has_accurate_num_input_records
? c_iter->NumInputEntryScanned() +
sub_compact->compaction_job_stats.num_input_records
: 0;
Copy link
Contributor Author

@hx235 hx235 Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug fix to #13983 for the case where a cancelled compaction is cancelled again. Will be tested in stress test support PR that's coming up.

Copy link
Contributor

@jaykorean jaykorean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need more time to review, but leaving the first set of comments

return s;
} else if (latest_compaction_progress_file.empty()) {
// CASE 1: Resume requested but no progress file found - start fresh
ROCKS_LOG_INFO(immutable_db_options_.info_log,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this may spam the logger

Copy link
Contributor Author

@hx235 hx235 Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think it can be replaced by existing logging already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I took a look and concluded these are the essential log to figure out why resuming is not happening. It's only printed once per OpenAndCompact call. Is there any concern on that frequency?

Copy link
Contributor

@jaykorean jaykorean Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end of the day, the value for this log is that it tells us there was no prior compaction progress which should be the majority of the cases.

IMO, as long as we can tell if the attempted compaction (whether it succeeded or failed) had had prior compaction or not at the end would be sufficient.

I'm okay we can log this for now in the initial rollout to see if we see too many of these and remove this later.

Copy link
Contributor Author

@hx235 hx235 Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. For "had had prior compaction or not at the end would be sufficient.", what would be your proposal way to tell by the way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be your proposal way to tell?

wondering if CompactionOutputs::Output can have a simple boolean field (e.g. restored). We can set it to true in CompactionJob::RestoreCompactionOutputs(), and then log during CompactionJob::VerifyOutputFiles()

This could give us an immediate answer whether output file was restored from prior compaction or created from current compaction at the verification step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi - i will keep this in mind. Thanks!

@hx235 hx235 force-pushed the resumable_compaction_secondary_db branch from 6e02ba1 to f3b9917 Compare October 9, 2025 22:06
Comment on lines +2090 to +2103
// LIMITATION: Don't save progress if the current key has already been scanned
// (looked ahead) in the input but not yet output. This can happen with merge
// operations, single deletes, and deletes at the bottommost level where
// CompactionIterator needs to look ahead to process multiple entries for the
// same user key before outputting a result. If we saved progress and resumed
// at this boundary, the resumed session would see and process the same input
// key again through Seek(), leading to incorrect double-counting in
// number of processed input entries and input count verification failure
//
// TODO(hx235): Offset num_processed_input_records to avoid double counting
// instead of disabling progress persistence.
if (c_iter->IsCurrentKeyAlreadyScanned()) {
return false;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaykorean another bug fix motivated by stress test in the case where the next_key_to_compact can be already counted in num_processed_input_records. See new UT in compaction_job_test.cc. I have some idea of the fix so we can relax this constraint but is not going to block this PR.

@meta-codesync
Copy link

meta-codesync bot commented Oct 9, 2025

@hx235 has imported this pull request. If you are a Meta employee, you can view this in D84299662.

return s;
} else if (latest_compaction_progress_file.empty()) {
// CASE 1: Resume requested but no progress file found - start fresh
ROCKS_LOG_INFO(immutable_db_options_.info_log,
Copy link
Contributor

@jaykorean jaykorean Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end of the day, the value for this log is that it tells us there was no prior compaction progress which should be the majority of the cases.

IMO, as long as we can tell if the attempted compaction (whether it succeeded or failed) had had prior compaction or not at the end would be sufficient.

I'm okay we can log this for now in the initial rollout to see if we see too many of these and remove this later.

@hx235 hx235 force-pushed the resumable_compaction_secondary_db branch from f3b9917 to 5b9aae8 Compare October 10, 2025 18:21
Comment on lines +1303 to +1314
// TODO(hx235): Resuming compaction is currently incompatible with
// paranoid_file_checks=true because OutputValidator hash verification would
// fail during compaction resumption. Before interruption, resuming
// compaction needs to persist the hash of each output file to enable
// validation after resumption. Alternatively and preferably, we could move
// the output verification to happen immediately after each output file is
// created. This workaround currently disables resuming compaction when
// paranoid_file_checks is enabled. Note that paranoid_file_checks is
// disabled by default.
bool allow_resumption =
options.allow_resumption &&
!cfd->GetLatestMutableCFOptions().paranoid_file_checks;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be resolved by #14042

@hx235 hx235 requested a review from jaykorean October 10, 2025 21:39
@hx235 hx235 force-pushed the resumable_compaction_secondary_db branch 4 times, most recently from f5064fa to e0266e6 Compare October 14, 2025 21:58
Copy link
Contributor

@jaykorean jaykorean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nitpick comments. The only thing that I'd suggest addressing is DeleteFileIfExists()

Almost there!

{{KeyStr("a", 1U, kTypeValue), "val1"},
{KeyStr("b", 2U, kTypeValue), "val2"}} /* input_file_1 */,
{{KeyStr("bb", 3U, kTypeValue), "val3"},
{KeyStr(kCancelBeforeThisKey, 4U, kTypeMerge),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// - Phase 3: Resume with resumption support (loads Phase 1's progress) and
// complete
// Validates: Resumption state can be toggled;
kMultipleCancelToggleResumption
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

s = FinalizeCompactionProgressWriter(compaction_progress_writer);
}
if (!s.ok()) {
return s;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't have to be in this PR, but just one thing to think about.

Do we want to fail the whole compaction when FinalizeCompactionProgressWriter() fails due to some transient filesystem issues? wondering if we can simply progress the compaction without compaction_progress_writer.

At this point, we know that we've successfully finished PrepareCompactionProgressState(), meaning that the tmp dir either contains the latest compaction progress or is completely clean. If current compaction (without progress writer) fails again, then the next compaction with allow_resumption=true will attempt to resume from prior progress again.

This may complicate things or potentially create a bug-prone race scenarios, so it might not be worth handling, but just wanted to bring it up for a potential future optimization.

// - OpenAndCompact() starts a fresh compaction from scratch.
// - No progress will be saved during execution, so interruptions require
// starting over completely.
// - CRITICAL REQUIREMENT: The `output_directory` associated MUST be empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@hx235 hx235 force-pushed the resumable_compaction_secondary_db branch from e0266e6 to de651d7 Compare October 15, 2025 18:55
@hx235 hx235 requested a review from jaykorean October 15, 2025 18:56
Copy link
Contributor

@jaykorean jaykorean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! Thanks for the prior refactor PRs to make compaction flow code more readable and modular. I also appreciate well-thought test coverages and detailed comments on the APIs

@meta-codesync
Copy link

meta-codesync bot commented Oct 15, 2025

@hx235 merged this pull request in f7e4009.

meta-codesync bot pushed a commit that referenced this pull request Oct 21, 2025
…paction (#14041)

Summary:
**Context/Summary:**
- Add resumable compaction to stress test with adaptive progress cancellation
- Add fault injection to remote compaction
- Fix a real minor bug in a couple testing framework bugs with remote compaction

Pull Request resolved: #14041

Test Plan: - Rehearsal stress test, finding bugs for #13984 effectively and did not create new failures.

Reviewed By: jaykorean

Differential Revision: D84524194

Pulled By: hx235

fbshipit-source-id: 42b4264e428c6739631ed9aa5eb02723367510bc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants