Skip to content

Skip Wal Recovery on SecondaryDB Open if for Remote Compaction#14462

Closed
jaykorean wants to merge 9 commits into
facebook:mainfrom
jaykorean:skip_wal_recovery_remote_compaction
Closed

Skip Wal Recovery on SecondaryDB Open if for Remote Compaction#14462
jaykorean wants to merge 9 commits into
facebook:mainfrom
jaykorean:skip_wal_recovery_remote_compaction

Conversation

@jaykorean
Copy link
Copy Markdown
Contributor

@jaykorean jaykorean commented Mar 13, 2026

Summary:

Skip WAL recovery when opening a secondary DB instance in OpenAndCompact() for remote compaction. WAL replay is unnecessary in this flow since only LSM state from MANIFEST is needed.

Test Plan:

  • make -j db_secondary_test && ./db_secondary_test — 35/35 passed
  • make -j compaction_service_test && ./compaction_service_test — 43/43 passed (includes new SkipWALRecoveryInOpenAndCompact test)
  • make -j options_settable_test && ./options_settable_test --gtest_filter="DBOptionsAllFieldsSettable" — 1/1 passed
  • Removed temporary hack in stress test that disables WAL

@meta-cla meta-cla Bot added the CLA Signed label Mar 13, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 13, 2026

✅ clang-tidy: No findings on changed lines

Completed in 243.2s.

@jaykorean jaykorean force-pushed the skip_wal_recovery_remote_compaction branch 3 times, most recently from 44fa6bd to 630b2a0 Compare March 16, 2026 16:50
@jaykorean jaykorean requested a review from hx235 March 16, 2026 19:43
@jaykorean jaykorean marked this pull request as ready for review March 16, 2026 19:43
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 16, 2026

@jaykorean has imported this pull request. If you are a Meta employee, you can view this in D96788211.

// Track whether FindAndRecoverLogFiles is called during compaction
std::atomic_bool wal_recovery_called{false};
SyncPoint::GetInstance()->SetCallBack(
"DBImplSecondary::FindAndRecoverLogFiles:Begin",
Copy link
Copy Markdown
Contributor

@hx235 hx235 Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better test is to not add sync point when we could invoke real DB behavior relatively easily. Even arbitrarily corrupting the WAL and seeing if the remote compaction is intact, compared before and after this PR seems better than adding a sync point. Alternatively, some statistics about WAL read IO would be ideal though I am not sure if it exists. Maybe that's a gap. FS wrapper counting reads can be used for testing here too.

Whenever we can use and observe the actual DB behavior, we should avoid adding sync point to make future refactoring easier

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's no other good option. Claude is suggesting to create a new FileSystemWrapper implementation for just WalrReadCounting purpose - WalReadCountingFS 😆 which I feel is overkill.

Maybe stress test is good enough.

Comment thread tools/db_crashtest.py Outdated
Copy link
Copy Markdown
Contributor

@hx235 hx235 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments - correctness wise looks fine to me and our stress test should be able to catch any bug there. Thanks!

Comment thread db/db_impl/db_impl_secondary.h Outdated
Comment on lines +387 to +399
// When true, WAL replay is skipped during Recover(). Used internally by
// OpenAndCompact() which only needs LSM state from MANIFEST.
bool skip_wal_recovery_ = false;

// Internal helper for opening a secondary instance, with an option to skip
// WAL recovery. Used by DB::OpenAsSecondary() and DB::OpenAndCompact().
static Status OpenAsSecondaryImpl(
const DBOptions& db_options, const std::string& dbname,
const std::string& secondary_path,
const std::vector<ColumnFamilyDescriptor>& column_families,
std::vector<ColumnFamilyHandle*>* handles, std::unique_ptr<DB>* dbptr,
bool skip_wal_recovery);

Copy link
Copy Markdown
Contributor

@hx235 hx235 Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the friend class and member variable can be avoided if we carve out the WAL recovery from secondary db's Recover() function to be its own since it's optional for secondary db's recovery.

  DB::OpenAsSecondary()                        [public, unchanged signature]
    └── DB::OpenAsSecondaryCore()              [private static on DB]
          ├── creates DBImplSecondary
          ├── calls Recover()                  ← MANIFEST only (WAL removed)
          └── sets up handles, superversions
    └── impl->FindAndRecoverLogFiles()         ← added after Core returns

  DB::OpenAndCompact()
    └── DB::OpenAsSecondaryCore()              ← no WAL recovery, done

Copy link
Copy Markdown
Contributor Author

@jaykorean jaykorean Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely doable. And cleaner, too I think. It just requires a bit more changes. Let me put up a change for this. This won't require bool skip_wal_recovery_ at all.

@jaykorean jaykorean force-pushed the skip_wal_recovery_remote_compaction branch 2 times, most recently from a5875aa to 3e2365e Compare March 19, 2026 18:37
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 19, 2026

@jaykorean has imported this pull request. If you are a Meta employee, you can view this in D96788211.

@jaykorean jaykorean force-pushed the skip_wal_recovery_remote_compaction branch from ac50fae to ce8600a Compare March 19, 2026 21:26
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 19, 2026

@jaykorean has imported this pull request. If you are a Meta employee, you can view this in D96788211.

@meta-codesync meta-codesync Bot closed this in 89322fd Mar 19, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 19, 2026

@jaykorean merged this pull request in 89322fd.

@jaykorean jaykorean deleted the skip_wal_recovery_remote_compaction branch March 20, 2026 02:10
doxtop pushed a commit to flyingw/rocksdb that referenced this pull request Apr 7, 2026
…ook#14462)

Summary:
Skip WAL recovery when opening a secondary DB instance in OpenAndCompact() for remote compaction. WAL replay is unnecessary in this flow since only LSM state from MANIFEST is needed.

Pull Request resolved: facebook#14462

Test Plan:
- make -j db_secondary_test && ./db_secondary_test — 35/35 passed
- make -j compaction_service_test && ./compaction_service_test — 43/43 passed (includes new SkipWALRecoveryInOpenAndCompact test)
- make -j options_settable_test && ./options_settable_test --gtest_filter="*DBOptionsAllFieldsSettable*" — 1/1 passed
- Removed temporary hack in stress test that disables WAL

Reviewed By: hx235

Differential Revision: D96788211

Pulled By: jaykorean

fbshipit-source-id: f91a2f861f2450ebc83423ed4c6f5b70da7d9e8b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants