Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing WAL in new manifest by rolling over the WAL deletion record from prev manifest #10892

Closed
wants to merge 4 commits into from

Conversation

hx235
Copy link
Contributor

@hx235 hx235 commented Oct 27, 2022

Context
Options::track_and_verify_wals_in_manifest = true verifies each of the WALs tracked in manifest indeed presents in the WAL folder. If not, a corruption "Missing WAL with log number" will be thrown.

DB::SyncWAL() called at a specific timing (i.e, at the TEST_SYNC_POINT("FindObsoleteFiles::PostMutexUnlock")) can record in a new manifest the WAL addition of a WAL file that already had a WAL deletion recorded in the previous manifest.
And the WAL deletion record is not rollover-ed to the new manifest. So the new manifest creates the illusion of such WAL never gets deleted and should presents at db re/open.

  • Such WAL deletion record can be caused by flushing the memtable associated with that WAL and such WAL deletion can actually happen in PurgeObsoleteFiles().

As a consequence, upon DB::Reopen(), this WAL file can be deleted while manifest still has its WAL addition record , which causes a false alarm of corruption "Missing WAL with log number" to be thrown.

Summary
This PR fixes this false alarm by rolling over the WAL deletion record from prev manifest to the new manifest by adding the WAL deletion record to the new manifest.

Test

  • Make check
  • Added new unit test TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL) that failed before the fix and passed after
  • [Ongoing]CI stress test + aggressive value as in [CI only]Aggressive crash test value #10761 , which is how this false alarm was first surfaced, to confirm such false alarm disappears
  • [Ongoing]Regular CI stress test to confirm such fix didn't harm anything

@hx235 hx235 added the WIP Work in progress label Oct 27, 2022
@hx235 hx235 changed the title Delete WAL of number before version_set's min_log_number_to_keep in CheckIterationResult to fix missing WAL Fix missing WAL by deleting WAL before VersionSet::min_log_number_to_keep_ Oct 27, 2022
@hx235 hx235 changed the title Fix missing WAL by deleting WAL before VersionSet::min_log_number_to_keep_ Fix missing WAL by deleting WAL before VersionSet::min_log_number_to_keep_ in VersionEditHandler::CheckIterationResult Oct 27, 2022
@hx235 hx235 changed the title Fix missing WAL by deleting WAL before VersionSet::min_log_number_to_keep_ in VersionEditHandler::CheckIterationResult Fix missing WAL by deleting WAL before VersionSet::min_log_number_to_keep_ in VersionEditHandler::CheckIterationResult() Oct 27, 2022
@hx235 hx235 changed the title Fix missing WAL by deleting WAL before VersionSet::min_log_number_to_keep_ in VersionEditHandler::CheckIterationResult() Fix missing WAL by removing WAL before min log number to keep from manifest in VersionEditHandler::CheckIterationResult() Oct 27, 2022
@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 requested a review from ajkr October 31, 2022 18:24
Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure yet. Isn't it odd that we write a WalAddition for 4.log when it's already obsolete, and then ignore it later?

HISTORY.md Outdated
@@ -3,6 +3,9 @@
### Performance Improvements
* Fixed an iterator performance regression for delete range users when scanning through a consecutive sequence of range tombstones (#10877).

### Bug fix
* Fixed a bug cased by `DB::SyncWAL()` on a obsoleted WAL and recording such WAL addition of an obsoleted WAL in a new manifest, affecting `track_and_verify_wals_in_manifest`. Without the fix, application may see "open error: Corruption: Missing WAL with log number" while trying to open the db. The corruption is a false alarm but prevents DB open (#10892).
Copy link
Contributor

@ajkr ajkr Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too much implementation detail IMO. How about: "For users of track_and_verify_wals_in_manifest, fixed a race condition in MANIFEST rollover (see max_manifest_file_size) leading to false positive DB::Open() failures"?

edit: Or just delete the "obsoleted WAL and recording such WAL addition of an obsoleted WAL" part; I think that's the only part in yours that is too low level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a straightforward thing we can add is change WriteCurrentStateToManifest() to call DeleteWalsBefore()

Ideally we would not have WalAdditions following WalDeletions affecting the same WAL, but that is a bigger problem unrelated to rollover or not - it just works fine in case no rollover happened.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 changed the title Fix missing WAL by removing WAL before min log number to keep from manifest in VersionEditHandler::CheckIterationResult() Fix missing WAL by DeleteWalsBefore() in ProcessManifestWrites to respect min log number to keep Nov 8, 2022
@hx235 hx235 changed the title Fix missing WAL by DeleteWalsBefore() in ProcessManifestWrites to respect min log number to keep Fix missing WAL by DeleteWalsBefore() in ProcessManifestWrites() to respect min log number to keep Nov 8, 2022
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235
Copy link
Contributor Author

hx235 commented Nov 8, 2022

Update: addressed feedback and going thru stress test now.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@@ -5506,7 +5509,8 @@ Status VersionSet::GetCurrentManifestPath(const std::string& dbname,
Status VersionSet::Recover(
const std::vector<ColumnFamilyDescriptor>& column_families, bool read_only,
std::string* db_id, bool no_error_if_files_missing) {
// Read "CURRENT" file, which contains a pointer to the current manifest file
// Read "CURRENT" file, which contains a pointer to the current manifest
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto format change.

@hx235 hx235 changed the title Fix missing WAL by DeleteWalsBefore() in ProcessManifestWrites() to respect min log number to keep Fix missing WAL in new manifest by rolling over the WAL deletion record from prev manifest Nov 28, 2022
Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Dec 7, 2022
Summary:
#10892 and #10955 mistakenly added two entries under sealed 7.9.history section. This PR fixes these two.

No need to update 7.9 branch (https://github.com/facebook/rocksdb/blob/7.9.fb/HISTORY.md) cuz it's cut before these two PRs landed

Pull Request resolved: #11013

Reviewed By: cbi42

Differential Revision: D41666514

Pulled By: hx235

fbshipit-source-id: c4bc7a29ff663664bf0be1ba1c7eab4d00a61528
facebook-github-bot pushed a commit that referenced this pull request Feb 7, 2023
…singMissingWAL) (#11186)

Summary:
**Context/Summary**:
Simplify `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` based on #11016 (review) and delete unused sync points.

Pull Request resolved: #11186

Test Plan:
- UT failed before fix in #10892 and passes after
- Check UT not flaky when running with https://app.circleci.com/pipelines/github/facebook/rocksdb/21985/workflows/5f6cc355-78c1-46d8-89ee-0fd679725a8a/jobs/540878

Reviewed By: ajkr

Differential Revision: D43034723

Pulled By: hx235

fbshipit-source-id: f503774987b8f3718505f99e95080a7fad28ac66
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants