-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix missing WAL in new manifest by rolling over the WAL deletion record from prev manifest #10892
Conversation
b1f8684
to
c6d5247
Compare
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
c6d5247
to
8977d68
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
8977d68
to
3526676
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure yet. Isn't it odd that we write a WalAddition for 4.log when it's already obsolete, and then ignore it later?
HISTORY.md
Outdated
@@ -3,6 +3,9 @@ | |||
### Performance Improvements | |||
* Fixed an iterator performance regression for delete range users when scanning through a consecutive sequence of range tombstones (#10877). | |||
|
|||
### Bug fix | |||
* Fixed a bug cased by `DB::SyncWAL()` on a obsoleted WAL and recording such WAL addition of an obsoleted WAL in a new manifest, affecting `track_and_verify_wals_in_manifest`. Without the fix, application may see "open error: Corruption: Missing WAL with log number" while trying to open the db. The corruption is a false alarm but prevents DB open (#10892). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too much implementation detail IMO. How about: "For users of track_and_verify_wals_in_manifest
, fixed a race condition in MANIFEST rollover (see max_manifest_file_size
) leading to false positive DB::Open()
failures"?
edit: Or just delete the "obsoleted WAL and recording such WAL addition of an obsoleted WAL" part; I think that's the only part in yours that is too low level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a straightforward thing we can add is change WriteCurrentStateToManifest() to call DeleteWalsBefore()
Ideally we would not have WalAdditions following WalDeletions affecting the same WAL, but that is a bigger problem unrelated to rollover or not - it just works fine in case no rollover happened.
3526676
to
220ba5f
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
220ba5f
to
d1ca4fb
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
Update: addressed feedback and going thru stress test now. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
d1ca4fb
to
be3a9b0
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
9d92fdc
to
b3c5745
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
b3c5745
to
66bd569
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@@ -5506,7 +5509,8 @@ Status VersionSet::GetCurrentManifestPath(const std::string& dbname, | |||
Status VersionSet::Recover( | |||
const std::vector<ColumnFamilyDescriptor>& column_families, bool read_only, | |||
std::string* db_id, bool no_error_if_files_missing) { | |||
// Read "CURRENT" file, which contains a pointer to the current manifest file | |||
// Read "CURRENT" file, which contains a pointer to the current manifest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Auto format change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
66bd569
to
fc44539
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
fc44539
to
745be9f
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
…fest in VersionEditHandler::CheckIterationResult()
745be9f
to
6da530d
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: #10892 and #10955 mistakenly added two entries under sealed 7.9.history section. This PR fixes these two. No need to update 7.9 branch (https://github.com/facebook/rocksdb/blob/7.9.fb/HISTORY.md) cuz it's cut before these two PRs landed Pull Request resolved: #11013 Reviewed By: cbi42 Differential Revision: D41666514 Pulled By: hx235 fbshipit-source-id: c4bc7a29ff663664bf0be1ba1c7eab4d00a61528
…singMissingWAL) (#11186) Summary: **Context/Summary**: Simplify `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` based on #11016 (review) and delete unused sync points. Pull Request resolved: #11186 Test Plan: - UT failed before fix in #10892 and passes after - Check UT not flaky when running with https://app.circleci.com/pipelines/github/facebook/rocksdb/21985/workflows/5f6cc355-78c1-46d8-89ee-0fd679725a8a/jobs/540878 Reviewed By: ajkr Differential Revision: D43034723 Pulled By: hx235 fbshipit-source-id: f503774987b8f3718505f99e95080a7fad28ac66
Context
Options::track_and_verify_wals_in_manifest = true
verifies each of the WALs tracked in manifest indeed presents in the WAL folder. If not, a corruption "Missing WAL with log number" will be thrown.DB::SyncWAL()
called at a specific timing (i.e, at theTEST_SYNC_POINT("FindObsoleteFiles::PostMutexUnlock")
) can record in a new manifest the WAL addition of a WAL file that already had a WAL deletion recorded in the previous manifest.And the WAL deletion record is not rollover-ed to the new manifest. So the new manifest creates the illusion of such WAL never gets deleted and should presents at db re/open.
PurgeObsoleteFiles()
.As a consequence, upon
DB::Reopen()
, this WAL file can be deleted while manifest still has its WAL addition record , which causes a false alarm of corruption "Missing WAL with log number" to be thrown.Summary
This PR fixes this false alarm by rolling over the WAL deletion record from prev manifest to the new manifest by adding the WAL deletion record to the new manifest.
Test
TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)
that failed before the fix and passed after