Reduce risk of backup or checkpoint missing a WAL file #10083

pdillinger · 2022-05-31T18:59:18Z

Summary: We recently saw a case in crash test in which a WAL file in the
middle of the list of live WALs was not included in the backup, so the
DB was not openable due to missing WAL. We are not sure why, but this
change should at least turn that into a backup-time failure by ensuring
all the WAL files expected by the manifest (according to VersionSet) are
included in GetSortedWalFiles() (used by GetLiveFilesStorageInfo(),
BackupEngine, and Checkpoint)

Related: to maximize the effectiveness of
track_and_verify_wals_in_manifest with GetSortedWalFiles() during
checkpoint/backup, we will now sync WAL in GetLiveFilesStorageInfo()
when track_and_verify_wals_in_manifest=true.

Test Plan: added new unit test for the check in GetSortedWalFiles()

Summary: We recently saw a case in crash test in which a WAL file in the middle of the list of live WALs was not included in the backup, so the DB was not openable due to missing WAL. We are not sure why, but this change should at least turn that into a backup-time failure by ensuring all the WAL files expected by the manifest (according to VersionSet) are included in `GetSortedWalFiles()` (used by `GetLiveFilesStorageInfo()`, `BackupEngine`, and `Checkpoint`) Related: to maximize the effectiveness of track_and_verify_wals_in_manifest with GetSortedWalFiles() during checkpoint/backup, we will now sync WAL in GetLiveFilesStorageInfo() when track_and_verify_wals_in_manifest=true. Test Plan: added new unit test for the check in GetSortedWalFiles()

facebook-github-bot · 2022-05-31T21:47:22Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ltamasi

Thanks for the fix!

ltamasi · 2022-05-31T22:51:14Z

db/db_filesnapshot.cc

+    // Record tracked WALs as a (minimum) cross-check for directory scan
+    const auto& manifest_wals = versions_->GetWalSet().GetWals();
+    required_by_manifest.reserve(manifest_wals.size());
+    for (const auto& wal : versions_->GetWalSet().GetWals()) {


Could be: const auto& wal : manifest_wals

ltamasi · 2022-05-31T22:54:01Z

db/db_filesnapshot.cc

+        ++required;
+        ++included;
+      } else {
+        // *required > incl_num


We could actually turn this into an assertion

ltamasi · 2022-05-31T23:03:15Z

db/db_wal_test.cc

+  options.wal_recovery_mode = WALRecoveryMode::kAbsoluteConsistency;
+
+  // Build a way to make wal files selectively go missing
+  bool enable_missing_wal = false;


How about calling this something like make_wals_go_missing ?

riversand963 · 2022-06-01T00:02:20Z

db/db_filesnapshot.cc

@@ -124,6 +124,9 @@ Status DBImpl::GetLiveFiles(std::vector<std::string>& ret,
 }

 Status DBImpl::GetSortedWalFiles(VectorLogPtr& files) {
+  // Record tracked WALs as a (minimum) cross-check for directory scan
+  std::vector<uint64_t> required_by_manifest;


Nit: autovector can be used to avoid allocating dynamic memory when WAL count is small.

This is not at all a performance critical path. We're currently taking a directory listing of the DB dir to get the answer.

facebook-github-bot · 2022-06-01T15:57:00Z

@pdillinger has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-06-01T16:08:18Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ajkr · 2022-06-16T00:26:39Z

db/db_filesnapshot.cc

+    s = FlushWAL(
+        immutable_db_options_.track_and_verify_wals_in_manifest /* sync */);


Surprisingly this can corrupt the DB. The SyncWAL() it calls when track_and_verify_wals_in_manifest == true will look at GetFileSize() after syncing finishes to determine the synced size. But new writes could have come to that WAL in the meantime causing GetFileSize() to be higher than the actual synced size. Then a recovery following a crash with unsynced data loss can fail to open like this:

Error opening DB: Corruption: Size mismatch: WAL (log number: 440) in MANIFEST is 63351 bytes , but actually is 62772 bytes on disk

I'll try to see if moving the GetFileSize() earlier can solve it...

Summary: We recently saw a case in crash test in which a WAL file in the middle of the list of live WALs was not included in the backup, so the DB was not openable due to missing WAL. We are not sure why, but this change should at least turn that into a backup-time failure by ensuring all the WAL files expected by the manifest (according to VersionSet) are included in `GetSortedWalFiles()` (used by `GetLiveFilesStorageInfo()`, `BackupEngine`, and `Checkpoint`) Related: to maximize the effectiveness of track_and_verify_wals_in_manifest with GetSortedWalFiles() during checkpoint/backup, we will now sync WAL in GetLiveFilesStorageInfo() when track_and_verify_wals_in_manifest=true. Pull Request resolved: facebook#10083 Test Plan: added new unit test for the check in GetSortedWalFiles() Reviewed By: ajkr Differential Revision: D36791608 Pulled By: pdillinger fbshipit-source-id: a27bcf0213fc7ab177760fede50d4375d579afa6

Summary: We recently saw a case in crash test in which a WAL file in the middle of the list of live WALs was not included in the backup, so the DB was not openable due to missing WAL. We are not sure why, but this change should at least turn that into a backup-time failure by ensuring all the WAL files expected by the manifest (according to VersionSet) are included in `GetSortedWalFiles()` (used by `GetLiveFilesStorageInfo()`, `BackupEngine`, and `Checkpoint`) Related: to maximize the effectiveness of track_and_verify_wals_in_manifest with GetSortedWalFiles() during checkpoint/backup, we will now sync WAL in GetLiveFilesStorageInfo() when track_and_verify_wals_in_manifest=true. Pull Request resolved: #10083 Test Plan: added new unit test for the check in GetSortedWalFiles() Reviewed By: ajkr Differential Revision: D36791608 Pulled By: pdillinger fbshipit-source-id: a27bcf0213fc7ab177760fede50d4375d579afa6

Summary: We recently saw a case in crash test in which a WAL file in the middle of the list of live WALs was not included in the backup, so the DB was not openable due to missing WAL. We are not sure why, but this change should at least turn that into a backup-time failure by ensuring all the WAL files expected by the manifest (according to VersionSet) are included in `GetSortedWalFiles()` (used by `GetLiveFilesStorageInfo()`, `BackupEngine`, and `Checkpoint`) Related: to maximize the effectiveness of track_and_verify_wals_in_manifest with GetSortedWalFiles() during checkpoint/backup, we will now sync WAL in GetLiveFilesStorageInfo() when track_and_verify_wals_in_manifest=true. Pull Request resolved: facebook#10083 Test Plan: added new unit test for the check in GetSortedWalFiles() Reviewed By: ajkr Differential Revision: D36791608 Pulled By: pdillinger fbshipit-source-id: a27bcf0213fc7ab177760fede50d4375d579afa6 Signed-off-by: tabokie <xy.tao@outlook.com>

…ook#10083)" This reverts commit a00cffa.

Summary: Background: there is one active WAL file but there can be several more WAL files in various states. Those other WALs are always in a "flushed" state but could be on the `logs_` list not yet fully synced. We currently allow any WAL that is not the active WAL to be hard-linked when creating a Checkpoint, as although it might still be open for write, we are not appending any more data to it. The problem is that a created Checkpoint is supposed to be fully synced on return of that function, and a hard-linked WAL in the state described above might not be fully synced. (Through some prudence in facebook#10083, it would synced if using track_and_verify_wals_in_manifest=true.) The fix is a step toward a long term goal of removing the need to query the filesystem to determine WAL files and their state. (I consider it dubious any time we independently read from or query metadata from a file we have open for writing, as this makes us more susceptible to FileSystem deficiencies or races.) More specifically: * Detect which WALs might not be fully synced, according to our DBImpl metadata, and prevent hard linking those (with `trim_to_size=true` from `GetLiveFilesStorageInfo()`. And while we're at it, use our known flushed sizes for those WALs. * To avoid a race between that and GetSortedWalFiles(), track a maximum needed WAL number for the Checkpoint/GetLiveFilesStorageInfo. * Because of the level of consistency provided by those two, we no longer need to consider syncing as part of the FlushWAL in GetLiveFilesStorageInfo. (We determine the max WAL number consistent with the manifest file size, while holding DB mutex. Should make track_and_verify_wals_in_manifest happy.) This makes the premise of test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback no longer hit) so the test is removed, with crash test as backstop for related issues. See facebook#10185 Stacked on facebook#12729 Test Plan: Expanded an existing test, which now fails before fix. Also long runs of blackbox_crash_test with amplified checkpoint frequency.

Summary: Background: there is one active WAL file but there can be several more WAL files in various states. Those other WALs are always in a "flushed" state but could be on the `logs_` list not yet fully synced. We currently allow any WAL that is not the active WAL to be hard-linked when creating a Checkpoint, as although it might still be open for write, we are not appending any more data to it. The problem is that a created Checkpoint is supposed to be fully synced on return of that function, and a hard-linked WAL in the state described above might not be fully synced. (Through some prudence in #10083, it would synced if using track_and_verify_wals_in_manifest=true.) The fix is a step toward a long term goal of removing the need to query the filesystem to determine WAL files and their state. (I consider it dubious any time we independently read from or query metadata from a file we have open for writing, as this makes us more susceptible to FileSystem deficiencies or races.) More specifically: * Detect which WALs might not be fully synced, according to our DBImpl metadata, and prevent hard linking those (with `trim_to_size=true` from `GetLiveFilesStorageInfo()`. And while we're at it, use our known flushed sizes for those WALs. * To avoid a race between that and GetSortedWalFiles(), track a maximum needed WAL number for the Checkpoint/GetLiveFilesStorageInfo. * Because of the level of consistency provided by those two, we no longer need to consider syncing as part of the FlushWAL in GetLiveFilesStorageInfo. (We determine the max WAL number consistent with the manifest file size, while holding DB mutex. Should make track_and_verify_wals_in_manifest happy.) This makes the premise of test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback no longer hit) so the test is removed, with crash test as backstop for related issues. See #10185 Stacked on #12729 Pull Request resolved: #12731 Test Plan: Expanded an existing test, which now fails before fix. Also long runs of blackbox_crash_test with amplified checkpoint frequency. Reviewed By: cbi42 Differential Revision: D58199629 Pulled By: pdillinger fbshipit-source-id: 376e55f4a2b082cd2adb6408a41209de14422382

facebook-github-bot added the CLA Signed label May 31, 2022

pdillinger requested a review from ltamasi May 31, 2022 21:50

ltamasi approved these changes May 31, 2022

View reviewed changes

riversand963 reviewed Jun 1, 2022

View reviewed changes

pdillinger added 2 commits June 1, 2022 08:43

Merge remote-tracking branch 'origin/main' into sanity_check_sorted_wals

312f159

Some improvements from review

8ebb322

facebook-github-bot closed this in a00cffa Jun 1, 2022

pdillinger mentioned this pull request Jun 1, 2022

Fix a bug in WAL tracking #10087

Closed

ajkr reviewed Jun 16, 2022

View reviewed changes

ajkr mentioned this pull request Jun 16, 2022

Fix race condition with WAL tracking and FlushWAL(true /* sync */) #10185

Closed

riversand963 mentioned this pull request Jul 13, 2022

Release RocksDB 7.3.2 #10359

Merged

hx235 added a commit to hx235/rocksdb that referenced this pull request Aug 22, 2022

Revert "Reduce risk of backup or checkpoint missing a WAL file (faceb…

7ea90c6

…ook#10083)" This reverts commit a00cffa.

pdillinger mentioned this pull request Jun 3, 2024

Fix Checkpoint hard link of inactive but unsynced WAL #12731

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce risk of backup or checkpoint missing a WAL file #10083

Reduce risk of backup or checkpoint missing a WAL file #10083

pdillinger commented May 31, 2022

facebook-github-bot commented May 31, 2022

ltamasi left a comment

ltamasi May 31, 2022

ltamasi May 31, 2022

ltamasi May 31, 2022

riversand963 Jun 1, 2022

pdillinger Jun 1, 2022

facebook-github-bot commented Jun 1, 2022

facebook-github-bot commented Jun 1, 2022

ajkr Jun 16, 2022 •

edited

Loading

		s = FlushWAL(
		immutable_db_options_.track_and_verify_wals_in_manifest /* sync */);

Reduce risk of backup or checkpoint missing a WAL file #10083

Reduce risk of backup or checkpoint missing a WAL file #10083

Conversation

pdillinger commented May 31, 2022

facebook-github-bot commented May 31, 2022

ltamasi left a comment

Choose a reason for hiding this comment

ltamasi May 31, 2022

Choose a reason for hiding this comment

ltamasi May 31, 2022

Choose a reason for hiding this comment

ltamasi May 31, 2022

Choose a reason for hiding this comment

riversand963 Jun 1, 2022

Choose a reason for hiding this comment

pdillinger Jun 1, 2022

Choose a reason for hiding this comment

facebook-github-bot commented Jun 1, 2022

facebook-github-bot commented Jun 1, 2022

ajkr Jun 16, 2022 • edited Loading

Choose a reason for hiding this comment

ajkr Jun 16, 2022 •

edited

Loading