Fix Checkpoint hard link of inactive but unsynced WAL by pdillinger · Pull Request #12731 · facebook/rocksdb

pdillinger · 2024-06-03T22:08:43Z

Summary: Background: there is one active WAL file but there can be
several more WAL files in various states. Those other WALs are always
in a "flushed" state but could be on the logs_ list not yet fully
synced. We currently allow any WAL that is not the active WAL to be
hard-linked when creating a Checkpoint, as although it might still be
open for write, we are not appending any more data to it.

The problem is that a created Checkpoint is supposed to be fully synced
on return of that function, and a hard-linked WAL in the state described
above might not be fully synced. (Through some prudence in #10083,
it would synced if using track_and_verify_wals_in_manifest=true.)

The fix is a step toward a long term goal of removing the need to query
the filesystem to determine WAL files and their state. (I consider it
dubious any time we independently read from or query metadata from a
file we have open for writing, as this makes us more susceptible to
FileSystem deficiencies or races.) More specifically:

Detect which WALs might not be fully synced, according to our DBImpl
metadata, and prevent hard linking those (with trim_to_size=true
from GetLiveFilesStorageInfo(). And while we're at it, use our known
flushed sizes for those WALs.
To avoid a race between that and GetSortedWalFiles(), track a maximum
needed WAL number for the Checkpoint/GetLiveFilesStorageInfo.
Because of the level of consistency provided by those two, we no
longer need to consider syncing as part of the FlushWAL in
GetLiveFilesStorageInfo. (We determine the max WAL number consistent
with the manifest file size, while holding DB mutex. Should make
track_and_verify_wals_in_manifest happy.) This makes the premise of
test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback
no longer hit) so the test is removed, with crash test as backstop for
related issues. See Fix race condition with WAL tracking and FlushWAL(true /* sync */) #10185

Stacked on #12729

Test Plan: Expanded an existing test, which now fails before fix.
Also long runs of blackbox_crash_test with amplified checkpoint frequency.

Summary: POSIX semantics for LinkFile (hard links) allow linking a file that is still being written two, with both the source and destination showing any subsequent writes to the source. This may not be practical semantics for some FileSystem implementations such as remote storage. They might only link the flushed or sync-ed file contents at time of LinkFile, or might even have undefined behavior if LinkFile is called on a file still open for write. This change builds on facebook#12731 to bring more hygiene to our handling of WAL files in Checkpoint. Specifically, we now Close WAL files as soon as they are fully synced, rather than letting Close() happen in a background thread. This should not be a performance issue as Close() should be trivial cost in the same thread in which we just did a Sync(), but just in case of unanticipated consequences, the old behavior is available with a new `background_close_inactive_wals` option. Test Plan: Extended existing unit test, especially adding a hygiene check to FaultInjectionTestFS to detect LinkFile() on a file still open for writes. FaultInjectionTestFS already has relevant tracking data, and tests can opt out of the new check, as in a smoke test I have left for the old, deprecated functionality `background_close_inactive_wals=true`. TODO: run lengthy blackbox_crash_test to ensure the hygiene check is OK with the crash test. (The only place I can find we use LinkFile in production is Checkpoint.)

Summary: Background: there is one active WAL file but there can be several more WAL files in various states. Those other WALs are always in a "flushed" state but could be on the `logs_` list not yet fully synced. We currently allow any WAL that is not the active WAL to be hard-linked when creating a Checkpoint, as although it might still be open for write, we are not appending any more data to it. The problem is that a created Checkpoint is supposed to be fully synced on return of that function, and a hard-linked WAL in the state described above might not be fully synced. (Through some prudence in facebook#10083, it would synced if using track_and_verify_wals_in_manifest=true.) The fix is a step toward a long term goal of removing the need to query the filesystem to determine WAL files and their state. (I consider it dubious any time we independently read from or query metadata from a file we have open for writing, as this makes us more susceptible to FileSystem deficiencies or races.) More specifically: * Detect which WALs might not be fully synced, according to our DBImpl metadata, and prevent hard linking those (with `trim_to_size=true` from `GetLiveFilesStorageInfo()`. And while we're at it, use our known flushed sizes for those WALs. * To avoid a race between that and GetSortedWalFiles(), track a maximum needed WAL number for the Checkpoint/GetLiveFilesStorageInfo. * Because of the level of consistency provided by those two, we no longer need to consider syncing as part of the FlushWAL in GetLiveFilesStorageInfo. (We determine the max WAL number consistent with the manifest file size, while holding DB mutex. Should make track_and_verify_wals_in_manifest happy.) This makes the premise of test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback no longer hit) so the test is removed, with crash test as backstop for related issues. See facebook#10185 Stacked on facebook#12729 Test Plan: Expanded an existing test, which now fails before fix. Also long runs of blackbox_crash_test with amplified checkpoint frequency.

pdillinger · 2024-06-04T23:35:51Z

Grr, unresolved issues showing up in crash test.

pdillinger · 2024-06-05T18:15:39Z

+  // fully synced. Although the output DB of a Checkpoint or Backup needs
+  // to be fully synced on return, we don't strictly need to sync this
+  // DB (the input DB). If we allow Checkpoint to hard link an inactive
+  // WAL that isn't fully synced, that could result in an unsufficiently


-> insufficiently

facebook-github-bot · 2024-06-05T18:22:47Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pdillinger · 2024-06-05T21:24:30Z

Passed 3 hours of blackbox_crash_test with amplified checkpoint and backup, so should be good for review

cbi42

LGTM.

Why don't we sync these WAL files and then hard link them?

cbi42 · 2024-06-05T22:15:40Z

  return res;
 }

+Status DBImpl::GetOpenWalSizes(std::map<uint64_t, uint64_t>& number_to_size) {


Could clear number_to_size to be safe

pdillinger · 2024-06-05T23:29:55Z

Why don't we sync these WAL files and then hard link them?

Hmm, you're probably right that I could have done SyncClosedWals before the final GetSortedWalFiles, but I believe the approach in this PR is a step closer to the goal of not relying on filesystem queries for WAL info in GetLiveFilesStorageInfo.

Also, GetLiveFilesStorageInfo() can be used aside from Checkpoint and Backup, e.g. for statistical purposes. Under that broader set of purposes, it should be minimally blocking when asked not to flush memtable.

facebook-github-bot · 2024-06-06T00:43:38Z

@pdillinger merged this pull request in 98393f0.

Summary: POSIX semantics for LinkFile (hard links) allow linking a file that is still being written two, with both the source and destination showing any subsequent writes to the source. This may not be practical semantics for some FileSystem implementations such as remote storage. They might only link the flushed or sync-ed file contents at time of LinkFile, or might even have undefined behavior if LinkFile is called on a file still open for write. This change builds on facebook#12731 to bring more hygiene to our handling of WAL files in Checkpoint. Specifically, we now Close WAL files as soon as they are fully synced, rather than letting Close() happen in a background thread. This should not be a performance issue as Close() should be trivial cost in the same thread in which we just did a Sync(), but just in case of unanticipated consequences, the old behavior is available with a new `background_close_inactive_wals` option. Test Plan: Extended existing unit test, especially adding a hygiene check to FaultInjectionTestFS to detect LinkFile() on a file still open for writes. FaultInjectionTestFS already has relevant tracking data, and tests can opt out of the new check, as in a smoke test I have left for the old, deprecated functionality `background_close_inactive_wals=true`. TODO: run lengthy blackbox_crash_test to ensure the hygiene check is OK with the crash test. (The only place I can find we use LinkFile in production is Checkpoint.)

Summary: POSIX semantics for LinkFile (hard links) allow linking a file that is still being written two, with both the source and destination showing any subsequent writes to the source. This may not be practical semantics for some FileSystem implementations such as remote storage. They might only link the flushed or sync-ed file contents at time of LinkFile, or might even have undefined behavior if LinkFile is called on a file still open for write (not yet "sealed"). This change builds on #12731 to bring more hygiene to our handling of WAL files in Checkpoint. Specifically, we now Close WAL files as soon as they are either (a) inactive and fully synced, or (b) inactive and obsolete (so maybe never fully synced), rather than letting Close() happen in handling obsolete files (maybe a background thread). This should not be a performance issue as Close() should be trivial cost relative to other IO ops, but just in case: * We don't Close() while holding a mutex, to avoid blocking, and * The old behavior is available with a new kill switch option `background_close_inactive_wals`. Stacked on #12731 Pull Request resolved: #12734 Test Plan: Extended existing unit test, especially adding a hygiene check to FaultInjectionTestFS to detect LinkFile() on a file still open for writes. FaultInjectionTestFS already has relevant tracking data, and tests can opt out of the new check, as in a smoke test I have left for the old, deprecated functionality `background_close_inactive_wals=true`. Also ran lengthy blackbox_crash_test to ensure the hygiene check is OK with the crash test. (The only place I can find we use LinkFile in production is Checkpoint.) Reviewed By: cbi42 Differential Revision: D58295284 Pulled By: pdillinger fbshipit-source-id: 64d90ed8477e2366c19eaf9c4c5ad60b82cac5c6

Summary: Background: there is one active WAL file but there can be several more WAL files in various states. Those other WALs are always in a "flushed" state but could be on the `logs_` list not yet fully synced. We currently allow any WAL that is not the active WAL to be hard-linked when creating a Checkpoint, as although it might still be open for write, we are not appending any more data to it. The problem is that a created Checkpoint is supposed to be fully synced on return of that function, and a hard-linked WAL in the state described above might not be fully synced. (Through some prudence in facebook#10083, it would synced if using track_and_verify_wals_in_manifest=true.) The fix is a step toward a long term goal of removing the need to query the filesystem to determine WAL files and their state. (I consider it dubious any time we independently read from or query metadata from a file we have open for writing, as this makes us more susceptible to FileSystem deficiencies or races.) More specifically: * Detect which WALs might not be fully synced, according to our DBImpl metadata, and prevent hard linking those (with `trim_to_size=true` from `GetLiveFilesStorageInfo()`. And while we're at it, use our known flushed sizes for those WALs. * To avoid a race between that and GetSortedWalFiles(), track a maximum needed WAL number for the Checkpoint/GetLiveFilesStorageInfo. * Because of the level of consistency provided by those two, we no longer need to consider syncing as part of the FlushWAL in GetLiveFilesStorageInfo. (We determine the max WAL number consistent with the manifest file size, while holding DB mutex. Should make track_and_verify_wals_in_manifest happy.) This makes the premise of test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback no longer hit) so the test is removed, with crash test as backstop for related issues. See facebook#10185 Stacked on facebook#12729 Pull Request resolved: facebook#12731 Test Plan: Expanded an existing test, which now fails before fix. Also long runs of blackbox_crash_test with amplified checkpoint frequency. Reviewed By: cbi42 Differential Revision: D58199629 Pulled By: pdillinger fbshipit-source-id: 376e55f4a2b082cd2adb6408a41209de14422382

Summary: POSIX semantics for LinkFile (hard links) allow linking a file that is still being written two, with both the source and destination showing any subsequent writes to the source. This may not be practical semantics for some FileSystem implementations such as remote storage. They might only link the flushed or sync-ed file contents at time of LinkFile, or might even have undefined behavior if LinkFile is called on a file still open for write (not yet "sealed"). This change builds on facebook#12731 to bring more hygiene to our handling of WAL files in Checkpoint. Specifically, we now Close WAL files as soon as they are either (a) inactive and fully synced, or (b) inactive and obsolete (so maybe never fully synced), rather than letting Close() happen in handling obsolete files (maybe a background thread). This should not be a performance issue as Close() should be trivial cost relative to other IO ops, but just in case: * We don't Close() while holding a mutex, to avoid blocking, and * The old behavior is available with a new kill switch option `background_close_inactive_wals`. Stacked on facebook#12731 Pull Request resolved: facebook#12734 Test Plan: Extended existing unit test, especially adding a hygiene check to FaultInjectionTestFS to detect LinkFile() on a file still open for writes. FaultInjectionTestFS already has relevant tracking data, and tests can opt out of the new check, as in a smoke test I have left for the old, deprecated functionality `background_close_inactive_wals=true`. Also ran lengthy blackbox_crash_test to ensure the hygiene check is OK with the crash test. (The only place I can find we use LinkFile in production is Checkpoint.) Reviewed By: cbi42 Differential Revision: D58295284 Pulled By: pdillinger fbshipit-source-id: 64d90ed8477e2366c19eaf9c4c5ad60b82cac5c6

pdillinger requested a review from cbi42 June 3, 2024 22:08

facebook-github-bot added the CLA Signed label Jun 3, 2024

pdillinger mentioned this pull request Jun 4, 2024

Ensure Close() before LinkFile() for WALs in Checkpoint #12734

Closed

Fixes for crash test

50e30c5

pdillinger force-pushed the checkpoint_unsynced_wal_fix branch from 8de8535 to 50e30c5 Compare June 5, 2024 18:13

pdillinger commented Jun 5, 2024

View reviewed changes

cbi42 approved these changes Jun 5, 2024

View reviewed changes

facebook-github-bot closed this in 98393f0 Jun 6, 2024

facebook-github-bot added the Merged label Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Checkpoint hard link of inactive but unsynced WAL#12731

Fix Checkpoint hard link of inactive but unsynced WAL#12731
pdillinger wants to merge 2 commits into
facebook:mainfrom
pdillinger:checkpoint_unsynced_wal_fix

pdillinger commented Jun 3, 2024

Uh oh!

pdillinger commented Jun 4, 2024

Uh oh!

pdillinger Jun 5, 2024

Uh oh!

facebook-github-bot commented Jun 5, 2024

Uh oh!

pdillinger commented Jun 5, 2024 •

edited

Loading

Uh oh!

cbi42 left a comment

Uh oh!

cbi42 Jun 5, 2024

Uh oh!

Uh oh!

pdillinger commented Jun 5, 2024

Uh oh!

facebook-github-bot commented Jun 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pdillinger commented Jun 3, 2024

Uh oh!

pdillinger commented Jun 4, 2024

Uh oh!

pdillinger Jun 5, 2024

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 5, 2024

Uh oh!

pdillinger commented Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cbi42 left a comment

Choose a reason for hiding this comment

Uh oh!

cbi42 Jun 5, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pdillinger commented Jun 5, 2024

Uh oh!

facebook-github-bot commented Jun 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pdillinger commented Jun 5, 2024 •

edited

Loading