Skip to content

Fix Checkpoint hard link of inactive but unsynced WAL#12731

Closed
pdillinger wants to merge 2 commits into
facebook:mainfrom
pdillinger:checkpoint_unsynced_wal_fix
Closed

Fix Checkpoint hard link of inactive but unsynced WAL#12731
pdillinger wants to merge 2 commits into
facebook:mainfrom
pdillinger:checkpoint_unsynced_wal_fix

Conversation

@pdillinger
Copy link
Copy Markdown
Contributor

Summary: Background: there is one active WAL file but there can be
several more WAL files in various states. Those other WALs are always
in a "flushed" state but could be on the logs_ list not yet fully
synced. We currently allow any WAL that is not the active WAL to be
hard-linked when creating a Checkpoint, as although it might still be
open for write, we are not appending any more data to it.

The problem is that a created Checkpoint is supposed to be fully synced
on return of that function, and a hard-linked WAL in the state described
above might not be fully synced. (Through some prudence in #10083,
it would synced if using track_and_verify_wals_in_manifest=true.)

The fix is a step toward a long term goal of removing the need to query
the filesystem to determine WAL files and their state. (I consider it
dubious any time we independently read from or query metadata from a
file we have open for writing, as this makes us more susceptible to
FileSystem deficiencies or races.) More specifically:

  • Detect which WALs might not be fully synced, according to our DBImpl
    metadata, and prevent hard linking those (with trim_to_size=true
    from GetLiveFilesStorageInfo(). And while we're at it, use our known
    flushed sizes for those WALs.
  • To avoid a race between that and GetSortedWalFiles(), track a maximum
    needed WAL number for the Checkpoint/GetLiveFilesStorageInfo.
  • Because of the level of consistency provided by those two, we no
    longer need to consider syncing as part of the FlushWAL in
    GetLiveFilesStorageInfo. (We determine the max WAL number consistent
    with the manifest file size, while holding DB mutex. Should make
    track_and_verify_wals_in_manifest happy.) This makes the premise of
    test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback
    no longer hit) so the test is removed, with crash test as backstop for
    related issues. See Fix race condition with WAL tracking and FlushWAL(true /* sync */) #10185

Stacked on #12729

Test Plan: Expanded an existing test, which now fails before fix.
Also long runs of blackbox_crash_test with amplified checkpoint frequency.

@pdillinger pdillinger requested a review from cbi42 June 3, 2024 22:08
pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Jun 4, 2024
Summary: POSIX semantics for LinkFile (hard links) allow linking a file
that is still being written two, with both the source and destination
showing any subsequent writes to the source. This may not be practical
semantics for some FileSystem implementations such as remote storage.
They might only link the flushed or sync-ed file contents at time of
LinkFile, or might even have undefined behavior if LinkFile is called on
a file still open for write. This change builds on facebook#12731 to bring more
hygiene to our handling of WAL files in Checkpoint.

Specifically, we now Close WAL files as soon as they are fully synced,
rather than letting Close() happen in a background thread. This should
not be a performance issue as Close() should be trivial cost in the same
thread in which we just did a Sync(), but just in case of unanticipated
consequences, the old behavior is available with a new
`background_close_inactive_wals` option.

Test Plan: Extended existing unit test, especially adding a hygiene
check to FaultInjectionTestFS to detect LinkFile() on a file still open
for writes. FaultInjectionTestFS already has relevant tracking data, and
tests can opt out of the new check, as in a smoke test I have left for
the old, deprecated functionality `background_close_inactive_wals=true`.

TODO: run lengthy blackbox_crash_test to ensure the hygiene check is OK
with the crash test. (The only place I can find we use LinkFile in
production is Checkpoint.)
Summary: Background: there is one active WAL file but there can be
several more WAL files in various states. Those other WALs are always
in a "flushed" state but could be on the `logs_` list not yet fully
synced. We currently allow any WAL that is not the active WAL to be
hard-linked when creating a Checkpoint, as although it might still be
open for write, we are not appending any more data to it.

The problem is that a created Checkpoint is supposed to be fully synced
on return of that function, and a hard-linked WAL in the state described
above might not be fully synced. (Through some prudence in facebook#10083,
it would synced if using track_and_verify_wals_in_manifest=true.)

The fix is a step toward a long term goal of removing the need to query
the filesystem to determine WAL files and their state. (I consider it
dubious any time we independently read from or query metadata from a
file we have open for writing, as this makes us more susceptible to
FileSystem deficiencies or races.) More specifically:
* Detect which WALs might not be fully synced, according to our DBImpl
  metadata, and prevent hard linking those (with `trim_to_size=true`
  from `GetLiveFilesStorageInfo()`. And while we're at it, use our known
  flushed sizes for those WALs.
* To avoid a race between that and GetSortedWalFiles(), track a maximum
  needed WAL number for the Checkpoint/GetLiveFilesStorageInfo.
* Because of the level of consistency provided by those two, we no
  longer need to consider syncing as part of the FlushWAL in
  GetLiveFilesStorageInfo. (We determine the max WAL number consistent
  with the manifest file size, while holding DB mutex. Should make
  track_and_verify_wals_in_manifest happy.) This makes the premise of
  test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback
  no longer hit) so the test is removed, with crash test as backstop for
  related issues. See facebook#10185

Stacked on facebook#12729

Test Plan: Expanded an existing test, which now fails before fix.
Also long runs of blackbox_crash_test with amplified checkpoint frequency.
@pdillinger
Copy link
Copy Markdown
Contributor Author

Grr, unresolved issues showing up in crash test.

@pdillinger pdillinger force-pushed the checkpoint_unsynced_wal_fix branch from 8de8535 to 50e30c5 Compare June 5, 2024 18:13
Comment thread db/db_filesnapshot.cc
// fully synced. Although the output DB of a Checkpoint or Backup needs
// to be fully synced on return, we don't strictly need to sync this
// DB (the input DB). If we allow Checkpoint to hard link an inactive
// WAL that isn't fully synced, that could result in an unsufficiently
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> insufficiently

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pdillinger
Copy link
Copy Markdown
Contributor Author

pdillinger commented Jun 5, 2024

Passed 3 hours of blackbox_crash_test with amplified checkpoint and backup, so should be good for review

Copy link
Copy Markdown
Contributor

@cbi42 cbi42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Why don't we sync these WAL files and then hard link them?

Comment thread db/db_impl/db_impl.cc
return res;
}

Status DBImpl::GetOpenWalSizes(std::map<uint64_t, uint64_t>& number_to_size) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could clear number_to_size to be safe

Comment thread utilities/checkpoint/checkpoint_test.cc
@pdillinger
Copy link
Copy Markdown
Contributor Author

Why don't we sync these WAL files and then hard link them?

Hmm, you're probably right that I could have done SyncClosedWals before the final GetSortedWalFiles, but I believe the approach in this PR is a step closer to the goal of not relying on filesystem queries for WAL info in GetLiveFilesStorageInfo.

Also, GetLiveFilesStorageInfo() can be used aside from Checkpoint and Backup, e.g. for statistical purposes. Under that broader set of purposes, it should be minimally blocking when asked not to flush memtable.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@pdillinger merged this pull request in 98393f0.

pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Jun 7, 2024
Summary: POSIX semantics for LinkFile (hard links) allow linking a file
that is still being written two, with both the source and destination
showing any subsequent writes to the source. This may not be practical
semantics for some FileSystem implementations such as remote storage.
They might only link the flushed or sync-ed file contents at time of
LinkFile, or might even have undefined behavior if LinkFile is called on
a file still open for write. This change builds on facebook#12731 to bring more
hygiene to our handling of WAL files in Checkpoint.

Specifically, we now Close WAL files as soon as they are fully synced,
rather than letting Close() happen in a background thread. This should
not be a performance issue as Close() should be trivial cost in the same
thread in which we just did a Sync(), but just in case of unanticipated
consequences, the old behavior is available with a new
`background_close_inactive_wals` option.

Test Plan: Extended existing unit test, especially adding a hygiene
check to FaultInjectionTestFS to detect LinkFile() on a file still open
for writes. FaultInjectionTestFS already has relevant tracking data, and
tests can opt out of the new check, as in a smoke test I have left for
the old, deprecated functionality `background_close_inactive_wals=true`.

TODO: run lengthy blackbox_crash_test to ensure the hygiene check is OK
with the crash test. (The only place I can find we use LinkFile in
production is Checkpoint.)
facebook-github-bot pushed a commit that referenced this pull request Jun 12, 2024
Summary:
POSIX semantics for LinkFile (hard links) allow linking a file
that is still being written two, with both the source and destination
showing any subsequent writes to the source. This may not be practical
semantics for some FileSystem implementations such as remote storage.
They might only link the flushed or sync-ed file contents at time of
LinkFile, or might even have undefined behavior if LinkFile is called on
a file still open for write (not yet "sealed"). This change builds on #12731
to bring more hygiene to our handling of WAL files in Checkpoint.

Specifically, we now Close WAL files as soon as they are either
(a) inactive and fully synced, or (b) inactive and obsolete (so maybe
never fully synced), rather than letting Close() happen in handling
obsolete files (maybe a background thread). This should not be a
performance issue as Close() should be trivial cost relative to other
IO ops, but just in case:
* We don't Close() while holding a mutex, to avoid blocking, and
* The old behavior is available with a new kill switch option
  `background_close_inactive_wals`.

Stacked on #12731

Pull Request resolved: #12734

Test Plan:
Extended existing unit test, especially adding a hygiene
check to FaultInjectionTestFS to detect LinkFile() on a file still open
for writes. FaultInjectionTestFS already has relevant tracking data, and
tests can opt out of the new check, as in a smoke test I have left for
the old, deprecated functionality `background_close_inactive_wals=true`.

Also ran lengthy blackbox_crash_test to ensure the hygiene check is OK
with the crash test. (The only place I can find we use LinkFile in
production is Checkpoint.)

Reviewed By: cbi42

Differential Revision: D58295284

Pulled By: pdillinger

fbshipit-source-id: 64d90ed8477e2366c19eaf9c4c5ad60b82cac5c6
ybtsdst pushed a commit to ybtsdst/rocksdb that referenced this pull request Apr 27, 2025
Summary:
Background: there is one active WAL file but there can be
several more WAL files in various states. Those other WALs are always
in a "flushed" state but could be on the `logs_` list not yet fully
synced. We currently allow any WAL that is not the active WAL to be
hard-linked when creating a Checkpoint, as although it might still be
open for write, we are not appending any more data to it.

The problem is that a created Checkpoint is supposed to be fully synced
on return of that function, and a hard-linked WAL in the state described
above might not be fully synced. (Through some prudence in facebook#10083,
it would synced if using track_and_verify_wals_in_manifest=true.)

The fix is a step toward a long term goal of removing the need to query
the filesystem to determine WAL files and their state. (I consider it
dubious any time we independently read from or query metadata from a
file we have open for writing, as this makes us more susceptible to
FileSystem deficiencies or races.) More specifically:
* Detect which WALs might not be fully synced, according to our DBImpl
  metadata, and prevent hard linking those (with `trim_to_size=true`
  from `GetLiveFilesStorageInfo()`. And while we're at it, use our known
  flushed sizes for those WALs.
* To avoid a race between that and GetSortedWalFiles(), track a maximum
  needed WAL number for the Checkpoint/GetLiveFilesStorageInfo.
* Because of the level of consistency provided by those two, we no
  longer need to consider syncing as part of the FlushWAL in
  GetLiveFilesStorageInfo. (We determine the max WAL number consistent
  with the manifest file size, while holding DB mutex. Should make
  track_and_verify_wals_in_manifest happy.) This makes the premise of
  test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback
  no longer hit) so the test is removed, with crash test as backstop for
  related issues. See facebook#10185

Stacked on facebook#12729

Pull Request resolved: facebook#12731

Test Plan:
Expanded an existing test, which now fails before fix.
Also long runs of blackbox_crash_test with amplified checkpoint frequency.

Reviewed By: cbi42

Differential Revision: D58199629

Pulled By: pdillinger

fbshipit-source-id: 376e55f4a2b082cd2adb6408a41209de14422382
ybtsdst pushed a commit to ybtsdst/rocksdb that referenced this pull request Apr 27, 2025
Summary:
POSIX semantics for LinkFile (hard links) allow linking a file
that is still being written two, with both the source and destination
showing any subsequent writes to the source. This may not be practical
semantics for some FileSystem implementations such as remote storage.
They might only link the flushed or sync-ed file contents at time of
LinkFile, or might even have undefined behavior if LinkFile is called on
a file still open for write (not yet "sealed"). This change builds on facebook#12731
to bring more hygiene to our handling of WAL files in Checkpoint.

Specifically, we now Close WAL files as soon as they are either
(a) inactive and fully synced, or (b) inactive and obsolete (so maybe
never fully synced), rather than letting Close() happen in handling
obsolete files (maybe a background thread). This should not be a
performance issue as Close() should be trivial cost relative to other
IO ops, but just in case:
* We don't Close() while holding a mutex, to avoid blocking, and
* The old behavior is available with a new kill switch option
  `background_close_inactive_wals`.

Stacked on facebook#12731

Pull Request resolved: facebook#12734

Test Plan:
Extended existing unit test, especially adding a hygiene
check to FaultInjectionTestFS to detect LinkFile() on a file still open
for writes. FaultInjectionTestFS already has relevant tracking data, and
tests can opt out of the new check, as in a smoke test I have left for
the old, deprecated functionality `background_close_inactive_wals=true`.

Also ran lengthy blackbox_crash_test to ensure the hygiene check is OK
with the crash test. (The only place I can find we use LinkFile in
production is Checkpoint.)

Reviewed By: cbi42

Differential Revision: D58295284

Pulled By: pdillinger

fbshipit-source-id: 64d90ed8477e2366c19eaf9c4c5ad60b82cac5c6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants