Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.2: logstore: sync sideloaded storage directories #115709

Merged
merged 6 commits into from Dec 7, 2023

Conversation

blathers-crl[bot]
Copy link

@blathers-crl blathers-crl bot commented Dec 6, 2023

Backport 6/6 commits from #114616 on behalf of @pavelkalinnikov.

/cc @cockroachdb/release


This PR ensures that the hierarchy of directories/files created by the sideloaded log storage is properly synced.

Previously, only the "leaf" files in this hierarchy were fsync-ed. Even though this guarantees the files content and metadata is synced, this still does not guarantee that the references to these files are durable. For example, Linux man page for fsync 1 says:

Calling fsync() does not necessarily ensure that the entry in the
directory containing the file has also reached disk.  For that an
explicit fsync() on a file descriptor for the directory is also
needed.

It means that these files can be lost after a system crash of power off. This leads to issues because:

  1. Pebble WAL syncs are not atomic with the sideloaded files syncs. It is thus possible that raft log metadata "references" a sideloaded file and gets synced, but the file is not yet. A power off/on at this point leads to an internal inconsistency, and can result in a crash loop when raft will try to load these entries to apply and/or send to other replicas.

  2. The durability of entry files is used as a pre-condition to sending raft messages that trigger committing these entries. A coordinated power off/on on a majority of replicas can thus lead to losing committed entries and unrecoverable loss-of-quorum.

This PR fixes the above issues, by syncing parents of all the directories and files that the sideloaded storage creates.

Part of #114411

Epic: none

Release note (bug fix): this commit fixes a durability bug in raft log storage, caused by incorrect syncing of filesystem metadata. It was possible to lose writes of a particular kind (AddSSTable) that is e.g. used by RESTORE. This loss was possible only under power-off or OS crash conditions. As a result, CRDB could enter a crash loop on restart. In the worst case of a correlated power-off/crash across multiple nodes this could lead to loss of quorum or data loss.


Release justification: critical bug fix

Footnotes

  1. https://man7.org/linux/man-pages/man2/fsync.2.html

This commit removes the dirCreated field of DiskSideloadStorage, because
it is only used in tests, and is reduntant (directory existence check
already does the job).

Epic: none
Release note: none
A couple of things to address in the future: sideloaded files removal
should happen strictly after a state machine sync; sideloaded files and
directories should be cleaned up on startup because their removal is not
always durable.

Epic: none
Release note: none
The sideloaded storage fsyncs the files that it creates. Even though
this guarantees durability of the files content and metadata, this still
does not guarantee that the references to these files are durable. For
example, Linux man page for fsync [^1] says:

```
Calling fsync() does not necessarily ensure that the entry in the
directory containing the file has also reached disk.  For that an
explicit fsync() on a file descriptor for the directory is also
needed.
```

It means that these files can be lost after a system crash of power off.
This leads to issues:

1. The storage syncs are not atomic with the sideloaded files syncs. It
   is thus possible that raft log metadata "references" a sideloaded
   file and gets synced, but the file is not yet. A power off/on at
   this point leads to an internal inconsistency, and can result in a
   crash loop when raft will try to load these entries to apply and/or
   send to other replicas.

2. The durability of entry files is used as a pre-condition to sending
   raft messages that trigger committing these entries. A coordinated
   power off/on on a majority of replicas can thus lead to losing
   committed entries and unrecoverable loss-of-quorum.

This commit fixes the above issues, by syncing the parent directory
after writing sideloaded entry files. The natural point for this is
MaybeSideloadEntries on the handleRaftReady path.

[^1]: https://man7.org/linux/man-pages/man2/fsync.2.html

Epic: none

Release note (bug fix): this commit fixes a durability bug in raft log
storage, caused by incorrect syncing of filesystem metadata. It was
possible to lose writes of a particular kind (AddSSTable) that is e.g.
used by RESTORE. This loss was possible only under power-off or OS crash
conditions. As a result, CRDB could enter a crash loop on restart. In
the worst case of a coordinated power-off/crash across multiple nodes
this could lead to an unrecoverable loss of quorum.
This commit adds `TestSideloadStorageSync` which demonstrates that the
sideloaded log storage can lose files and directories upon system crash.
This is due to the fact that the whole directory hierarchy is not
properly synced when the directories and files are created.

A typical sideloaded storage file (entry 123 at term 44 for range r1234)
looks like: `<data-dir>/auxiliary/sideloading/r1XXX/r1234/i123.t44`.

Only existence of auxiliary directory is persisted upon its creation, by
syncing the <data-dir> when Pebble initializes the store. All other
directories (sideloading, r1XXX, r1234) are not persisted upon creation.

Epic: none
Release note: none
The sideloaded log storage does not sync the hierarchy of directories it
creates. This can potentially lead to full or partial loss of its
sub-directories in case of a system crash or power off.

After this commit, every time sideloaded storage creates a directory, it
syncs its parent so that the reference is durable.

Epic: none
Release note: none
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-23.2-114616 branch from 1ad14d0 to f8af9ff Compare December 6, 2023 17:42
@blathers-crl blathers-crl bot requested review from a team as code owners December 6, 2023 17:42
@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Dec 6, 2023
Copy link
Author

blathers-crl bot commented Dec 6, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Backports should only be created for serious
    issues
    or test-only changes.
  • Backports should not break backwards-compatibility.
  • Backports should change as little code as possible.
  • Backports should not change on-disk formats or node communication protocols.
  • Backports should not add new functionality (except as defined
    here).
  • Backports must not add, edit, or otherwise modify cluster versions; or add version gates.
  • All backports must be reviewed by the owning areas TL and one additional
    TL. For more information as to how that review should be conducted, please consult the backport
    policy
    .
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters. State changes must be further protected such that nodes running old binaries will not be negatively impacted by the new state (with a mixed version test added).
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.
  • Your backport must be accompanied by a post to the appropriate Slack
    channel (#db-backports-point-releases or #db-backports-XX-X-release) for awareness and discussion.

Also, please add a brief release justification to the body of your PR to justify this
backport.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Dec 6, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pav-kv pav-kv merged commit 6fc66ab into release-23.2 Dec 7, 2023
5 of 6 checks passed
@pav-kv pav-kv deleted the blathers/backport-release-23.2-114616 branch December 7, 2023 09:31
@yuzefovich
Copy link
Member

Note that this was merged after 23.2.0-rc branch was cut.

@pav-kv
Copy link
Collaborator

pav-kv commented Dec 8, 2023

Created #115841 for the RC too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants