Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a snapshot does not verify that all nodes are writing to the same blobstore #81907

Open
DaveCTurner opened this issue Dec 19, 2021 · 1 comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Distributed Meta label for distributed team (obsolete)

Comments

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Dec 19, 2021

Snapshots work by writing to a blobstore in which the same blob can be accessed at the same path across all nodes. By default we check that the blobstore is shared across nodes correctly when the repository is registered. This check helps catch config and permission errors, including cases where the underlying blobstore is not properly shared across all nodes. This check can be bypassed by users who need to register a blobstore which is unavailable at registration time but will become available later on.

Today if the blobstore is accessible but not shared (and the user bypasses the registration-time checks that would prevent this) then snapshot creation will report success because we create snapshots without ever reading a blob that another node has written. Listing, restoring, and deleting snapshots may also sometimes appear to succeed. However it's definitely not safe to rely on such a setup to protect your data.

We should not report success when creating a snapshot in such a setup. We can detect this sort of problem by having the master read at least one blob written by every data node during snapshot creation. We mustn't verify too many blobs (e.g. one per shard) since this would be slow and expensive without adding much extra protection.

I propose that the master reads the first BlobStoreIndexShardSnapshot that each data node writes, and fails the snapshot if that read fails. I think we don't need to re-check this on every snapshot creation, it should be enough to remember past successes of nodes that have remained in the cluster since. Possibly we should re-check every 24h or so just in case the repository gets unmounted out from under us.

@DaveCTurner DaveCTurner added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Dec 19, 2021
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team (obsolete) label Dec 19, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner DaveCTurner added the Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. label Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Distributed Meta label for distributed team (obsolete)
Projects
None yet
Development

No branches or pull requests

2 participants