Creating a snapshot does not verify that all nodes are writing to the same blobstore #81907

DaveCTurner · 2021-12-19T11:13:12Z

Snapshots work by writing to a blobstore in which the same blob can be accessed at the same path across all nodes. By default we check that the blobstore is shared across nodes correctly when the repository is registered. This check helps catch config and permission errors, including cases where the underlying blobstore is not properly shared across all nodes. This check can be bypassed by users who need to register a blobstore which is unavailable at registration time but will become available later on.

Today if the blobstore is accessible but not shared (and the user bypasses the registration-time checks that would prevent this) then snapshot creation will report success because we create snapshots without ever reading a blob that another node has written. Listing, restoring, and deleting snapshots may also sometimes appear to succeed. However it's definitely not safe to rely on such a setup to protect your data.

We should not report success when creating a snapshot in such a setup. We can detect this sort of problem by having the master read at least one blob written by every data node during snapshot creation. We mustn't verify too many blobs (e.g. one per shard) since this would be slow and expensive without adding much extra protection.

I propose that the master reads the first BlobStoreIndexShardSnapshot that each data node writes, and fails the snapshot if that read fails. I think we don't need to re-check this on every snapshot creation, it should be enough to remember past successes of nodes that have remained in the cluster since. Possibly we should re-check every 24h or so just in case the repository gets unmounted out from under us.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-12-19T11:13:15Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Dec 19, 2021

elasticmachine added the Team:Distributed Meta label for distributed team (obsolete) label Dec 19, 2021

DaveCTurner added the Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. label Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating a snapshot does not verify that all nodes are writing to the same blobstore #81907

Creating a snapshot does not verify that all nodes are writing to the same blobstore #81907

DaveCTurner commented Dec 19, 2021 •

edited

Loading

elasticmachine commented Dec 19, 2021

Creating a snapshot does not verify that all nodes are writing to the same blobstore #81907

Creating a snapshot does not verify that all nodes are writing to the same blobstore #81907

Comments

DaveCTurner commented Dec 19, 2021 • edited Loading

elasticmachine commented Dec 19, 2021

DaveCTurner commented Dec 19, 2021 •

edited

Loading