Skip to content

Commit

Permalink
Enable Fully Concurrent Snapshot Operations (#56911)
Browse files Browse the repository at this point in the history
Enables fully concurrent snapshot operations:
* Snapshot create- and delete operations can be started in any order
* Delete operations wait for snapshot finalization to finish, are batched as much as possible to improve efficiency and once enqueued in the cluster state prevent new snapshots from starting on data nodes until executed
   * We could be even more concurrent here in a follow-up by interleaving deletes and snapshots on a per-shard level. I decided not to do this for now since it seemed not worth the added complexity yet. Due to batching+deduplicating of deletes the pain of having a delete stuck behind a long -running snapshot seemed manageable (dropped client connections + resulting retries don't cause issues due to deduplication of delete jobs, batching of deletes allows enqueuing more and more deletes even if a snapshot blocks for a long time that will all be executed in essentially constant time (due to bulk snapshot deletion, deleting multiple snapshots is mostly about as fast as deleting a single one))
* Snapshot creation is completely concurrent across shards, but per shard snapshots are linearized for each repository as are snapshot finalizations

See updated JavaDoc and added test cases for more details and illustration on the functionality.

Some notes:

The queuing of snapshot finalizations and deletes and the related locking/synchronization is a little awkward in this version but can be much simplified with some refactoring.  The problem is that snapshot finalizations resolve their listeners on the `SNAPSHOT` pool while deletes resolve the listener on the master update thread. With some refactoring both of these could be moved to the master update thread, effectively removing the need for any synchronization around the `SnapshotService` state. I didn't do this refactoring here because it's a fairly large change and not necessary for the functionality but plan to do so in a follow-up.

This change allows for completely removing any trickery around synchronizing deletes and snapshots from SLM and 100% does away with SLM errors from collisions between deletes and snapshots.

Snapshotting a single index in parallel to a long running full backup will execute without having to wait for the long running backup as required by the ILM/SLM use case of moving indices to "snapshot tier". Finalizations are linearized but ordered according to which snapshot saw all of its shards complete first
  • Loading branch information
original-brownbear committed Jul 10, 2020
1 parent aacf624 commit d333dac
Show file tree
Hide file tree
Showing 23 changed files with 2,867 additions and 471 deletions.
Expand Up @@ -249,7 +249,7 @@ public void finalizeSnapshot(ShardGenerations shardGenerations, long repositoryS

@Override
public void deleteSnapshots(Collection<SnapshotId> snapshotIds, long repositoryStateId, Version repositoryMetaVersion,
ActionListener<Void> listener) {
ActionListener<RepositoryData> listener) {
if (SnapshotsService.useShardGenerations(repositoryMetaVersion) == false) {
listener = delayedListener(listener);
}
Expand Down

Large diffs are not rendered by default.

This file was deleted.

Expand Up @@ -196,6 +196,7 @@ private void buildResponse(SnapshotsInProgress snapshotsInProgress, SnapshotsSta
break;
case INIT:
case WAITING:
case QUEUED:
stage = SnapshotIndexShardStage.STARTED;
break;
case SUCCESS:
Expand Down

0 comments on commit d333dac

Please sign in to comment.