Making snapshot deletes distributed across data nodes. #39656

backslasht · 2019-03-04T15:01:28Z

Unlike snapshot creation operation, snapshot deletion operation is executed
in master. This is acceptable for small to medium sized clusters where the
delete operation completes in less than a minute. But, for bigger clusters with
few thousand primary shards, the deletion operation takes considerable amount of
time as there is only one thread working on deleting the snapshot data index by
index, shard by shard.

To reduce the time spent on deletion, parallelizing (distributing) the snapshot deletes
across all data nodes similar to snapshot creation but with no tight coupling between
data node and snapshot shards that are getting deleted.

With this changes, for a cluster with 3 dedicated masters and 20 data nodes and 20,000
primary shards, the deletion took around 1 minute where as without this change it took
around 70 minutes.

Unlike snapshot creation operation, snapshot deletion operation is executed in master. This is acceptable for small to medium sized clusters where the delete operation completes in less than a minute. But, for bigger clusters with few thousand primary shards, the deletion operation takes considerable amount of time as there is only one thread working on deleting the snapshot data index by index, shard by shard. To reduce the time spent on deletion, parallelizing (distributing) the snapshot deletes across all data nodes similar to snapshot creation but with no tight coupling between data node and snapshot shards that are getting deleted. With this changes, for a cluster with 3 dedicated masters and 20 data nodes and 20,000 primary shards, the deletion took around 1 minute where as without this change it took around 70 minutes.

ywelsch

The problem is that deletes are processed sequentially on the master. I don't think we should add a whole new distributed layer for this, but just look into doing some of these calls concurrently on the master.

@backslasht before working on such a large item, it's best to open an issue first to discuss the solution.

elasticmachine · 2019-03-04T15:22:33Z

Pinging @elastic/es-distributed

backslasht · 2019-03-05T05:10:51Z

@ywelsch - Like you mentioned, I initially started with making calls concurrent on the master. It does help a little but still the time taken in quite high.

Let us a take following setup as an example.
Dedicated master available: Yes
Number of Data Nodes: 100
Number of Primary Shards: 10,000
Time taken to compute list of shards = y seconds (say 10 for example)
Average time to delete a snapshot of a shard = x seconds (say 5 for example)

Deletion time with single thread (as it exists today)
-> y + 10000 * x
Deletion time with parallelization in master (5 threads) - ** 5 times improvement over 1**
-> y + (10000/5) * x
-> y + 2000 * x
Deletion time with multiple threads across data nodes (proposed change) - ** 500 times improvement over 1 and 100 times improvement over 2**
-> y + ((10000/100)/5) * x
-> y +( 100/5) * x
-> y + 20 * x

So, the proposed change makes snapshot delete operation time as a function of number of data nodes and number of primary shards similar to snapshot creation. The deletion time will remain the same with increase in data nodes (and corresponding increase in shards).

ywelsch · 2019-03-07T11:33:13Z

I initially started with making calls concurrent on the master. It does help a little but still the time taken in quite high ...

That's a bit of an odd comparison. It all depends on how many concurrent calls you globally do. Whether the calls are done from a single node or multiple nodes does not matter given that the requests are very small and latency seems to be the determining factor.

We had a team discussion on this implementation and think it's too complex, which is why I'm closing this PR. We still think this is a problem worth tackling, though, but are looking at other ways of achieving this without building a full distributed task queue. Two options we're looking at is 1) parallelizing deletes on the master, possibly using async APIs, and 2) using repository-specific options (e.g. bulk-deletions on S3).

Motivated by slow snapshot deletes reported in e.g. #39656 and the fact that these likely are a contributing factor to repositories accumulating stale files over time when deletes fail to finish in time and are interrupted before they can complete. * Makes snapshot deletion async and parallelizes some steps of the delete process that can be safely run concurrently via the snapshot thread poll * I did not take the biggest potential speedup step here and parallelize the shard file deletion because that's probably better handled by moving to bulk deletes where possible (and can still be parallelized via the snapshot pool where it isn't). Also, I wanted to keep the size of the PR manageable. * See #39656 (comment) * Also, as a side effect this gives the `SnapshotResiliencyTests` a little more coverage for master failover scenarios (since parallel access to a blob store repository during deletes is now possible since a delete isn't a single task anymore). * By adding a `ThreadPool` reference to the repository this also lays the groundwork to parallelizing shard snapshot uploads to improve the situation reported in #39657

Motivated by slow snapshot deletes reported in e.g. elastic#39656 and the fact that these likely are a contributing factor to repositories accumulating stale files over time when deletes fail to finish in time and are interrupted before they can complete. * Makes snapshot deletion async and parallelizes some steps of the delete process that can be safely run concurrently via the snapshot thread poll * I did not take the biggest potential speedup step here and parallelize the shard file deletion because that's probably better handled by moving to bulk deletes where possible (and can still be parallelized via the snapshot pool where it isn't). Also, I wanted to keep the size of the PR manageable. * See elastic#39656 (comment) * Also, as a side effect this gives the `SnapshotResiliencyTests` a little more coverage for master failover scenarios (since parallel access to a blob store repository during deletes is now possible since a delete isn't a single task anymore). * By adding a `ThreadPool` reference to the repository this also lays the groundwork to parallelizing shard snapshot uploads to improve the situation reported in elastic#39657

Motivated by slow snapshot deletes reported in e.g. #39656 and the fact that these likely are a contributing factor to repositories accumulating stale files over time when deletes fail to finish in time and are interrupted before they can complete. * Makes snapshot deletion async and parallelizes some steps of the delete process that can be safely run concurrently via the snapshot thread poll * I did not take the biggest potential speedup step here and parallelize the shard file deletion because that's probably better handled by moving to bulk deletes where possible (and can still be parallelized via the snapshot pool where it isn't). Also, I wanted to keep the size of the PR manageable. * See #39656 (comment) * Also, as a side effect this gives the `SnapshotResiliencyTests` a little more coverage for master failover scenarios (since parallel access to a blob store repository during deletes is now possible since a delete isn't a single task anymore). * By adding a `ThreadPool` reference to the repository this also lays the groundwork to parallelizing shard snapshot uploads to improve the situation reported in #39657

Motivated by slow snapshot deletes reported in e.g. elastic#39656 and the fact that these likely are a contributing factor to repositories accumulating stale files over time when deletes fail to finish in time and are interrupted before they can complete. * Makes snapshot deletion async and parallelizes some steps of the delete process that can be safely run concurrently via the snapshot thread poll * I did not take the biggest potential speedup step here and parallelize the shard file deletion because that's probably better handled by moving to bulk deletes where possible (and can still be parallelized via the snapshot pool where it isn't). Also, I wanted to keep the size of the PR manageable. * See elastic#39656 (comment) * Also, as a side effect this gives the `SnapshotResiliencyTests` a little more coverage for master failover scenarios (since parallel access to a blob store repository during deletes is now possible since a delete isn't a single task anymore). * By adding a `ThreadPool` reference to the repository this also lays the groundwork to parallelizing shard snapshot uploads to improve the situation reported in elastic#39657

ywelsch suggested changes Mar 4, 2019

View reviewed changes

ywelsch added the :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Mar 4, 2019

ywelsch added the team-discuss label Mar 6, 2019

original-brownbear self-requested a review March 6, 2019 14:30

ywelsch closed this Mar 7, 2019

This was referenced Mar 18, 2019

Async Snapshot Repository Deletes #40144

Merged

Add Bulk Delete API to Blob Store Interface #40250

Closed

original-brownbear mentioned this pull request Apr 26, 2019

Async Snapshot Repository Deletes (#40144) #41571

Merged

codebrain mentioned this pull request Aug 5, 2019

[meta] 7.2 Release elastic/elasticsearch-net#3980

Closed

37 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making snapshot deletes distributed across data nodes. #39656

Making snapshot deletes distributed across data nodes. #39656

backslasht commented Mar 4, 2019

ywelsch left a comment

elasticmachine commented Mar 4, 2019

backslasht commented Mar 5, 2019 •

edited

Loading

ywelsch commented Mar 7, 2019

Making snapshot deletes distributed across data nodes. #39656

Making snapshot deletes distributed across data nodes. #39656

Conversation

backslasht commented Mar 4, 2019

ywelsch left a comment

Choose a reason for hiding this comment

elasticmachine commented Mar 4, 2019

backslasht commented Mar 5, 2019 • edited Loading

ywelsch commented Mar 7, 2019

backslasht commented Mar 5, 2019 •

edited

Loading