Limit the number of concurrent shard snapshots #89826

tlrx · 2022-09-06T15:50:50Z

Description

Since #56911 we can create (or delete) snapshots concurrently and we are limited to 1000 operations at a time. But we don't have any limit on the number of shards these snapshots can contain, and in a cluster with many shards this can end up with hundred thousands shards waiting to be snapshotted.

I think we could introduce a limit on the maximum number of shards a cluster can snapshot a a time and reject any new snapshot creation that would cause this limit to be exceeded (without adding it to the cluster state as a new snapshot-in-progress entry).

This would also serve as a cheap back-pressure mechanism in case aggreassive SLM policies are creating new snapshots faster than the cluster can snapshot the shards.

elasticsearchmachine · 2022-09-06T15:51:13Z

Pinging @elastic/es-distributed (Team:Distributed)

pxsalehi · 2022-09-07T08:47:43Z

I am not entirely sure I understand all the details here, but after checking also the linked SDH, I have the impression that the following two changes would potentially make this much less severe:

@tlrx what do you think? Or do you mean that regardless of the size of a SnapshotsInProgress or the number of the concurrently running shard snapshots (which seems to be the source of the contention for the Azure repo), we'd still want to have a limited queue in #88209?

cc: @original-brownbear, @DaveCTurner

tlrx · 2022-09-07T09:35:06Z

Pooyah and I discussed this together and I updated the description to (hopefully) makes this more clear.

And thanks @pxsalehi for pointing me the memory improvement from #88726, this can help that user but I don't think this will solve the root cause of the issue.

original-brownbear · 2022-09-07T11:08:59Z

I think we have to be careful here with a limit because it could easily break large clusters at a random point when reached. The issue really isn't so much shards overall anyway is it? The issue is shards / node right?
Also, note that shard count is only one dimension making snapshots slow, at least in my experience (no science behind this really) most of the time a very slow snapshot is rather a result of slow uploading of lots of data because of the low default limit?

Also, note that #89619 will massively reduce the amount of memory consumed by the internal snapshot data structure as well as actually help speed up snapshotting many shards to some degree (the amount of CPU+time it takes to fully serialize the data structure for a large number of shards is out of control at the moment but disappears in the many shard benchmarks with the diffing implementation).

Also, I wonder do we even still need this once we have #65318 (Joe's aware of it and on it now as far as I understand)?

tlrx · 2022-09-07T12:18:12Z

I think we have to be careful here with a limit because it could easily break large clusters at a random point when reached. The issue really isn't so much shards overall anyway is it? The issue is shards / node right?

I agree we should be careful when defining this limit, which could also be defined as a max. number of shard snapshots per data nodes. Note that I'm proposing to reject new snapshots that would bring the total number of shard snapshots above the limit (I'm not proposing to fail snapshots of already queued/started shard snapshots).

Also, note that shard count is only one dimension making snapshots slow, at least in my experience (no science behind this really) most of the time a very slow snapshot is rather a result of slow uploading of lots of data because of the low default limit?

I agree. My most recent experience was with a master+data node with a snapshot thread pool of 1 and 600K shard snapshots. I think #89608 would help this special case too.

Also, note that #89619 will massively reduce the amount of memory consumed by the internal snapshot data structure as well as actually help speed up snapshotting many shards to some degree (the amount of CPU+time it takes to fully serialize the data structure for a large number of shards is out of control at the moment but disappears in the many shard benchmarks with the diffing implementation).

I do agree this is a good improvement but my fear is that without a limit on shard snapshots a cluster would still be at risk of having shard snapshots queued faster than it can process them, only at a different scale? And #65318 solves the SLM side of things but people using other tools to create snapshots won't be protected unless Elasticsearch rejects the snapshot creation itself.

original-brownbear · 2022-09-07T12:55:06Z

without a limit on shard snapshots a cluster would still be at risk of having shard snapshots queued faster than it can process them, only at a different scale?

Maybe, but then it's really on the user to stop queuing up snapshots IMO. It seems to me form benchmarking diffing that at least memory wise we survive a lot with just the diffing fix and that might be good enough?

DaveCTurner · 2022-09-07T13:15:41Z

I'm inclined to agree with Armin here. If they are pushing too hard then eventually they'll hit the concurrent ops limit, so as long as the cluster can cope with this many concurrent ops then I'm not sure we need another limit.

Also in these cases many of the shard snapshots will be effective no-ops (except for some metadata work I guess). Perhaps we can work harder to deduplicate these things to save on the metadata work too?

AIUI in this case it didn't work to cancel the ongoing snapshots because of all the enqueued work. If so, can we do something to address that?

original-brownbear · 2022-09-07T13:18:52Z

Also in these cases many of the shard snapshots will be effective no-ops (except for some metadata work I guess). Perhaps we can work harder to deduplicate these things to save on the metadata work too?

Yea we could add more batching here on the data node level for sure. It's part of the plan I had around #89019. Once that refactoring has landed we could make the NOOP snapshots a lot more efficient for sure.

tlrx added >enhancement :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Sep 6, 2022

elasticsearchmachine added the Team:Distributed Meta label for distributed team label Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit the number of concurrent shard snapshots #89826

Limit the number of concurrent shard snapshots #89826

tlrx commented Sep 6, 2022 •

edited

elasticsearchmachine commented Sep 6, 2022

pxsalehi commented Sep 7, 2022 •

edited

tlrx commented Sep 7, 2022

original-brownbear commented Sep 7, 2022

tlrx commented Sep 7, 2022

original-brownbear commented Sep 7, 2022

DaveCTurner commented Sep 7, 2022

original-brownbear commented Sep 7, 2022

Limit the number of concurrent shard snapshots #89826

Limit the number of concurrent shard snapshots #89826

Comments

tlrx commented Sep 6, 2022 • edited

Description

elasticsearchmachine commented Sep 6, 2022

pxsalehi commented Sep 7, 2022 • edited

tlrx commented Sep 7, 2022

original-brownbear commented Sep 7, 2022

tlrx commented Sep 7, 2022

original-brownbear commented Sep 7, 2022

DaveCTurner commented Sep 7, 2022

original-brownbear commented Sep 7, 2022

tlrx commented Sep 6, 2022 •

edited

pxsalehi commented Sep 7, 2022 •

edited