Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit the number of concurrent shard snapshots #89826

Open
tlrx opened this issue Sep 6, 2022 · 8 comments
Open

Limit the number of concurrent shard snapshots #89826

tlrx opened this issue Sep 6, 2022 · 8 comments
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed Meta label for distributed team

Comments

@tlrx
Copy link
Member

tlrx commented Sep 6, 2022

Description

Since #56911 we can create (or delete) snapshots concurrently and we are limited to 1000 operations at a time. But we don't have any limit on the number of shards these snapshots can contain, and in a cluster with many shards this can end up with hundred thousands shards waiting to be snapshotted.

I think we could introduce a limit on the maximum number of shards a cluster can snapshot a a time and reject any new snapshot creation that would cause this limit to be exceeded (without adding it to the cluster state as a new snapshot-in-progress entry).

This would also serve as a cheap back-pressure mechanism in case aggreassive SLM policies are creating new snapshots faster than the cluster can snapshot the shards.

@tlrx tlrx added >enhancement :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Sep 6, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Sep 6, 2022
@pxsalehi
Copy link
Member

pxsalehi commented Sep 7, 2022

I am not entirely sure I understand all the details here, but after checking also the linked SDH, I have the impression that the following two changes would potentially make this much less severe:

@tlrx what do you think? Or do you mean that regardless of the size of a SnapshotsInProgress or the number of the concurrently running shard snapshots (which seems to be the source of the contention for the Azure repo), we'd still want to have a limited queue in #88209?

cc: @original-brownbear, @DaveCTurner

@tlrx
Copy link
Member Author

tlrx commented Sep 7, 2022

Pooyah and I discussed this together and I updated the description to (hopefully) makes this more clear.

And thanks @pxsalehi for pointing me the memory improvement from #88726, this can help that user but I don't think this will solve the root cause of the issue.

@original-brownbear
Copy link
Member

I think we have to be careful here with a limit because it could easily break large clusters at a random point when reached. The issue really isn't so much shards overall anyway is it? The issue is shards / node right?
Also, note that shard count is only one dimension making snapshots slow, at least in my experience (no science behind this really) most of the time a very slow snapshot is rather a result of slow uploading of lots of data because of the low default limit?

Also, note that #89619 will massively reduce the amount of memory consumed by the internal snapshot data structure as well as actually help speed up snapshotting many shards to some degree (the amount of CPU+time it takes to fully serialize the data structure for a large number of shards is out of control at the moment but disappears in the many shard benchmarks with the diffing implementation).

Also, I wonder do we even still need this once we have #65318 (Joe's aware of it and on it now as far as I understand)?

@tlrx
Copy link
Member Author

tlrx commented Sep 7, 2022

I think we have to be careful here with a limit because it could easily break large clusters at a random point when reached. The issue really isn't so much shards overall anyway is it? The issue is shards / node right?

I agree we should be careful when defining this limit, which could also be defined as a max. number of shard snapshots per data nodes. Note that I'm proposing to reject new snapshots that would bring the total number of shard snapshots above the limit (I'm not proposing to fail snapshots of already queued/started shard snapshots).

Also, note that shard count is only one dimension making snapshots slow, at least in my experience (no science behind this really) most of the time a very slow snapshot is rather a result of slow uploading of lots of data because of the low default limit?

I agree. My most recent experience was with a master+data node with a snapshot thread pool of 1 and 600K shard snapshots. I think #89608 would help this special case too.

Also, note that #89619 will massively reduce the amount of memory consumed by the internal snapshot data structure as well as actually help speed up snapshotting many shards to some degree (the amount of CPU+time it takes to fully serialize the data structure for a large number of shards is out of control at the moment but disappears in the many shard benchmarks with the diffing implementation).

I do agree this is a good improvement but my fear is that without a limit on shard snapshots a cluster would still be at risk of having shard snapshots queued faster than it can process them, only at a different scale? And #65318 solves the SLM side of things but people using other tools to create snapshots won't be protected unless Elasticsearch rejects the snapshot creation itself.

@original-brownbear
Copy link
Member

without a limit on shard snapshots a cluster would still be at risk of having shard snapshots queued faster than it can process them, only at a different scale?

Maybe, but then it's really on the user to stop queuing up snapshots IMO. It seems to me form benchmarking diffing that at least memory wise we survive a lot with just the diffing fix and that might be good enough?

@DaveCTurner
Copy link
Contributor

I'm inclined to agree with Armin here. If they are pushing too hard then eventually they'll hit the concurrent ops limit, so as long as the cluster can cope with this many concurrent ops then I'm not sure we need another limit.

Also in these cases many of the shard snapshots will be effective no-ops (except for some metadata work I guess). Perhaps we can work harder to deduplicate these things to save on the metadata work too?

AIUI in this case it didn't work to cancel the ongoing snapshots because of all the enqueued work. If so, can we do something to address that?

@original-brownbear
Copy link
Member

Also in these cases many of the shard snapshots will be effective no-ops (except for some metadata work I guess). Perhaps we can work harder to deduplicate these things to save on the metadata work too?

Yea we could add more batching here on the data node level for sure. It's part of the plan I had around #89019. Once that refactoring has landed we could make the NOOP snapshots a lot more efficient for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

5 participants