Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot and restore queue #27353

Closed
redlus opened this issue Nov 11, 2017 · 12 comments
Closed

Snapshot and restore queue #27353

redlus opened this issue Nov 11, 2017 · 12 comments
Labels
discuss :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@redlus
Copy link

redlus commented Nov 11, 2017

Hey

We've been trying to automate index snapshot creation through the elasticseasrch snapshot API. However, we get an exception when trying to run more than one snapshot in parallel:
elasticsearch.exceptions.TransportError: TransportError(503, 'concurrent_snapshot_execution_exception', 'a snapshot is already running')

This is an odd behavior, as elasticsearch already has many queues for many of its operations - and I'd expect the same for snapshots and restores. I understand there is a technical limitation for actually running two snapshots concurrently, but I think it would be a good idea to have elasticsearch add snapshots requests to a queue, automatically executing each snapshot as the previous finished running.

@dnhatn dnhatn added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs discuss labels Nov 12, 2017
@imotov
Copy link
Contributor

imotov commented Nov 17, 2017

This is an interesting idea. We should definitely consider it for snapshot redesign, unless we go with the continues backup idea, in which case this is not going to be applicable.

On the other side, a restore queue doesn't make much sense to me. So, we should probably rename the issue to Snapshot Queue.

@s1monw
Copy link
Contributor

s1monw commented Nov 17, 2017

we spoke about this and decided that we don't want to add a queue since it's nice to give feedback to the user immediately that we can't run more than one snapshot. There is also the fact that we don't queue long running processes and this one would be an exception.

@DaveCTurner
Copy link
Contributor

There is still interest in this issue from time to time, particularly with users wanting to enqueue restore jobs. Snapshots are normally a periodic cluster-wide activity so it doesn't make so much sense to queue them up, but restores are more ad-hoc and typically only focus on a few indices so a bit of asynchrony might be useful.

As of 6.6.0 Elasticsearch now supports concurrent restores. A user can trigger multiple the restores at once and Elasticsearch will throttle the corresponding recoveries, effectively enqueueing the restoration of each shard until it has the capacity to handle it.

However Elasticsearch still forbids restores occurring concurrently with either taking or deleting a snapshot. We recommend taking frequent incremental snapshots, which can make it challenging to find a good time to perform a restore.

@redlus
Copy link
Author

redlus commented Apr 11, 2019

@DaveCTurner these are good news. I'd have to ask, though. I'd expect only the relevant shards to be locked for concurrent snapshot actions. Is there a technical limitation to performing snapshots/restores which do not touch the same shards, or is it just plain code?

@DaveCTurner
Copy link
Contributor

is it just plain code?

"Just" is just the worst word 😁 There are some quite significant obstacles.

@johanmha
Copy link

johanmha commented Nov 26, 2020

@DaveCTurner has there been any new discussion on the theme of a snapshot queue? I can see several people on the internet argue that it is better to simply take one large snapshot. However, one might as well argue that one large snapshot of the whole cluster leaves you vulnerable to user errors, for example someone manually deleting the container with snapshots, or to errors such as the snapshot or container being corrupt. If you implement different snapshots for different bulks of indexes, this would help to even further mitigate potential risks. One could of course avoid this issue by simply settings up the policies with different timeslots, however this grows hard to manage as the amount of desired snapshot policies grow.
Any input would be highly appreciated.

@DaveCTurner
Copy link
Contributor

@johanmha I think the implementation is complete now so there's nothing left to discuss.

It's generally best to put all your snapshots in one repository rather than spreading them around, since it's going to be painful to reconstruct your cluster from multiple snapshots in different repositories after a disaster. There's no technical reason not to do this, it's just a bad idea. We can't really offer much advice on protecting a repository from damage (whether malicious or accidental) - it's up to you to apply appropriate access controls and monitor your disks for errors and so on.

@packetrevolt
Copy link

The lack of some sort of snapshot queue is an ongoing problem for us. It is to the point that if there was a third party plugin available we would buy it.

On average if we use a single big snapshot job it takes about 7 hours and runs at about 2TB per hour. Different sets of data have regulatory or legal requirements for different sets of data which require different snapshot jobs and in actuality we are running snapshot jobs nearly 24x7. Keeping several sets of jobs scheduled so they don’t stomp on each other is not fun. Dedicating 2 hours of staff time per day and a big Excel sheet to try and keep it all sorted, plus automated scripting to retry failed jobs. Time consuming and poorly working hack, but not many other options.

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Nov 26, 2020

Keeping several sets of jobs scheduled so they don’t stomp on each other is not fun.

... or necessary? You can run snapshots in parallel these days.

edit: Also, if you want to keep different indices for different lengths of time to satisfy regulatory requirements, it's probably simplest to take frequent whole-cluster snapshots and then use the clone snapshot API to make snapshots containing just the indices you want to retain for longer. Cloning is a zero-copy operation so it's pretty cheap.

@johanmha
Copy link

johanmha commented Nov 27, 2020

@DaveCTurner I wasn't aware you could run snapshots in parallel! How is this possible?

And thanks for the prompt reply!

@DaveCTurner
Copy link
Contributor

This conversation isn't really on-topic for a Github issue so I suggest we continue it over on the discussion forum. I won't be replying here any more, but feel free to link to your forum thread below.

@johanmha
Copy link

johanmha commented Nov 30, 2020

Link to discussion forum as sugested:
https://discuss.elastic.co/t/taking-several-snapshots-in-parallell/257054

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
Development

No branches or pull requests

7 participants