Snapshot and restore queue #27353

redlus · 2017-11-11T22:58:21Z

Hey

We've been trying to automate index snapshot creation through the elasticseasrch snapshot API. However, we get an exception when trying to run more than one snapshot in parallel:
elasticsearch.exceptions.TransportError: TransportError(503, 'concurrent_snapshot_execution_exception', 'a snapshot is already running')

This is an odd behavior, as elasticsearch already has many queues for many of its operations - and I'd expect the same for snapshots and restores. I understand there is a technical limitation for actually running two snapshots concurrently, but I think it would be a good idea to have elasticsearch add snapshots requests to a queue, automatically executing each snapshot as the previous finished running.

The text was updated successfully, but these errors were encountered:

imotov · 2017-11-17T13:04:37Z

This is an interesting idea. We should definitely consider it for snapshot redesign, unless we go with the continues backup idea, in which case this is not going to be applicable.

On the other side, a restore queue doesn't make much sense to me. So, we should probably rename the issue to Snapshot Queue.

s1monw · 2017-11-17T14:31:26Z

we spoke about this and decided that we don't want to add a queue since it's nice to give feedback to the user immediately that we can't run more than one snapshot. There is also the fact that we don't queue long running processes and this one would be an exception.

DaveCTurner · 2019-04-11T07:25:07Z

There is still interest in this issue from time to time, particularly with users wanting to enqueue restore jobs. Snapshots are normally a periodic cluster-wide activity so it doesn't make so much sense to queue them up, but restores are more ad-hoc and typically only focus on a few indices so a bit of asynchrony might be useful.

As of 6.6.0 Elasticsearch now supports concurrent restores. A user can trigger multiple the restores at once and Elasticsearch will throttle the corresponding recoveries, effectively enqueueing the restoration of each shard until it has the capacity to handle it.

However Elasticsearch still forbids restores occurring concurrently with either taking or deleting a snapshot. We recommend taking frequent incremental snapshots, which can make it challenging to find a good time to perform a restore.

redlus · 2019-04-11T10:32:00Z

@DaveCTurner these are good news. I'd have to ask, though. I'd expect only the relevant shards to be locked for concurrent snapshot actions. Is there a technical limitation to performing snapshots/restores which do not touch the same shards, or is it just plain code?

DaveCTurner · 2019-04-11T11:18:30Z

is it just plain code?

"Just" is just the worst word 😁 There are some quite significant obstacles.

johanmha · 2020-11-26T14:52:24Z

@DaveCTurner has there been any new discussion on the theme of a snapshot queue? I can see several people on the internet argue that it is better to simply take one large snapshot. However, one might as well argue that one large snapshot of the whole cluster leaves you vulnerable to user errors, for example someone manually deleting the container with snapshots, or to errors such as the snapshot or container being corrupt. If you implement different snapshots for different bulks of indexes, this would help to even further mitigate potential risks. One could of course avoid this issue by simply settings up the policies with different timeslots, however this grows hard to manage as the amount of desired snapshot policies grow.
Any input would be highly appreciated.

DaveCTurner · 2020-11-26T15:15:08Z

@johanmha I think the implementation is complete now so there's nothing left to discuss.

It's generally best to put all your snapshots in one repository rather than spreading them around, since it's going to be painful to reconstruct your cluster from multiple snapshots in different repositories after a disaster. There's no technical reason not to do this, it's just a bad idea. We can't really offer much advice on protecting a repository from damage (whether malicious or accidental) - it's up to you to apply appropriate access controls and monitor your disks for errors and so on.

packetrevolt · 2020-11-26T17:17:58Z

The lack of some sort of snapshot queue is an ongoing problem for us. It is to the point that if there was a third party plugin available we would buy it.

On average if we use a single big snapshot job it takes about 7 hours and runs at about 2TB per hour. Different sets of data have regulatory or legal requirements for different sets of data which require different snapshot jobs and in actuality we are running snapshot jobs nearly 24x7. Keeping several sets of jobs scheduled so they don’t stomp on each other is not fun. Dedicating 2 hours of staff time per day and a big Excel sheet to try and keep it all sorted, plus automated scripting to retry failed jobs. Time consuming and poorly working hack, but not many other options.

DaveCTurner · 2020-11-26T17:58:48Z

Keeping several sets of jobs scheduled so they don’t stomp on each other is not fun.

... or necessary? You can run snapshots in parallel these days.

edit: Also, if you want to keep different indices for different lengths of time to satisfy regulatory requirements, it's probably simplest to take frequent whole-cluster snapshots and then use the clone snapshot API to make snapshots containing just the indices you want to retain for longer. Cloning is a zero-copy operation so it's pretty cheap.

johanmha · 2020-11-27T08:09:39Z

@DaveCTurner I wasn't aware you could run snapshots in parallel! How is this possible?

And thanks for the prompt reply!

DaveCTurner · 2020-11-27T13:49:32Z

This conversation isn't really on-topic for a Github issue so I suggest we continue it over on the discussion forum. I won't be replying here any more, but feel free to link to your forum thread below.

johanmha · 2020-11-30T10:49:31Z

Link to discussion forum as sugested:
https://discuss.elastic.co/t/taking-several-snapshots-in-parallell/257054

dnhatn added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs discuss labels Nov 12, 2017

s1monw closed this as completed Nov 17, 2017

jasontedor mentioned this issue May 21, 2018

Snapshot and restore queue #30742

Closed

DaveCTurner mentioned this issue Apr 24, 2019

Allow a snapshot to be deleted if a restore for a different snapshot is in progress #41463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot and restore queue #27353

Snapshot and restore queue #27353

redlus commented Nov 11, 2017 •

edited

Loading

imotov commented Nov 17, 2017 •

edited

Loading

s1monw commented Nov 17, 2017

DaveCTurner commented Apr 11, 2019

redlus commented Apr 11, 2019

DaveCTurner commented Apr 11, 2019

johanmha commented Nov 26, 2020 •

edited

Loading

DaveCTurner commented Nov 26, 2020

packetrevolt commented Nov 26, 2020

DaveCTurner commented Nov 26, 2020 •

edited

Loading

johanmha commented Nov 27, 2020 •

edited

Loading

DaveCTurner commented Nov 27, 2020

johanmha commented Nov 30, 2020 •

edited

Loading

Snapshot and restore queue #27353

Snapshot and restore queue #27353

Comments

redlus commented Nov 11, 2017 • edited Loading

imotov commented Nov 17, 2017 • edited Loading

s1monw commented Nov 17, 2017

DaveCTurner commented Apr 11, 2019

redlus commented Apr 11, 2019

DaveCTurner commented Apr 11, 2019

johanmha commented Nov 26, 2020 • edited Loading

DaveCTurner commented Nov 26, 2020

packetrevolt commented Nov 26, 2020

DaveCTurner commented Nov 26, 2020 • edited Loading

johanmha commented Nov 27, 2020 • edited Loading

DaveCTurner commented Nov 27, 2020

johanmha commented Nov 30, 2020 • edited Loading

redlus commented Nov 11, 2017 •

edited

Loading

imotov commented Nov 17, 2017 •

edited

Loading

johanmha commented Nov 26, 2020 •

edited

Loading

DaveCTurner commented Nov 26, 2020 •

edited

Loading

johanmha commented Nov 27, 2020 •

edited

Loading

johanmha commented Nov 30, 2020 •

edited

Loading