Master failover during ILM snapshot creation causes ILM to enter ERROR #83694

dakrone · 2022-02-08T22:43:02Z

Elasticsearch Version

All

Problem Description

It's possible for ILM to attempt a double-invocation of the create-snapshot step due to a master fail-over during the snapshot. This causes ILM to get stuck in the create-snapshot ERROR step with the error that the snapshot already exists (or else is currently ongoing).

For remediation, you can either move the index back to the cleanup-snapshot step (IN THE SAME PHASE), which will delete the snapshot and start taking a new one:

POST _ilm/move/<your-index-here>
{
  "current_step": { 
    "phase": "cold",
    "action": "searchable_snapshot",
    "name": "ERROR"
  },
  "next_step": { 
    "phase": "cold",
    "action": "searchable_snapshot",
    "name": "cleanup-snapshot"
  }
}

Or, use the delete snapshot API and simply delete the existing snapshots, since ILM keeps retrying the create-snapshot step and will recreate the snapshot with that name.

Steps to Reproduce

It manifests in the following way:

Enter create-snapshot step
ILM issues the create snapshot request, which is ongoing
The master node fails or is shut down
A new master is elected
The new master invokes the create-snapshot step a second time
The step fails due the snapshot either already existing, or a snapshot request already ongoing

Logs (if relevant)

You can usually diagnose this by looking to see if the master node changes between the first create-snapshot and subsequent create-snapshot invocation (ILM only executes on the master node, so the logs will come from a different node for the two invocations).

You may see logs such as:

[2021-09-27T18:20:22.182Z][WARN][org.elasticsearch.snapshots.SnapshotsService] [instance-0000000152] [myindex][mysnapshotname] failed to create snapshot
org.elasticsearch.snapshots.InvalidSnapshotNameException: [myrepo:mysnapshotname] Invalid snapshot name [mysnapshotname], snapshot with the same name is already in-progress
	at org.elasticsearch.snapshots.SnapshotsService.ensureSnapshotNameNotRunning(SnapshotsService.java:614) ~[elasticsearch-7.14.2.jar:7.14.2]
	at org.elasticsearch.snapshots.SnapshotsService.access$5300(SnapshotsService.java:123) ~[elasticsearch-7.14.2.jar:7.14.2]
	at org.elasticsearch.snapshots.SnapshotsService$2.execute(SnapshotsService.java:462) ~[elasticsearch-7.14.2.jar:7.14.2]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$1.execute(BlobStoreRepository.java:480) ~[elasticsearch-7.14.2.jar:7.14.2]
	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:48) ~[elasticsearch-7.14.2.jar:7.14.2]
	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:691) ~[elasticsearch-7.14.2.jar:7.14.2]

And also:

org.elasticsearch.snapshots.SnapshotException: [myrepo-mysnapshotname/uuid] no longer master

And potentially a message about a snapshot with the same name already existing:

[2021-09-27T23:50:22.612Z][ERROR][org.elasticsearch.xpack.ilm.IndexLifecycleRunner] [instance-0000000152] policy [my-ilm-policy] for index [myindex] failed on step [{"phase":"cold","action":"searchable_snapshot","name":"create-snapshot"}]. Moving to ERROR step
org.elasticsearch.snapshots.InvalidSnapshotNameException: [myrepo:mysnapshotname] Invalid snapshot name [mysnapshotname], snapshot with the same name already exists
	at org.elasticsearch.snapshots.SnapshotsService.ensureSnapshotNameAvailableInRepo(SnapshotsService.java:754) ~[elasticsearch-7.14.2.jar:7.14.2]
	at org.elasticsearch.snapshots.SnapshotsService.access$5200(SnapshotsService.java:123) ~[elasticsearch-7.14.2.jar:7.14.2]
	at org.elasticsearch.snapshots.SnapshotsService$2.execute(SnapshotsService.java:459) ~[elasticsearch-7.14.2.jar:7.14.2]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$1.execute(BlobStoreRepository.java:480) ~[elasticsearch-7.14.2.jar:7.14.2]

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-02-08T22:43:05Z

Pinging @elastic/es-data-management (Team:Data Management)

dakrone · 2022-02-08T22:48:17Z

Here's a synopsis for why this is failing:

ILM has special execution semantics for certain types of steps called "AsyncActionStep", these have "exactly-once" execution, which means that the step is only run when the cluster state has been processed by the preceding step (we use this mechanism to ensure the action is only run once for actions where running it twice would be Bad News ™️, this includes things like shrink as well as creating the snapshot.

This mechanism works fine, and ensures we don't have the kind of double-invoking that this index is seeing. There is one problem though, which is, what happens if a new node becomes master while an index is on that step? Well, it wouldn't normally invoke the step (because these special steps are run neither when the cluster state changes nor on ILM's periodic timer interval), so, we have to go through and run all the AsyncActionStep steps whenever a new node becomes master.

In this case, the master switched after the old master had already invoked the create-snapshot action, so the new master kicked off the create-snapshot action which led to the second one failing (because the snapshot was already in progress). Once the step is in the ERROR step, we retry it every 10 minutes, however, by that point it's stuck in an error because it doesn't know it can already proceed!

This is why exactly-once invocation is hard, and this is an edge case I think we were originally okay with when we designed it this way (and indeed, it so rarely has impacted users that this is the first time I can recall seeing it), so we as the Data Management team will have to brainstorm what potential solution we want (if any) for this situation.

Thinking about how to solve this, we could potentially have the create-snapshot step check the cluster state for an ongoing snapshot, and if there is one with exactly the same name, then go to the next step. This is an expedient fix for this particular case, but doesn't actually solve the potential-double-invocation-on-master-fail-over scenario I outlined above. We still may want to do that as a stop-gap until we can figure out a full solution.

During a master restart or failover the step CreateSnapshotStep of the action SearchableSnapshotAction can be invoked twice. When this occurs we mark the creation as incomplete and we follow the existing mechanism of returning back to CleanupSnapshotStep. Closes #83694

…lastic#84829) During a master restart or failover the step CreateSnapshotStep of the action SearchableSnapshotAction can be invoked twice. When this occurs we mark the creation as incomplete and we follow the existing mechanism of returning back to CleanupSnapshotStep. Closes elastic#83694 (cherry picked from commit e03c2b0) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>

…93724) During a master restart or failover the step CreateSnapshotStep of the action SearchableSnapshotAction can be invoked twice. When this occurs we mark the creation as incomplete and we follow the existing mechanism of returning back to CleanupSnapshotStep. Closes #83694 (cherry picked from commit e03c2b0) Signed-off-by: Andrei Dan <andrei.dan@elastic.co> Co-authored-by: Andrei Dan <andrei.dan@elastic.co>

dakrone added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Feb 8, 2022

elasticmachine added the Team:Data Management Meta label for data/management team label Feb 8, 2022

gmarouli self-assigned this Feb 10, 2022

gmarouli mentioned this issue Mar 9, 2022

Retry clean and create snapshot if it already exists #83694 #84829

Merged

gmarouli closed this as completed in #84829 Mar 23, 2022

andreidan mentioned this issue Feb 12, 2023

[7.17] Retry clean and create snapshot if it already exists (#84829) #93724

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Master failover during ILM snapshot creation causes ILM to enter ERROR #83694

Master failover during ILM snapshot creation causes ILM to enter ERROR #83694

dakrone commented Feb 8, 2022 •

edited

elasticmachine commented Feb 8, 2022

dakrone commented Feb 8, 2022 •

edited

Master failover during ILM snapshot creation causes ILM to enter ERROR #83694

Master failover during ILM snapshot creation causes ILM to enter ERROR #83694

Comments

dakrone commented Feb 8, 2022 • edited

Elasticsearch Version

Problem Description

Steps to Reproduce

Logs (if relevant)

elasticmachine commented Feb 8, 2022

dakrone commented Feb 8, 2022 • edited

dakrone commented Feb 8, 2022 •

edited

dakrone commented Feb 8, 2022 •

edited