Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master failover during ILM snapshot creation causes ILM to enter ERROR #83694

Closed
dakrone opened this issue Feb 8, 2022 · 2 comments · Fixed by #84829
Closed

Master failover during ILM snapshot creation causes ILM to enter ERROR #83694

dakrone opened this issue Feb 8, 2022 · 2 comments · Fixed by #84829
Assignees
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team

Comments

@dakrone
Copy link
Member

dakrone commented Feb 8, 2022

Elasticsearch Version

All

Problem Description

It's possible for ILM to attempt a double-invocation of the create-snapshot step due to a master fail-over during the snapshot. This causes ILM to get stuck in the create-snapshot ERROR step with the error that the snapshot already exists (or else is currently ongoing).

For remediation, you can either move the index back to the cleanup-snapshot step (IN THE SAME PHASE), which will delete the snapshot and start taking a new one:

POST _ilm/move/<your-index-here>
{
  "current_step": { 
    "phase": "cold",
    "action": "searchable_snapshot",
    "name": "ERROR"
  },
  "next_step": { 
    "phase": "cold",
    "action": "searchable_snapshot",
    "name": "cleanup-snapshot"
  }
}

Or, use the delete snapshot API and simply delete the existing snapshots, since ILM keeps retrying the create-snapshot step and will recreate the snapshot with that name.

Steps to Reproduce

It manifests in the following way:

  1. Enter create-snapshot step
  2. ILM issues the create snapshot request, which is ongoing
  3. The master node fails or is shut down
  4. A new master is elected
  5. The new master invokes the create-snapshot step a second time
  6. The step fails due the snapshot either already existing, or a snapshot request already ongoing

Logs (if relevant)

You can usually diagnose this by looking to see if the master node changes between the first create-snapshot and subsequent create-snapshot invocation (ILM only executes on the master node, so the logs will come from a different node for the two invocations).

You may see logs such as:

[2021-09-27T18:20:22.182Z][WARN][org.elasticsearch.snapshots.SnapshotsService] [instance-0000000152] [myindex][mysnapshotname] failed to create snapshot
org.elasticsearch.snapshots.InvalidSnapshotNameException: [myrepo:mysnapshotname] Invalid snapshot name [mysnapshotname], snapshot with the same name is already in-progress
	at org.elasticsearch.snapshots.SnapshotsService.ensureSnapshotNameNotRunning(SnapshotsService.java:614) ~[elasticsearch-7.14.2.jar:7.14.2]
	at org.elasticsearch.snapshots.SnapshotsService.access$5300(SnapshotsService.java:123) ~[elasticsearch-7.14.2.jar:7.14.2]
	at org.elasticsearch.snapshots.SnapshotsService$2.execute(SnapshotsService.java:462) ~[elasticsearch-7.14.2.jar:7.14.2]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$1.execute(BlobStoreRepository.java:480) ~[elasticsearch-7.14.2.jar:7.14.2]
	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:48) ~[elasticsearch-7.14.2.jar:7.14.2]
	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:691) ~[elasticsearch-7.14.2.jar:7.14.2]

And also:

org.elasticsearch.snapshots.SnapshotException: [myrepo-mysnapshotname/uuid] no longer master

And potentially a message about a snapshot with the same name already existing:

[2021-09-27T23:50:22.612Z][ERROR][org.elasticsearch.xpack.ilm.IndexLifecycleRunner] [instance-0000000152] policy [my-ilm-policy] for index [myindex] failed on step [{"phase":"cold","action":"searchable_snapshot","name":"create-snapshot"}]. Moving to ERROR step
org.elasticsearch.snapshots.InvalidSnapshotNameException: [myrepo:mysnapshotname] Invalid snapshot name [mysnapshotname], snapshot with the same name already exists
	at org.elasticsearch.snapshots.SnapshotsService.ensureSnapshotNameAvailableInRepo(SnapshotsService.java:754) ~[elasticsearch-7.14.2.jar:7.14.2]
	at org.elasticsearch.snapshots.SnapshotsService.access$5200(SnapshotsService.java:123) ~[elasticsearch-7.14.2.jar:7.14.2]
	at org.elasticsearch.snapshots.SnapshotsService$2.execute(SnapshotsService.java:459) ~[elasticsearch-7.14.2.jar:7.14.2]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$1.execute(BlobStoreRepository.java:480) ~[elasticsearch-7.14.2.jar:7.14.2]
@dakrone dakrone added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Feb 8, 2022
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Feb 8, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@dakrone
Copy link
Member Author

dakrone commented Feb 8, 2022

Here's a synopsis for why this is failing:

ILM has special execution semantics for certain types of steps called "AsyncActionStep", these have "exactly-once" execution, which means that the step is only run when the cluster state has been processed by the preceding step (we use this mechanism to ensure the action is only run once for actions where running it twice would be Bad News ™️, this includes things like shrink as well as creating the snapshot.

This mechanism works fine, and ensures we don't have the kind of double-invoking that this index is seeing. There is one problem though, which is, what happens if a new node becomes master while an index is on that step? Well, it wouldn't normally invoke the step (because these special steps are run neither when the cluster state changes nor on ILM's periodic timer interval), so, we have to go through and run all the AsyncActionStep steps whenever a new node becomes master.

In this case, the master switched after the old master had already invoked the create-snapshot action, so the new master kicked off the create-snapshot action which led to the second one failing (because the snapshot was already in progress). Once the step is in the ERROR step, we retry it every 10 minutes, however, by that point it's stuck in an error because it doesn't know it can already proceed!

This is why exactly-once invocation is hard, and this is an edge case I think we were originally okay with when we designed it this way (and indeed, it so rarely has impacted users that this is the first time I can recall seeing it), so we as the Data Management team will have to brainstorm what potential solution we want (if any) for this situation.

Thinking about how to solve this, we could potentially have the create-snapshot step check the cluster state for an ongoing snapshot, and if there is one with exactly the same name, then go to the next step. This is an expedient fix for this particular case, but doesn't actually solve the potential-double-invocation-on-master-fail-over scenario I outlined above. We still may want to do that as a stop-gap until we can figure out a full solution.

@gmarouli gmarouli self-assigned this Feb 10, 2022
gmarouli added a commit that referenced this issue Mar 23, 2022
During a master restart or failover the step CreateSnapshotStep of the action SearchableSnapshotAction can be invoked twice. When this occurs we mark the creation as incomplete and we follow the existing mechanism of returning back to CleanupSnapshotStep.

Closes #83694
andreidan pushed a commit to andreidan/elasticsearch that referenced this issue Feb 12, 2023
…lastic#84829)

During a master restart or failover the step CreateSnapshotStep of the action SearchableSnapshotAction can be invoked twice. When this occurs we mark the creation as incomplete and we follow the existing mechanism of returning back to CleanupSnapshotStep.

Closes elastic#83694

(cherry picked from commit e03c2b0)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
andreidan added a commit that referenced this issue Feb 14, 2023
…93724)

During a master restart or failover the step CreateSnapshotStep of the action SearchableSnapshotAction can be invoked twice. When this occurs we mark the creation as incomplete and we follow the existing mechanism of returning back to CleanupSnapshotStep.

Closes #83694

(cherry picked from commit e03c2b0)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>

Co-authored-by: Andrei Dan <andrei.dan@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants