New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Master failover during ILM snapshot creation causes ILM to enter ERROR #83694
Comments
Pinging @elastic/es-data-management (Team:Data Management) |
Here's a synopsis for why this is failing:
Thinking about how to solve this, we could potentially have the |
During a master restart or failover the step CreateSnapshotStep of the action SearchableSnapshotAction can be invoked twice. When this occurs we mark the creation as incomplete and we follow the existing mechanism of returning back to CleanupSnapshotStep. Closes #83694
…lastic#84829) During a master restart or failover the step CreateSnapshotStep of the action SearchableSnapshotAction can be invoked twice. When this occurs we mark the creation as incomplete and we follow the existing mechanism of returning back to CleanupSnapshotStep. Closes elastic#83694 (cherry picked from commit e03c2b0) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
…93724) During a master restart or failover the step CreateSnapshotStep of the action SearchableSnapshotAction can be invoked twice. When this occurs we mark the creation as incomplete and we follow the existing mechanism of returning back to CleanupSnapshotStep. Closes #83694 (cherry picked from commit e03c2b0) Signed-off-by: Andrei Dan <andrei.dan@elastic.co> Co-authored-by: Andrei Dan <andrei.dan@elastic.co>
Elasticsearch Version
All
Problem Description
It's possible for ILM to attempt a double-invocation of the
create-snapshot
step due to a master fail-over during the snapshot. This causes ILM to get stuck in thecreate-snapshot
ERROR step with the error that the snapshot already exists (or else is currently ongoing).For remediation, you can either move the index back to the
cleanup-snapshot
step (IN THE SAME PHASE), which will delete the snapshot and start taking a new one:Or, use the delete snapshot API and simply delete the existing snapshots, since ILM keeps retrying the create-snapshot step and will recreate the snapshot with that name.
Steps to Reproduce
It manifests in the following way:
create-snapshot
stepcreate-snapshot
step a second timeLogs (if relevant)
You can usually diagnose this by looking to see if the master node changes between the first
create-snapshot
and subsequentcreate-snapshot
invocation (ILM only executes on the master node, so the logs will come from a different node for the two invocations).You may see logs such as:
And also:
And potentially a message about a snapshot with the same name already existing:
The text was updated successfully, but these errors were encountered: