Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Update IndexShardSnapshotStatus when an exception is encountered #32265
We have identified an issue in latest elastic search snapshot code where the snapshot is stuck (making no progress and not get deleted) when one or more shards’ (whose snapshot state is in INIT/STARTED state) are not worked on by the node it is assigned to. This could happen when primary node is different (changed after the snapshot is started) from the node (old primary) to which the shard is marked to be snapshot-ed.
When does it happen
When one of the data nodes having primary shards is restarted (process restart) while the snapshot is running and joins back the cluster within 30 seconds. The node upon restart fails to process the cluster update (due to a race condition between the snapshot service and indices service) and all the shards for which the node was primary (before restart) and in INIT or STARTED state (snapshot state) will be stuck in that state forever.
The shards get stuck as the indices services throws a IndexNotFoundException as it hasn't processed the cluster update yet. And if one of the shards (say x out of y shards that need to snapshot-ed) receives the IndexNotFoundException, SnapshotShardsService fails to queue the snapshot thread for that shard as well for all the following shards (y - x) + 1. The master will keep on waiting for the shard to go to logical end state (DONE or FAILED) and report it back to the master. But since the snapshot thread didn't start for the shard, it will never report back the state and thus snapshot stuck indefinitely.
When a delete call is invoked on the snapshot which is stuck, all the shards which is in INIT or STARTED state will be marked as ABORTED expecting the BlobStoreRepository to throw an exception and move the shard to logically end state FAILED. But since no thread is working on these shards, it will remain as ABORTED and new subsequent delete call will be queued resulting in increase in number of tasks.
Steps to reproduce (100% success)
Thanks for reporting this. I've looked more closely into the series of events leading to this situation:
When a node leaves the cluster, the master will fail all shards allocated to that node. SnapshotsService on the master (which is a ClusterStateApplier) will get the updated cluster state with the removed node and call processSnapshotsOnRemovedNodes, which in turn will submit a cluster state update task to move the snapshot from STARTED/ABORTED to FAILED.
The first thing we'll need to do here is to write an integration test that reproduces the issue. Regarding a fix, I would prefer to have a SnapshotsInProgress object that's fully in-sync with the routing table, similar to what I have done here for the RestoreInProgress information, and then build a solution on top of that. I'll explore this further in the next days, just wanted to give you an update here.
Thanks for looking into the change.
In addition to the IndexNotFoundException, I have also found that we are not updating IndexShardSnapshotStatus in cases like:
I have fixed it as part of this PR. I have also added missing ABORTED status in SnapshotIndexShardStage, SnapshotIndexShardStatus and SnapshotShardsStats.
Should I raise a separate a PR for them?
@BobBlank12 that will unfortunately not help. I've worked on a proper fix (which requires rewriting some core parts of the snapshotting code), but have to break this up now into smaller reviewable pieces. There is unfortunately no workaround for now. If you hit this issue, you can manually solve the problem by following the procedure outlined here: #31624 (comment)