Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update IndexShardSnapshotStatus when an exception is encountered #32265

Closed
wants to merge 1 commit into from

Conversation

Projects
None yet
6 participants
@backslasht
Copy link
Contributor

commented Jul 21, 2018

Background

We have identified an issue in latest elastic search snapshot code where the snapshot is stuck (making no progress and not get deleted) when one or more shards’ (whose snapshot state is in INIT/STARTED state) are not worked on by the node it is assigned to. This could happen when primary node is different (changed after the snapshot is started) from the node (old primary) to which the shard is marked to be snapshot-ed.

When does it happen

When one of the data nodes having primary shards is restarted (process restart) while the snapshot is running and joins back the cluster within 30 seconds. The node upon restart fails to process the cluster update (due to a race condition between the snapshot service and indices service) and all the shards for which the node was primary (before restart) and in INIT or STARTED state (snapshot state) will be stuck in that state forever.

The shards get stuck as the indices services throws a IndexNotFoundException as it hasn't processed the cluster update yet. And if one of the shards (say x out of y shards that need to snapshot-ed) receives the IndexNotFoundException, SnapshotShardsService fails to queue the snapshot thread for that shard as well for all the following shards (y - x) + 1. The master will keep on waiting for the shard to go to logical end state (DONE or FAILED) and report it back to the master. But since the snapshot thread didn't start for the shard, it will never report back the state and thus snapshot stuck indefinitely.

When a delete call is invoked on the snapshot which is stuck, all the shards which is in INIT or STARTED state will be marked as ABORTED expecting the BlobStoreRepository to throw an exception and move the shard to logically end state FAILED. But since no thread is working on these shards, it will remain as ABORTED and new subsequent delete call will be queued resulting in increase in number of tasks.

Proposed Fix

  1. When IndexNotFoundException is received by the SnapshotShardsService, catch the exception and immediately mark the snapshot shard state to FAILED. After marking it as FAILED, it can still go ahead and process the rest of the shards which will eventually make the snapshot to go into PARTIAL state instead of stuck forever.

  2. When a snapshot delete call is made for a snapshot which is in progress (or stuck), SnaphotShardsService iterate through the shards of the snapshot and marks them as DONE or FAILED if it is complete or error-ed respectively. For the shards which are in INIT or STARTED state, it is marked as ABORTED expecting the thread which is uploading the data to detect the ABORTED state and throw an error, thus going to a logical FAILED state. But, since there are no threads working on these shards (lost during restart), the state remains the same forever. To fix that, while marking the shard status as ABORTED we can do an additional check to see if the shard’s current primary is different than the node to which the shard is marked to be snapshot-ed. If they are different, we can fail the shard immediately thus making all shards to reach a logical state (either DONE or FAILED) which in turn result in successful deletion of the problematic snapshot.

  3. Adding ABORTED state in SnapshotIndexShardStage, SnapshotIndexShardStatus and SnapshotShardsStats where it is missing.

Steps to reproduce (100% success)

  • Create a multi node cluster (say with 3 data nodes).
  • Using load generator, create multiple indices (say 500 indices) each with 1 replica and 3 shards
  • After the data is loaded, register a repo (my-repo) and initiate a snapshot.
  • Restart the elastic search process in one of the data nodes.
  • [Verify] The new process will fail to update the snapshot state from cluster state.
  • The snapshot goes to stuck state and can’t be deleted.

Related Issues

Prabhakar Sithanandam
UpdateIndexShardSnapshotStatus when an exception is encountered befor…
…e snapshot call is made

When snapshot is about to start for a shard which is in one of the following states, currently
we don't move the snapshot status to failed, adding code to move the status to failed.
i) Not primary in the current node
ii) Relocating
iii) Recovering
iv) Index hasn't loaded yet in indices service

Also adding ABORTED status in the IndexShard Status/Stats/Stage.
@elasticmachine

This comment has been minimized.

Copy link

commented Jul 21, 2018

@ywelsch ywelsch self-assigned this Jul 24, 2018

@ywelsch ywelsch added the >bug label Jul 24, 2018

@ywelsch

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2018

Thanks for reporting this. I've looked more closely into the series of events leading to this situation:

When a node leaves the cluster, the master will fail all shards allocated to that node. SnapshotsService on the master (which is a ClusterStateApplier) will get the updated cluster state with the removed node and call processSnapshotsOnRemovedNodes, which in turn will submit a cluster state update task to move the snapshot from STARTED/ABORTED to FAILED.
This means that there is a cluster state where there are entries in SnapshotsInProgress which mark certain shards as requiring a snapshot, but where the actual shard routing has been removed from the cluster state. If the node rejoins before the cluster state update task submitted by processSnapshotsOnRemovedNodes has gotten the chance to execute on the master and bring the entries in SnapshotsInProgress back in sync with the routing table, the node will have a snapshot assigned to it with no corresponding IndexShard locally available. It will therefore fail to get the IndexShard from IndicesService in SnapshotShardsService, which leads to an exception on the cluster applier thread, and will in turn prevent other (possibly valid) snapshots to start on the node. On subsequent cluster state updates, these are not resubmitted to the executor, which means that they'll be indefinitely stuck.

The first thing we'll need to do here is to write an integration test that reproduces the issue. Regarding a fix, I would prefer to have a SnapshotsInProgress object that's fully in-sync with the routing table, similar to what I have done here for the RestoreInProgress information, and then build a solution on top of that. I'll explore this further in the next days, just wanted to give you an update here.

@backslasht

This comment has been minimized.

Copy link
Contributor Author

commented Jul 25, 2018

Thanks for looking into the change.

In addition to the IndexNotFoundException, I have also found that we are not updating IndexShardSnapshotStatus in cases like:

  1. The node is not primary for that shard anymore.
  2. Shard is currently relocating.

I have fixed it as part of this PR. I have also added missing ABORTED status in SnapshotIndexShardStage, SnapshotIndexShardStatus and SnapshotShardsStats.

Should I raise a separate a PR for them?

@BobBlank12

This comment has been minimized.

Copy link
Contributor

commented Oct 3, 2018

For people struggling with this issue, should we consider temporarily disabling shard allocation before snapshots and enabling it afterwards?

@ywelsch

This comment has been minimized.

Copy link
Contributor

commented Oct 4, 2018

@BobBlank12 that will unfortunately not help. I've worked on a proper fix (which requires rewriting some core parts of the snapshotting code), but have to break this up now into smaller reviewable pieces. There is unfortunately no workaround for now. If you hit this issue, you can manually solve the problem by following the procedure outlined here: #31624 (comment)

@original-brownbear original-brownbear self-assigned this Nov 14, 2018

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 23, 2018

SNAPSHOT: Introduce ClusterState Tests
* Bringing in cluster state test infrastructure
* Relates elastic#32265

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Dec 14, 2018

SNAPSHOT: Deterministic ClusterState Tests
* Use `DeterministicTaskQueue` infrastructure to reproduce elastic#32265
@ywelsch

This comment has been minimized.

Copy link
Contributor

commented Jan 15, 2019

We have a quick fix in #36113 and a more comprehensive fix will follow.

@ywelsch ywelsch closed this Jan 15, 2019

@ywelsch

This comment has been minimized.

Copy link
Contributor

commented Jan 15, 2019

@backslasht can you verify that #36113 fixes the issue?

original-brownbear added a commit that referenced this pull request Feb 1, 2019

Fix Two Races that Lead to Stuck Snapshots (#37686)
* Fixes two broken spots:
    1. Master failover while deleting a snapshot that has no shards will get stuck if the new master finds the 0-shard snapshot in `INIT` when deleting
    2. Aborted shards that were never seen in `INIT` state by the `SnapshotsShardService` will not be notified as failed, leading to the snapshot staying in `ABORTED` state and never getting deleted with one or more shards stuck in `ABORTED` state
* Tried to make fixes as short as possible so we can backport to `6.x` with the least amount of risk
* Significantly extended test infrastructure to reproduce the above two issues
  * Two new test runs:
      1. Reproducing the effects of node disconnects/restarts in isolation
      2. Reproducing the effects of disconnects/restarts in parallel with shard relocations and deletes
* Relates #32265 
* Closes #32348
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.