Recovery: Quick cluster state processing can cause relocation finalization to fail and delete both copies #9503

bleskes · 2015-01-30T13:31:08Z

#8570 added some extra protection for the case where a source shard is being closed during recovery. However, this introduces a race condition in the case that the target shard has moved to POST_RECOVERY and the master processes the shard started action and activates the shard before the source node completes the recovery. In that case the source node will close the source shard, causing the recovery to be cancelled. The target node receives the cancellation notification and deletes the local copy (still in POST_RECOVERY).

The extra close listener is not yet released but is part of the 1.5 push.

See: http://build-us-00.elasticsearch.org/job/es_core_1x_debian/3474/

…emantics We keep track of the current stage of recovery using an instance of RecoveryState which is stored on the relevant IndexShard. At the moment changes to this object are made in many places of the code, which are charged of doing it in the right order, keeping track of timers and many more. Also the changes to shard state are decoupled from the recovery stages which caused elastic#9503. This PR refactors this and brings all of the changes into IndexShard. It also makes all recovery follow the exact same stages and shortcut some. This is in order to keep things simple and always the same (those shortcuts didn't add anything, we ended doing it all anyway). Also, all timer management is now folded into RecoveryState and unit tests are added. This closes elastic#9503 by moving the shard to post recovery only once the recovery is done (before they were decoupled), meaning that master promotion of the target shard to started can not cancel the recovery. Closes elastic#9902

bleskes added v2.0.0-beta1 resiliency v1.5.0 :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jan 30, 2015

s1monw added the blocker label Feb 9, 2015

bleskes mentioned this issue Feb 26, 2015

Unify RecoveryState management to IndexShard and clean up semantics #9902

Closed

bleskes closed this as completed in 0cec37f Feb 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery: Quick cluster state processing can cause relocation finalization to fail and delete both copies #9503

Recovery: Quick cluster state processing can cause relocation finalization to fail and delete both copies #9503

bleskes commented Jan 30, 2015

Recovery: Quick cluster state processing can cause relocation finalization to fail and delete both copies #9503

Recovery: Quick cluster state processing can cause relocation finalization to fail and delete both copies #9503

Comments

bleskes commented Jan 30, 2015