Improve handling of failed primary replica handling #6825

kimchy · 2014-07-11T07:12:24Z

Out of #6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method.
This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior
The change also expose a problem in our handling of replica shards that stay initializing during primary failure and electing another replica shard as primary, where we need to cancel its ongoing recovery to make sure it re-starts from the new elected primary

Out of elastic#6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method. This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior The change also expose a problem in our handling of replica shards that stay initializing during primary failure and electing another replica shard as primary, where we need to cancel its ongoing recovery to make sure it re-starts from the new elected primary closes elastic#6825

bleskes · 2014-07-11T08:04:18Z

LGTM

martijnvg · 2014-07-11T08:50:40Z

LGTM

Out of #6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method. This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior The change also expose a problem in our handling of replica shards that stay initializing during primary failure and electing another replica shard as primary, where we need to cancel its ongoing recovery to make sure it re-starts from the new elected primary closes #6825

make sure we use the instance itself to look it up, and not the shard id, as we might get another instance leftover from #6825

Due to change introduced in elastic#6825, we now start a local gateway recovery for replicas, if the source node can not be found. The recovery then fails because we never recover replicas from disk.

Due to change introduced in #6825, we now start a local gateway recovery for replicas, if the source node can not be found. The recovery then fails because we never recover replicas from disk. Closes #6879

Due to change introduced in elastic#6825, we now start a local gateway recovery for replicas, if the source node can not be found. The recovery then fails because we never recover replicas from disk. Closes elastic#6879

kimchy added review and removed v1.2.0 labels Jul 11, 2014

kimchy closed this in 01ca81e Jul 11, 2014

kimchy deleted the better_primary_failure_handling branch July 11, 2014 08:52

kimchy added a commit that referenced this pull request Jul 11, 2014

Only use IndexShard instance to lookup recovery status

43a5cbe

make sure we use the instance itself to look it up, and not the shard id, as we might get another instance leftover from #6825

kimchy added a commit that referenced this pull request Jul 11, 2014

Only use IndexShard instance to lookup recovery status

00c85f2

make sure we use the instance itself to look it up, and not the shard id, as we might get another instance leftover from #6825

bleskes mentioned this pull request Jul 15, 2014

[Recovery] don't start a gateway recovery if source node is not found #6879

Closed

jpountz removed the review label Jul 16, 2014

clintongormley changed the title ~~Improve handling of failed primary replica handling~~ Resiliency: Improve handling of failed primary replica handling Jul 16, 2014

clintongormley added the enhancement label Jul 16, 2014

clintongormley added the :Cluster label Jun 7, 2015

clintongormley changed the title ~~Resiliency: Improve handling of failed primary replica handling~~ Improve handling of failed primary replica handling Jun 7, 2015

clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of failed primary replica handling #6825

Improve handling of failed primary replica handling #6825

kimchy commented Jul 11, 2014

bleskes commented Jul 11, 2014

martijnvg commented Jul 11, 2014

Improve handling of failed primary replica handling #6825

Improve handling of failed primary replica handling #6825

Conversation

kimchy commented Jul 11, 2014

bleskes commented Jul 11, 2014

martijnvg commented Jul 11, 2014