Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of failed primary replica handling #6825

Closed

Conversation

kimchy
Copy link
Member

@kimchy kimchy commented Jul 11, 2014

Out of #6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method.
This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior
The change also expose a problem in our handling of replica shards that stay initializing during primary failure and electing another replica shard as primary, where we need to cancel its ongoing recovery to make sure it re-starts from the new elected primary

Out of elastic#6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method.
This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior
The change also expose a problem in our handling of replica shards that stay initializing during primary failure and electing another replica shard as primary, where we need to cancel its ongoing recovery to make sure it re-starts from the new elected primary
closes elastic#6825
@bleskes
Copy link
Contributor

bleskes commented Jul 11, 2014

LGTM

@kimchy kimchy added review and removed v1.2.0 labels Jul 11, 2014
@martijnvg
Copy link
Member

LGTM

@kimchy kimchy closed this in 01ca81e Jul 11, 2014
@kimchy kimchy deleted the better_primary_failure_handling branch July 11, 2014 08:52
kimchy added a commit that referenced this pull request Jul 11, 2014
Out of #6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method.
This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior
The change also expose a problem in our handling of replica shards that stay initializing during primary failure and electing another replica shard as primary, where we need to cancel its ongoing recovery to make sure it re-starts from the new elected primary
closes #6825
kimchy added a commit that referenced this pull request Jul 11, 2014
make sure we use the instance itself to look it up, and not the shard id, as we might get another instance
leftover from #6825
kimchy added a commit that referenced this pull request Jul 11, 2014
make sure we use the instance itself to look it up, and not the shard id, as we might get another instance
leftover from #6825
bleskes added a commit to bleskes/elasticsearch that referenced this pull request Jul 15, 2014
Due to change introduced in elastic#6825, we now start a local gateway recovery for replicas, if the source node can not be found. The recovery then fails because we never recover replicas from disk.
bleskes added a commit that referenced this pull request Jul 16, 2014
Due to change introduced in #6825, we now start a local gateway recovery for replicas, if the source node can not be found. The recovery then fails because we never recover replicas from disk.

Closes #6879
bleskes added a commit that referenced this pull request Jul 16, 2014
Due to change introduced in #6825, we now start a local gateway recovery for replicas, if the source node can not be found. The recovery then fails because we never recover replicas from disk.

Closes #6879
bleskes added a commit that referenced this pull request Jul 16, 2014
Due to change introduced in #6825, we now start a local gateway recovery for replicas, if the source node can not be found. The recovery then fails because we never recover replicas from disk.

Closes #6879
@jpountz jpountz removed the review label Jul 16, 2014
@clintongormley clintongormley changed the title Improve handling of failed primary replica handling Resiliency: Improve handling of failed primary replica handling Jul 16, 2014
@clintongormley clintongormley changed the title Resiliency: Improve handling of failed primary replica handling Improve handling of failed primary replica handling Jun 7, 2015
mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
Due to change introduced in elastic#6825, we now start a local gateway recovery for replicas, if the source node can not be found. The recovery then fails because we never recover replicas from disk.

Closes elastic#6879
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement resiliency v1.3.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants