Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexShard.routingEntry should only be updated once all internal state is ready #26776

Merged
merged 2 commits into from Sep 25, 2017

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Sep 25, 2017

The routing entry is used by external components to check whether the shard is ready to perform as primary. Most notably, the peer recovery source handler delays recoveries until the shard routing entry says the shard is ready.

When a shard is promoted to primary, we currently update the shard's routing entry before we finish all the work relating to the promotion. This can cause recoveries to fail later on because the GlobalCheckpointTracker isn't set (yet) to primary mode.

This commit fixes this issue by updating the routing entry last.

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@s1monw s1monw added the blocker label Sep 25, 2017
Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one question.

@@ -524,6 +523,8 @@ public void onFailure(Exception e) {
latch.countDown();
}
}
// set this last, once we finished updating all internal state.
this.shardRouting = newRouting;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in the case of a primary this should be published before we countdown the latch that lets the operations that are run under the permit block proceed. At that point we will have published the new primary term but not the new routing entry, and this shard will be used for example in the resync operations. Am I missing something here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. Thanks.

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming green CI, LGTM.

@bleskes bleskes merged commit ec0c621 into elastic:master Sep 25, 2017
@bleskes bleskes deleted the recovery_corrupt_it_primary_promotion branch September 25, 2017 17:57
@bleskes
Copy link
Contributor Author

bleskes commented Sep 25, 2017

Thx @s1monw , @jasontedor

bleskes added a commit that referenced this pull request Sep 25, 2017
…ate is ready (#26776)

The routing entry is used by external components to check whether the shard is ready to perform as primary. Most notably, the peer recovery source handler delays recoveries until the shard routing entry says the shard is ready. 

When a shard is promoted to primary, we currently update the shard's routing entry before we finish all the work relating to the promotion. This can cause recoveries to fail later on because the `GlobalCheckpointTracker` isn't set (yet) to primary mode. 

This commit fixes this issue by updating the routing entry last.
bleskes added a commit that referenced this pull request Sep 25, 2017
…ate is ready (#26776)

The routing entry is used by external components to check whether the shard is ready to perform as primary. Most notably, the peer recovery source handler delays recoveries until the shard routing entry says the shard is ready. 

When a shard is promoted to primary, we currently update the shard's routing entry before we finish all the work relating to the promotion. This can cause recoveries to fail later on because the `GlobalCheckpointTracker` isn't set (yet) to primary mode. 

This commit fixes this issue by updating the routing entry last.
@lcawl lcawl removed the v6.1.0 label Dec 12, 2017
@jpountz jpountz added the :Core/Infra/Core Core issues without another label label Jan 29, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants