Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shard failure requests for non-existent shards #16089

Closed
wants to merge 7 commits into from
Closed

Shard failure requests for non-existent shards #16089

wants to merge 7 commits into from

Conversation

jasontedor
Copy link
Member

This commit adds handling on the master side for shard failure requests
for shards that do not exist at the time that they are processed on the
master node (whether it be from errant requests, duplicate requests, or
both the primary and replica notifying the master of a shard
failure). This change is made because such shard failure requests should
always be considered successful (the failed shard is not there anymore),
but could be marked as failed if batched with a shard failure request
that does in fact fail.

Relates #14252

This commit adds handling on the master side for shard failure requests
for shards that do not exist at the time that they are processed on the
master node (whether it be from errant requests, duplicate requests, or
both the primary and replica notifying the master of a shard
failure). This change is made because such shard failure requests should
always be considered successful (the failed shard is not there anymore),
but could be marked as failed if batched with a shard failure request
that does in fact fail.
@jasontedor jasontedor mentioned this pull request Jan 20, 2016
9 tasks
@jasontedor
Copy link
Member Author

@bleskes I've updated this pull request.

@@ -223,9 +227,24 @@ public ShardFailedClusterStateTaskExecutor(AllocationService allocationService,
@Override
public BatchResult<ShardRoutingEntry> execute(ClusterState currentState, List<ShardRoutingEntry> tasks) throws Exception {
BatchResult.Builder<ShardRoutingEntry> batchResultBuilder = BatchResult.builder();
Set<ShardRoutingEntry> nonTrivialTasks = Collections.newSetFromMap(new IdentityHashMap<>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call these - missingShardTasks or alreadyRemovedTasks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry - got confused - maybe resolvedShards? (nonTrivial is so vague)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bleskes I pushed 418a63f.

@bleskes
Copy link
Contributor

bleskes commented Jan 21, 2016

Thanks @jasontedor . Left some comments/questions...

This commit renames a variable in
ShardFailedClusterStateTaskExecutor#execute in that it contains the
tasks that need the allocation service to be processed.
This commit simplifies the task-splitting logic when processing a batch
of shard failure tasks.
@@ -120,7 +120,7 @@ private TaskResult(Throwable failure) {
}

public boolean isSuccess() {
return failure != null;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug, but fortunately the method was unused until now so not impactful.

@jasontedor
Copy link
Member Author

@bleskes I pushed two commits.

This commit adds tests of the execution logic when failing a batch of
shards. Three tests are added:
 - test that an empty task list produces no change in the cluster state
 - test that duplicate shard failure requests are processed successfully
 - test that shard failure requests for non-existent shards are
   processed successfully
 - test that failing tasks do not prevent trivially successful tasks
   from succeeding

// tasks that correspond to non-existent shards are marked
// as successful
batchResultBuilder.successes(partition.get(false));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this up next to the stream partitioning so it will be clearer why we do this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 8d8bd8b.

This commit rearranges code to clarify the intent of the task
partitioning in the shard failure cluster state task executor.
}

private List<ShardStateAction.ShardRoutingEntry> createNonExistentShards(String reason) {
MetaData nonExistentMetaData =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also add shards with the same index and id as the existing ones but with a different allocation ids?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also add shards with the same index and id as the existing ones but with a different allocation ids?

That's a good suggestion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bleskes I pushed aa6a5c9.

@bleskes
Copy link
Contributor

bleskes commented Jan 26, 2016

LGTM. Left some suggestions for the tests, non of it is blocking pushing this.

This commit adds some shards with the same shard ID as existing shards,
but with a non-existent allocation ID. This tests that these shards are
correctly handled by the shard failure cluster state task executor.
This commit adds assertions that when a shard failure request is
successful, the shard that was requested to be failed is in fact no
longer in the cluster state.
@jasontedor jasontedor deleted the validate-shard-failure-requests branch January 26, 2016 21:38
@jasontedor jasontedor removed the review label Jan 26, 2016
@jasontedor
Copy link
Member Author

Thanks for another great review @bleskes.

@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement v5.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants