Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail demoted primary shards and retry request #16415

Closed
wants to merge 11 commits into from
Closed

Fail demoted primary shards and retry request #16415

wants to merge 11 commits into from

Conversation

jasontedor
Copy link
Member

This commit handles the scenario where a replication action fails on a
replica shard, the primary shard attempts to fail the replica shard
but the primary shard is notified of demotion by the master. In this
scenario, the demoted primary shard must be failed, and then the
request rerouted again to the new primary shard.

Relates #14252

This commit handles the scenario where a replication action fails on a
replica shard, the primary shard attempts to fail the replica shard
but the primary shard is notified of demotion by the master. In this
scenario, the demoted primary shard must be failed, and then the
request rerouted again to the new primary shard.
// TODO: handle catastrophic non-channel failures
onReplicaFailure(nodeId, exp);
public void onFailure(Throwable shardFailedError) {
logger.error("[{}] catastrophic error while failing replica shard [{}] for [{}]", shardFailedError, shardId, shard, exp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already logged in the shard state action

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 0aff805.

@bleskes
Copy link
Contributor

bleskes commented Feb 4, 2016

Thanks Jason. left some comments with suggestions

This commit removes a logging statement from TransportReplicationAction
that occurs when handling a catastrophic non-channel failure while
failing a shard. The same information is also logged from
ShardStateAction when the master responds to the shard failure request
there.
This commit significantly simplifies the retry logic upon a demoted
primary shard. In particular, rather failing the demoted primary via
shard state action and starting a new reroute phase, this commit
simplifies this logic by failing the demoted primary via an index shard
reference, and sending a retry on primary exception back to the original
reroute phase.
This commit adds a dedicated test for handling the situation of a
demoted primary shard that was trying to fail a replica shard during a
replication action.
This commit removes a dead request parameter from the constructor of
TransportReplicationAction.ReplicationPhase.
* master: (85 commits)
  Rename variables.
  BootstrapSettings final with private constructor
  simplify method signature
  Register bootstrap settings
  Fix asciidoc typo
  Log warning if max file descriptors too low
  Require that relocation source is marked as relocating before starting recovery to relocation target
  apply feedback from @colings86
  Speedup MessageDigestTests#testToHexString
  Fix serialization of `search_analyzer`.
  Cleanup JavaVersion
  Detach QueryShardContext from IndexShard and remove obsolete threadlocals
  Move sorting tests w/o scripting back to core
  Fix recovery translog stats totals when recovering from store
  [TEST] Don't assert on null value. It's fine to not always see an exception in this part of the test.
  Objects#requireNonNull guard in MessageDigests
  Fix trace logging statement in ZenDiscovery
  Avoid cloning MessageDigest instances
  Minor clean up.
  Make IndicesWarmer a private class of IndexService
  ...
@jasontedor
Copy link
Member Author

@bleskes I've pushed more commits.

ShardRouting primaryShard = indexShardReference.routingEntry();
String message = String.format(Locale.ROOT, "primary shard [%s] was demoted while failing replica shard [%s] for [%s]", primaryShard, shard, exp);
// we are no longer the primary, fail ourselves and start over
indexShardReference.failShard(message, shardFailedError);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use forceFinishAsFailed here no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 29643b7.

@@ -882,14 +900,85 @@ public void testCounterDecrementedIfShardOperationThrowsException() throws Inter
assertPhase(task, "failed");
}

public void testReroutePhaseRetriedAfterDemotedPrimary() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we know that we return with retry on primary (and we test for it), do we need a dedicated test here? can't we rely on the generic retry test (which I do home check the RetryOnPrimaryException semantics)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bleskes I think that runReplicateTest is getting very complicated, and I'd prefer to keep this test with much simpler semantics for what is a complicated situation. Do you feel strongly that it should be removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant removing this in favor of what I thought would be a centralized reroute test , but I know see there is no suche thing and we do it on a case by case basis. So Ignore me, all ok.

@bleskes
Copy link
Contributor

bleskes commented Feb 10, 2016

I think this turned out well. I left some comments. I'm thinking we should get this in and then open a follow up issue to deal with the annoying request-reset issue.

@bleskes
Copy link
Contributor

bleskes commented Feb 10, 2016

LGTM.

@jasontedor jasontedor deleted the waiting-is-the-hardest-part branch February 24, 2016 10:24
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement v5.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants