-
Notifications
You must be signed in to change notification settings - Fork 24.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail demoted primary shards and retry request #16415
Fail demoted primary shards and retry request #16415
Conversation
This commit handles the scenario where a replication action fails on a replica shard, the primary shard attempts to fail the replica shard but the primary shard is notified of demotion by the master. In this scenario, the demoted primary shard must be failed, and then the request rerouted again to the new primary shard.
// TODO: handle catastrophic non-channel failures | ||
onReplicaFailure(nodeId, exp); | ||
public void onFailure(Throwable shardFailedError) { | ||
logger.error("[{}] catastrophic error while failing replica shard [{}] for [{}]", shardFailedError, shardId, shard, exp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is already logged in the shard state action
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 0aff805.
Thanks Jason. left some comments with suggestions |
This commit removes a logging statement from TransportReplicationAction that occurs when handling a catastrophic non-channel failure while failing a shard. The same information is also logged from ShardStateAction when the master responds to the shard failure request there.
This commit significantly simplifies the retry logic upon a demoted primary shard. In particular, rather failing the demoted primary via shard state action and starting a new reroute phase, this commit simplifies this logic by failing the demoted primary via an index shard reference, and sending a retry on primary exception back to the original reroute phase.
This commit adds a dedicated test for handling the situation of a demoted primary shard that was trying to fail a replica shard during a replication action.
This commit removes a dead request parameter from the constructor of TransportReplicationAction.ReplicationPhase.
* master: (85 commits) Rename variables. BootstrapSettings final with private constructor simplify method signature Register bootstrap settings Fix asciidoc typo Log warning if max file descriptors too low Require that relocation source is marked as relocating before starting recovery to relocation target apply feedback from @colings86 Speedup MessageDigestTests#testToHexString Fix serialization of `search_analyzer`. Cleanup JavaVersion Detach QueryShardContext from IndexShard and remove obsolete threadlocals Move sorting tests w/o scripting back to core Fix recovery translog stats totals when recovering from store [TEST] Don't assert on null value. It's fine to not always see an exception in this part of the test. Objects#requireNonNull guard in MessageDigests Fix trace logging statement in ZenDiscovery Avoid cloning MessageDigest instances Minor clean up. Make IndicesWarmer a private class of IndexService ...
@bleskes I've pushed more commits. |
ShardRouting primaryShard = indexShardReference.routingEntry(); | ||
String message = String.format(Locale.ROOT, "primary shard [%s] was demoted while failing replica shard [%s] for [%s]", primaryShard, shard, exp); | ||
// we are no longer the primary, fail ourselves and start over | ||
indexShardReference.failShard(message, shardFailedError); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use forceFinishAsFailed here no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 29643b7.
@@ -882,14 +900,85 @@ public void testCounterDecrementedIfShardOperationThrowsException() throws Inter | |||
assertPhase(task, "failed"); | |||
} | |||
|
|||
public void testReroutePhaseRetriedAfterDemotedPrimary() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we know that we return with retry on primary (and we test for it), do we need a dedicated test here? can't we rely on the generic retry test (which I do home check the RetryOnPrimaryException semantics)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bleskes I think that runReplicateTest
is getting very complicated, and I'd prefer to keep this test with much simpler semantics for what is a complicated situation. Do you feel strongly that it should be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant removing this in favor of what I thought would be a centralized reroute test , but I know see there is no suche thing and we do it on a case by case basis. So Ignore me, all ok.
I think this turned out well. I left some comments. I'm thinking we should get this in and then open a follow up issue to deal with the annoying request-reset issue. |
LGTM. |
This commit handles the scenario where a replication action fails on a
replica shard, the primary shard attempts to fail the replica shard
but the primary shard is notified of demotion by the master. In this
scenario, the demoted primary shard must be failed, and then the
request rerouted again to the new primary shard.
Relates #14252