Fail demoted primary shards and retry request #16415

jasontedor · 2016-02-03T13:59:44Z

This commit handles the scenario where a replication action fails on a
replica shard, the primary shard attempts to fail the replica shard
but the primary shard is notified of demotion by the master. In this
scenario, the demoted primary shard must be failed, and then the
request rerouted again to the new primary shard.

Relates #14252

This commit handles the scenario where a replication action fails on a replica shard, the primary shard attempts to fail the replica shard but the primary shard is notified of demotion by the master. In this scenario, the demoted primary shard must be failed, and then the request rerouted again to the new primary shard.

bleskes · 2016-02-04T07:51:04Z

core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

-                                            // TODO: handle catastrophic non-channel failures
-                                            onReplicaFailure(nodeId, exp);
+                                        public void onFailure(Throwable shardFailedError) {
+                                            logger.error("[{}] catastrophic error while failing replica shard [{}] for [{}]", shardFailedError, shardId, shard, exp);


this is already logged in the shard state action

Addressed in 0aff805.

bleskes · 2016-02-04T07:57:16Z

Thanks Jason. left some comments with suggestions

This commit removes a logging statement from TransportReplicationAction that occurs when handling a catastrophic non-channel failure while failing a shard. The same information is also logged from ShardStateAction when the master responds to the shard failure request there.

This commit significantly simplifies the retry logic upon a demoted primary shard. In particular, rather failing the demoted primary via shard state action and starting a new reroute phase, this commit simplifies this logic by failing the demoted primary via an index shard reference, and sending a retry on primary exception back to the original reroute phase.

This commit adds a dedicated test for handling the situation of a demoted primary shard that was trying to fail a replica shard during a replication action.

This commit removes a dead request parameter from the constructor of TransportReplicationAction.ReplicationPhase.

@colings86

* master: (85 commits) Rename variables. BootstrapSettings final with private constructor simplify method signature Register bootstrap settings Fix asciidoc typo Log warning if max file descriptors too low Require that relocation source is marked as relocating before starting recovery to relocation target apply feedback from @colings86 Speedup MessageDigestTests#testToHexString Fix serialization of `search_analyzer`. Cleanup JavaVersion Detach QueryShardContext from IndexShard and remove obsolete threadlocals Move sorting tests w/o scripting back to core Fix recovery translog stats totals when recovering from store [TEST] Don't assert on null value. It's fine to not always see an exception in this part of the test. Objects#requireNonNull guard in MessageDigests Fix trace logging statement in ZenDiscovery Avoid cloning MessageDigest instances Minor clean up. Make IndicesWarmer a private class of IndexService ...

jasontedor · 2016-02-09T12:31:30Z

@bleskes I've pushed more commits.

bleskes · 2016-02-10T15:41:57Z

core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

+                                                ShardRouting primaryShard = indexShardReference.routingEntry();
+                                                String message = String.format(Locale.ROOT, "primary shard [%s] was demoted while failing replica shard [%s] for [%s]", primaryShard, shard, exp);
+                                                // we are no longer the primary, fail ourselves and start over
+                                                indexShardReference.failShard(message, shardFailedError);


we can use forceFinishAsFailed here no?

Done in 29643b7.

bleskes · 2016-02-10T15:48:51Z

.../test/java/org/elasticsearch/action/support/replication/TransportReplicationActionTests.java

@@ -882,14 +900,85 @@ public void testCounterDecrementedIfShardOperationThrowsException() throws Inter
        assertPhase(task, "failed");
    }

+    public void testReroutePhaseRetriedAfterDemotedPrimary() {


since we know that we return with retry on primary (and we test for it), do we need a dedicated test here? can't we rely on the generic retry test (which I do home check the RetryOnPrimaryException semantics)?

@bleskes I think that runReplicateTest is getting very complicated, and I'd prefer to keep this test with much simpler semantics for what is a complicated situation. Do you feel strongly that it should be removed?

I meant removing this in favor of what I thought would be a centralized reroute test , but I know see there is no suche thing and we do it on a case by case basis. So Ignore me, all ok.

bleskes · 2016-02-10T15:51:02Z

I think this turned out well. I left some comments. I'm thinking we should get this in and then open a follow up issue to deal with the annoying request-reset issue.

bleskes · 2016-02-10T16:18:03Z

LGTM.

jasontedor added >enhancement review WIP v5.0.0-alpha1 labels Feb 3, 2016

jasontedor assigned bleskes Feb 3, 2016

jasontedor mentioned this pull request Feb 3, 2016

Wait on shard failures #14252

Closed

9 tasks

bleskes reviewed Feb 4, 2016
View reviewed changes

jasontedor added 7 commits February 9, 2016 06:45

Dedicated test upon demoted primary

2b3f87f

This commit adds a dedicated test for handling the situation of a demoted primary shard that was trying to fail a replica shard during a replication action.

Remove dead parameter in ReplicationPhase

2746be7

This commit removes a dead request parameter from the constructor of TransportReplicationAction.ReplicationPhase.

Acked indexing test still awaits fix

6cd730b

Non-local primary means non-local primary

2645039

bleskes reviewed Feb 10, 2016
View reviewed changes

Cleanly handle failure paths

29643b7

bleskes reviewed Feb 10, 2016
View reviewed changes

Constant is not random

2912da0

Keep demoted primary as cause of retry on primary

ef2ec7c

jasontedor removed review WIP labels Feb 10, 2016

jasontedor closed this in 346ff04 Feb 10, 2016

jasontedor deleted the waiting-is-the-hardest-part branch February 24, 2016 10:24

jasontedor mentioned this pull request Mar 29, 2016

Replication operation that try to perform the primary phase on a replica should be retried #17358

Merged

clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail demoted primary shards and retry request #16415

Fail demoted primary shards and retry request #16415

jasontedor commented Feb 3, 2016

bleskes Feb 4, 2016

jasontedor Feb 9, 2016

bleskes commented Feb 4, 2016

jasontedor commented Feb 9, 2016

bleskes Feb 10, 2016

jasontedor Feb 10, 2016

bleskes Feb 10, 2016

jasontedor Feb 10, 2016

bleskes Feb 10, 2016

bleskes commented Feb 10, 2016

bleskes commented Feb 10, 2016

Fail demoted primary shards and retry request #16415

Fail demoted primary shards and retry request #16415

Conversation

jasontedor commented Feb 3, 2016

bleskes Feb 4, 2016

Choose a reason for hiding this comment

jasontedor Feb 9, 2016

Choose a reason for hiding this comment

bleskes commented Feb 4, 2016

jasontedor commented Feb 9, 2016

bleskes Feb 10, 2016

Choose a reason for hiding this comment

jasontedor Feb 10, 2016

Choose a reason for hiding this comment

bleskes Feb 10, 2016

Choose a reason for hiding this comment

jasontedor Feb 10, 2016

Choose a reason for hiding this comment

bleskes Feb 10, 2016

Choose a reason for hiding this comment

bleskes commented Feb 10, 2016

bleskes commented Feb 10, 2016