Trigger replica recovery restarts by master when primary relocation completes #23926

ywelsch · 2017-04-05T17:46:42Z

When a primary relocation completes while there are ongoing replica recoveries, the recoveries for these replicas need to be restarted (as a new primary is in charge of replicating changes). Before this PR, the need for a recovery restart was detected by the data nodes that had the replicas, by checking on each cluster state update if the recovery process had completed before the recovery source changed. That code had a race, however, which could lead to a not-fully recovered shard exposing itself as started (see #23904).

This PR takes a different approach: When the primary relocation completes and the master updates the cluster state to move the primary shard from relocating to started, it will reinitialize all initializing replica shards, by giving them a fresh allocation id. Data nodes that have the replica shard will simply detect that the allocation id changed and restart the recovery process (instead of trying to determine the need to restart based on ongoing recoveries).

Note: Removal of the code in IndicesClusterStateService that checks whether the recovery source has changed will not be backported to the 5.x branch. This ensures backward compatibility for the situation where the master node is older and does not have the code changes that have been introduced in this PR.

Closes #23904

bleskes

Looks great. Left a bunch of nits and questions. Nothing major there

bleskes · 2017-04-10T11:41:34Z

core/src/main/java/org/elasticsearch/cluster/routing/RoutingChangesObserver.java

@@ -69,6 +69,11 @@
     */
    void replicaPromoted(ShardRouting replicaShard);

+    /**
+     * Called when an initializing replica is reinitialized.


can you add a little note as to when this happens?

sure, I've pushed f53cded

bleskes · 2017-04-10T11:44:34Z

core/src/main/java/org/elasticsearch/cluster/routing/RoutingChangesObserver.java

@@ -120,6 +125,11 @@ public void startedPrimaryReinitialized(ShardRouting startedPrimaryShard, ShardR
        public void replicaPromoted(ShardRouting replicaShard) {

        }
+
+        @Override
+        public void initializedReplicaReinitialized(ShardRouting initializingReplica, ShardRouting reinitializedReplica) {


wondering (unrelated to this change) - why not use default implementations on the interface?

also the parameter naming can be very confusing - can we add java docs? maybe call it oldReplica & reinitializedReplica?

There is an implementation of the interface (RoutingNodesChangedObserver) where I want to be 100% sure that every method that is added to the interface is properly implemented in that class. With default methods, it's easy to miss this.

I've renamed the method parameter.

bleskes · 2017-04-10T11:57:19Z

.../src/main/java/org/elasticsearch/cluster/routing/allocation/RoutingNodesChangedObserver.java

+    public void initializedReplicaReinitialized(ShardRouting initializingReplica, ShardRouting reinitializedReplica) {
+        assert initializingReplica.initializing() && initializingReplica.primary() == false :
+            "expected initializing replica shard " + initializingReplica;
+        assert reinitializedReplica.initializing() && reinitializedReplica.primary() == false :


maybe add an assertion that the allocation id is changed?

sure, I've pushed f53cded

bleskes · 2017-04-10T12:01:54Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

@@ -416,6 +416,9 @@ public void updateRoutingEntry(ShardRouting newRouting) throws IOException {
                // active primaries.
                throw new IndexShardRelocatedException(shardId(), "Shard is marked as relocated, cannot safely move to state " + newRouting.state());
            }
+            assert newRouting.active() == false || state == IndexShardState.STARTED || state == IndexShardState.RELOCATED ||


to make sure I don't miss something: this PR doesn't make this assertion pass, it was relevant before as well, right?

yes, it was relevant before, and actually broken: If we had had this assertion, it would have triggered when we had the test failure here: #23904.

bleskes · 2017-04-10T12:02:43Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

@@ -416,6 +416,9 @@ public void updateRoutingEntry(ShardRouting newRouting) throws IOException {
                // active primaries.
                throw new IndexShardRelocatedException(shardId(), "Shard is marked as relocated, cannot safely move to state " + newRouting.state());
            }
+            assert newRouting.active() == false || state == IndexShardState.STARTED || state == IndexShardState.RELOCATED ||
+                state == IndexShardState.CLOSED :
+                "shard state is " + state + ", but routing is active " + newRouting;


I'm not sure this is correct.

can you elaborate? What's not correct?
What I want to check is newRouting.active() ==> state == XYZ, which is rewritten as newRouting.active() == false || state == XYZ

OK. got it. I got confusing. I would prefer a message like "routing is active, but local shard state isn't. routing [] , local state [ ]". If you prefer yours I'm got with keeping as is.

bleskes · 2017-04-10T12:05:04Z

core/src/test/java/org/elasticsearch/cluster/routing/allocation/StartedShardsRoutingTests.java

+
+        logger.info("--> building initial cluster state");
+        AllocationId primaryId = AllocationId.newRelocation(AllocationId.newInitializing());
+        AllocationId replicaId = AllocationId.newRelocation(AllocationId.newInitializing());


This doesn't need to be relocating per se? should we use the relocatingReplica boolean?

right, I've pushed f53cded

bleskes · 2017-04-10T12:07:19Z

core/src/test/java/org/elasticsearch/cluster/routing/allocation/StartedShardsRoutingTests.java

+            assertNotEquals(replica.allocationId().getId(), startedReplica.allocationId().getId());
+        }
+
+        logger.info("--> test starting of relocating primary shard together with initializing / relocating replica");


bleskes

LGTM. Thanks Yannick.

bleskes · 2017-04-10T13:56:42Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

@@ -416,6 +416,9 @@ public void updateRoutingEntry(ShardRouting newRouting) throws IOException {
                // active primaries.
                throw new IndexShardRelocatedException(shardId(), "Shard is marked as relocated, cannot safely move to state " + newRouting.state());
            }
+            assert newRouting.active() == false || state == IndexShardState.STARTED || state == IndexShardState.RELOCATED ||
+                state == IndexShardState.CLOSED :
+                "shard state is " + state + ", but routing is active " + newRouting;


OK. got it. I got confusing. I would prefer a message like "routing is active, but local shard state isn't. routing [] , local state [ ]". If you prefer yours I'm got with keeping as is.

…ompletes (#23926) When a primary relocation completes while there are ongoing replica recoveries, the recoveries for these replicas need to be restarted (as a new primary is in charge of replicating changes). Before this commit, the need for a recovery restart was detected by the data nodes that had the replicas, by checking on each cluster state update if the recovery process had completed before the recovery source changed. That code had a race, however, which could lead to a not-fully recovered shard exposing itself as started (see #23904). This commit takes a different approach: When the primary relocation completes and the master updates the cluster state to move the primary shard from relocating to started, it will reinitialize all initializing replica shards, by giving them a fresh allocation id. Data nodes that have the replica shard will simply detect that the allocation id changed and restart the recovery process (instead of trying to determine the need to restart based on ongoing recoveries). Note: Removal of the code in IndicesClusterStateService that checks whether the recovery source has changed will not be backported to the 5.x branch. This ensures backward compatibility for the situation where the master node is older and does not have the code changes that have been introduced in this PR. Closes #23904

Reinitialize replicas when primary relocation completes

5b47701

ywelsch added :Allocation :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement v5.4.0 v6.0.0-alpha1 labels Apr 5, 2017

ywelsch requested a review from bleskes April 5, 2017 17:46

bleskes suggested changes Apr 10, 2017

View reviewed changes

review feedback

f53cded

bleskes approved these changes Apr 10, 2017

View reviewed changes

ywelsch added 2 commits April 10, 2017 16:18

change assert message

6418964

fix test

bf58f1c

ywelsch merged commit 88a54f1 into elastic:master Apr 11, 2017

ywelsch mentioned this pull request Apr 26, 2017

CI Failure: org.elasticsearch.recovery.RelocationIT testIndexAndRelocateConcurrently #24167

Closed

ywelsch mentioned this pull request Jun 30, 2017

Replica recoveries can get stuck on concurrent primary relocation #20192

Closed

lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018

DaveCTurner mentioned this pull request May 2, 2023

More graceful recovery restart when primary allocation ID changes #95732

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trigger replica recovery restarts by master when primary relocation completes #23926

Trigger replica recovery restarts by master when primary relocation completes #23926

ywelsch commented Apr 5, 2017

bleskes left a comment

bleskes Apr 10, 2017

ywelsch Apr 10, 2017

bleskes Apr 10, 2017

bleskes Apr 10, 2017

ywelsch Apr 10, 2017

bleskes Apr 10, 2017

ywelsch Apr 10, 2017

bleskes Apr 10, 2017 •

edited

Loading

ywelsch Apr 10, 2017

bleskes Apr 10, 2017

ywelsch Apr 10, 2017

bleskes Apr 10, 2017

bleskes Apr 10, 2017

ywelsch Apr 10, 2017

bleskes Apr 10, 2017

bleskes left a comment

bleskes Apr 10, 2017

Trigger replica recovery restarts by master when primary relocation completes #23926

Trigger replica recovery restarts by master when primary relocation completes #23926

Conversation

ywelsch commented Apr 5, 2017

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes Apr 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes Apr 10, 2017 •

edited

Loading