Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancel recoveries even if all shards assigned #46520

Merged
merged 13 commits into from
Oct 1, 2019

Conversation

howardhuanghua
Copy link
Contributor

@howardhuanghua howardhuanghua commented Sep 10, 2019

Issue

Sometimes we need to perform rolling restart for some static configurations to take effect, or rolling restart to upgrade whole cluster. We have tested one of the cluster that has 6 nodes, total 10TB data, 6000+ shards with 1 replica. Before restarting, we have done sync flush, each node cost 10+ mins to get cluster GREEN, all nodes rolling restart cost more than 1 hour. And for 100+ nodes, need more than one day to upgrade.

After sorting related logic, we found an issue, take 3 nodes A, B, C as an example:
Test version: 5.6.4, 6.4.3, 7.3.1.
Related settings:
"index.unassigned.node_left.delayed_timeout": 3m
"cluster.routing.allocation.node_concurrent_recoveries": 30
"indices.recovery.max_bytes_per_sec": 40mb

One node (A) restart flow:

  • Restart node A. All shards on node A become unassigned.
  • Before node A gets back, also before delay allocation timeout (3m), no unassigned shards gets relocated.
  • After node A gets started, all unassigned shards start to recovery from node A, and node A gets throttled (30) soon.
  • Then some unassigned shards start to be relocated to other nodes (B, C), as they are not throttled. Since peer recovery from remote node would copy segment files and translogs, shard around 30GB would cost 10+ mins (40mb/sec by default).

Solution

With this PR optimization, one node restarting time in above case could reduce from 10+ mins to around 1 min. The main logic:
After restarted node gets back, if it gets throttled, do not relocate unassigned shards to other nodes before delay allocation timeout. Then in most case would not cause segment files copy from remote node.

@elasticcla
Copy link

Hi @howardhuanghua, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in your Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

@howardhuanghua howardhuanghua marked this pull request as ready for review September 10, 2019 08:31
@DaveCTurner DaveCTurner added the :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Sep 10, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor

Thanks for the suggestion @howardhuanghua. I think we need to understand the situation you're describing a little more clearly. If your indices are successfully synced-flushed then it shouldn't matter if some of them start to recover onto other nodes because those recoveries should quickly be cancelled by recoveries onto the restarted node. Are you saying this is not the case? Can you share logs from such a restart with logger.org.elasticsearch.gateway.GatewayAllocator: TRACE so we can see why the recoveries are not being cancelled?

Also can you supply tests that support your change?

@DaveCTurner
Copy link
Contributor

I think I see an issue:

// now allocate all the unassigned to available nodes
if (allocation.routingNodes().unassigned().size() > 0) {
removeDelayMarkers(allocation);
gatewayAllocator.allocateUnassigned(allocation);
}

We only consider cancelling ongoing recoveries if there are unassigned shards. I think this might explain why Elasticsearch spends time completing a recovery despite the synced-flushed shard on the restarted node. Is it the case that most of the recoveries are cancelled, but the last few run to completion?

@howardhuanghua
Copy link
Contributor Author

howardhuanghua commented Sep 10, 2019

@DaveCTurner Yes, only few shards run to completion.
I opened logger.org.elasticsearch.gateway.GatewayAllocator: TRACE in our test cluster. And cannot see any useful log. But after the restarted node gets back, before delay timeout, I could see following INITIALIZING shards, 9.28.82.208 is the restarted node, 9.28.82.74 is the node that should not allocate shard.

lb_backend_server-300@1568044800000_7 2 r INITIALIZING 9.28.82.74 1527044744023702309
cvm_device-10@1568044800000_1 2 r INITIALIZING 9.28.82.208 1527044744023702209
vbcgw_broute_tunnel-60@1567872000000_3 1 r INITIALIZING 9.28.82.208 1527044744023702209
disk_iostat-10@1568044800000_1 2 r INITIALIZING 9.28.82.208 1527044744023702209

@DaveCTurner
Copy link
Contributor

@DaveCTurner Yes, only few shards run to completion.

Thanks for confirming. In this case, I think it'd be better to contemplate cancelling recoveries even if there are no unallocated shards, rather than relying on the delayed allocation timeout as you propose.

@howardhuanghua
Copy link
Contributor Author

@DaveCTurner Thanks. I will continue to check the cancelling recoveries.

@howardhuanghua
Copy link
Contributor Author

Hi @DaveCTurner, we have sorted the slow recovery issue and provide new propose.

Why some shards allocated to other nodes?

// now allocate all the unassigned to available nodes
if (allocation.routingNodes().unassigned().size() > 0) {
removeDelayMarkers(allocation);
gatewayAllocator.allocateUnassigned(allocation);
}

Above logic tries to handle unassigned shards where valid copies of the shard already exist.
But following shard checking logic only considers syncId or sizeMatched between primary and replica.
if (replicaSyncId != null && replicaSyncId.equals(primarySyncId)) {
return Long.MAX_VALUE;
} else {
long sizeMatched = 0;
for (StoreFileMetaData storeFileMetaData : storeFilesMetaData) {
String metaDataFileName = storeFileMetaData.name();
if (primaryStore.fileExists(metaDataFileName) && primaryStore.file(metaDataFileName).isSame(storeFileMetaData)) {
sizeMatched += storeFileMetaData.length();
}
}
return sizeMatched;
}

This would have follow issues:

  • syncId could be changed at any time in continuous writing case.
  • Due to segment merge, primaryStore.fileExists(metaDataFileName) would be false and sizeMatched would be 0. This would cause makeAllocationDecision function returns NOT_TAKEN due to matchingNodes.hasAnyData() is true.

Above issues would cause unassigned shard goes to next allocation step:


Before allocation delay timeout, these unassigned shards would be allocated to other nodes.

Why allocating cannot be cancelled?
SyncId would not equal between primary and replica as we have already checked before:

if (replicaSyncId != null && replicaSyncId.equals(primarySyncId)) {

This would cause cancelling logic cannot take effect:
if (currentNode.equals(nodeWithHighestMatch) == false
&& Objects.equals(currentSyncId, primaryStore.syncId()) == false
&& matchingNodes.isNodeMatchBySyncID(nodeWithHighestMatch)) {

Allocating to new node cannot be cancelled cause restarting takes long time.

Our propose:
SeqNo has been introduced since 6.x to speed up peer recovery, so we try to stop allocating unassigned shards to other nodes before allocation delay timeout, this would try to use seqNo during recovery process. This is the above PR propose.

We have another propose to solve the issue, for the key point:

if (replicaSyncId != null && replicaSyncId.equals(primarySyncId)) {
return Long.MAX_VALUE;
} else {
long sizeMatched = 0;
for (StoreFileMetaData storeFileMetaData : storeFilesMetaData) {
String metaDataFileName = storeFileMetaData.name();
if (primaryStore.fileExists(metaDataFileName) && primaryStore.file(metaDataFileName).isSame(storeFileMetaData)) {
sizeMatched += storeFileMetaData.length();
}
}
return sizeMatched;
}

There are several levels to check the unassigned shards should be relocated or not:

  • SyncId, already has.
  • SeqNo. We could introduce minSeqNo of primary shard which represents minimum checkpoint of primary translogs, currently we could get maxSeqNo from MetadataSnapshot.commitUserData directly but no minSeqNo. Based on minSeqNo, we could compare with current replica local checkpoint to see if we could recover from translogs even syncId is not equal.
    minSeqNo implementation:
    1). Commit minSeqNo into segment_N file. Load minSeqNo from the file and add it to MetadataSnapshot.commitUserData in fetchData process. This would have issue that translog would be async-cleaned and minSeqNo would not be the latest value.
    2). Get minSeqNo from IndexShard(real-time update) and put it into MetadataSnapshot.commitUserData in fetchData process.
    3). Instead of checking minSeqNo, we could simplely pick up the restarted node directly in findMatchingNodes method.
  • Segment matching size compare, already has.

Please help to evaluate, if it's ok, we could provide patch. Thanks.

@DaveCTurner
Copy link
Contributor

You are right that we could use sequence numbers to make a better allocation decision in the case that there is no sync id too, but we are already working on this in #46318.

@howardhuanghua
Copy link
Contributor Author

Hi @DaveCTurner, we have checked sequence-number-based replica allocation. It could handle the first phase of rerouting unassigned shards. How do you think we still need to avoid allocating shards to other nodes before delay node left timeout in case of phase 1 has any issues? As the PR described in phase 2.

@DaveCTurner
Copy link
Contributor

I'm sorry I don't really understand the question. What is phase 2?

@howardhuanghua
Copy link
Contributor Author

howardhuanghua commented Sep 16, 2019

Hi @DaveCTurner, in reroute method, the first phase tries to allocate unassigned shards to a node that already has a data copy:

// now allocate all the unassigned to available nodes
if (allocation.routingNodes().unassigned().size() > 0) {
removeDelayMarkers(allocation);
gatewayAllocator.allocateUnassigned(allocation);
}

Sequence-number-based replica allocation should be in phase 1 to check existing data copy, please correct me if something wrong.

And phase 2 tries to allocate unassigned shards to a node as matched as possible, including new node (no data copy):



So I provided the above PR to avoid allocating shards to other nodes before delay node left timeout in phase 2.

@DaveCTurner
Copy link
Contributor

Sorry @howardhuanghua I am perhaps misunderstanding the issue you are trying to fix with this PR. Can you add a test case that this change fixes? I think that would make things clearer. You might find it helpful to look at org.elasticsearch.cluster.routing.DelayedAllocationIT since this suite tests related properties.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @howardhuanghua, I think this test is not realistic because of the skipAllocation flag you've added to the DelayedShardsMockGatewayAllocator. I have left a more detailed comment inline.

@@ -268,6 +271,9 @@ public void applyFailedShards(RoutingAllocation allocation, List<FailedShard> fa

@Override
public void allocateUnassigned(RoutingAllocation allocation) {
if (this.skipAllocation) {
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this addition. As far as I can tell this makes this mock allocator behave quite differently from the real GatewayAllocator doesn't it? Your test only fails because of this difference in behaviour: if I comment this line out then your test passes without any changes to the production code. Can you provide a test case using the real allocator? I suggest adding to DelayedAllocationIT rather than here to ensure the test matches the production code more closely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to simulate the case that if unassigned shard cannot be allocated in GatewayAllocator, then it also needs to be delayed in ShardsAllocator. Suppose we skip GatewayAllocator allocation, without this PR, if I remove the node that has replica shard, unassigned shard will be allocated immediately to the other node, with this PR, it will still be delayed until delayed_timeout.

Considering the sequence based allocation decision could select correct node for allocation, so this PR will have side effects on phase 2 allocation that will delay shard allocation until delayed_timeout. For more infomation please reference #46520 (comment).

@howardhuanghua
Copy link
Contributor Author

Sorry @DaveCTurner, I didn't add comments in time after commiting the test case, let me try to explain our idea clearly.

Let’s see we have a cluster with 3 nodes(A/B/C) and some indices, one of the indices named test has 1 primary shard and 1 replica shard. The test index shard layout is node A(shard p0), node B(shard r0), and node C. Set "index.unassigned.node_left.delayed_timeout" to 5 mins.

The following steps create the scenario:

  1. Write some data to test index.
  2. Do a sync flush.
  3. Take node B down.
  4. Do a force merge to merge multiple segments of p0 to 1.
    This make sure p0 has different sync-id with r0, also has different segment files.
    And now r0 allocation will be delayed before delayed_timeout:
    return AllocateUnassignedDecision.delayed(remainingDelayMillis, totalDelayMillis, nodeDecisions);
  5. Startup node B.
    It's still before delayed_timeout, since both sync-id and segment files are different between p0 and r0, then makeAllocationDecision method would return NOT_TAKEN:
    This cause gatewayAllocator could not handle unassigned r0 (we called phase 1 before):

And next step, unassigned r0 will be handled in shardsAllocator (we called phase 2 before):


In this step, if node B gets throttled by other shards, r0 could be allocated to node C and cannot be cancelled, all the segment files need to be copied from p0.

Our above PR is going to stop allocating r0 to node C before allocation delay timeout in phase 2.

However, if allocation decision is based on sequence number (#46318), r0 would be allocated to node B in above phase 1 if p0 on node A still has complete history operations. If r0 cannot be allocated in phase 1, that means it must have no reusable data copy or complete history operations, we should relocate it to a new node immediately. In this case, our above PR will still wait until delayed_timeout, this may not be appropriate.

So all in all, we plan to only fix the ongoing recoveries cannot be cancelled issue as you mentioned in #46520 (comment). If you think it's OK then we will provide the patch for this issue only. Please provide some advice if you have, thanks a lot.

@DaveCTurner
Copy link
Contributor

Thanks @howardhuanghua for your patient explanations. I think I now understand the issue more clearly, and I am more certain that it will be fixed by the seqno-based replica shard allocator that we're working on.

This make sure p0 has different sync-id with r0, also has different segment files.

This is crucial to the issue you are seeing: the primary and replica must have absolutely nothing in common to hit this issue. If they share even a single segment (or a sync id) then we will hit the following code instead, and this will correctly throttle the allocation in the ReplicaShardAllocator rather than returning NOT_TAKEN:

} else if (matchingNodes.getNodeWithHighestMatch() != null) {
RoutingNode nodeWithHighestMatch = allocation.routingNodes().node(matchingNodes.getNodeWithHighestMatch().getId());
// we only check on THROTTLE since we checked before before on NO
Decision decision = allocation.deciders().canAllocate(unassignedShard, nodeWithHighestMatch, allocation);
if (decision.type() == Decision.Type.THROTTLE) {
logger.debug("[{}][{}]: throttling allocation [{}] to [{}] in order to reuse its unallocated persistent store",
unassignedShard.index(), unassignedShard.id(), unassignedShard, nodeWithHighestMatch.node());
// we are throttling this, as we have enough other shards to allocate to this node, so ignore it for now
return AllocateUnassignedDecision.throttle(nodeDecisions);
} else {
logger.debug("[{}][{}]: allocating [{}] to [{}] in order to reuse its unallocated persistent store",
unassignedShard.index(), unassignedShard.id(), unassignedShard, nodeWithHighestMatch.node());
// we found a match
return AllocateUnassignedDecision.yes(nodeWithHighestMatch.node(), null, nodeDecisions, true);
}

Once the ReplicaShardAllocator takes sequence numbers into account we will execute this code if the primary and replica have any operations in common, which they always do, so the shard will always be allocated back to the returning node.

So all in all, we plan to only fix the ongoing recoveries cannot be cancelled issue as you mentioned in #46520 (comment). If you think it's OK then we will provide the patch for this issue only. Please provide some advice if you have, thanks a lot.

Yes, we'd very much appreciate a fix for that :)

@howardhuanghua
Copy link
Contributor Author

Hi @DaveCTurner, thanks for the response. I am glad that we are now on the same page :). I will provide the fix soon.

@howardhuanghua
Copy link
Contributor Author

Hi @DaveCTurner, I have updated the commit, added hasInactiveShards checking in both AllocationService.reroute and GatewayAllocator.innerAllocatedUnassigned. This would solve the case that only has initializing shard, and they are peer recovering to the new node rather than the existing data copy node. Please help to check, thank you.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @howardhuanghua, can you also supply some tests for this change? We need at least an ESIntegTestCase showing that it does cancel the last batch of recoveries. I would guess you could add this to org.elasticsearch.indices.recovery.IndexRecoveryIT.

// now allocate all the unassigned to available nodes
if (allocation.routingNodes().unassigned().size() > 0) {
// now allocate all the unassigned to available nodes or cancel existing recoveries if we have a better match
if (allocation.routingNodes().unassigned().size() > 0 || allocation.routingNodes().hasInactiveShards()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder: why not remove this condition entirely?

@howardhuanghua
Copy link
Contributor Author

howardhuanghua commented Sep 24, 2019

Thanks @DaveCTurner, I have updated the commit.

  1. Removed unassigned shard or initialization shard condition.
  2. Added testCancelNewShardRecoveryAndUsesExistingShardCopy in org.elasticsearch.indices.recovery.IndexRecoveryIT. This IT simulates 3 nodes cluster, one of the data nodes down and back, the new shard recovery should be canceled and uses existing shard copy to recovery.

@howardhuanghua
Copy link
Contributor Author

Hi @DaveCTurner, would you please help to check the updated commit again? Thank you.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @howardhuanghua. The test you provided failed when I ran it locally (see below for details). It's normally a good idea to run tests like this repeatedly since they are not fully deterministic and might not fail every time. That said, it looks like it's doing roughly the right things and I left some ideas for small improvements.

List<RecoveryState> nodeARecoveryStates = findRecoveriesForTargetNode(nodeA, recoveryStates);
assertThat(nodeARecoveryStates.size(), equalTo(1));
List<RecoveryState> nodeCRecoveryStates = findRecoveriesForTargetNode(nodeC, recoveryStates);
assertThat(nodeCRecoveryStates.size(), equalTo(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I ran this test it failed here:

  2> REPRODUCE WITH: ./gradlew ':server:integTest' --tests "org.elasticsearch.indices.recovery.IndexRecoveryIT.testCancelNewShardRecoveryAndUsesExistingShardCopy {seed=[ECDF910E1F356F6D:FFC9E32BAD24745B]}" -Dtests.seed=ECDF910E1F356F6D -Dtests.security.manager=true -Dtests.jvms=4 -Dtests.locale=it -Dtests.timezone=America/Rio_Branco -Dcompiler.java=12 -Druntime.java=12
  2> java.lang.AssertionError: 
    Expected: <1>
         but: was <0>
        at __randomizedtesting.SeedInfo.seed([ECDF910E1F356F6D:FFC9E32BAD24745B]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.junit.Assert.assertThat(Assert.java:923)

assertBusy(() -> assertThat(client().admin().indices().prepareSyncedFlush(INDEX_NAME).get().failedShards(), equalTo(0)));

logger.info("--> slowing down recoveries");
slowDownRecovery(shardSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slowDownRecovery is for testing the throttling behaviour and is not sufficient here as there is still a chance that the recovery finishes before it is cancelled and this will cause the test to fail. I think we must completely halt the recovery until it has been cancelled. I would do this by either capture the START_RECOVERY action (see testRecoverLocallyUpToGlobalCheckpoint for instance) or one of the subsidiary requests (e.g. CLEAN_FILES as done in testOngoingRecoveryAndMasterFailOver).

assertFalse(client().admin().cluster().prepareHealth().setWaitForNodes("3").get().isTimedOut());

// do sync flush to gen sync id
assertBusy(() -> assertThat(client().admin().indices().prepareSyncedFlush(INDEX_NAME).get().failedShards(), equalTo(0)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is an assertBusy necessary here? I think a failure of a synced flush is unexpected and should result in a test failure.

slowDownRecovery(shardSize);

logger.info("--> stop node B");
internalCluster().stopRandomNode(InternalTestCluster.nameFilter(nodeB));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to use internalCluster().restartNode() which takes a RestartCallback whose onNodeStopped method is a good place to do things to the cluster while the node is stopped.


logger.info("--> request recoveries");
// peer recovery from nodeA to nodeC should be canceled, replica should be allocated to nodeB that has the data copy
assertBusy(() -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we ensureGreen here instead of this assertBusy? I think all of the assertions in here should hold for sure once the cluster is green again.

@DaveCTurner
Copy link
Contributor

Also could you merge the latest master, because there are now some conflicts that need resolving.

@howardhuanghua
Copy link
Contributor Author

Thanks @DaveCTurner, I have updated the test case based on your suggestion. During restarting replica shard node, hold peer recovery from primary shard to new node, and check peer recovery source/target on replica shard node stopped, finally make sure cluster green before releasing the held peer recovery. Please help to review again. Thanks for your help!

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @howardhuanghua I left a few more comments but this is looking very good.

new InternalTestCluster.RestartCallback() {
@Override
public Settings onNodeStopped(String nodeName) throws Exception {
assertBusy(() -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😁 I was just about to note the missing wait here.

I think it'd be neater to wait for node A to send its CLEAN_FILES action instead of using an assertBusy. You can use another CountDownLatch for this.

RecoveryResponse response = client().admin().indices().prepareRecoveries(INDEX_NAME).execute().actionGet();

List<RecoveryState> recoveryStates = response.shardRecoveryStates().get(INDEX_NAME);
List<RecoveryState> nodeARecoveryStates = findRecoveriesForTargetNode(nodeA, recoveryStates);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do not need to say anything about the recoveries on node A. These assertions are true, but not particularly important for this test.

}
});

// wait for peer recovering from nodeA to nodeB to be finished
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me some time to work out why this works - I suggest this comment explaining it:

Suggested change
// wait for peer recovering from nodeA to nodeB to be finished
// wait for peer recovery from nodeA to nodeB which is a no-op recovery so it skips the CLEAN_FILES stage and hence is not blocked

final String nodeA = internalCluster().startNode();

logger.info("--> create index on node: {}", nodeA);
ByteSizeValue shardSize = createAndPopulateIndex(INDEX_NAME, 1, SHARD_COUNT, REPLICA_COUNT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shardSize is unused:

Suggested change
ByteSizeValue shardSize = createAndPopulateIndex(INDEX_NAME, 1, SHARD_COUNT, REPLICA_COUNT)
createAndPopulateIndex(INDEX_NAME, 1, SHARD_COUNT, REPLICA_COUNT).getShards()[0].getStats().getStore().size();

logger.info("--> start node B");
// force a shard recovery from nodeA to nodeB
final String nodeB = internalCluster().startNode();
Settings nodeBDataPathSettings = internalCluster().dataPathSettings(nodeB);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodeBDataPathSettings is unused:

Suggested change
Settings nodeBDataPathSettings = internalCluster().dataPathSettings(nodeB);


logger.info("--> start node C");
final String nodeC = internalCluster().startNode();
assertFalse(client().admin().cluster().prepareHealth().setWaitForNodes("3").get().isTimedOut());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd normally recommend the shorthand

Suggested change
assertFalse(client().admin().cluster().prepareHealth().setWaitForNodes("3").get().isTimedOut());
ensureStableCluster(3);

but I don't think this is necessary:

  • startNode() calls validateClusterFormed()
  • anyway it doesn't matter if node C takes a bit longer to join the cluster because we have to wait for its recovery to start which only happens after it's joined.

Therefore I think we can drop this:

Suggested change
assertFalse(client().admin().cluster().prepareHealth().setWaitForNodes("3").get().isTimedOut());

@howardhuanghua
Copy link
Contributor Author

Hi @DaveCTurner, appreciate your patient help! I have updated the test case, please help to check again.

@DaveCTurner
Copy link
Contributor

@elasticmachine test this please

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @howardhuanghua.

@DaveCTurner DaveCTurner changed the title Optimize rolling restart efficiency. Cancel recoveries even if all shards assigned Oct 1, 2019
@DaveCTurner DaveCTurner merged commit af930a7 into elastic:master Oct 1, 2019
DaveCTurner added a commit that referenced this pull request Oct 1, 2019
DaveCTurner pushed a commit that referenced this pull request Oct 1, 2019
We cancel ongoing peer recoveries if a node joins the cluster with a completely
up-to-date copy of a shard, because we can use such a copy to recover a replica
instantly. However, today we only look for recoveries to cancel while there are
unassigned shards in the cluster. This means that we do not contemplate the
cancellation of the last few recoveries since recovering shards are not
unassigned.  It might take much longer for these recoveries to complete than
would be necessary if they were cancelled.

This commit fixes this by checking for cancellable recoveries even if all
shards are assigned.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v7.5.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants