Ensure cluster is stable in ShrinkIndexIT.testShrinkThenSplitWithFailedNode #44860

tlrx · 2019-07-25T15:23:21Z

The test ShrinkIndexIT.testShrinkThenSplitWithFailedNode sometimes fails because the resize operation is not acknowledged (see #44736). This resize operation creates a new index "splitagain" and it results in a cluster state update (TransportResizeAction uses MetaDataCreateIndexService.createIndex() to create the resized index). This cluster state update is expected to be acknowledged by all nodes (see IndexCreationTask.onAllNodesAcked()) but this is not always true: the data node that was just stopped in the test before executing the resize operation might still be considered as a "faulty" node (and not yet removed from the cluster nodes) by the FollowersChecker. The cluster state is then acked on all nodes but one, and it results in a non acknowledged resize operation.

This pull request adds an ensureStableCluster() check after stopping the node in the test. The goal is to ensure that the data node has been correctly removed from the cluster and that all nodes are fully connected to each before moving forward with the resize operation.

Closes #44736

elasticmachine · 2019-07-25T15:23:24Z

Pinging @elastic/es-distributed

original-brownbear

LGTM, this makes perfect sense. When I investigated this the failure situation was the only time where the shutdown of the node didn't fully go through before the next CS update I think.

original-brownbear · 2019-07-25T15:29:36Z

server/src/test/java/org/elasticsearch/action/admin/indices/create/ShrinkIndexIT.java

        internalCluster().stopRandomNode(InternalTestCluster.nameFilter(shrinkNode));
+        ensureStableCluster(nodeCount -1);


NIT: formatting of the -1 is missing a space - 1 :)

@original-brownbear you've got eagle 👀 :)

DaveCTurner

Yep seems reasonable to me too. Good catch.

tlrx · 2019-07-26T07:33:30Z

@elasticmachine run elasticsearch-ci/2

tlrx · 2019-07-26T08:13:12Z

Thanks @original-brownbear and @DaveCTurner

…edNode (#44860) The test ShrinkIndexIT.testShrinkThenSplitWithFailedNode sometimes fails because the resize operation is not acknowledged (see #44736). This resize operation creates a new index "splitagain" and it results in a cluster state update (TransportResizeAction uses MetaDataCreateIndexService.createIndex() to create the resized index). This cluster state update is expected to be acknowledged by all nodes (see IndexCreationTask.onAllNodesAcked()) but this is not always true: the data node that was just stopped in the test before executing the resize operation might still be considered as a "faulty" node (and not yet removed from the cluster nodes) by the FollowersChecker. The cluster state is then acked on all nodes but one, and it results in a non acknowledged resize operation. This commit adds an ensureStableCluster() check after stopping the node in the test. The goal is to ensure that the data node has been correctly removed from the cluster and that all nodes are fully connected to each before moving forward with the resize operation. Closes #44736

…edNode (elastic#44860) The test ShrinkIndexIT.testShrinkThenSplitWithFailedNode sometimes fails because the resize operation is not acknowledged (see elastic#44736). This resize operation creates a new index "splitagain" and it results in a cluster state update (TransportResizeAction uses MetaDataCreateIndexService.createIndex() to create the resized index). This cluster state update is expected to be acknowledged by all nodes (see IndexCreationTask.onAllNodesAcked()) but this is not always true: the data node that was just stopped in the test before executing the resize operation might still be considered as a "faulty" node (and not yet removed from the cluster nodes) by the FollowersChecker. The cluster state is then acked on all nodes but one, and it results in a non acknowledged resize operation. This commit adds an ensureStableCluster() check after stopping the node in the test. The goal is to ensure that the data node has been correctly removed from the cluster and that all nodes are fully connected to each before moving forward with the resize operation. Closes elastic#44736

…edNode (#44860) The test ShrinkIndexIT.testShrinkThenSplitWithFailedNode sometimes fails because the resize operation is not acknowledged (see #44736). This resize operation creates a new index "splitagain" and it results in a cluster state update (TransportResizeAction uses MetaDataCreateIndexService.createIndex() to create the resized index). This cluster state update is expected to be acknowledged by all nodes (see IndexCreationTask.onAllNodesAcked()) but this is not always true: the data node that was just stopped in the test before executing the resize operation might still be considered as a "faulty" node (and not yet removed from the cluster nodes) by the FollowersChecker. The cluster state is then acked on all nodes but one, and it results in a non acknowledged resize operation. This commit adds an ensureStableCluster() check after stopping the node in the test. The goal is to ensure that the data node has been correctly removed from the cluster and that all nodes are fully connected to each before moving forward with the resize operation. Closes #44736

Ensure cluster is stable in ShrinkIndexIT

e381e08

tlrx added >test Issues or PRs that are addressing/adding tests :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. v8.0.0 v7.4.0 v7.3.1 v6.8.3 labels Jul 25, 2019

tlrx requested a review from DaveCTurner July 25, 2019 15:23

original-brownbear approved these changes Jul 25, 2019

View reviewed changes

DaveCTurner approved these changes Jul 25, 2019

View reviewed changes

add space

aa4ee99

tlrx merged commit e3997c6 into elastic:master Jul 26, 2019

tlrx deleted the fix-44736 branch July 26, 2019 08:13

jpountz added v7.3.0 v7.3.1 and removed v7.3.1 v7.3.0 labels Jul 26, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure cluster is stable in ShrinkIndexIT.testShrinkThenSplitWithFailedNode #44860

Ensure cluster is stable in ShrinkIndexIT.testShrinkThenSplitWithFailedNode #44860

tlrx commented Jul 25, 2019

elasticmachine commented Jul 25, 2019

original-brownbear left a comment

original-brownbear Jul 25, 2019

tlrx Jul 25, 2019

DaveCTurner left a comment

tlrx commented Jul 26, 2019

tlrx commented Jul 26, 2019

		internalCluster().stopRandomNode(InternalTestCluster.nameFilter(shrinkNode));
		ensureStableCluster(nodeCount -1);

Ensure cluster is stable in ShrinkIndexIT.testShrinkThenSplitWithFailedNode #44860

Ensure cluster is stable in ShrinkIndexIT.testShrinkThenSplitWithFailedNode #44860

Conversation

tlrx commented Jul 25, 2019

elasticmachine commented Jul 25, 2019

original-brownbear left a comment

Choose a reason for hiding this comment

original-brownbear Jul 25, 2019

Choose a reason for hiding this comment

tlrx Jul 25, 2019

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

tlrx commented Jul 26, 2019

tlrx commented Jul 26, 2019