Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure cluster is stable in ShrinkIndexIT.testShrinkThenSplitWithFailedNode #44860

Merged
merged 2 commits into from
Jul 26, 2019

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Jul 25, 2019

The test ShrinkIndexIT.testShrinkThenSplitWithFailedNode sometimes fails because the resize operation is not acknowledged (see #44736). This resize operation creates a new index "splitagain" and it results in a cluster state update (TransportResizeAction uses MetaDataCreateIndexService.createIndex() to create the resized index). This cluster state update is expected to be acknowledged by all nodes (see IndexCreationTask.onAllNodesAcked()) but this is not always true: the data node that was just stopped in the test before executing the resize operation might still be considered as a "faulty" node (and not yet removed from the cluster nodes) by the FollowersChecker. The cluster state is then acked on all nodes but one, and it results in a non acknowledged resize operation.

This pull request adds an ensureStableCluster() check after stopping the node in the test. The goal is to ensure that the data node has been correctly removed from the cluster and that all nodes are fully connected to each before moving forward with the resize operation.

Closes #44736

@tlrx tlrx added >test Issues or PRs that are addressing/adding tests :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. v8.0.0 v7.4.0 v7.3.1 v6.8.3 labels Jul 25, 2019
@tlrx tlrx requested a review from DaveCTurner July 25, 2019 15:23
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this makes perfect sense. When I investigated this the failure situation was the only time where the shutdown of the node didn't fully go through before the next CS update I think.

internalCluster().stopRandomNode(InternalTestCluster.nameFilter(shrinkNode));
ensureStableCluster(nodeCount -1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: formatting of the -1 is missing a space - 1 :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@original-brownbear you've got eagle 👀 :)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep seems reasonable to me too. Good catch.

@tlrx
Copy link
Member Author

tlrx commented Jul 26, 2019

@elasticmachine run elasticsearch-ci/2

@tlrx tlrx merged commit e3997c6 into elastic:master Jul 26, 2019
@tlrx
Copy link
Member Author

tlrx commented Jul 26, 2019

Thanks @original-brownbear and @DaveCTurner

@tlrx tlrx deleted the fix-44736 branch July 26, 2019 08:13
tlrx added a commit that referenced this pull request Jul 26, 2019
…edNode (#44860)

The test ShrinkIndexIT.testShrinkThenSplitWithFailedNode sometimes fails 
because the resize operation is not acknowledged (see #44736). This resize 
operation creates a new index "splitagain" and it results in a cluster state 
update (TransportResizeAction uses MetaDataCreateIndexService.createIndex() 
to create the resized index). This cluster state update is expected to be 
acknowledged by all nodes (see IndexCreationTask.onAllNodesAcked()) but 
this is not always true: the data node that was just stopped in the test before 
executing the resize operation might still be considered as a "faulty" node
 (and not yet removed from the cluster nodes) by the FollowersChecker. The 
cluster state is then acked on all nodes but one, and it results in a non 
acknowledged resize operation.

This commit adds an ensureStableCluster() check after stopping the node in 
the test. The goal is to ensure that the data node has been correctly removed 
from the cluster and that all nodes are fully connected to each before moving 
forward with the resize operation.

Closes #44736
tlrx added a commit that referenced this pull request Jul 26, 2019
…edNode (#44860)

The test ShrinkIndexIT.testShrinkThenSplitWithFailedNode sometimes fails 
because the resize operation is not acknowledged (see #44736). This resize 
operation creates a new index "splitagain" and it results in a cluster state 
update (TransportResizeAction uses MetaDataCreateIndexService.createIndex() 
to create the resized index). This cluster state update is expected to be 
acknowledged by all nodes (see IndexCreationTask.onAllNodesAcked()) but 
this is not always true: the data node that was just stopped in the test before 
executing the resize operation might still be considered as a "faulty" node
 (and not yet removed from the cluster nodes) by the FollowersChecker. The 
cluster state is then acked on all nodes but one, and it results in a non 
acknowledged resize operation.

This commit adds an ensureStableCluster() check after stopping the node in 
the test. The goal is to ensure that the data node has been correctly removed 
from the cluster and that all nodes are fully connected to each before moving 
forward with the resize operation.

Closes #44736
tlrx added a commit that referenced this pull request Jul 26, 2019
…edNode (#44860)

The test ShrinkIndexIT.testShrinkThenSplitWithFailedNode sometimes fails 
because the resize operation is not acknowledged (see #44736). This resize 
operation creates a new index "splitagain" and it results in a cluster state 
update (TransportResizeAction uses MetaDataCreateIndexService.createIndex() 
to create the resized index). This cluster state update is expected to be 
acknowledged by all nodes (see IndexCreationTask.onAllNodesAcked()) but 
this is not always true: the data node that was just stopped in the test before 
executing the resize operation might still be considered as a "faulty" node
 (and not yet removed from the cluster nodes) by the FollowersChecker. The 
cluster state is then acked on all nodes but one, and it results in a non 
acknowledged resize operation.

This commit adds an ensureStableCluster() check after stopping the node in 
the test. The goal is to ensure that the data node has been correctly removed 
from the cluster and that all nodes are fully connected to each before moving 
forward with the resize operation.

Closes #44736
polyfractal pushed a commit to polyfractal/elasticsearch that referenced this pull request Jul 29, 2019
…edNode (elastic#44860)

The test ShrinkIndexIT.testShrinkThenSplitWithFailedNode sometimes fails 
because the resize operation is not acknowledged (see elastic#44736). This resize 
operation creates a new index "splitagain" and it results in a cluster state 
update (TransportResizeAction uses MetaDataCreateIndexService.createIndex() 
to create the resized index). This cluster state update is expected to be 
acknowledged by all nodes (see IndexCreationTask.onAllNodesAcked()) but 
this is not always true: the data node that was just stopped in the test before 
executing the resize operation might still be considered as a "faulty" node
 (and not yet removed from the cluster nodes) by the FollowersChecker. The 
cluster state is then acked on all nodes but one, and it results in a non 
acknowledged resize operation.

This commit adds an ensureStableCluster() check after stopping the node in 
the test. The goal is to ensure that the data node has been correctly removed 
from the cluster and that all nodes are fully connected to each before moving 
forward with the resize operation.

Closes elastic#44736
jkakavas pushed a commit that referenced this pull request Jul 31, 2019
…edNode (#44860)

The test ShrinkIndexIT.testShrinkThenSplitWithFailedNode sometimes fails 
because the resize operation is not acknowledged (see #44736). This resize 
operation creates a new index "splitagain" and it results in a cluster state 
update (TransportResizeAction uses MetaDataCreateIndexService.createIndex() 
to create the resized index). This cluster state update is expected to be 
acknowledged by all nodes (see IndexCreationTask.onAllNodesAcked()) but 
this is not always true: the data node that was just stopped in the test before 
executing the resize operation might still be considered as a "faulty" node
 (and not yet removed from the cluster nodes) by the FollowersChecker. The 
cluster state is then acked on all nodes but one, and it results in a non 
acknowledged resize operation.

This commit adds an ensureStableCluster() check after stopping the node in 
the test. The goal is to ensure that the data node has been correctly removed 
from the cluster and that all nodes are fully connected to each before moving 
forward with the resize operation.

Closes #44736
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >test Issues or PRs that are addressing/adding tests v6.8.3 v7.3.1 v7.4.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] ShrinkIndexIT testShrinkThenSplitWithFailedNode failure
6 participants