Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] ShrinkIndexIT testShrinkThenSplitWithFailedNode failure #44736

Closed
jkakavas opened this issue Jul 23, 2019 · 1 comment · Fixed by #44860
Closed

[CI] ShrinkIndexIT testShrinkThenSplitWithFailedNode failure #44736

jkakavas opened this issue Jul 23, 2019 · 1 comment · Fixed by #44860
Assignees
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >test-failure Triaged test failures from CI

Comments

@jkakavas
Copy link
Member

jkakavas commented Jul 23, 2019

Build scan: https://gradle.com/s/k5loufononht4
Console log: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=zulu12,nodes=general-purpose/98/console

Failure:

07:53:55   2> REPRODUCE WITH: ./gradlew :server:integTest --tests "org.elasticsearch.action.admin.indices.create.ShrinkIndexIT.testShrinkThenSplitWithFailedNode" -Dtests.seed=7F9B0926CAB77C2 -Dtests.security.manager=true -Dtests.locale=nus -Dtests.timezone=America/Indiana/Knox -Dcompiler.java=12 -Druntime.java=12
07:53:55   2> java.lang.AssertionError: ResizeResponse failed - not acked
07:53:55     Expected: <true>
07:53:55          but: was <false>
07:53:55         at __randomizedtesting.SeedInfo.seed([7F9B0926CAB77C2:DAC471C7C22737B0]:0)
07:53:55         at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
07:53:55         at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked(ElasticsearchAssertions.java:112)
07:53:55         at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked(ElasticsearchAssertions.java:100)
07:53:55         at org.elasticsearch.action.admin.indices.create.ShrinkIndexIT.testShrinkThenSplitWithFailedNode(ShrinkIndexIT.java:589)
07:53:55   1> [2019-07-22T23:53:46,590][INFO ][o.e.a.a.i.c.ShrinkIndexIT] [testCreateShrinkIndexToN] before test
07:53:55   1> [2019-07-22T23:53:46,590][INFO ][o.e.a.a.i.c.ShrinkIndexIT] [testCreateShrinkIndexToN] [ShrinkIndexIT#testCreateShrinkIndexToN]: setting up test
07:53:55   1> [2019-07-22T23:53:46,590][INFO ][o.e.t.InternalTestCluster] [testCreateShrinkIndexToN] adding voting config exclusions [node_s2] prior to restart/shutdown
07:53:55   1> [2019-07-22T23:53:46,620][INFO ][o.e.n.Node               ] [testCreateShrinkIndexToN] stopping ...
07:53:55   1> [2019-07-22T23:53:46,621][INFO ][o.e.c.c.Coordinator      ] [node_s2] master node [{node_s0}{0Vn4EYvgS7aBtp3STN22mQ}{0vkYnVenRCm120uqiHNWPg}{127.0.0.1}{127.0.0.1:42453}{dim}] failed, restarting discovery
07:53:55   1> org.elasticsearch.transport.NodeDisconnectedException: [node_s0][127.0.0.1:42453][disconnected] disconnected
07:53:55   1> [2019-07-22T23:53:46,623][INFO ][o.e.c.s.MasterService    ] [node_s0] node-left[{node_s2}{27VqX-qiQaqfZcYaMp-jUQ}{ELWQ7z4oRhazcYurKCwzxg}{127.0.0.1}{127.0.0.1:40475}{dim} disconnected], term: 1, version: 250, reason: removed {{node_s2}{27VqX-qiQaqfZcYaMp-jUQ}{ELWQ7z4oRhazcYurKCwzxg}{127.0.0.1}{127.0.0.1:40475}{dim},}
07:53:55   1> [2019-07-22T23:53:46,623][INFO ][o.e.n.Node               ] [testCreateShrinkIndexToN] stopped
07:53:55   1> [2019-07-22T23:53:46,623][INFO ][o.e.n.Node               ] [testCreateShrinkIndexToN] closing ...
07:53:55   1> [2019-07-22T23:53:46,625][INFO ][o.e.n.Node               ] [testCreateShrinkIndexToN] closed

Reproduction with:

./gradlew :server:integTest --tests "org.elasticsearch.action.admin.indices.create.ShrinkIndexIT.testShrinkThenSplitWithFailedNode" \
  -Dtests.seed=7F9B0926CAB77C2 \
  -Dtests.security.manager=true \
  -Dtests.locale=nus \
  -Dtests.timezone=America/Indiana/Knox \
  -Dcompiler.java=12 \
  -Druntime.java=12
./gradlew :server:integTest --tests "org.elasticsearch.action.admin.indices.create.ShrinkIndexIT.testShrinkThenSplitWithFailedNode" \
  -Dtests.seed=7F9B0926CAB77C2 \
  -Dtests.security.manager=true \
  -Dtests.locale=nus \
  -Dtests.timezone=America/Indiana/Knox \
  -Dcompiler.java=12 \
  -Druntime.java=12

This does not reproduce locally. Pinging @original-brownbear because of #44214

@jkakavas jkakavas added >test-failure Triaged test failures from CI :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Jul 23, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

tlrx added a commit that referenced this issue Jul 26, 2019
…edNode (#44860)

The test ShrinkIndexIT.testShrinkThenSplitWithFailedNode sometimes fails 
because the resize operation is not acknowledged (see #44736). This resize 
operation creates a new index "splitagain" and it results in a cluster state 
update (TransportResizeAction uses MetaDataCreateIndexService.createIndex() 
to create the resized index). This cluster state update is expected to be 
acknowledged by all nodes (see IndexCreationTask.onAllNodesAcked()) but 
this is not always true: the data node that was just stopped in the test before 
executing the resize operation might still be considered as a "faulty" node
 (and not yet removed from the cluster nodes) by the FollowersChecker. The 
cluster state is then acked on all nodes but one, and it results in a non 
acknowledged resize operation.

This commit adds an ensureStableCluster() check after stopping the node in 
the test. The goal is to ensure that the data node has been correctly removed 
from the cluster and that all nodes are fully connected to each before moving 
forward with the resize operation.

Closes #44736
tlrx added a commit that referenced this issue Jul 26, 2019
…edNode (#44860)

The test ShrinkIndexIT.testShrinkThenSplitWithFailedNode sometimes fails 
because the resize operation is not acknowledged (see #44736). This resize 
operation creates a new index "splitagain" and it results in a cluster state 
update (TransportResizeAction uses MetaDataCreateIndexService.createIndex() 
to create the resized index). This cluster state update is expected to be 
acknowledged by all nodes (see IndexCreationTask.onAllNodesAcked()) but 
this is not always true: the data node that was just stopped in the test before 
executing the resize operation might still be considered as a "faulty" node
 (and not yet removed from the cluster nodes) by the FollowersChecker. The 
cluster state is then acked on all nodes but one, and it results in a non 
acknowledged resize operation.

This commit adds an ensureStableCluster() check after stopping the node in 
the test. The goal is to ensure that the data node has been correctly removed 
from the cluster and that all nodes are fully connected to each before moving 
forward with the resize operation.

Closes #44736
tlrx added a commit that referenced this issue Jul 26, 2019
…edNode (#44860)

The test ShrinkIndexIT.testShrinkThenSplitWithFailedNode sometimes fails 
because the resize operation is not acknowledged (see #44736). This resize 
operation creates a new index "splitagain" and it results in a cluster state 
update (TransportResizeAction uses MetaDataCreateIndexService.createIndex() 
to create the resized index). This cluster state update is expected to be 
acknowledged by all nodes (see IndexCreationTask.onAllNodesAcked()) but 
this is not always true: the data node that was just stopped in the test before 
executing the resize operation might still be considered as a "faulty" node
 (and not yet removed from the cluster nodes) by the FollowersChecker. The 
cluster state is then acked on all nodes but one, and it results in a non 
acknowledged resize operation.

This commit adds an ensureStableCluster() check after stopping the node in 
the test. The goal is to ensure that the data node has been correctly removed 
from the cluster and that all nodes are fully connected to each before moving 
forward with the resize operation.

Closes #44736
tlrx added a commit that referenced this issue Jul 26, 2019
…edNode (#44860)

The test ShrinkIndexIT.testShrinkThenSplitWithFailedNode sometimes fails 
because the resize operation is not acknowledged (see #44736). This resize 
operation creates a new index "splitagain" and it results in a cluster state 
update (TransportResizeAction uses MetaDataCreateIndexService.createIndex() 
to create the resized index). This cluster state update is expected to be 
acknowledged by all nodes (see IndexCreationTask.onAllNodesAcked()) but 
this is not always true: the data node that was just stopped in the test before 
executing the resize operation might still be considered as a "faulty" node
 (and not yet removed from the cluster nodes) by the FollowersChecker. The 
cluster state is then acked on all nodes but one, and it results in a non 
acknowledged resize operation.

This commit adds an ensureStableCluster() check after stopping the node in 
the test. The goal is to ensure that the data node has been correctly removed 
from the cluster and that all nodes are fully connected to each before moving 
forward with the resize operation.

Closes #44736
jkakavas pushed a commit that referenced this issue Jul 31, 2019
…edNode (#44860)

The test ShrinkIndexIT.testShrinkThenSplitWithFailedNode sometimes fails 
because the resize operation is not acknowledged (see #44736). This resize 
operation creates a new index "splitagain" and it results in a cluster state 
update (TransportResizeAction uses MetaDataCreateIndexService.createIndex() 
to create the resized index). This cluster state update is expected to be 
acknowledged by all nodes (see IndexCreationTask.onAllNodesAcked()) but 
this is not always true: the data node that was just stopped in the test before 
executing the resize operation might still be considered as a "faulty" node
 (and not yet removed from the cluster nodes) by the FollowersChecker. The 
cluster state is then acked on all nodes but one, and it results in a non 
acknowledged resize operation.

This commit adds an ensureStableCluster() check after stopping the node in 
the test. The goal is to ensure that the data node has been correctly removed 
from the cluster and that all nodes are fully connected to each before moving 
forward with the resize operation.

Closes #44736
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants