CI: test failure PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode #37345

alpar-t · 2019-01-11T08:41:57Z

Example build failure

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1198/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu&&virtual/174/console

Reproduction line

does not reproduce locally

/gradlew :server:integTest -Dtests.seed=ED29DBC5949B19E9 -Dtests.class=org.elasticsearch.cluster.routing.PrimaryAllocationIT -Dtests.method="testForceStaleReplicaToBePromotedToPrimaryOnWrongNode" -Dtests.security.manager=true -Dtests.locale=cs -Dtests.timezone=Europe/Kaliningrad -Dcompiler.java=11 -Druntime.java=8

Example relevant log:

10:28:30 ERROR   4.30s J3 | PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode <<< FAILURES!
10:28:30    > Throwable #1: java.lang.NullPointerException
10:28:30    > 	at __randomizedtesting.SeedInfo.seed([ED29DBC5949B19E9:D46FFB23117F3451]:0)
10:28:30    > 	at org.elasticsearch.cluster.routing.PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode(PrimaryAllocationIT.java:282)
10:28:30    > 	at java.lang.Thread.run(Thread.java:748)

Frequency

up to 8-9 a day.

Possibly related to: #35497

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-01-11T08:41:58Z

Pinging @elastic/es-distributed

alpar-t · 2019-01-11T08:46:51Z

@original-brownbear could it be related to #37226 ?

…WrongNode Tracking issue: elastic#37345

alpar-t · 2019-01-11T08:47:36Z

Muted in 3e73911

…WrongNode Tracking issue: #37345

* Ensure stable cluster after starting new datanodes before querying for shard store status * Don't filter shard stores status request by status * Closes elastic#37345

original-brownbear · 2019-01-11T10:13:42Z

@atorok yea that def. seems related, I should have a fix in a few minutes :)

* Forcing a stale primary allocation on a green index was tripping the assertion that was removed * Added a test that this case still errors out correctly * Made the ability to wipe stopped datanode's data public on the internal test cluster and used it to ensure correct behaviour on the fixed test * Previously it simply passed because the test finished before the index went green and would NPE when the index was green at the time of the shard store status request, that would then come up empty * Closes elastic#37345

* Fix PrimaryAllocationIT Race Condition * Forcing a stale primary allocation on a green index was tripping the assertion that was removed * Added a test that this case still errors out correctly * Made the ability to wipe stopped datanode's data public on the internal test cluster and used it to ensure correct behaviour on the fixed test * Previously it simply passed because the test finished before the index went green and would NPE when the index was green at the time of the shard store status request, that would then come up empty * Closes #37345

matriv · 2019-01-29T11:33:49Z

New failures: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu11,nodes=virtual&&linux/210/console

Reproduce:

./gradlew :server:integTest \
  -Dtests.seed=E77709B66B5E760A \
  -Dtests.class=org.elasticsearch.cluster.routing.PrimaryAllocationIT \
  -Dtests.method="testForceStaleReplicaToBePromotedToPrimaryOnWrongNode" \
  -Dtests.security.manager=true \
  -Dtests.locale=es-CL \
  -Dtests.timezone=Europe/Brussels \
  -Dcompiler.java=11 \
  -Druntime.java=11

Couldn't reproduce locally.

ERROR   1128s J4 | PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode <<< FAILURES!
  2> 	at java.base@11.0.2/java.lang.Thread.run(Thread.java:834)
  2> "elasticsearch[node_td4][[timer]]" ID=3401 TIMED_WAITING

matriv · 2019-01-29T12:19:18Z

And another one: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java8,nodes=virtual&&linux/210/console

ERROR   16.3s J4 | PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode <<< FAILURES!
   > Throwable #1: java.lang.NullPointerException
   > 	at __randomizedtesting.SeedInfo.seed([EF58BD1E4C92C848:D61E9DF8C976E5F0]:0)
   > 	at org.elasticsearch.cluster.routing.PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode(PrimaryAllocationIT.java:313)
   > 	at java.lang.Thread.run(Thread.java:748)
  1> [2019-01-29T11:27:02,923][INFO ][o.e.c.r.PrimaryAllocationIT] [testForceStaleReplicaToBePromotedToPrimary] before test
  1> [2019-01-29T11:27:02,923][INFO ][o.e.c.r.PrimaryAllocationIT] [testForceStaleReplicaToBePromotedToPrimary] [PrimaryAllocationIT#testForceStaleReplicaToBePromotedToPrimary]: setting up test
  1> [2019-01-29T11:27:02,924][INFO ][o.e.t.InternalTestCluster] [testForceStaleReplicaToBePromotedToPrimary] Setup InternalTestCluster [TEST-CHILD_VM=[4]-CLUSTER_SEED=[-4661995235852411665]-HASH=[1419D940F93]-cluster] with seed [BF4D440886741CEF] using [0] dedicated masters, [0] (data) nodes and [0] coord only nodes (min_master_nodes are [auto-managed])
  1> [2019-01-29T11:27:02,924][INFO ][o.e.c.r.PrimaryAllocationIT] [testForceStaleReplicaToBePromotedToPrimary] [PrimaryAllocationIT#testForceStaleReplicaToBePromotedToPrimary]: all set up test
  1> [2019-01-29T11:27:02,924][INFO ][o.e.c.r.PrimaryAllocationIT] [testForceStaleReplicaToBePromotedToPrimary] --> starting 3 nodes, 1 master, 2 data
  1> [2019-01-29T11:27:02,928][INFO ][o.e.e.NodeEnvironment    ] [testForceStaleReplicaToBePromotedToPrimary] using [1] data paths, mounts [[/ (/dev/sda1)]], net usable_space [465.7gb], net total_space [503.9gb], types [ext4]
  1> [2019-01-29T11:27:02,928][INFO ][o.e.e.NodeEnvironment    ] [testForceStaleReplicaToBePromotedToPrimary] heap size [491mb], compressed ordinary object pointers [true]
  1> [2019-01-29T11:27:02,929][INFO ][o.e.n.Node               ] [testForceStaleReplicaToBePromotedToPrimary] node name [node_tm0], node ID [uGBxeepJTWy3MSBE9paOtQ]
  1> [2019-01-29T11:27:02,929][INFO ][o.e.n.Node               ] [testForceStaleReplicaToBePromotedToPrimary] version[6.7.0-SNAPSHOT], pid[14028], build[unknown/unknown/Unknown/Unknown], OS[Linux/4.4.0-141-generic/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_202/25.202-b08]

original-brownbear · 2019-01-29T12:20:59Z

@matriv I'm on it, sorry for the noise.

matriv · 2019-01-29T12:33:28Z

@original-brownbear No worries :-)

original-brownbear · 2019-01-29T13:33:56Z

The NPE when calling org.elasticsearch.test.InternalTestCluster#getNodeNames is quite interesting, the only way I can see this happening is via some concurrent modification of org.elasticsearch.test.InternalTestCluster#nodes that is happening in 6.x but not in master.
Going through the differences between the two now to see where this could be coming from.

original-brownbear · 2019-02-20T15:25:28Z

This will be resolved by #39168

* Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes #37965 * Closes #37275 * Closes #37345

* Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes elastic#37965 * Closes elastic#37275 * Closes elastic#37345

) * Simplify and Fix Synchronization in InternalTestCluster (#39168) * Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes #37965 * Closes #37275 * Closes #37345

* Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes elastic#37965 * Closes elastic#37275 * Closes elastic#37345

) * Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes #37965 * Closes #37275 * Closes #37345

alpar-t added >test-failure Triaged test failures from CI v7.0.0 :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Jan 11, 2019

alpar-t added v6.7.0 and removed v6.7.0 labels Jan 11, 2019

alpar-t added a commit to alpar-t/elasticsearch that referenced this issue Jan 11, 2019

Mute PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOn…

8e5e385

…WrongNode Tracking issue: elastic#37345

alpar-t added a commit that referenced this issue Jan 11, 2019

Mute PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOn…

3e73911

…WrongNode Tracking issue: #37345

original-brownbear self-assigned this Jan 11, 2019

original-brownbear mentioned this issue Jan 11, 2019

Fix PrimaryAllocationIT Race Condition #37355

Merged

original-brownbear closed this as completed in #37355 Jan 11, 2019

matriv reopened this Jan 29, 2019

original-brownbear added v6.7.0 and removed v7.0.0 labels Jan 29, 2019

danielmitterdorfer added v7.2.0 and removed v6.7.0 labels Feb 7, 2019

original-brownbear mentioned this issue Feb 20, 2019

Simplify and Fix Synchronization in InternalTestCluster #39168

Merged

original-brownbear closed this as completed in #39168 Feb 21, 2019

original-brownbear mentioned this issue Feb 21, 2019

Simplify and Fix Synchronization in InternalTestCluster (#39168) #39241

Merged

original-brownbear mentioned this issue Mar 13, 2019

Simplify and Fix Synchronization in InternalTestCluster (#39168) #40013

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: test failure PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode #37345

CI: test failure PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode #37345

alpar-t commented Jan 11, 2019

elasticmachine commented Jan 11, 2019

alpar-t commented Jan 11, 2019

alpar-t commented Jan 11, 2019 •

edited

original-brownbear commented Jan 11, 2019

matriv commented Jan 29, 2019

matriv commented Jan 29, 2019

original-brownbear commented Jan 29, 2019

matriv commented Jan 29, 2019

original-brownbear commented Jan 29, 2019

original-brownbear commented Feb 20, 2019

CI: test failure PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode #37345

CI: test failure PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode #37345

Comments

alpar-t commented Jan 11, 2019

Example build failure

Reproduction line

Example relevant log:

Frequency

elasticmachine commented Jan 11, 2019

alpar-t commented Jan 11, 2019

alpar-t commented Jan 11, 2019 • edited

original-brownbear commented Jan 11, 2019

matriv commented Jan 29, 2019

matriv commented Jan 29, 2019

original-brownbear commented Jan 29, 2019

matriv commented Jan 29, 2019

original-brownbear commented Jan 29, 2019

original-brownbear commented Feb 20, 2019

alpar-t commented Jan 11, 2019 •

edited