Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: test failure PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode #37345

Closed
alpar-t opened this issue Jan 11, 2019 · 10 comments · Fixed by #37355 or #39168
Closed
Assignees
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >test-failure Triaged test failures from CI v7.2.0

Comments

@alpar-t
Copy link
Contributor

alpar-t commented Jan 11, 2019

Example build failure

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1198/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu&&virtual/174/console

Reproduction line

does not reproduce locally

/gradlew :server:integTest -Dtests.seed=ED29DBC5949B19E9 -Dtests.class=org.elasticsearch.cluster.routing.PrimaryAllocationIT -Dtests.method="testForceStaleReplicaToBePromotedToPrimaryOnWrongNode" -Dtests.security.manager=true -Dtests.locale=cs -Dtests.timezone=Europe/Kaliningrad -Dcompiler.java=11 -Druntime.java=8

Example relevant log:

10:28:30 ERROR   4.30s J3 | PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode <<< FAILURES!
10:28:30    > Throwable #1: java.lang.NullPointerException
10:28:30    > 	at __randomizedtesting.SeedInfo.seed([ED29DBC5949B19E9:D46FFB23117F3451]:0)
10:28:30    > 	at org.elasticsearch.cluster.routing.PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode(PrimaryAllocationIT.java:282)
10:28:30    > 	at java.lang.Thread.run(Thread.java:748)

Frequency

up to 8-9 a day.

Possibly related to: #35497

@alpar-t alpar-t added >test-failure Triaged test failures from CI v7.0.0 :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Jan 11, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@alpar-t alpar-t added v6.7.0 and removed v6.7.0 labels Jan 11, 2019
@alpar-t
Copy link
Contributor Author

alpar-t commented Jan 11, 2019

@original-brownbear could it be related to #37226 ?

alpar-t added a commit to alpar-t/elasticsearch that referenced this issue Jan 11, 2019
@alpar-t
Copy link
Contributor Author

alpar-t commented Jan 11, 2019

Muted in 3e73911

alpar-t added a commit that referenced this issue Jan 11, 2019
@original-brownbear original-brownbear self-assigned this Jan 11, 2019
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jan 11, 2019
* Ensure stable cluster after starting new datanodes before querying for shard store status
* Don't filter shard stores status request by status
* Closes elastic#37345
@original-brownbear
Copy link
Member

@atorok yea that def. seems related, I should have a fix in a few minutes :)

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jan 11, 2019
* Forcing a stale primary allocation on a green index was tripping the assertion that was removed
   * Added a test that this case still errors out correctly
* Made the ability to wipe stopped datanode's data public on the internal test cluster and used it to ensure correct behaviour on the fixed test
   * Previously it simply passed because the test finished before the index went green and would NPE when the index was green at the time of the shard store status request, that would then come up empty
* Closes elastic#37345
original-brownbear added a commit that referenced this issue Jan 11, 2019
* Fix PrimaryAllocationIT Race Condition

* Forcing a stale primary allocation on a green index was tripping the assertion that was removed
   * Added a test that this case still errors out correctly
* Made the ability to wipe stopped datanode's data public on the internal test cluster and used it to ensure correct behaviour on the fixed test
   * Previously it simply passed because the test finished before the index went green and would NPE when the index was green at the time of the shard store status request, that would then come up empty
* Closes #37345
original-brownbear added a commit that referenced this issue Jan 29, 2019
* Fix PrimaryAllocationIT Race Condition

* Forcing a stale primary allocation on a green index was tripping the assertion that was removed
   * Added a test that this case still errors out correctly
* Made the ability to wipe stopped datanode's data public on the internal test cluster and used it to ensure correct behaviour on the fixed test
   * Previously it simply passed because the test finished before the index went green and would NPE when the index was green at the time of the shard store status request, that would then come up empty
* Closes #37345
@matriv
Copy link
Contributor

matriv commented Jan 29, 2019

New failures: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu11,nodes=virtual&&linux/210/console

Reproduce:

./gradlew :server:integTest \
  -Dtests.seed=E77709B66B5E760A \
  -Dtests.class=org.elasticsearch.cluster.routing.PrimaryAllocationIT \
  -Dtests.method="testForceStaleReplicaToBePromotedToPrimaryOnWrongNode" \
  -Dtests.security.manager=true \
  -Dtests.locale=es-CL \
  -Dtests.timezone=Europe/Brussels \
  -Dcompiler.java=11 \
  -Druntime.java=11

Couldn't reproduce locally.

ERROR   1128s J4 | PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode <<< FAILURES!
  2> 	at java.base@11.0.2/java.lang.Thread.run(Thread.java:834)
  2> "elasticsearch[node_td4][[timer]]" ID=3401 TIMED_WAITING

@matriv matriv reopened this Jan 29, 2019
@matriv
Copy link
Contributor

matriv commented Jan 29, 2019

And another one: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java8,nodes=virtual&&linux/210/console

ERROR   16.3s J4 | PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode <<< FAILURES!
   > Throwable #1: java.lang.NullPointerException
   > 	at __randomizedtesting.SeedInfo.seed([EF58BD1E4C92C848:D61E9DF8C976E5F0]:0)
   > 	at org.elasticsearch.cluster.routing.PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode(PrimaryAllocationIT.java:313)
   > 	at java.lang.Thread.run(Thread.java:748)
  1> [2019-01-29T11:27:02,923][INFO ][o.e.c.r.PrimaryAllocationIT] [testForceStaleReplicaToBePromotedToPrimary] before test
  1> [2019-01-29T11:27:02,923][INFO ][o.e.c.r.PrimaryAllocationIT] [testForceStaleReplicaToBePromotedToPrimary] [PrimaryAllocationIT#testForceStaleReplicaToBePromotedToPrimary]: setting up test
  1> [2019-01-29T11:27:02,924][INFO ][o.e.t.InternalTestCluster] [testForceStaleReplicaToBePromotedToPrimary] Setup InternalTestCluster [TEST-CHILD_VM=[4]-CLUSTER_SEED=[-4661995235852411665]-HASH=[1419D940F93]-cluster] with seed [BF4D440886741CEF] using [0] dedicated masters, [0] (data) nodes and [0] coord only nodes (min_master_nodes are [auto-managed])
  1> [2019-01-29T11:27:02,924][INFO ][o.e.c.r.PrimaryAllocationIT] [testForceStaleReplicaToBePromotedToPrimary] [PrimaryAllocationIT#testForceStaleReplicaToBePromotedToPrimary]: all set up test
  1> [2019-01-29T11:27:02,924][INFO ][o.e.c.r.PrimaryAllocationIT] [testForceStaleReplicaToBePromotedToPrimary] --> starting 3 nodes, 1 master, 2 data
  1> [2019-01-29T11:27:02,928][INFO ][o.e.e.NodeEnvironment    ] [testForceStaleReplicaToBePromotedToPrimary] using [1] data paths, mounts [[/ (/dev/sda1)]], net usable_space [465.7gb], net total_space [503.9gb], types [ext4]
  1> [2019-01-29T11:27:02,928][INFO ][o.e.e.NodeEnvironment    ] [testForceStaleReplicaToBePromotedToPrimary] heap size [491mb], compressed ordinary object pointers [true]
  1> [2019-01-29T11:27:02,929][INFO ][o.e.n.Node               ] [testForceStaleReplicaToBePromotedToPrimary] node name [node_tm0], node ID [uGBxeepJTWy3MSBE9paOtQ]
  1> [2019-01-29T11:27:02,929][INFO ][o.e.n.Node               ] [testForceStaleReplicaToBePromotedToPrimary] version[6.7.0-SNAPSHOT], pid[14028], build[unknown/unknown/Unknown/Unknown], OS[Linux/4.4.0-141-generic/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_202/25.202-b08]

@original-brownbear
Copy link
Member

@matriv I'm on it, sorry for the noise.

@matriv
Copy link
Contributor

matriv commented Jan 29, 2019

@original-brownbear No worries :-)

@original-brownbear
Copy link
Member

The NPE when calling org.elasticsearch.test.InternalTestCluster#getNodeNames is quite interesting, the only way I can see this happening is via some concurrent modification of org.elasticsearch.test.InternalTestCluster#nodes that is happening in 6.x but not in master.
Going through the differences between the two now to see where this could be coming from.

@original-brownbear
Copy link
Member

This will be resolved by #39168

original-brownbear added a commit that referenced this issue Feb 21, 2019
* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes #37965
* Closes #37275
* Closes #37345
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Feb 21, 2019
* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes elastic#37965
* Closes elastic#37275
* Closes elastic#37345
original-brownbear added a commit that referenced this issue Feb 21, 2019
)

* Simplify and Fix Synchronization in InternalTestCluster (#39168)

* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes #37965
* Closes #37275
* Closes #37345
weizijun pushed a commit to weizijun/elasticsearch that referenced this issue Feb 22, 2019
* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes elastic#37965
* Closes elastic#37275
* Closes elastic#37345
weizijun pushed a commit to weizijun/elasticsearch that referenced this issue Feb 22, 2019
* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes elastic#37965
* Closes elastic#37275
* Closes elastic#37345
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Mar 13, 2019
* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes elastic#37965
* Closes elastic#37275
* Closes elastic#37345
original-brownbear added a commit that referenced this issue Mar 14, 2019
)

* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes #37965
* Closes #37275
* Closes #37345
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >test-failure Triaged test failures from CI v7.2.0
Projects
None yet
5 participants