[CI] testLeaderDisconnectionWithoutDisconnectEventDetectedQuickly failure #53271

hendrikmuhs · 2020-03-09T09:51:22Z

log:https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu-18.04&&immutable/622/console
gradle:https://gradle-enterprise.elastic.co/s/olunhuu5qmlyk

reproduces locally

failure:

 2> REPRODUCE WITH: ./gradlew ':server:test' --tests "org.elasticsearch.cluster.coordination.CoordinatorTests.testLeaderDisconnectionWithoutDisconnectEventDetectedQuickly" -Dtests.seed=30C7DB320E81F553 -Dtests.security.manager=true -Dtests.locale=es-VE -Dtests.timezone=Australia/Currie -Dcompiler.java=13
  2> java.lang.AssertionError: node0 is a follower of node4
    Expected: is <FOLLOWER>
         but: was <CANDIDATE>
        at __randomizedtesting.SeedInfo.seed([30C7DB320E81F553:2208AF1950E0880]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase$Cluster.stabilise(AbstractCoordinatorTestCase.java:530)
        at org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase$Cluster.stabilise(AbstractCoordinatorTestCase.java:490)
        at org.elasticsearch.cluster.coordination.CoordinatorTests.testLeaderDisconnectionWithoutDisconnectEventDetectedQuickly(CoordinatorTests.java:409)
  2> NOTE: leaving temporary files on disk at: /home/hendrik/work/git-elastic-prod/elasticsearch/server/build/testrun/test/temp/org.elasticsearch.cluster.coordination.CoordinatorTests_30C7DB320E81F553-003
  2> NOTE: test params are: codec=Lucene84, sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@407d2d52), locale=es-VE, timezone=Australia/Currie
  2> NOTE: Linux 5.3.0-40-generic amd64/AdoptOpenJDK 13.0.2 (64-bit)/cpus=16,threads=1,free=464710688,total=536870912
  2> NOTE: All tests run in this JVM: [CoordinatorTests]

repro:

./gradlew ':server:test' --tests "org.elasticsearch.cluster.coordination.CoordinatorTests.testLeaderDisconnectionWithoutDisconnectEventDetectedQuickly" \
  -Dtests.seed=30C7DB320E81F553 \
  -Dtests.security.manager=true \
  -Dtests.locale=es-VE \
  -Dtests.timezone=Australia/Currie \
  -Dcompiler.java=13

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-03-09T09:51:24Z

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

DaveCTurner · 2020-03-09T18:33:36Z

Ohh this is tricky.

The test fails in the first stabilise() method, before any disruptions took place. node0 is rejecting publications from the leader (node4) because node0 is in term 2 but node4 is in term 1.

The trouble is that node4 tried to trigger an election in term 2 and broadcasts a start-join request, but fails to process it locally due to a (simulated) IO exception, so it remains in term 1. Meanwhile node0 receives and processes the start-join request successfully, votes for node4 and enters term 2. Then node4 becomes leader in term 1 anyway.

Publication#onMissingJoin deals with the case where a publication was accepted by a node that voted for a different master in this term by triggering a term bump. However here node0 is forced to reject the publication since it's in a stale term, so we do not trigger the term bump.

In rare circumstances it is possible for an isolated node to have a greater term than the currently-elected leader. Today such a node will attempt to join the cluster but will not offer a vote to the leader and will reject its cluster state publications due to their stale term. This situation persists since there is no mechanism for the joining node to inform the leader that its term is stale and a new election is required. This commit adds the current term of the joining node to the join request. Once the join has been validated, the leader will perform another election to increase its term far enough to allow the isolated node to join properly. Fixes elastic#53271

In rare circumstances it is possible for an isolated node to have a greater term than the currently-elected leader. Today such a node will attempt to join the cluster but will not offer a vote to the leader and will reject its cluster state publications due to their stale term. This situation persists since there is no mechanism for the joining node to inform the leader that its term is stale and a new election is required. This commit adds the current term of the joining node to the join request. Once the join has been validated, the leader will perform another election to increase its term far enough to allow the isolated node to join properly. Fixes #53271

hendrikmuhs added >test-failure Triaged test failures from CI :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Mar 9, 2020

DaveCTurner self-assigned this Mar 9, 2020

DaveCTurner mentioned this issue Mar 10, 2020

Allow joining node to trigger term bump #53338

Merged

DaveCTurner closed this as completed in #53338 Mar 11, 2020

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket (Part 3) elastic/elasticsearch-net#4534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] testLeaderDisconnectionWithoutDisconnectEventDetectedQuickly failure #53271

[CI] testLeaderDisconnectionWithoutDisconnectEventDetectedQuickly failure #53271

hendrikmuhs commented Mar 9, 2020

elasticmachine commented Mar 9, 2020

DaveCTurner commented Mar 9, 2020

[CI] testLeaderDisconnectionWithoutDisconnectEventDetectedQuickly failure #53271

[CI] testLeaderDisconnectionWithoutDisconnectEventDetectedQuickly failure #53271

Comments

hendrikmuhs commented Mar 9, 2020

elasticmachine commented Mar 9, 2020

DaveCTurner commented Mar 9, 2020