Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] testLeaderDisconnectionWithoutDisconnectEventDetectedQuickly failure #53271

Closed
hendrikmuhs opened this issue Mar 9, 2020 · 2 comments · Fixed by #53338
Closed

[CI] testLeaderDisconnectionWithoutDisconnectEventDetectedQuickly failure #53271

hendrikmuhs opened this issue Mar 9, 2020 · 2 comments · Fixed by #53338
Assignees
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI

Comments

@hendrikmuhs
Copy link
Contributor

log:https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu-18.04&&immutable/622/console
gradle:https://gradle-enterprise.elastic.co/s/olunhuu5qmlyk

reproduces locally

failure:

 2> REPRODUCE WITH: ./gradlew ':server:test' --tests "org.elasticsearch.cluster.coordination.CoordinatorTests.testLeaderDisconnectionWithoutDisconnectEventDetectedQuickly" -Dtests.seed=30C7DB320E81F553 -Dtests.security.manager=true -Dtests.locale=es-VE -Dtests.timezone=Australia/Currie -Dcompiler.java=13
  2> java.lang.AssertionError: node0 is a follower of node4
    Expected: is <FOLLOWER>
         but: was <CANDIDATE>
        at __randomizedtesting.SeedInfo.seed([30C7DB320E81F553:2208AF1950E0880]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase$Cluster.stabilise(AbstractCoordinatorTestCase.java:530)
        at org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase$Cluster.stabilise(AbstractCoordinatorTestCase.java:490)
        at org.elasticsearch.cluster.coordination.CoordinatorTests.testLeaderDisconnectionWithoutDisconnectEventDetectedQuickly(CoordinatorTests.java:409)
  2> NOTE: leaving temporary files on disk at: /home/hendrik/work/git-elastic-prod/elasticsearch/server/build/testrun/test/temp/org.elasticsearch.cluster.coordination.CoordinatorTests_30C7DB320E81F553-003
  2> NOTE: test params are: codec=Lucene84, sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@407d2d52), locale=es-VE, timezone=Australia/Currie
  2> NOTE: Linux 5.3.0-40-generic amd64/AdoptOpenJDK 13.0.2 (64-bit)/cpus=16,threads=1,free=464710688,total=536870912
  2> NOTE: All tests run in this JVM: [CoordinatorTests]

repro:

./gradlew ':server:test' --tests "org.elasticsearch.cluster.coordination.CoordinatorTests.testLeaderDisconnectionWithoutDisconnectEventDetectedQuickly" \
  -Dtests.seed=30C7DB320E81F553 \
  -Dtests.security.manager=true \
  -Dtests.locale=es-VE \
  -Dtests.timezone=Australia/Currie \
  -Dcompiler.java=13
@hendrikmuhs hendrikmuhs added >test-failure Triaged test failures from CI :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Mar 9, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

@DaveCTurner DaveCTurner self-assigned this Mar 9, 2020
@DaveCTurner
Copy link
Contributor

Ohh this is tricky.

The test fails in the first stabilise() method, before any disruptions took place. node0 is rejecting publications from the leader (node4) because node0 is in term 2 but node4 is in term 1.

The trouble is that node4 tried to trigger an election in term 2 and broadcasts a start-join request, but fails to process it locally due to a (simulated) IO exception, so it remains in term 1. Meanwhile node0 receives and processes the start-join request successfully, votes for node4 and enters term 2. Then node4 becomes leader in term 1 anyway.

Publication#onMissingJoin deals with the case where a publication was accepted by a node that voted for a different master in this term by triggering a term bump. However here node0 is forced to reject the publication since it's in a stale term, so we do not trigger the term bump.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Mar 10, 2020
In rare circumstances it is possible for an isolated node to have a greater
term than the currently-elected leader. Today such a node will attempt to join
the cluster but will not offer a vote to the leader and will reject its cluster
state publications due to their stale term. This situation persists since there
is no mechanism for the joining node to inform the leader that its term is
stale and a new election is required.

This commit adds the current term of the joining node to the join request. Once
the join has been validated, the leader will perform another election to
increase its term far enough to allow the isolated node to join properly.

Fixes elastic#53271
DaveCTurner added a commit that referenced this issue Mar 11, 2020
In rare circumstances it is possible for an isolated node to have a greater
term than the currently-elected leader. Today such a node will attempt to join
the cluster but will not offer a vote to the leader and will reject its cluster
state publications due to their stale term. This situation persists since there
is no mechanism for the joining node to inform the leader that its term is
stale and a new election is required.

This commit adds the current term of the joining node to the join request. Once
the join has been validated, the leader will perform another election to
increase its term far enough to allow the isolated node to join properly.

Fixes #53271
DaveCTurner added a commit that referenced this issue Mar 11, 2020
In rare circumstances it is possible for an isolated node to have a greater
term than the currently-elected leader. Today such a node will attempt to join
the cluster but will not offer a vote to the leader and will reject its cluster
state publications due to their stale term. This situation persists since there
is no mechanism for the joining node to inform the leader that its term is
stale and a new election is required.

This commit adds the current term of the joining node to the join request. Once
the join has been validated, the leader will perform another election to
increase its term far enough to allow the isolated node to join properly.

Fixes #53271
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants