Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow joining node to trigger term bump #53338

Merged
merged 2 commits into from
Mar 11, 2020

Conversation

DaveCTurner
Copy link
Contributor

In rare circumstances it is possible for an isolated node to have a greater
term than the currently-elected leader. Today such a node will attempt to join
the cluster but will not offer a vote to the leader and will reject its cluster
state publications due to their stale term. This situation persists since there
is no mechanism for the joining node to inform the leader that its term is
stale and a new election is required.

This commit adds the current term of the joining node to the join request. Once
the join has been validated, the leader will perform another election to
increase its term far enough to allow the isolated node to join properly.

Fixes #53271

In rare circumstances it is possible for an isolated node to have a greater
term than the currently-elected leader. Today such a node will attempt to join
the cluster but will not offer a vote to the leader and will reject its cluster
state publications due to their stale term. This situation persists since there
is no mechanism for the joining node to inform the leader that its term is
stale and a new election is required.

This commit adds the current term of the joining node to the join request. Once
the join has been validated, the leader will perform another election to
increase its term far enough to allow the isolated node to join properly.

Fixes elastic#53271
@DaveCTurner DaveCTurner added >bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.7.0 labels Mar 10, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

@DaveCTurner
Copy link
Contributor Author

Note to reviewers: I contemplated a few alternative mechanisms for telling the master that it needs to bump its term in these circumstances. I chose this one because it's a small and reasonably clear change, but it does have the disadvantage that the join attempt that triggers the term bump will fail and require a retry. IMO this is rare enough that that's ok. I also considered

  • a more specialised exception from the publication indicating that the term bump is needed.
  • piggybacking the term on the fault detection mechanism instead of the join.
  • a whole new transport message specifically for this situation, sent prior to/instead of the join.

Liveness is hard.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner merged commit 9dcd88e into elastic:master Mar 11, 2020
@DaveCTurner DaveCTurner deleted the 2020-03-10-bump-term-at-join branch March 11, 2020 09:03
DaveCTurner added a commit that referenced this pull request Mar 11, 2020
DaveCTurner added a commit that referenced this pull request Mar 11, 2020
In rare circumstances it is possible for an isolated node to have a greater
term than the currently-elected leader. Today such a node will attempt to join
the cluster but will not offer a vote to the leader and will reject its cluster
state publications due to their stale term. This situation persists since there
is no mechanism for the joining node to inform the leader that its term is
stale and a new election is required.

This commit adds the current term of the joining node to the join request. Once
the join has been validated, the leader will perform another election to
increase its term far enough to allow the isolated node to join properly.

Fixes #53271
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Mar 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v7.7.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] testLeaderDisconnectionWithoutDisconnectEventDetectedQuickly failure
4 participants