Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve reliability of elections with message delays #98354

Conversation

DaveCTurner
Copy link
Contributor

@DaveCTurner DaveCTurner commented Aug 10, 2023

Today we close the election scheduler when the coordinator leaves mode
CANDIDATE, before even starting the publication that establishes the
election winner as the cluster master. If this publication subsequently
fails then we start a new election scheduler with the original, short,
timeout, and do not back off. With very high numbers of master-eligible
nodes this can lead to constant election clashes that never resolve. We
must count such failed publications as failed election attempts for
election scheduling and backoff purposes.

This commit keeps the election scheduler open until a published state is
applied, which means we continue to back off until a publication has
completed.

Closes #97909

Today we close the election scheduler when the coordinator leaves mode
`CANDIDATE`, before even starting the publication that establishes the
election winner as the cluster master. If this publication subsequently
fails then we start a new election scheduler with the original, short,
timeout, and do not back off. With very high numbers of master-eligible
nodes this can lead to constant election clashes that never resolve. We
must count such failed publications as failed election attempts for
election scheduling and backoff purposes.

This commit keeps the election scheduler open until a published state is
applied, which means we continue to back off until a publication has
completed.
@DaveCTurner DaveCTurner added >bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.10.0 labels Aug 10, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Aug 10, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Aug 10, 2023

Note that this situation is pretty delicate to achieve, because we suppress election attempts on our peers with follower checks as soon as we become LEADER. That's why the test delays both publications and follower checks. It was very hard to reproduce this with 3 nodes, and even with the 7 nodes used in the test here it would eventually stabilise with a long enough timeout.

Also note that it uses a somewhat simpler strategy from the one described in #97909. I found it so hard to reproduce this that I'm fairly confident just waiting until the end of the first publication will be enough in practice.

Closes ES-6502

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've updated the changelog YAML for you.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

A single question though for my education: could we not have closed the election scheduler as soon as we know the publication is committed rather than wait until after state has been applied? What is the benefit of waiting?

Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner
Copy link
Contributor Author

A single question though for my education: could we not have closed the election scheduler as soon as we know the publication is committed rather than wait until after state has been applied? What is the benefit of waiting?

It's a good question. This never becomes truly (i.e. 100% provably) reliable, but the general principle is that longer we wait the more reliable it becomes. We just have to be sure to reset the scheduler when the cluster is "stable", whatever that might mean. I suspect you're right that waiting for commit would be about as strong as waiting for apply.

Formally speaking, the liveness argument is based on the protocol being eventually quiescent, which we achieve by extending the timeout whenever an election fails before reaching a quiescent state. We could probably come up with a more refined liveness argument if we tried (e.g. being more discriminating about which messages are or aren't part of the election process) but I don't see the need.

In fact even the implementation given here isn't quite enough to satisfy eventual quiescence, for at least a couple of reasons:

  • the followers reset their schedulers once they have locally applied the state, but there are still other messages in flight at this point so we have not yet reached the quiescent state.
  • if we are missing a join vote from a node (i.e. it voted for a different master in the current term) then we bump the term and run another election. Note that this happens after the previous state is applied everywhere, which is kind of why I consider the apply-commit messages to be important for quiescence.

Fixing these is possible too, but adds quite some complexity, and given how tricky it is to reproduce the problem fixed here I don't believe it will be necessary in practice.

@DaveCTurner DaveCTurner added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Aug 11, 2023
@DaveCTurner
Copy link
Contributor Author

NB subtle change in 28b7d58: if we have discovered a quorum and are running elections, and then we lose contact with the nodes we discovered, we have more fundamental problems than election collisions, so we should stop the election scheduler.

@henningandersen
Copy link
Contributor

Thank for the explanation and the link to your excellent blog post, it all makes sense to me.

@DaveCTurner
Copy link
Contributor Author

NB subtle change in 28b7d58: if we have discovered a quorum and are running elections, and then we lose contact with the nodes we discovered, we have more fundamental problems than election collisions, so we should stop the election scheduler.

Bah, unfortunately this breaks things again. When we become LEADER we clear out the PeerFinder, so if the election fails and we go back to CANDIDATE then we start discovery from scratch again which (with this change) resets the election scheduler. Yet I think we do need something like this change to avoid keeping the election scheduler running if discovery is broken for an extended period.

I think it should be possible to restore the previous PeerFinder state in this situation. I'll explore that next week.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Aug 13, 2023
Adds a test for a subtle problem that the initial version of elastic#98354
would have introduced: we must disable the election scheduler while we
cannot even discover a quorum of peers, so that when discovery starts
working again the election can happen quickly.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Aug 13, 2023
Adds a test for a subtle problem that the initial version of elastic#98354
would have introduced: we must disable the election scheduler while we
cannot even discover a quorum of peers, so that when discovery starts
working again the election can happen quickly.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Aug 13, 2023
Adds a test showing that when an extended discovery outage eventually
clears up, an election can happen very promptly (and fixes the
calculation of `prevElectionWon` to avoid a situation where that would
not happen today). This demonstrates a subtle problem that the initial
version of elastic#98354 would have introduced.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Aug 13, 2023
Adds a test showing that when an extended discovery outage eventually
clears up, an election can happen very promptly (and fixes the
calculation of `prevElectionWon` to avoid a situation where that would
not happen today). This demonstrates a subtle problem that the initial
version of elastic#98354 would have introduced.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Aug 13, 2023
Adds a test showing that when an extended discovery outage eventually
clears up, an election can happen very promptly (and fixes the
calculation of `prevElectionWon` to avoid a situation where that would
not happen today). This demonstrates a subtle problem that the initial
version of elastic#98354 would have introduced.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Aug 13, 2023
Adds a test showing that when an extended discovery outage eventually
clears up, an election can happen very promptly (and fixes the
calculation of `prevElectionWon` to avoid a situation where that would
not happen today). This demonstrates a subtle problem that the initial
version of elastic#98354 would have introduced.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Aug 13, 2023
Adds a test showing that when an extended discovery outage eventually
clears up, an election can happen very promptly (and fixes the
calculation of `prevElectionWon` to avoid a situation where that would
not happen today). This demonstrates a subtle problem that the initial
version of elastic#98354 would have introduced.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Aug 13, 2023
Adds a test showing that when an extended discovery outage eventually
clears up, an election can happen very promptly (and fixes the
calculation of `prevElectionWon` to avoid a situation where that would
not happen today). This demonstrates a subtle problem that the initial
version of elastic#98354 would have introduced.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Aug 14, 2023
Today the `PeerFinder` releases all its peers on deactivation, but to
complete elastic#98354 we will need to delay the release until the cluster has
reached a more stable state, which happens some time later. This commit
separates the deactivate and release steps within the `PeerFinder` in
preparation for that change, although it still does both steps at once
in all production callers.
elasticsearchmachine pushed a commit that referenced this pull request Aug 14, 2023
Today the `PeerFinder` releases all its peers on deactivation, but to
complete #98354 we will need to delay the release until the cluster has
reached a more stable state, which happens some time later. This commit
separates the deactivate and release steps within the `PeerFinder` in
preparation for that change, although it still does both steps at once
in all production callers.
DaveCTurner added a commit that referenced this pull request Aug 14, 2023
Adds a test showing that when an extended discovery outage eventually
clears up, an election can happen very promptly (and fixes the
calculation of `prevElectionWon` to avoid a situation where that would
not happen today). This demonstrates a subtle problem that the initial
version of #98354 would have introduced.
@elasticsearchmachine elasticsearchmachine merged commit ef50770 into elastic:main Aug 14, 2023
12 checks passed
@DaveCTurner DaveCTurner deleted the 2023/08/10/elections-with-message-delays branch August 14, 2023 14:43
@DaveCTurner
Copy link
Contributor Author

Oh, I forgot to remove the auto-merge label, I was going to ask for another final review. Nothing major has changed tho, apart from things reviewed elsewhere:

I'll leave it merged unless I hear any objections.

@henningandersen
Copy link
Contributor

Still LGTM.

csoulios pushed a commit to csoulios/elasticsearch that referenced this pull request Aug 18, 2023
Today the `PeerFinder` releases all its peers on deactivation, but to
complete elastic#98354 we will need to delay the release until the cluster has
reached a more stable state, which happens some time later. This commit
separates the deactivate and release steps within the `PeerFinder` in
preparation for that change, although it still does both steps at once
in all production callers.
csoulios pushed a commit to csoulios/elasticsearch that referenced this pull request Aug 18, 2023
Adds a test showing that when an extended discovery outage eventually
clears up, an election can happen very promptly (and fixes the
calculation of `prevElectionWon` to avoid a situation where that would
not happen today). This demonstrates a subtle problem that the initial
version of elastic#98354 would have introduced.
csoulios pushed a commit to csoulios/elasticsearch that referenced this pull request Aug 18, 2023
Today we close the election scheduler when the coordinator leaves mode
`CANDIDATE`, before even starting the publication that establishes the
election winner as the cluster master. If this publication subsequently
fails then we start a new election scheduler with the original, short,
timeout, and do not back off. With very high numbers of master-eligible
nodes this can lead to constant election clashes that never resolve. We
must count such failed publications as failed election attempts for
election scheduling and backoff purposes.

This commit keeps the election scheduler open until a published state is
applied, which means we continue to back off until a publication has
completed.

Closes elastic#97909
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team v8.10.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Election scheduler should not reset before publication complete
4 participants