New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve reliability of elections with message delays #98354
Improve reliability of elections with message delays #98354
Conversation
Today we close the election scheduler when the coordinator leaves mode `CANDIDATE`, before even starting the publication that establishes the election winner as the cluster master. If this publication subsequently fails then we start a new election scheduler with the original, short, timeout, and do not back off. With very high numbers of master-eligible nodes this can lead to constant election clashes that never resolve. We must count such failed publications as failed election attempts for election scheduling and backoff purposes. This commit keeps the election scheduler open until a published state is applied, which means we continue to back off until a publication has completed.
Pinging @elastic/es-distributed (Team:Distributed) |
Hi @DaveCTurner, I've created a changelog YAML for you. |
Note that this situation is pretty delicate to achieve, because we suppress election attempts on our peers with follower checks as soon as we become Also note that it uses a somewhat simpler strategy from the one described in #97909. I found it so hard to reproduce this that I'm fairly confident just waiting until the end of the first publication will be enough in practice. Closes ES-6502 |
Hi @DaveCTurner, I've updated the changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
A single question though for my education: could we not have closed the election scheduler as soon as we know the publication is committed rather than wait until after state has been applied? What is the benefit of waiting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
It's a good question. This never becomes truly (i.e. 100% provably) reliable, but the general principle is that longer we wait the more reliable it becomes. We just have to be sure to reset the scheduler when the cluster is "stable", whatever that might mean. I suspect you're right that waiting for commit would be about as strong as waiting for apply. Formally speaking, the liveness argument is based on the protocol being eventually quiescent, which we achieve by extending the timeout whenever an election fails before reaching a quiescent state. We could probably come up with a more refined liveness argument if we tried (e.g. being more discriminating about which messages are or aren't part of the election process) but I don't see the need. In fact even the implementation given here isn't quite enough to satisfy eventual quiescence, for at least a couple of reasons:
Fixing these is possible too, but adds quite some complexity, and given how tricky it is to reproduce the problem fixed here I don't believe it will be necessary in practice. |
NB subtle change in 28b7d58: if we have discovered a quorum and are running elections, and then we lose contact with the nodes we discovered, we have more fundamental problems than election collisions, so we should stop the election scheduler. |
Thank for the explanation and the link to your excellent blog post, it all makes sense to me. |
Bah, unfortunately this breaks things again. When we become I think it should be possible to restore the previous |
Adds a test for a subtle problem that the initial version of elastic#98354 would have introduced: we must disable the election scheduler while we cannot even discover a quorum of peers, so that when discovery starts working again the election can happen quickly.
Adds a test for a subtle problem that the initial version of elastic#98354 would have introduced: we must disable the election scheduler while we cannot even discover a quorum of peers, so that when discovery starts working again the election can happen quickly.
Adds a test showing that when an extended discovery outage eventually clears up, an election can happen very promptly (and fixes the calculation of `prevElectionWon` to avoid a situation where that would not happen today). This demonstrates a subtle problem that the initial version of elastic#98354 would have introduced.
Adds a test showing that when an extended discovery outage eventually clears up, an election can happen very promptly (and fixes the calculation of `prevElectionWon` to avoid a situation where that would not happen today). This demonstrates a subtle problem that the initial version of elastic#98354 would have introduced.
Adds a test showing that when an extended discovery outage eventually clears up, an election can happen very promptly (and fixes the calculation of `prevElectionWon` to avoid a situation where that would not happen today). This demonstrates a subtle problem that the initial version of elastic#98354 would have introduced.
Adds a test showing that when an extended discovery outage eventually clears up, an election can happen very promptly (and fixes the calculation of `prevElectionWon` to avoid a situation where that would not happen today). This demonstrates a subtle problem that the initial version of elastic#98354 would have introduced.
Adds a test showing that when an extended discovery outage eventually clears up, an election can happen very promptly (and fixes the calculation of `prevElectionWon` to avoid a situation where that would not happen today). This demonstrates a subtle problem that the initial version of elastic#98354 would have introduced.
Adds a test showing that when an extended discovery outage eventually clears up, an election can happen very promptly (and fixes the calculation of `prevElectionWon` to avoid a situation where that would not happen today). This demonstrates a subtle problem that the initial version of elastic#98354 would have introduced.
Today the `PeerFinder` releases all its peers on deactivation, but to complete elastic#98354 we will need to delay the release until the cluster has reached a more stable state, which happens some time later. This commit separates the deactivate and release steps within the `PeerFinder` in preparation for that change, although it still does both steps at once in all production callers.
Today the `PeerFinder` releases all its peers on deactivation, but to complete #98354 we will need to delay the release until the cluster has reached a more stable state, which happens some time later. This commit separates the deactivate and release steps within the `PeerFinder` in preparation for that change, although it still does both steps at once in all production callers.
Adds a test showing that when an extended discovery outage eventually clears up, an election can happen very promptly (and fixes the calculation of `prevElectionWon` to avoid a situation where that would not happen today). This demonstrates a subtle problem that the initial version of #98354 would have introduced.
Oh, I forgot to remove the
I'll leave it merged unless I hear any objections. |
Still LGTM. |
Today the `PeerFinder` releases all its peers on deactivation, but to complete elastic#98354 we will need to delay the release until the cluster has reached a more stable state, which happens some time later. This commit separates the deactivate and release steps within the `PeerFinder` in preparation for that change, although it still does both steps at once in all production callers.
Adds a test showing that when an extended discovery outage eventually clears up, an election can happen very promptly (and fixes the calculation of `prevElectionWon` to avoid a situation where that would not happen today). This demonstrates a subtle problem that the initial version of elastic#98354 would have introduced.
Today we close the election scheduler when the coordinator leaves mode `CANDIDATE`, before even starting the publication that establishes the election winner as the cluster master. If this publication subsequently fails then we start a new election scheduler with the original, short, timeout, and do not back off. With very high numbers of master-eligible nodes this can lead to constant election clashes that never resolve. We must count such failed publications as failed election attempts for election scheduling and backoff purposes. This commit keeps the election scheduler open until a published state is applied, which means we continue to back off until a publication has completed. Closes elastic#97909
Today we close the election scheduler when the coordinator leaves mode
CANDIDATE
, before even starting the publication that establishes theelection winner as the cluster master. If this publication subsequently
fails then we start a new election scheduler with the original, short,
timeout, and do not back off. With very high numbers of master-eligible
nodes this can lead to constant election clashes that never resolve. We
must count such failed publications as failed election attempts for
election scheduling and backoff purposes.
This commit keeps the election scheduler open until a published state is
applied, which means we continue to back off until a publication has
completed.
Closes #97909