Improve reliability of elections with message delays #98354

DaveCTurner · 2023-08-10T13:20:12Z

Today we close the election scheduler when the coordinator leaves mode
CANDIDATE, before even starting the publication that establishes the
election winner as the cluster master. If this publication subsequently
fails then we start a new election scheduler with the original, short,
timeout, and do not back off. With very high numbers of master-eligible
nodes this can lead to constant election clashes that never resolve. We
must count such failed publications as failed election attempts for
election scheduling and backoff purposes.

This commit keeps the election scheduler open until a published state is
applied, which means we continue to back off until a publication has
completed.

Closes #97909

Today we close the election scheduler when the coordinator leaves mode `CANDIDATE`, before even starting the publication that establishes the election winner as the cluster master. If this publication subsequently fails then we start a new election scheduler with the original, short, timeout, and do not back off. With very high numbers of master-eligible nodes this can lead to constant election clashes that never resolve. We must count such failed publications as failed election attempts for election scheduling and backoff purposes. This commit keeps the election scheduler open until a published state is applied, which means we continue to back off until a publication has completed.

elasticsearchmachine · 2023-08-10T13:20:37Z

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine · 2023-08-10T13:20:38Z

Hi @DaveCTurner, I've created a changelog YAML for you.

DaveCTurner · 2023-08-10T13:23:33Z

Note that this situation is pretty delicate to achieve, because we suppress election attempts on our peers with follower checks as soon as we become LEADER. That's why the test delays both publications and follower checks. It was very hard to reproduce this with 3 nodes, and even with the 7 nodes used in the test here it would eventually stabilise with a long enough timeout.

Also note that it uses a somewhat simpler strategy from the one described in #97909. I found it so hard to reproduce this that I'm fairly confident just waiting until the end of the first publication will be enough in practice.

Closes ES-6502

elasticsearchmachine · 2023-08-10T13:24:09Z

Hi @DaveCTurner, I've updated the changelog YAML for you.

henningandersen

LGTM.

A single question though for my education: could we not have closed the election scheduler as soon as we know the publication is committed rather than wait until after state has been applied? What is the benefit of waiting?

fcofdez

LGTM

DaveCTurner · 2023-08-11T08:54:18Z

A single question though for my education: could we not have closed the election scheduler as soon as we know the publication is committed rather than wait until after state has been applied? What is the benefit of waiting?

It's a good question. This never becomes truly (i.e. 100% provably) reliable, but the general principle is that longer we wait the more reliable it becomes. We just have to be sure to reset the scheduler when the cluster is "stable", whatever that might mean. I suspect you're right that waiting for commit would be about as strong as waiting for apply.

Formally speaking, the liveness argument is based on the protocol being eventually quiescent, which we achieve by extending the timeout whenever an election fails before reaching a quiescent state. We could probably come up with a more refined liveness argument if we tried (e.g. being more discriminating about which messages are or aren't part of the election process) but I don't see the need.

In fact even the implementation given here isn't quite enough to satisfy eventual quiescence, for at least a couple of reasons:

the followers reset their schedulers once they have locally applied the state, but there are still other messages in flight at this point so we have not yet reached the quiescent state.
if we are missing a join vote from a node (i.e. it voted for a different master in the current term) then we bump the term and run another election. Note that this happens after the previous state is applied everywhere, which is kind of why I consider the apply-commit messages to be important for quiescence.

Fixing these is possible too, but adds quite some complexity, and given how tricky it is to reproduce the problem fixed here I don't believe it will be necessary in practice.

DaveCTurner · 2023-08-11T09:11:06Z

NB subtle change in 28b7d58: if we have discovered a quorum and are running elections, and then we lose contact with the nodes we discovered, we have more fundamental problems than election collisions, so we should stop the election scheduler.

henningandersen · 2023-08-11T09:49:23Z

Thank for the explanation and the link to your excellent blog post, it all makes sense to me.

DaveCTurner · 2023-08-11T17:27:48Z

NB subtle change in 28b7d58: if we have discovered a quorum and are running elections, and then we lose contact with the nodes we discovered, we have more fundamental problems than election collisions, so we should stop the election scheduler.

Bah, unfortunately this breaks things again. When we become LEADER we clear out the PeerFinder, so if the election fails and we go back to CANDIDATE then we start discovery from scratch again which (with this change) resets the election scheduler. Yet I think we do need something like this change to avoid keeping the election scheduler running if discovery is broken for an extended period.

I think it should be possible to restore the previous PeerFinder state in this situation. I'll explore that next week.

Adds a test for a subtle problem that the initial version of elastic#98354 would have introduced: we must disable the election scheduler while we cannot even discover a quorum of peers, so that when discovery starts working again the election can happen quickly.

Adds a test showing that when an extended discovery outage eventually clears up, an election can happen very promptly (and fixes the calculation of `prevElectionWon` to avoid a situation where that would not happen today). This demonstrates a subtle problem that the initial version of elastic#98354 would have introduced.

Today the `PeerFinder` releases all its peers on deactivation, but to complete elastic#98354 we will need to delay the release until the cluster has reached a more stable state, which happens some time later. This commit separates the deactivate and release steps within the `PeerFinder` in preparation for that change, although it still does both steps at once in all production callers.

Today the `PeerFinder` releases all its peers on deactivation, but to complete #98354 we will need to delay the release until the cluster has reached a more stable state, which happens some time later. This commit separates the deactivate and release steps within the `PeerFinder` in preparation for that change, although it still does both steps at once in all production callers.

Adds a test showing that when an extended discovery outage eventually clears up, an election can happen very promptly (and fixes the calculation of `prevElectionWon` to avoid a situation where that would not happen today). This demonstrates a subtle problem that the initial version of #98354 would have introduced.

DaveCTurner · 2023-08-14T15:01:03Z

Oh, I forgot to remove the auto-merge label, I was going to ask for another final review. Nothing major has changed tho, apart from things reviewed elsewhere:

28b7d58 - Reset scheduler if discovery fails, as mentioned above
Fix election scheduling after discovery outage #98420 - add test for previous point
Preserve PeerFinder state when deactivated #98434 - fixes problem mentioned above
4b16d2c - fixes test failure found during soak testing

I'll leave it merged unless I hear any objections.

henningandersen · 2023-08-14T19:41:33Z

Still LGTM.

Today the `PeerFinder` releases all its peers on deactivation, but to complete elastic#98354 we will need to delay the release until the cluster has reached a more stable state, which happens some time later. This commit separates the deactivate and release steps within the `PeerFinder` in preparation for that change, although it still does both steps at once in all production callers.

Adds a test showing that when an extended discovery outage eventually clears up, an election can happen very promptly (and fixes the calculation of `prevElectionWon` to avoid a situation where that would not happen today). This demonstrates a subtle problem that the initial version of elastic#98354 would have introduced.

Today we close the election scheduler when the coordinator leaves mode `CANDIDATE`, before even starting the publication that establishes the election winner as the cluster master. If this publication subsequently fails then we start a new election scheduler with the original, short, timeout, and do not back off. With very high numbers of master-eligible nodes this can lead to constant election clashes that never resolve. We must count such failed publications as failed election attempts for election scheduling and backoff purposes. This commit keeps the election scheduler open until a published state is applied, which means we continue to back off until a publication has completed. Closes elastic#97909

DaveCTurner added >bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.10.0 labels Aug 10, 2023

DaveCTurner requested review from fcofdez and henningandersen August 10, 2023 13:20

elasticsearchmachine added the Team:Distributed Meta label for distributed team label Aug 10, 2023

Update docs/changelog/98354.yaml

f941419

Update docs/changelog/98354.yaml

e504df3

henningandersen approved these changes Aug 11, 2023

View reviewed changes

fcofdez approved these changes Aug 11, 2023

View reviewed changes

DaveCTurner added 2 commits August 11, 2023 10:04

Merge branch 'main' into 2023/08/10/elections-with-message-delays

eecb507

Reset scheduler if discovery fails

28b7d58

DaveCTurner added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Aug 11, 2023

DaveCTurner mentioned this pull request Aug 13, 2023

Fix election scheduling after discovery outage #98420

Merged

DaveCTurner mentioned this pull request Aug 14, 2023

Preserve PeerFinder state when deactivated #98434

Merged

DaveCTurner added 4 commits August 14, 2023 14:54

Merge branch 'main' into 2023/08/10/elections-with-message-delays

b465fe1

Close PeerFinder on apply

679104c

Logging improvements

5aa5d36

Heal cluster state application

4b16d2c

elasticsearchmachine merged commit ef50770 into elastic:main Aug 14, 2023
12 checks passed

DaveCTurner deleted the 2023/08/10/elections-with-message-delays branch August 14, 2023 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve reliability of elections with message delays #98354

Improve reliability of elections with message delays #98354

DaveCTurner commented Aug 10, 2023 •

edited

elasticsearchmachine commented Aug 10, 2023

elasticsearchmachine commented Aug 10, 2023

DaveCTurner commented Aug 10, 2023 •

edited

elasticsearchmachine commented Aug 10, 2023

henningandersen left a comment •

edited

fcofdez left a comment

DaveCTurner commented Aug 11, 2023

DaveCTurner commented Aug 11, 2023

henningandersen commented Aug 11, 2023

DaveCTurner commented Aug 11, 2023

DaveCTurner commented Aug 14, 2023

henningandersen commented Aug 14, 2023

Improve reliability of elections with message delays #98354

Improve reliability of elections with message delays #98354

Conversation

DaveCTurner commented Aug 10, 2023 • edited

elasticsearchmachine commented Aug 10, 2023

elasticsearchmachine commented Aug 10, 2023

DaveCTurner commented Aug 10, 2023 • edited

elasticsearchmachine commented Aug 10, 2023

henningandersen left a comment • edited

Choose a reason for hiding this comment

fcofdez left a comment

Choose a reason for hiding this comment

DaveCTurner commented Aug 11, 2023

DaveCTurner commented Aug 11, 2023

henningandersen commented Aug 11, 2023

DaveCTurner commented Aug 11, 2023

DaveCTurner commented Aug 14, 2023

henningandersen commented Aug 14, 2023

DaveCTurner commented Aug 10, 2023 •

edited

DaveCTurner commented Aug 10, 2023 •

edited

henningandersen left a comment •

edited