New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid counting votes from master-ineligible nodes #43688
Avoid counting votes from master-ineligible nodes #43688
Conversation
Today if a master-eligible node is converted to a master-ineligible node it may remain in the voting configuration, meaning that the master node may count its publish responses as an indication that it has properly persisted the cluster state. However master-ineligible nodes do not properly persist the cluster state, so it is not safe to count these votes. This change adjusts `CoordinationState` to take account of this from a safety point of view, and also adjusts the `Coordinator` to prevent such nodes from joining the cluster. Instead, it triggers a reconfiguration to remove from the voting configuration a node that now appears to be master-ineligible before processing its join.
Pinging @elastic/es-distributed |
I attempted to update the formal model to match this here too, allowing a node to switch its master-eligibility at reboot time and fall back to an arbitrary persisted state if so, but my attempts did very bad things to the state space. |
Wouldn't it be simpler to adapt |
I wasn't as happy with the safety argument for that approach, or the failure reporting. Votes from nodes that used to be master-eligible can still be in play after the node comes back as a non-master node, and I'm worried about a future surprise from that fact. Similarly I don't think we technically need to remove master-ineligible nodes from the config before allowing them to join, but I think that's a nice invariant to have. However, I've also implemented the alternative on this branch. |
@elasticmachine please run elasticsearch-ci/docbldesx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. I've left some comments.
.filter(this::hasJoinVoteFrom).collect(Collectors.toSet()); | ||
final VotingConfiguration newConfig = reconfigurator.reconfigure(liveNodes, | ||
clusterState.getVotingConfigExclusions().stream().map(VotingConfigExclusion::getNodeId).collect(Collectors.toSet()), | ||
.filter(DiscoveryNode::isMasterNode).filter(this::hasJoinVoteFrom).collect(Collectors.toSet()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change still necessary? hasJoinVoteFrom
will only return true for a master node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, hasJoinVoteFrom
just looks at the node ID, so if you've restarted a master node as a non-master node then as it stands today we will still return true
.
But this exposes another issue: we must keep track of the joins from the non-master nodes too, because we bump our term (and hold another election) if we think the node voted for a different master in this term. I added a test for this in 89584a8.
In addressing this, it turns out that everywhere else we care about the master nodes that haven't sent us a vote, so I think keeping the isMasterNode
filter here makes the most sense. See c998eb1 for other related changes.
final VotingConfiguration newConfig = reconfigurator.reconfigure(liveNodes, Stream.concat( | ||
clusterState.getVotingConfigExclusions().stream().map(VotingConfigExclusion::getNodeId), | ||
StreamSupport.stream(clusterState.nodes().spliterator(), false) | ||
.filter(Predicate.not(DiscoveryNode::isMasterNode)).map(DiscoveryNode::getId)).collect(Collectors.toSet()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment to say "auto-exclude non master-eligible nodes that are in the voting config" and only set those that are actually in the config? This will make the logging less confusing in Reconfigurator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, see fb274a9.
@@ -907,13 +910,13 @@ private void handleJoin(Join join) { | |||
|
|||
if (coordinationState.get().electionWon()) { | |||
// If we have already won the election then the actual join does not matter for election purposes, so swallow any exception | |||
final boolean isNewJoin = handleJoinIgnoringExceptions(join); | |||
final boolean isNewMasterEligibleNode = handleJoinIgnoringExceptions(join); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps isNewJoinFromMasterEligibleNode
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, see b2db083.
|
||
assert publicationCompletedIffAllTargetsInactiveOrCancelled(); | ||
try { | ||
handlePublishResponse(response.getPublishResponse()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we move this exception handling to the handlePublishResponse
method? It should only wrap the Publication.this.handlePublishResponse
part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, see b38833d.
server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java
Show resolved
Hide resolved
@@ -875,7 +878,8 @@ ClusterNode restartedNode(Function<MetaData, MetaData> adaptGlobalMetaData, Func | |||
final DiscoveryNode newLocalNode = new DiscoveryNode(localNode.getName(), localNode.getId(), | |||
UUIDs.randomBase64UUID(random()), // generated deterministically for repeatable tests | |||
address.address().getHostString(), address.getAddress(), address, Collections.emptyMap(), | |||
localNode.isMasterNode() ? DiscoveryNodeRole.BUILT_IN_ROLES : emptySet(), Version.CURRENT); | |||
localNode.isMasterNode() && Node.NODE_MASTER_SETTING.get(nodeSettings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make it so that the settings always properly reflect the roles and use DiscoveryNode.getRolesFromSettings(nodeSettings)
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -102,6 +105,26 @@ public static long value(ClusterState clusterState) { | |||
} | |||
|
|||
void reboot() { | |||
if (localNode.isMasterNode() == false && rarely()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we do something similar for CoordinatorTests to check liveness aspects of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I added a thing to arbitrarily roll back the state on data nodes in 0fa253d.
@@ -1217,6 +1218,30 @@ public void assertMatched() { | |||
} | |||
} | |||
|
|||
public void testReconfiguresToExcludeMasterIneligibleVotingNodes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testReconfiguresToExcludeMasterIneligibleNodesInVotingConfig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you wish :) see eca9468
This reverts commit 9a0158c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments. This is ready for another look.
.filter(this::hasJoinVoteFrom).collect(Collectors.toSet()); | ||
final VotingConfiguration newConfig = reconfigurator.reconfigure(liveNodes, | ||
clusterState.getVotingConfigExclusions().stream().map(VotingConfigExclusion::getNodeId).collect(Collectors.toSet()), | ||
.filter(DiscoveryNode::isMasterNode).filter(this::hasJoinVoteFrom).collect(Collectors.toSet()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, hasJoinVoteFrom
just looks at the node ID, so if you've restarted a master node as a non-master node then as it stands today we will still return true
.
But this exposes another issue: we must keep track of the joins from the non-master nodes too, because we bump our term (and hold another election) if we think the node voted for a different master in this term. I added a test for this in 89584a8.
In addressing this, it turns out that everywhere else we care about the master nodes that haven't sent us a vote, so I think keeping the isMasterNode
filter here makes the most sense. See c998eb1 for other related changes.
@@ -875,7 +878,8 @@ ClusterNode restartedNode(Function<MetaData, MetaData> adaptGlobalMetaData, Func | |||
final DiscoveryNode newLocalNode = new DiscoveryNode(localNode.getName(), localNode.getId(), | |||
UUIDs.randomBase64UUID(random()), // generated deterministically for repeatable tests | |||
address.address().getHostString(), address.getAddress(), address, Collections.emptyMap(), | |||
localNode.isMasterNode() ? DiscoveryNodeRole.BUILT_IN_ROLES : emptySet(), Version.CURRENT); | |||
localNode.isMasterNode() && Node.NODE_MASTER_SETTING.get(nodeSettings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final VotingConfiguration newConfig = reconfigurator.reconfigure(liveNodes, Stream.concat( | ||
clusterState.getVotingConfigExclusions().stream().map(VotingConfigExclusion::getNodeId), | ||
StreamSupport.stream(clusterState.nodes().spliterator(), false) | ||
.filter(Predicate.not(DiscoveryNode::isMasterNode)).map(DiscoveryNode::getId)).collect(Collectors.toSet()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, see fb274a9.
@@ -907,13 +910,13 @@ private void handleJoin(Join join) { | |||
|
|||
if (coordinationState.get().electionWon()) { | |||
// If we have already won the election then the actual join does not matter for election purposes, so swallow any exception | |||
final boolean isNewJoin = handleJoinIgnoringExceptions(join); | |||
final boolean isNewMasterEligibleNode = handleJoinIgnoringExceptions(join); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, see b2db083.
|
||
assert publicationCompletedIffAllTargetsInactiveOrCancelled(); | ||
try { | ||
handlePublishResponse(response.getPublishResponse()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, see b38833d.
@@ -1217,6 +1218,30 @@ public void assertMatched() { | |||
} | |||
} | |||
|
|||
public void testReconfiguresToExcludeMasterIneligibleVotingNodes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you wish :) see eca9468
@@ -102,6 +105,26 @@ public static long value(ClusterState clusterState) { | |||
} | |||
|
|||
void reboot() { | |||
if (localNode.isMasterNode() == false && rarely()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I added a thing to arbitrarily roll back the state on data nodes in 0fa253d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few more minor comments and questions on the tests (mainly for my understanding), but looks very good.
not(hasItem(chosenNode.getId()))); | ||
} | ||
|
||
public void testDoesNotPerformElectionWhenRestartingNonMasterNode() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to generalize this test so that restarting any node that is not the current leader and without which there is still a publication quorum does not lead to an election?
We could then just call this testDoesNotPerformUnnecessaryElection()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact I think the condition you described is a tautology: the term shouldn't change when restarting any node that isn't the current leader. If there's ≥3 master-eligible nodes then the cluster is fault tolerant; if there's 1 then we don't restart the unique master-eligible node, but if there's 2 then we know the leader is in the voting config and the other node isn't, so in all cases we can restart every node except the leader without triggering an election. Implemented this in 4f4cfe8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah, I thought about this, but dismissed the case of 2 nodes too quickly. You're right, cool that this simplifies the conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bah except we randomly disable auto-shrinking the voting configuration so in a 2-node cluster we sometimes we have a 2-node config. Fixed that in a40f8f6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💃
final long persistedCurrentTerm; | ||
|
||
if ((oldState.getLastAcceptedState().nodes().getLocalNode().isMasterNode() && newLocalNode.isMasterNode()) == false | ||
&& (oldState.getLastAcceptedState().term() > 0L || oldState.getLastAcceptedState().version() > 0L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this extra condition here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're rolling back the state to a strictly earlier version, which only exists if this condition is true. If the state on disk has the right term and version then it's otherwise correct too. I added comments in 81e162d to clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
|
||
final long persistedCurrentTerm; | ||
|
||
if ((oldState.getLastAcceptedState().nodes().getLocalNode().isMasterNode() && newLocalNode.isMasterNode()) == false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only exercised by the new test that you've added, right? We're not switching roles o.w.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not reset state when restarting non-master node? why only case if we switch from master to non-master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition is true if the node was master-ineligible either before or after the restart. In particular it's true when restarting a master-ineligible node without changing its roles. I think you're parsing it wrong? I added comments in 81e162d to clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I parsed it wrong indeed, too many brackets 🎱
newLastAcceptedVersion = randomLongBetween(0L, | ||
newLastAcceptedTerm == lastAcceptedTerm ? lastAcceptedVersion - 1 : Long.MAX_VALUE); | ||
} | ||
final VotingConfiguration newVotingConfiguration = new VotingConfiguration(singleton(randomAlphaOfLength(10))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
randomly set this empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ done in 81e162d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
@elasticmachine please run elasticsearch-ci/packaging-sample |
Today if a master-eligible node is converted to a master-ineligible node it may remain in the voting configuration, meaning that the master node may count its publish responses as an indication that it has properly persisted the cluster state. However master-ineligible nodes do not properly persist the cluster state, so it is not safe to count these votes. This change adjusts `CoordinationState` to take account of this from a safety point of view, and also adjusts the `Coordinator` to prevent such nodes from joining the cluster. Instead, it triggers a reconfiguration to remove from the voting configuration a node that now appears to be master-ineligible before processing its join.
Today if a master-eligible node is converted to a master-ineligible node it may remain in the voting configuration, meaning that the master node may count its publish responses as an indication that it has properly persisted the cluster state. However master-ineligible nodes do not properly persist the cluster state, so it is not safe to count these votes. This change adjusts `CoordinationState` to take account of this from a safety point of view, and also adjusts the `Coordinator` to prevent such nodes from joining the cluster. Instead, it triggers a reconfiguration to remove from the voting configuration a node that now appears to be master-ineligible before processing its join. Backport of #43688, see #44260.
Today if a master-eligible node is converted to a master-ineligible node it may
remain in the voting configuration, meaning that the master node may count its
publish responses as an indication that it has properly persisted the cluster
state. However master-ineligible nodes do not properly persist the cluster
state, so it is not safe to count these votes.
This change adjusts
CoordinationState
to take account of this from a safetypoint of view, and also adjusts the
Coordinator
to prevent such nodes fromjoining the cluster. Instead, it triggers a reconfiguration to remove from the
voting configuration a node that now appears to be master-ineligible before
processing its join.