Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid counting votes from master-ineligible nodes #43688

Conversation

DaveCTurner
Copy link
Contributor

Today if a master-eligible node is converted to a master-ineligible node it may
remain in the voting configuration, meaning that the master node may count its
publish responses as an indication that it has properly persisted the cluster
state. However master-ineligible nodes do not properly persist the cluster
state, so it is not safe to count these votes.

This change adjusts CoordinationState to take account of this from a safety
point of view, and also adjusts the Coordinator to prevent such nodes from
joining the cluster. Instead, it triggers a reconfiguration to remove from the
voting configuration a node that now appears to be master-ineligible before
processing its join.

Today if a master-eligible node is converted to a master-ineligible node it may
remain in the voting configuration, meaning that the master node may count its
publish responses as an indication that it has properly persisted the cluster
state. However master-ineligible nodes do not properly persist the cluster
state, so it is not safe to count these votes.

This change adjusts `CoordinationState` to take account of this from a safety
point of view, and also adjusts the `Coordinator` to prevent such nodes from
joining the cluster. Instead, it triggers a reconfiguration to remove from the
voting configuration a node that now appears to be master-ineligible before
processing its join.
@DaveCTurner DaveCTurner added >bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.3.0 labels Jun 27, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor Author

I attempted to update the formal model to match this here too, allowing a node to switch its master-eligibility at reboot time and fall back to an arbitrary persisted state if so, but my attempts did very bad things to the state space.

@ywelsch
Copy link
Contributor

ywelsch commented Jun 27, 2019

Wouldn't it be simpler to adapt VoteCollection to ignore votes from master-inelegible nodes?

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Jun 27, 2019

I wasn't as happy with the safety argument for that approach, or the failure reporting. Votes from nodes that used to be master-eligible can still be in play after the node comes back as a non-master node, and I'm worried about a future surprise from that fact.

Similarly I don't think we technically need to remove master-ineligible nodes from the config before allowing them to join, but I think that's a nice invariant to have.

However, I've also implemented the alternative on this branch.

@DaveCTurner
Copy link
Contributor Author

@elasticmachine please run elasticsearch-ci/docbldesx

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I've left some comments.

.filter(this::hasJoinVoteFrom).collect(Collectors.toSet());
final VotingConfiguration newConfig = reconfigurator.reconfigure(liveNodes,
clusterState.getVotingConfigExclusions().stream().map(VotingConfigExclusion::getNodeId).collect(Collectors.toSet()),
.filter(DiscoveryNode::isMasterNode).filter(this::hasJoinVoteFrom).collect(Collectors.toSet());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change still necessary? hasJoinVoteFrom will only return true for a master node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, hasJoinVoteFrom just looks at the node ID, so if you've restarted a master node as a non-master node then as it stands today we will still return true.

But this exposes another issue: we must keep track of the joins from the non-master nodes too, because we bump our term (and hold another election) if we think the node voted for a different master in this term. I added a test for this in 89584a8.

In addressing this, it turns out that everywhere else we care about the master nodes that haven't sent us a vote, so I think keeping the isMasterNode filter here makes the most sense. See c998eb1 for other related changes.

final VotingConfiguration newConfig = reconfigurator.reconfigure(liveNodes, Stream.concat(
clusterState.getVotingConfigExclusions().stream().map(VotingConfigExclusion::getNodeId),
StreamSupport.stream(clusterState.nodes().spliterator(), false)
.filter(Predicate.not(DiscoveryNode::isMasterNode)).map(DiscoveryNode::getId)).collect(Collectors.toSet()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment to say "auto-exclude non master-eligible nodes that are in the voting config" and only set those that are actually in the config? This will make the logging less confusing in Reconfigurator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, see fb274a9.

@@ -907,13 +910,13 @@ private void handleJoin(Join join) {

if (coordinationState.get().electionWon()) {
// If we have already won the election then the actual join does not matter for election purposes, so swallow any exception
final boolean isNewJoin = handleJoinIgnoringExceptions(join);
final boolean isNewMasterEligibleNode = handleJoinIgnoringExceptions(join);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps isNewJoinFromMasterEligibleNode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, see b2db083.


assert publicationCompletedIffAllTargetsInactiveOrCancelled();
try {
handlePublishResponse(response.getPublishResponse());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move this exception handling to the handlePublishResponse method? It should only wrap the Publication.this.handlePublishResponse part

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, see b38833d.

@@ -875,7 +878,8 @@ ClusterNode restartedNode(Function<MetaData, MetaData> adaptGlobalMetaData, Func
final DiscoveryNode newLocalNode = new DiscoveryNode(localNode.getName(), localNode.getId(),
UUIDs.randomBase64UUID(random()), // generated deterministically for repeatable tests
address.address().getHostString(), address.getAddress(), address, Collections.emptyMap(),
localNode.isMasterNode() ? DiscoveryNodeRole.BUILT_IN_ROLES : emptySet(), Version.CURRENT);
localNode.isMasterNode() && Node.NODE_MASTER_SETTING.get(nodeSettings)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make it so that the settings always properly reflect the roles and use DiscoveryNode.getRolesFromSettings(nodeSettings) here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but I won't do that here as it increases the scope of this PR rather a lot. I started down this path in 9a0158c but then started chasing failures and decided it went too far. Reverted in a9cd443, let's do this in a follow-up.

@@ -102,6 +105,26 @@ public static long value(ClusterState clusterState) {
}

void reboot() {
if (localNode.isMasterNode() == false && rarely()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do something similar for CoordinatorTests to check liveness aspects of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I added a thing to arbitrarily roll back the state on data nodes in 0fa253d.

@@ -1217,6 +1218,30 @@ public void assertMatched() {
}
}

public void testReconfiguresToExcludeMasterIneligibleVotingNodes() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testReconfiguresToExcludeMasterIneligibleNodesInVotingConfig

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you wish :) see eca9468

@jpountz jpountz added v7.4.0 and removed v7.3.0 labels Jul 3, 2019
Copy link
Contributor Author

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. This is ready for another look.

.filter(this::hasJoinVoteFrom).collect(Collectors.toSet());
final VotingConfiguration newConfig = reconfigurator.reconfigure(liveNodes,
clusterState.getVotingConfigExclusions().stream().map(VotingConfigExclusion::getNodeId).collect(Collectors.toSet()),
.filter(DiscoveryNode::isMasterNode).filter(this::hasJoinVoteFrom).collect(Collectors.toSet());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, hasJoinVoteFrom just looks at the node ID, so if you've restarted a master node as a non-master node then as it stands today we will still return true.

But this exposes another issue: we must keep track of the joins from the non-master nodes too, because we bump our term (and hold another election) if we think the node voted for a different master in this term. I added a test for this in 89584a8.

In addressing this, it turns out that everywhere else we care about the master nodes that haven't sent us a vote, so I think keeping the isMasterNode filter here makes the most sense. See c998eb1 for other related changes.

@@ -875,7 +878,8 @@ ClusterNode restartedNode(Function<MetaData, MetaData> adaptGlobalMetaData, Func
final DiscoveryNode newLocalNode = new DiscoveryNode(localNode.getName(), localNode.getId(),
UUIDs.randomBase64UUID(random()), // generated deterministically for repeatable tests
address.address().getHostString(), address.getAddress(), address, Collections.emptyMap(),
localNode.isMasterNode() ? DiscoveryNodeRole.BUILT_IN_ROLES : emptySet(), Version.CURRENT);
localNode.isMasterNode() && Node.NODE_MASTER_SETTING.get(nodeSettings)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but I won't do that here as it increases the scope of this PR rather a lot. I started down this path in 9a0158c but then started chasing failures and decided it went too far. Reverted in a9cd443, let's do this in a follow-up.

final VotingConfiguration newConfig = reconfigurator.reconfigure(liveNodes, Stream.concat(
clusterState.getVotingConfigExclusions().stream().map(VotingConfigExclusion::getNodeId),
StreamSupport.stream(clusterState.nodes().spliterator(), false)
.filter(Predicate.not(DiscoveryNode::isMasterNode)).map(DiscoveryNode::getId)).collect(Collectors.toSet()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, see fb274a9.

@@ -907,13 +910,13 @@ private void handleJoin(Join join) {

if (coordinationState.get().electionWon()) {
// If we have already won the election then the actual join does not matter for election purposes, so swallow any exception
final boolean isNewJoin = handleJoinIgnoringExceptions(join);
final boolean isNewMasterEligibleNode = handleJoinIgnoringExceptions(join);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, see b2db083.


assert publicationCompletedIffAllTargetsInactiveOrCancelled();
try {
handlePublishResponse(response.getPublishResponse());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, see b38833d.

@@ -1217,6 +1218,30 @@ public void assertMatched() {
}
}

public void testReconfiguresToExcludeMasterIneligibleVotingNodes() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you wish :) see eca9468

@@ -102,6 +105,26 @@ public static long value(ClusterState clusterState) {
}

void reboot() {
if (localNode.isMasterNode() == false && rarely()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I added a thing to arbitrarily roll back the state on data nodes in 0fa253d.

@DaveCTurner DaveCTurner requested a review from ywelsch July 10, 2019 15:47
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few more minor comments and questions on the tests (mainly for my understanding), but looks very good.

not(hasItem(chosenNode.getId())));
}

public void testDoesNotPerformElectionWhenRestartingNonMasterNode() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to generalize this test so that restarting any node that is not the current leader and without which there is still a publication quorum does not lead to an election?

We could then just call this testDoesNotPerformUnnecessaryElection()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact I think the condition you described is a tautology: the term shouldn't change when restarting any node that isn't the current leader. If there's ≥3 master-eligible nodes then the cluster is fault tolerant; if there's 1 then we don't restart the unique master-eligible node, but if there's 2 then we know the leader is in the voting config and the other node isn't, so in all cases we can restart every node except the leader without triggering an election. Implemented this in 4f4cfe8.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, I thought about this, but dismissed the case of 2 nodes too quickly. You're right, cool that this simplifies the conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bah except we randomly disable auto-shrinking the voting configuration so in a 2-node cluster we sometimes we have a 2-node config. Fixed that in a40f8f6.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💃

final long persistedCurrentTerm;

if ((oldState.getLastAcceptedState().nodes().getLocalNode().isMasterNode() && newLocalNode.isMasterNode()) == false
&& (oldState.getLastAcceptedState().term() > 0L || oldState.getLastAcceptedState().version() > 0L)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this extra condition here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're rolling back the state to a strictly earlier version, which only exists if this condition is true. If the state on disk has the right term and version then it's otherwise correct too. I added comments in 81e162d to clarify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks


final long persistedCurrentTerm;

if ((oldState.getLastAcceptedState().nodes().getLocalNode().isMasterNode() && newLocalNode.isMasterNode()) == false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only exercised by the new test that you've added, right? We're not switching roles o.w.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not reset state when restarting non-master node? why only case if we switch from master to non-master

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is true if the node was master-ineligible either before or after the restart. In particular it's true when restarting a master-ineligible node without changing its roles. I think you're parsing it wrong? I added comments in 81e162d to clarify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I parsed it wrong indeed, too many brackets 🎱

newLastAcceptedVersion = randomLongBetween(0L,
newLastAcceptedTerm == lastAcceptedTerm ? lastAcceptedVersion - 1 : Long.MAX_VALUE);
}
final VotingConfiguration newVotingConfiguration = new VotingConfiguration(singleton(randomAlphaOfLength(10)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

randomly set this empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ done in 81e162d.

@DaveCTurner DaveCTurner requested a review from ywelsch July 11, 2019 12:38
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@DaveCTurner
Copy link
Contributor Author

@elasticmachine please run elasticsearch-ci/packaging-sample

@DaveCTurner DaveCTurner merged commit cc51a93 into elastic:master Jul 12, 2019
@DaveCTurner DaveCTurner deleted the 2019-06-27-prevent-votes-from-master-ineligible-nodes branch July 12, 2019 07:32
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 12, 2019
Today if a master-eligible node is converted to a master-ineligible node it may
remain in the voting configuration, meaning that the master node may count its
publish responses as an indication that it has properly persisted the cluster
state. However master-ineligible nodes do not properly persist the cluster
state, so it is not safe to count these votes.

This change adjusts `CoordinationState` to take account of this from a safety
point of view, and also adjusts the `Coordinator` to prevent such nodes from
joining the cluster. Instead, it triggers a reconfiguration to remove from the
voting configuration a node that now appears to be master-ineligible before
processing its join.
DaveCTurner added a commit that referenced this pull request Jul 12, 2019
Today if a master-eligible node is converted to a master-ineligible node it may
remain in the voting configuration, meaning that the master node may count its
publish responses as an indication that it has properly persisted the cluster
state. However master-ineligible nodes do not properly persist the cluster
state, so it is not safe to count these votes.

This change adjusts `CoordinationState` to take account of this from a safety
point of view, and also adjusts the `Coordinator` to prevent such nodes from
joining the cluster. Instead, it triggers a reconfiguration to remove from the
voting configuration a node that now appears to be master-ineligible before
processing its join.

Backport of #43688, see #44260.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v7.4.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants