Bootstrap a Zen2 cluster once quorum is discovered #37463

DaveCTurner · 2019-01-15T10:03:28Z

Today when bootstrapping a Zen2 cluster we wait for every node in the
initial_master_nodes setting to be discovered, so that we can map the
node names or addresses in the initial_master_nodes list to their IDs for
inclusion in the initial voting configuration. This means that if any of
the expected master-eligible nodes fails to start then bootstrapping will
not occur and the cluster will not form. This is not ideal, and we would
prefer the cluster to bootstrap even if some of the master-eligible nodes
do not start.

Safe bootstrapping requires that all pairs of quorums of all initial
configurations overlap, and this is particularly troublesome to ensure
given that nodes may be concurrently and independently attempting to
bootstrap the cluster. The solution is to bootstrap using an initial
configuration whose size matches the size of the expected set of
master-eligible nodes, but with the unknown IDs replaced by "placeholder"
IDs that can never belong to any node. Any quorum of received votes in any
of these placeholder-laden initial configurations is also a quorum of the
"true" initial set of master-eligible nodes, giving the guarantee that it
intersects all other quorums as required.

Note that this change means that the initial configuration is not
necessarily robust to any node failures. Normally the cluster will form and
then auto-reconfigure to a more robust configuration in which the
placeholder IDs are replaced by the IDs of genuine nodes as they join the
cluster; however if a node fails between bootstrapping and this
auto-reconfiguration then the cluster may become unavailable. This we feel
to be less likely than a node failing to start at all.

This commit also enormously simplifies the cluster bootstrapping process.
Today, the cluster bootstrapping process involves two (local) transport actions
in order to support a flexible bootstrapping API and to make it easily
accessible to plugins. However this flexibility is not required for the current
design so it is adding a good deal of unnecessary complexity. Here we remove
this complexity in favour of a much simpler ClusterBootstrapService
implementation that does all the work itself.

Today, the cluster bootstrapping process involves two (local) transport actions in order to support a flexible bootstrapping API and to make it easily accessible to plugins. However this flexibility is not required for the current design so it is adding a good deal of unnecessary complexity. This commit removes this complexity in favour of a much simpler ClusterBootstrapService implementation that does all the work itself.

elasticmachine · 2019-01-15T10:03:31Z

Pinging @elastic/es-distributed

DaveCTurner · 2019-01-15T10:05:47Z

The merge conflict is trivial - the file can be deleted - but I am not using the current HEAD of master since it is not currently passing CI. I'll merge a more recent master once things go green there again.

ywelsch

I like the simplification here. I've left some comments on ClusterBootstrapService.

server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java

ywelsch · 2019-01-15T12:41:51Z

server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java

+
+    private void startBootstrap(Set<DiscoveryNode> discoveryNodes) {
+        if (bootstrappingPermitted.compareAndSet(true, false)) {
+            doBootstrap(new VotingConfiguration(discoveryNodes.stream().map(DiscoveryNode::getId).collect(Collectors.toSet())));


I'm confused. Why does it put all discovered nodes into the bootstrapping configuration, and not only those that are actually matching the initialMasterNodes requirements? For the best-effort bootstrapping I understand, but for the settings-based bootstrapping, this is a bit unexpected.

We know these nodes are live (the PeerFinder heard from them within the last few seconds) and will reconfigure to an optimal configuration as soon as the master is elected, so I don't think it makes much difference. More master nodes is generally more robust, right?

it might also be unsafe. Assume I start up 5 nodes (A,B,C,D,E) with setting initial_master_nodes to A. Then it's possible that B will bootstrap with A,B,C and that D will bootstrap with A,D,E, both of which can have a non-overlapping quorum (B,C) and (D,E). I don't think we should do this in the case of using initialMasterNodes and stick exactly to the formal model.

Oh yes, you're quite right. In fact it's unsafe with 3 nodes and initial_master_nodes: A, because A might bootstrap using only itself and B might bootstrap with A,B,C so that {B,C} is a quorum disjoint from A's. Fixed in 526acba.

server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java

ywelsch · 2019-01-15T12:54:00Z

server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java

-                                    return new GetDiscoveredNodesResponse(in);
-                                }
-                            });
+                        doBootstrap(votingConfiguration);


Is it so important to use the same voting configuration for retrying? As long as it matches the requirements it should be good, no? In case where it's an automated best-effort bootstrap, we have no strong guarantees and keeping the same voting config does not matter either?

Probably not, but this is how we've modelled it. Retrying here is a fairly weird thing to be doing anyway, requiring one of the following:

the node is not bootstrapped, and also not a CANDIDATE

the node is no longer in touch with a quorum of peers

an I/O error occurred

I'm in two minds about whether to retry automatically at all, as opposed to simply logging the issue and letting the operator restart the node if they really need to.

I thought about this some more and concluded that retrying was not a particularly helpful thing to do on an exception, so as of 2ff94a0 we no longer do so.

I think we should have retries, for the reason of making bootstrapping fault-tolerant. It would be bad if in a 3 node cluster, one node would go down during bootstrapping, and none of the 2 other nodes would complete the bootstrapping. Yet this is a possibilty, with node 1 going down and node 2 and 3 both use node 1 and themselves as initial voting config (because node 1 was shortly available during discovery).
Similarly I feel that a node turning to follower (because of follower-check, so does not have accepted a cluster state yet) is not a good reason to abort setting initial config. Given that we're not changing the cluster state anymore when setting initial config, I think we should change the check from (not be CANDIDATE) to (not have a voting configuration).

Ok, I reverted that commit in 9d2bc20.

We already only bootstrap if we do (not have a voting configuration) - this check is in addition to the check on the mode. And yet we do change the cluster state when setting the initial config, so I don't follow the reasoning. Do you mean to say that we don't change the cluster state version on bootstrapping?

Do you mean to say that we don't change the cluster state version on bootstrapping?

yes.

The mode check might be superfluous.

Note that this comment is still unaddressed:

I think we should have retries, for the reason of making bootstrapping fault-tolerant. It would be bad if in a 3 node cluster, one node would go down during bootstrapping, and none of the 2 other nodes would complete the bootstrapping. Yet this is a possibilty, with node 1 going down and node 2 and 3 both use node 1 and themselves as initial voting config (because node 1 was shortly available during discovery).

I feel that this is an important situation to cover and we should also have CoordinatorTests for it, i.e., start 3-node cluster with initial_master_nodes set to all nodes, do a little runRandomly, then completely isolate one of the nodes, and see if the other 2 can form a cluster (stabilization).

I removed the mode check in f30ab6e and we agreed to follow up on the potential availability issue later since it requires some careful thought.

…gible nodes

DaveCTurner · 2019-01-16T08:03:38Z

Build failure looks like #37275; @elasticmachine please run the Gradle build tests 2

DaveCTurner · 2019-01-17T12:46:38Z

@elasticmachine please run the Gradle build tests 2

DaveCTurner · 2019-01-17T13:38:45Z

@elasticmachine please:

run the Gradle build tests 1
run the Gradle build tests 2

This reverts commit 2ff94a0.

DaveCTurner · 2019-01-18T10:52:52Z

@elasticmachine please:

run the Gradle build tests 1
run the Gradle build tests 2
run the default distro tests

DaveCTurner · 2019-01-18T12:07:11Z

@elasticmachine please:

run the Gradle build tests 1
run the Gradle build tests 2

DaveCTurner · 2019-01-18T14:00:08Z

@elasticmachine please:

run the Gradle build tests 1
run the Gradle build tests 2

Today when bootstrapping a Zen2 cluster we wait for every node in the `initial_master_nodes` setting to be discovered, so that we can map the node names or addresses in the `initial_master_nodes` list to their IDs for inclusion in the initial voting configuration. This means that if any of the expected master-eligible nodes fails to start then bootstrapping will not occur and the cluster will not form. This is not ideal, and we would prefer the cluster to bootstrap even if some of the master-eligible nodes do not start. Safe bootstrapping requires that all pairs of quorums of all initial configurations overlap, and this is particularly troublesome to ensure given that nodes may be concurrently and independently attempting to bootstrap the cluster. The solution is to bootstrap using an initial configuration whose size matches the size of the expected set of master-eligible nodes, but with the unknown IDs replaced by "placeholder" IDs that can never belong to any node. Any quorum of received votes in any of these placeholder-laden initial configurations is also a quorum of the "true" initial set of master-eligible nodes, giving the guarantee that it intersects all other quorums that we require. Note that this change means that the initial configuration is not necessarily robust to any node failures. Normally the cluster will form and then auto-reconfigure to a more robust configuration in which the placeholder IDs are replaced by the IDs of genuine nodes as they join the cluster; however if a node fails between bootstrapping and this auto-reconfiguration then the cluster may become unavailable. This we feel to be less likely than a node failing to start at all.

talevy · 2019-01-18T20:47:01Z

server/src/main/java/org/elasticsearch/ElasticsearchException.java

        SNAPSHOT_IN_PROGRESS_EXCEPTION(org.elasticsearch.snapshots.SnapshotInProgressException.class,
-            org.elasticsearch.snapshots.SnapshotInProgressException::new, 152, Version.V_7_0_0);
+            org.elasticsearch.snapshots.SnapshotInProgressException::new, 151, Version.V_7_0_0);


ywelsch · 2019-01-21T09:46:21Z

server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java

+        assert placeholderCount < discoveryNodes.size() : discoveryNodes.size() + " <= " + placeholderCount;
+        if (bootstrappingPermitted.compareAndSet(true, false)) {
+            doBootstrap(new VotingConfiguration(Stream.concat(discoveryNodes.stream().map(DiscoveryNode::getId),
+                Stream.generate(() -> BOOTSTRAP_PLACEHOLDER_PREFIX + UUIDs.randomBase64UUID(random)).limit(placeholderCount))


instead of randomly-generated UUIDs here, I think we should combine the BOOTSTRAP_PLACEHOLDER_PREFIX with the initial_master_nodes value here. This will make it clearer which entry we have a place-holder for.

Very well, I did so in c716734 and also ensured that there are no duplicate requirements in eadc797.

ywelsch · 2019-01-21T09:48:02Z

server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java

-                                    return new GetDiscoveredNodesResponse(in);
-                                }
-                            });
+                        doBootstrap(votingConfiguration);


Do you mean to say that we don't change the cluster state version on bootstrapping?

yes.

The mode check might be superfluous.

ywelsch · 2019-01-21T09:51:49Z

server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java

-                                    return new GetDiscoveredNodesResponse(in);
-                                }
-                            });
+                        doBootstrap(votingConfiguration);


Note that this comment is still unaddressed:

I think we should have retries, for the reason of making bootstrapping fault-tolerant. It would be bad if in a 3 node cluster, one node would go down during bootstrapping, and none of the 2 other nodes would complete the bootstrapping. Yet this is a possibilty, with node 1 going down and node 2 and 3 both use node 1 and themselves as initial voting config (because node 1 was shortly available during discovery).

I feel that this is an important situation to cover and we should also have CoordinatorTests for it, i.e., start 3-node cluster with initial_master_nodes set to all nodes, do a little runRandomly, then completely isolate one of the nodes, and see if the other 2 can form a cluster (stabilization).

ywelsch · 2019-01-21T17:44:40Z

server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java

+            }
+
+            final Set<DiscoveryNode> nodesMatchingRequirements = requirementMatchingResult.v1();
+            final List<String> unsatisfiedRequirements = requirementMatchingResult.v2();


can you output matching and unsatisfiedRequirements at trace level here?

Yep, see 2878bd6

ywelsch · 2019-01-21T17:51:39Z

server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java

+        assert unsatisfiedRequirements.size() < discoveryNodes.size() : discoveryNodes + " smaller than " + unsatisfiedRequirements;
+        if (bootstrappingPermitted.compareAndSet(true, false)) {
+            doBootstrap(new VotingConfiguration(Stream.concat(discoveryNodes.stream().map(DiscoveryNode::getId),
+                unsatisfiedRequirements.stream().map(s -> BOOTSTRAP_PLACEHOLDER_PREFIX + s))


add a - or something between BOOTSTRAP_PLACEHOLDER_PREFIX and requirement? Alternatively, add it directly to the PREFIX.

Yep, see b40144f. It already ended in punctuation (i.e. }) but a hyphen can't hurt too.

* elastic/master: (43 commits) Remove remaining occurances of "include_type_name=true" in docs (elastic#37646) SQL: Return Intervals in SQL format for CLI (elastic#37602) Publish to masters first (elastic#37673) Un-assign persistent tasks as nodes exit the cluster (elastic#37656) Fail start of non-data node if node has data (elastic#37347) Use cancel instead of timeout for aborting publications (elastic#37670) Follow stats api should return a 404 when requesting stats for a non existing index (elastic#37220) Remove deprecated FieldNamesFieldMapper.Builder#index (elastic#37305) Document that date math is locale independent Bootstrap a Zen2 cluster once quorum is discovered (elastic#37463) Upgrade to lucene-8.0.0-snapshot-83f9835. (elastic#37668) Mute failing test Fix java time formatters that round up (elastic#37604) Removes awaits fix as the fix is in. (elastic#37676) Mute failing test Ensure that max seq # is equal to the global checkpoint when creating ReadOnlyEngines (elastic#37426) Mute failing discovery disruption tests Add note about how the body is referenced (elastic#33935) Define constants for REST requests endpoints in tests (elastic#37610) Make prepare engine step of recovery source non-blocking (elastic#37573) ...

DaveCTurner added >non-issue v7.0.0 :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Jan 15, 2019

DaveCTurner requested a review from ywelsch January 15, 2019 10:03

DaveCTurner added 2 commits January 15, 2019 11:21

Line length

9742709

Merge branch 'master' into 2018-01-09-simple-cluster-bootstrap-service

3a040f4

ywelsch suggested changes Jan 15, 2019

View reviewed changes

DaveCTurner added 3 commits January 15, 2019 17:12

Assert that the bootstrap configuration contains only Zen2 master-eli…

c4555ac

…gible nodes

Clarify why bootstrapping was cancelled

61af11d

Throw exception if requirements match multiple nodes or v.v.

7d70a29

DaveCTurner requested a review from ywelsch January 15, 2019 17:29

DaveCTurner added 2 commits January 17, 2019 11:54

Do not retry if bootstrapping throws an exception

2ff94a0

Do not include any extra nodes in the bootstrap configuration

526acba

DaveCTurner added 3 commits January 17, 2019 15:40

Revert "Do not retry if bootstrapping throws an exception"

9d2bc20

This reverts commit 2ff94a0.

Best-effort avoidance of bootstrapping if already bootstrapped

8df9f66

Merge branch 'master' into 2018-01-09-simple-cluster-bootstrap-service

39d37da

DaveCTurner mentioned this pull request Jan 18, 2019

Create specific exception for when snapshots are in progress #37550

Merged

Line length

7a75c0b

DaveCTurner changed the title ~~Simplify ClusterBootstrapService~~ Bootstrap a Zen2 cluster once quorum is discovered Jan 18, 2019

DaveCTurner added the >enhancement label Jan 18, 2019

DaveCTurner removed the >non-issue label Jan 18, 2019

One blasted character over the limit

28b9d46

talevy reviewed Jan 18, 2019

View reviewed changes

ywelsch suggested changes Jan 21, 2019

View reviewed changes

DaveCTurner added 9 commits January 21, 2019 12:52

Add ClusterBootstrapService::isBootstrapPlaceholder

8185e13

Unnecessary assert

dfaa7de

Use clusterNode.applyInitialConfiguration

c0d62af

Reject duplicate requirements early

eadc797

Use placeholders in coordinator tests

f5b75f4

Rename initialMasterNodes to bootstrapRequirements

a83dff7

Include requirement in the bootstrap placeholder

c716734

Permit bootstrapping in any mode

f30ab6e

Merge branch 'master' into 2018-01-09-simple-cluster-bootstrap-service

c639939

DaveCTurner requested a review from ywelsch January 21, 2019 16:55

ywelsch approved these changes Jan 21, 2019

View reviewed changes

DaveCTurner added 4 commits January 21, 2019 18:38

Add trailing hyphen to placeholder prefix

b40144f

Trace logging

2878bd6

Make requirements unmodifiable

b757444

Merge branch 'master' into 2018-01-09-simple-cluster-bootstrap-service

72e6787

DaveCTurner merged commit 5db7ed2 into elastic:master Jan 22, 2019

DaveCTurner deleted the 2018-01-09-simple-cluster-bootstrap-service branch January 22, 2019 11:03

ywelsch mentioned this pull request Jan 22, 2019

A new cluster coordination layer #32006

Closed

61 tasks

talevy mentioned this pull request Jan 22, 2019

Create specific exception for when snapshots are in progress (#37550) #37723

Merged

DaveCTurner mentioned this pull request Jan 28, 2019

Quorum-based bootstrapping elastic/elasticsearch-formal-models#41

Closed

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bootstrap a Zen2 cluster once quorum is discovered #37463

Bootstrap a Zen2 cluster once quorum is discovered #37463

DaveCTurner commented Jan 15, 2019 •

edited

Loading

elasticmachine commented Jan 15, 2019

DaveCTurner commented Jan 15, 2019

ywelsch left a comment

ywelsch Jan 15, 2019

DaveCTurner Jan 15, 2019

ywelsch Jan 16, 2019 •

edited

Loading

DaveCTurner Jan 17, 2019

ywelsch Jan 15, 2019

DaveCTurner Jan 15, 2019

DaveCTurner Jan 17, 2019

ywelsch Jan 17, 2019

DaveCTurner Jan 17, 2019

ywelsch Jan 21, 2019

ywelsch Jan 21, 2019

DaveCTurner Jan 21, 2019

DaveCTurner commented Jan 16, 2019

DaveCTurner commented Jan 17, 2019

DaveCTurner commented Jan 17, 2019

DaveCTurner commented Jan 18, 2019

DaveCTurner commented Jan 18, 2019

DaveCTurner commented Jan 18, 2019

talevy Jan 18, 2019

ywelsch Jan 21, 2019

DaveCTurner Jan 21, 2019

ywelsch Jan 21, 2019

ywelsch Jan 21, 2019

ywelsch Jan 21, 2019

DaveCTurner Jan 21, 2019

ywelsch Jan 21, 2019

DaveCTurner Jan 21, 2019

Bootstrap a Zen2 cluster once quorum is discovered #37463

Bootstrap a Zen2 cluster once quorum is discovered #37463

Conversation

DaveCTurner commented Jan 15, 2019 • edited Loading

elasticmachine commented Jan 15, 2019

DaveCTurner commented Jan 15, 2019

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywelsch Jan 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner commented Jan 16, 2019

DaveCTurner commented Jan 17, 2019

DaveCTurner commented Jan 17, 2019

DaveCTurner commented Jan 18, 2019

DaveCTurner commented Jan 18, 2019

DaveCTurner commented Jan 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner commented Jan 15, 2019 •

edited

Loading

ywelsch Jan 16, 2019 •

edited

Loading