Wait on incoming joins before electing local node as master #12161

Closed
wants to merge 10 commits into
from

Projects

None yet

4 participants

@bleskes
Member
bleskes commented Jul 9, 2015

During master election each node pings in order to discover other nodes and validate the liveness of existing nodes. Based on this information the node either discovers an existing master or, if enough nodes are found (based on `discovery.zen.minimum_master_nodes>>) a new master will be elected.

Currently, the node that is elected as master will currently update it the cluster state to indicate the result of the election. Other nodes will submit a join request to the newly elected master node. Instead of immediately processing the election result, the elected master
node should wait for the incoming joins from other nodes, thus validating the elections result is properly applied. As soon as enough nodes have sent their joins request (based on the minimum_master_nodes settings) the cluster state is modified.

Note that if minimum_master_nodes is not set, this change has no effect.

bleskes added some commits Jul 6, 2015
@bleskes bleskes wip 33c9078
@bleskes bleskes wip 850acc4
@bleskes bleskes wip cc5a881
@bleskes bleskes adding tests, some concurrency issues with double election winning 267e061
@bleskes bleskes Discovery: wait on incoming joins before electing local node as master
During master election each node pings in order to discover other nodes and validate the liveness of existing
nodes. Based on this information the node either discovers an existing master or, if enough nodes are found
(based on `discovery.zen.minimum_master_nodes>>) a new master will be elected.  Currently, the node that is
elected as master will currently update it the cluster state to indicate the result of the election. Other nodes will submit
a join request to the newly elected master node. Instead of immediately processing the election result, the elected master
node should wait for the incoming joins from other nodes, thus validating the elections result is properly applied. As soon as enough
nodes have sent their joins request (based on the `minimum_master_nodes` settings) the cluster state is modified.

Note that if `minimum_master_nodes` is not set, this change has no effect.
d7de212
@bleskes
Member
bleskes commented Jul 9, 2015

@kimchy @martijnvg can you take a look ? thx...

@martijnvg martijnvg commented on the diff Jul 10, 2015
...h/discovery/DiscoveryWithServiceDisruptionsTests.java
@@ -240,12 +240,16 @@ public void failWithMinimumMasterNodesConfigured() throws Exception {
/** Verify that nodes fault detection works after master (re) election */
@Test
public void testNodesFDAfterMasterReelection() throws Exception {
- startCluster(3);
+ startCluster(4);
@martijnvg
martijnvg Jul 10, 2015 Member

what is the reason for changing this test?

@bleskes
bleskes Jul 12, 2015 Member

In the end of the tests we brought the cluster down to one node, which required setting min master node to 1 (which we weren't doing). I felt uncomfortable setting it to 1, I chose to start with 4 nodes bringing the cluster down to 2 (and min master node to 2 as well).

@martijnvg
martijnvg Jul 13, 2015 Member

makes sense, I thought that this was done specifically for this change, but this is just a test improvement.

@martijnvg martijnvg commented on the diff Jul 10, 2015
...ava/org/elasticsearch/discovery/zen/ZenDiscovery.java
@@ -353,6 +358,7 @@ public boolean joiningCluster() {
private void innerJoinCluster() {
DiscoveryNode masterNode = null;
final Thread currentThread = Thread.currentThread();
+ nodeJoinController.startAccumulatingJoins();
@martijnvg
martijnvg Jul 10, 2015 Member

why are we going to accumulate join request while we don't know that we are going to become master master?

@bleskes
bleskes Jul 12, 2015 Member

since the election process is async, it may be that other nodes finish it before we do. It would be a shame to loose their joins. This can't really hurt because 99% of the times, an incoming join will mean the local node will become a master soon and will process it correctly.

@martijnvg
martijnvg Jul 13, 2015 Member

cool, can you add this a comment?

@martijnvg martijnvg and 1 other commented on an outdated diff Jul 10, 2015
...g/elasticsearch/discovery/zen/NodeJoinController.java
+ }
+
+ @Override
+ public void onFailure(Throwable t) {
+ if (closed.compareAndSet(false, true)) {
+ try {
+ onClose();
+ } finally {
+ callback.onFailure(t);
+ }
+ }
+ }
+ }
+
+
+ class ProcessJoinsTask extends ProcessedClusterStateUpdateTask {
@martijnvg
martijnvg Jul 10, 2015 Member

maybe make this class abstract? I assume the idea to extract code from the checkPendingJoinsAndElectIfNeeded() method, is that it otherwise gets difficult to read?

@martijnvg
martijnvg Jul 10, 2015 Member

nevermind... it is being used in two places

@martijnvg martijnvg and 1 other commented on an outdated diff Jul 10, 2015
...g/elasticsearch/discovery/zen/NodeJoinController.java
+ assert electionContext.get() == null : "startAccumulatingJoins() called, but there is an ongoing election context";
+ }
+
+ public void stopAccumulatingJoins() {
+ logger.trace("stopping join accumulation");
+ assert electionContext.get() == null : "stopAccumulatingJoins() called, but there is an ongoing election context";
+ boolean b = accumulateJoins.getAndSet(false);
+ assert b : "stopAccumulatingJoins() called but not accumulating";
+ synchronized (pendingJoinRequests) {
+ if (pendingJoinRequests.size() > 0) {
+ processJoins("stopping to accumulate joins");
+ }
+ }
+ }
+
+ public void handleJoinRequest(final DiscoveryNode node, final MembershipAction.JoinCallback callback) {
@martijnvg
martijnvg Jul 10, 2015 Member

With this change we don't response immediately to a node's join request, if the soon to be elected master node received enough join request it responds to those nodes. I wonder what the downsides are for potentially waiting for some time before responding. (in terms of timing on the joining side and networking)

@martijnvg
martijnvg Jul 10, 2015 Member

I mean the fact that we don't elect a master if enough join requests have been received doesn't mean we need to also wait response to the nodes sending a join request. Those nodes can just wait until they get the next cluster state.

@bleskes
bleskes Jul 12, 2015 Member

I'm not sure I follow you but the semantic on of the join process is such that when it finishes the joining node has to have a master. We check for it here: https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/discovery/zen/ZenDiscovery.java#L426

@martijnvg
martijnvg Jul 13, 2015 Member

true, the node trying join, will then just try again.

What I was wondering about is, if there are any downsides with waiting for return the join request calls? Obviously there is a big upside, that is why this PR exists :) but I'm trying to think about any downsides I'm maybe overlooking, because prior to this change we used return immediately.

@martijnvg martijnvg and 1 other commented on an outdated diff Jul 10, 2015
...ava/org/elasticsearch/discovery/zen/ZenDiscovery.java
@@ -888,11 +865,27 @@ static boolean shouldIgnoreOrRejectNewClusterState(ESLogger logger, ClusterState
}
}
+ /**
+ * In the case we follow an elected master the new cluster state needs to have the same elected master
+ * This method checks for this and throws an exception if needed
+ */
+ public static void rejectNewClusterStateIfNeeded(ESLogger logger, DiscoveryNodes currentNodes, ClusterState newClusterState) {
+ if (currentNodes.masterNodeId() == null) {
+ return;
+ }
+ if (!currentNodes.masterNodeId().equals(newClusterState.nodes().masterNodeId())) {
+ logger.warn("received a cluster state from a different master then the current one, rejecting (received {}, current {})", newClusterState.nodes().masterNode(), currentNodes.masterNode());
@martijnvg
martijnvg Jul 10, 2015 Member

maybe instead log: received a cluster state the follows a different elected master than we are following?

@martijnvg martijnvg commented on the diff Jul 10, 2015
...ava/org/elasticsearch/discovery/zen/ZenDiscovery.java
void handleJoinRequest(final DiscoveryNode node, final MembershipAction.JoinCallback callback) {
if (!transportService.addressSupported(node.address().getClass())) {
// TODO, what should we do now? Maybe inform that node that its crap?
logger.warn("received a wrong address type from [{}], ignoring...", node);
+ } else if (nodeJoinController == null) {
+ throw new IllegalStateException("discovery module is not yet started");
@martijnvg
martijnvg Jul 10, 2015 Member

is it really possible that we receive join requests if ZenDiscover#doStart() hasn't been completed yet? this feels odd to me

@bleskes
bleskes Jul 12, 2015 Member

We register the membership action in the constructor. If the transport is started before ZenDisco (which I think it is) we can get requests at any moment. I think it's there is no harm in being defensive here..

@martijnvg
martijnvg Jul 13, 2015 Member

yes, we can't do too much about this, so it is better be defensive here.

@martijnvg martijnvg and 1 other commented on an outdated diff Jul 10, 2015
...sticsearch/discovery/zen/NodeJoinControllerTests.java
+import org.elasticsearch.test.cluster.TestClusterService;
+import org.elasticsearch.test.junit.annotations.TestLogging;
+import org.junit.Before;
+
+import java.util.*;
+import java.util.concurrent.*;
+import java.util.concurrent.atomic.AtomicInteger;
+import java.util.concurrent.atomic.AtomicReference;
+
+import static org.hamcrest.Matchers.equalTo;
+import static org.hamcrest.Matchers.instanceOf;
+
+@TestLogging("discovery.zen:TRACE")
+public class NodeJoinControllerTests extends ElasticsearchTestCase {
+
+ TestClusterService clusterService;
@martijnvg
martijnvg Jul 10, 2015 Member

maybe make these fields private?

@martijnvg martijnvg commented on the diff Jul 10, 2015
...sticsearch/discovery/zen/NodeJoinControllerTests.java
+
+ private void joinNode(final DiscoveryNode node) throws InterruptedException, ExecutionException {
+ joinNodeAsync(node).get();
+ }
+
+ protected DiscoveryNode newNode(int i) {
+ return newNode(i, randomBoolean());
+ }
+
+ protected DiscoveryNode newNode(int i, boolean master) {
+ Map<String, String> attributes = new HashMap<>();
+ attributes.put("master", Boolean.toString(master));
+ final String prefix = master ? "master_" : "data_";
+ return new DiscoveryNode(prefix + i, i + "", new LocalTransportAddress("test_" + i), attributes, Version.CURRENT);
+ }
+}
@martijnvg
martijnvg Jul 10, 2015 Member

good stuff! all these unit tests

@martijnvg
Member

@bleskes Left a couple of questions. This is a good improvement to the zen disco!

@bleskes
Member
bleskes commented Jul 12, 2015

@martijnvg Thx. Pushed a new commit.

@martijnvg
Member

@bleskes LTGM

@kimchy
Member
kimchy commented Jul 14, 2015

LGTM, two minor changes: more javadoc on node controller, and break the test for 0 expected and more to 2 tests

@clintongormley clintongormley commented on an outdated diff Jul 14, 2015
docs/resiliency/index.asciidoc
@@ -89,6 +89,19 @@ Further issues remain with the retry mechanism:
See {GIT}9967[#9967]. (STATUS: ONGOING)
[float]
+=== Wait on incoming joins before electing local node as master (STATUS: ONGOING)
+
+During master election each node pings in order to discover other nodes and validate the liveness of existing
+nodes. Based on this information the node either discovers an existing master or, if enough nodes are found
+(see <<master-election,`discovery.zen.minimum_master_nodes>>) a new master will be elected. Currently, the node that is
@clintongormley
clintongormley Jul 14, 2015 Member

the master-election ID is in the ref docs, not in the resiliency page. you need an absolute link for that to work.

@clintongormley clintongormley commented on an outdated diff Jul 14, 2015
docs/resiliency/index.asciidoc
@@ -89,6 +89,19 @@ Further issues remain with the retry mechanism:
See {GIT}9967[#9967]. (STATUS: ONGOING)
[float]
+=== Wait on incoming joins before electing local node as master (STATUS: ONGOING)
+
+During master election each node pings in order to discover other nodes and validate the liveness of existing
+nodes. Based on this information the node either discovers an existing master or, if enough nodes are found
+(see <<master-election,`discovery.zen.minimum_master_nodes>>) a new master will be elected. Currently, the node that is
+elected as master will currently update it the cluster state to indicate the result of the election. Other nodes will submit
@clintongormley
clintongormley Jul 14, 2015 Member

"will currently update it" -> "will update"

@clintongormley clintongormley commented on an outdated diff Jul 14, 2015
docs/resiliency/index.asciidoc
@@ -89,6 +89,19 @@ Further issues remain with the retry mechanism:
See {GIT}9967[#9967]. (STATUS: ONGOING)
[float]
+=== Wait on incoming joins before electing local node as master (STATUS: ONGOING)
+
+During master election each node pings in order to discover other nodes and validate the liveness of existing
+nodes. Based on this information the node either discovers an existing master or, if enough nodes are found
+(see <<master-election,`discovery.zen.minimum_master_nodes>>) a new master will be elected. Currently, the node that is
+elected as master will currently update it the cluster state to indicate the result of the election. Other nodes will submit
+a join request to the newly elected master node. Instead of immediately processing the election result, the elected master
+node should wait for the incoming joins from other nodes, thus validating the elections result is properly applied. As soon as enough
@clintongormley
clintongormley Jul 14, 2015 Member

"the elections result" -> "that the result of the election"

@clintongormley clintongormley and 1 other commented on an outdated diff Jul 14, 2015
docs/resiliency/index.asciidoc
@@ -89,6 +89,19 @@ Further issues remain with the retry mechanism:
See {GIT}9967[#9967]. (STATUS: ONGOING)
[float]
+=== Wait on incoming joins before electing local node as master (STATUS: ONGOING)
+
+During master election each node pings in order to discover other nodes and validate the liveness of existing
+nodes. Based on this information the node either discovers an existing master or, if enough nodes are found
+(see <<master-election,`discovery.zen.minimum_master_nodes>>) a new master will be elected. Currently, the node that is
+elected as master will currently update it the cluster state to indicate the result of the election. Other nodes will submit
+a join request to the newly elected master node. Instead of immediately processing the election result, the elected master
+node should wait for the incoming joins from other nodes, thus validating the elections result is properly applied. As soon as enough
+nodes have sent their joins request (based on the `minimum_master_nodes` settings) the cluster state is modified.
@clintongormley
clintongormley Jul 14, 2015 Member

modified -> updated?

@bleskes bleskes added a commit that closed this pull request Jul 15, 2015
@bleskes bleskes Discovery: wait on incoming joins before electing local node as master
During master election each node pings in order to discover other nodes and validate the liveness of existing nodes. Based on this information the node either discovers an existing master or, if enough nodes are found (based on `discovery.zen.minimum_master_nodes>>) a new master will be elected.

Currently, the node that is elected as master will currently update it the cluster state to indicate the result of the election. Other nodes will submit a join request to the newly elected master node. Instead of immediately processing the election result, the elected master
node should wait for the incoming joins from other nodes, thus validating the elections result is properly applied. As soon as enough nodes have sent their joins request (based on the `minimum_master_nodes` settings) the cluster state is modified.

Note that if `minimum_master_nodes` is not set, this change has no effect.

Closes #12161
1e35bf3
@bleskes bleskes closed this in 1e35bf3 Jul 15, 2015
@bleskes bleskes deleted the bleskes:discovery_wait_for_join_on_elections branch Jul 15, 2015
@clintongormley clintongormley removed the review label Aug 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment