Add current cluster state version to zen pings and use them in master election #20384

bleskes · 2016-09-08T10:15:32Z

During a networking partition, cluster states updates (like mapping changes or shard assignments)
are committed if a majority of the masters node received the update correctly. This means that the current master has access to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
is still recovering from the previous one and the old master is put in the minority side, it may be that a new master is elected which did not yet catch up. If that happens, cluster state updates can be lost.

This commit fixed 95% of this rare problem by adding the current cluster state version to PingResponse and use them when deciding which master to join (and thus casting the node's vote).

Note: this doesn't fully mitigate the problem as a cluster state update which is issued concurrently with a network partition can be lost if the partition prevents the commit message (part of the two phased commit of cluster state updates) from reaching any single node in the majority side and the partition does allow for the master to acknowledge the change. We are working on a more comprehensive fix but that requires considerate work and is targeted at 6.0.

PS this PR contains and depends on #20348 , which was required for long testing. That part doesn't need to be reviewed.

LongGCDisruption simulates a Long GC by suspending all threads belonging to a node. That's fine, unless those threads hold shared locks that can prevent other nodes from running. Concretely the logging infrastructure, which is shared between the nodes, can cause some deadlocks. LongGCDisruption has protection for this, but it needs to be updated to point at log4j2 classes, introduced in elastic#20235 This commit also fixes improper handling of retry logic in LongGCDisruption and adds a protection against deadlocking the test code which activates the disruption (and uses logging too! :)). On top of that we have some new, evil and nasty tests.

clintongormley · 2016-09-08T10:19:58Z

docs/resiliency/index.asciidoc

+=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)
+
+During a networking partition, cluster states updates (like mapping changes or shard assignments)
+are committed if a majority of the masters node received the update correctly. This means that the current master has access


masters -> master-eligible

As @clintongormley said, masters -> master-eligible but then node -> nodes so it reads majority of the master-eligible nodes....

jasontedor · 2016-09-15T02:18:43Z

Would you mind merging master in after you integrate #20348?

jasontedor · 2016-09-15T19:20:24Z

core/src/main/java/org/elasticsearch/discovery/zen/ElectMasterService.java

+        }
+
+        /**
+         * compares two candidate to indicate who's the a better master.


Nit: candidate -> candidates

Nit: who's the a better -> which is the better

changed. You know that to me the nodes are human...

jasontedor · 2016-09-15T19:21:43Z

core/src/main/java/org/elasticsearch/discovery/zen/ElectMasterService.java

    private volatile int minimumMasterNodes;

+    public static class Candidate {


Can this class have Javadocs please?

I wonder if the class should be called something like MasterCandidate or CandidateMaster?

It's a inner class of ElectMaster, but sure. Can do MasterCandidate

jasontedor · 2016-09-15T19:24:42Z

core/src/main/java/org/elasticsearch/discovery/zen/ElectMasterService.java

+        return sortedCandidates.get(0);
+    }
+
+    /** selects the best active master to join, where multiple are discovered (oh noes) */


Drop the "oh noes"?

party pooper. removed.

jasontedor · 2016-09-15T19:39:03Z

core/src/main/java/org/elasticsearch/discovery/zen/ping/ZenPing.java

            out.writeLong(id);
        }

        @Override
        public String toString() {
-            return "ping_response{node [" + node + "], id[" + id + "], master [" + master + "], hasJoinedOnce [" + hasJoinedOnce + "], cluster_name[" + clusterName.value() + "]}";
+            return "ping_response{node [" + node + "], id[" + id + "], master [" + master + "], cs version [" + clusterStateVersion


Nit: cs version -> cluster_state_version, please.

sooo long. replaced.

jasontedor · 2016-09-15T19:56:34Z

docs/resiliency/index.asciidoc

@@ -64,6 +64,22 @@ framework. As the Jepsen tests evolve, we will continue porting new scenarios th
 all new scenarios and will report issues that we find on this page and in our GitHub repository.

 [float]
+=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)
+
+During a networking partition, cluster states updates (like mapping changes or shard assignments)


states -> state

jasontedor · 2016-09-15T20:06:11Z

test/framework/src/main/java/org/elasticsearch/test/disruption/NetworkDisruptionTests.java

+    public void testIsolateAll() {
+        Set<String> nodes = generateRandomStringSet(1, 10);
+        NetworkDisruption.DisruptedLinks topology = new NetworkDisruption.IsolateAllNodes(nodes);
+        for (int i = 0; i < 10; i++) {


Why not test all possible pairs, it's only 10 choose 2?

yeah , balancing act between speed and chances the test fails if you get something wrong. I also just hate the resulting double loop for a "check all combinations"

jasontedor · 2016-09-15T20:13:11Z

core/src/test/java/org/elasticsearch/discovery/zen/ElectMasterServiceTests.java

        ElectMasterService service = electMasterService();
-        int min_master_nodes = randomIntBetween(0, nodes.size());
+        int min_master_nodes = randomIntBetween(0, candidates.size());


While we are here, can we give this a proper Java variable name (minMasterNodes)?

jasontedor · 2016-09-15T20:13:51Z

core/src/test/java/org/elasticsearch/discovery/zen/ElectMasterServiceTests.java

-        } else if (min_master_nodes > 0 && master_nodes < min_master_nodes) {
-            assertNull(master);
-        } else {
+        Candidate master = service.electMaster(candidates);
            assertNotNull(master);


The indentation is off here and the rest of the way through this test.

jasontedor · 2016-09-15T20:14:10Z

core/src/test/java/org/elasticsearch/discovery/zen/ElectMasterServiceTests.java

-                    assertTrue(master.getId().compareTo(node.getId()) <= 0);
+            for (Candidate candidate : candidates) {
+                if (candidate.getNode().equals(master.getNode())) {
+                       // meh


Maybe a more descriptive comment? 😄

I made a longer but just as meaningless text :)

jasontedor · 2016-09-15T20:15:23Z

core/src/test/java/org/elasticsearch/discovery/zen/ElectMasterServiceTests.java

+                    assertThat("candidate " + candidate + " has a lower or equal id than master " + master, candidate.getNode().getId(),
+                        greaterThan(master.getNode().getId()));
+                } else {
+                    assertThat("candidate " + master + " has a higher id than candidate " + candidate, master.getClusterStateVersion(),


This should say higher cluster state version instead of higher id.

jasontedor · 2016-09-15T20:19:37Z

core/src/test/java/org/elasticsearch/discovery/DiscoveryWithServiceDisruptionsIT.java

@@ -1189,6 +1192,61 @@ public void testIndicesDeleted() throws Exception {
        assertFalse(client().admin().indices().prepareExists(idxName).get().isExists());
    }

+    public void testElectMasterWithLatestVersion() throws Exception {


This is a beautiful test.

jasontedor

Thanks @bleskes, I left some feedback. In general, it looks sound.

jasontedor · 2016-09-15T20:28:41Z

core/src/main/java/org/elasticsearch/discovery/zen/ping/ZenPingService.java

+            final AtomicBoolean counted = new AtomicBoolean();
+            try {
+                zenPing.ping(pings -> {
+                    response.addPings(pings);


Should the add pings only be done inside the guard?

tja - doesn't really matter. I figured every extra bit of information, if we manage to get it in, counts

…ect_by_version

bleskes · 2016-09-15T21:14:19Z

thx @jasontedor , @s1monw and @clintongormley . I addressed all the comments.

jasontedor

LGTM

… election (#20384) During a networking partition, cluster states updates (like mapping changes or shard assignments) are committed if a majority of the masters node received the update correctly. This means that the current master has access to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected which did not yet catch up. If that happens, cluster state updates can be lost. This commit fixed 95% of this rare problem by adding the current cluster state version to `PingResponse` and use them when deciding which master to join (and thus casting the node's vote). Note: this doesn't fully mitigate the problem as a cluster state update which is issued concurrently with a network partition can be lost if the partition prevents the commit message (part of the two phased commit of cluster state updates) from reaching any single node in the majority side *and* the partition does allow for the master to acknowledge the change. We are working on a more comprehensive fix but that requires considerate work and is targeted at 6.0.

makeyang · 2017-01-19T07:04:51Z

this one plus logical time plus this PR: #13062 is really makeing a raft.
then u guys will make pacificA, your log replication method really solid because pacificA requires raft/paxos to maintain replication set config.

bleskes · 2017-01-19T07:26:08Z

@makeyang there are a lot of similarities between ZenDiscovery and Raft if you look at it the right way (although ZenDiscovery was built before Raft was there). I'm not sure 100% what you mean, but a pacificA like log replication model requires an external concensus oracle. For pacifica it indeed doesn't matter which algorithm you use,.

makeyang · 2017-01-19T07:45:17Z

@bleskes that's all I mean: as long as leader election is solid concensus, then your make log replication solid.
just wonder: no matter what u call it, ZenDiscovery or whatever it is and no matter whether it is before or after raft, what really matter is: it used to to wrong and it is still wrong. so why not just make it raft espencially it is really similarity to raft?

makeyang · 2017-01-19T07:52:59Z

@bleskes just another question which is more serious than last one: based on the current condition, ES won't pass jepsen-like test, right?

bleskes · 2017-01-19T07:53:04Z

@makeyang sadly there is no "just implement it" in distributed systems. It's a long process consisting of small steps. This and and other PRs you follow are part of that journey.

bleskes · 2017-01-19T07:55:15Z

ES won't pass jepsen-like test, right?

That's broad question. ES 5.0 is light years ahead of 1.x but there are still known issues. You can read about them in our documentation here.

makeyang · 2017-01-19T07:57:27Z

@bleskes agree what u said. but please make these critical small setps, which is impact data safety, faster and faster before ES ruin reputation like MongoDB does before due to MongoDB's careless to data loss.

bleskes · 2017-01-19T08:01:30Z

@makeyang we're making as a fast as we can responsibly make them.

careless to data loss.

I think this very conversation shows otherwise. If you are speaking from experience, please do share your problem so we can see if it has already been solved or we need to fix something and add it to the working queue. Abstract claims are dangerous and hard to address.

makeyang · 2017-01-19T08:07:11Z

@bleskes what I mentioned is mongdb, just google "mongodb loses data" and u'll see that. I'm not saying ES.
I'll share anything related to ES to github or discussion.

makeyang · 2017-05-02T09:49:07Z

@bleskes I have a question related pacificA:
according to section of "Change of Primary" of paper:
"During reconciliation, p sends prepare messages for uncommitted requests in its prepared list and have them committed on the new configuration"
assume below scenario:
primary sends a update request(we call it RN) with its configuration version and the serial number in a prepare message to all replicas and network partition happens, some replicas get this prepare message and put it into their prepared list while othres not.
the old master will acked failed to client.
then one secondary which get RN get master lease and "sends prepare messages for uncommitted requests in its prepared list and have them committed on the new configuration"
by accidently, although the system achieve consistency, but the system have the date it shouldn't have.
do I miss something?

jasontedor · 2017-05-02T10:36:08Z

@makeyang Sorry, but this is not the place for discussion, we have the forum for that. However, general questions that are not at all specific to Elasticsearch are outside what you can expect to be answered there.

bleskes added 10 commits September 5, 2016 21:58

failing test

41044ea

first version compiles

221f124

remove unneeded package

588da25

deal with the case where no master nodes are found

5b8971e

more removal of incompatibleMinVersion

f158640

tweaks

b546a62

Merge remote-tracking branch 'upstream/master' into zen_elect_by_version

9f012eb

Merge branch 'gc_disrupt_log4j' into zen_elect_by_version

9f90a1f

resiliency page update

942523d

bleskes added resiliency :Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure v5.0.0-beta1 labels Sep 8, 2016

bleskes assigned jasontedor Sep 8, 2016

clintongormley reviewed Sep 8, 2016
View reviewed changes

clintongormley removed the v5.0.0-beta1 label Sep 14, 2016

Merge branch 'master' into zen_elect_by_version

18a6492

fix compilation issues

0ef5ec6

bleskes force-pushed the zen_elect_by_version branch from 56d3f6e to 0ef5ec6 Compare September 15, 2016 09:47

jasontedor reviewed Sep 15, 2016

View reviewed changes

bleskes added 2 commits September 15, 2016 22:31

Merge branch 'master' of github.com:elastic/elasticsearch into zen_el…

6cde8b3

…ect_by_version

feedback

2394902

jasontedor approved these changes Sep 15, 2016

View reviewed changes

bleskes merged commit 577dcb3 into elastic:master Sep 15, 2016

bleskes deleted the zen_elect_by_version branch September 15, 2016 21:39

bleskes mentioned this pull request Sep 19, 2016

Network partitions can cause divergence, dirty reads, and lost updates. #20031

Closed

clintongormley added the >bug label Sep 19, 2016

bleskes added v6.0.0-alpha1 v5.1.1 v5.0.0-beta1 labels Sep 19, 2016

clintongormley removed the v5.1.1 label Dec 8, 2016

		private volatile int minimumMasterNodes;

		public static class Candidate {

Add current cluster state version to zen pings and use them in master election #20384

Add current cluster state version to zen pings and use them in master election #20384

Conversation

bleskes commented Sep 8, 2016

Choose a reason for hiding this comment

jasontedor Sep 15, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasontedor commented Sep 15, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasontedor Sep 15, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasontedor Sep 15, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasontedor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented Sep 15, 2016

jasontedor left a comment

Choose a reason for hiding this comment

makeyang commented Jan 19, 2017 • edited

bleskes commented Jan 19, 2017

makeyang commented Jan 19, 2017 • edited

makeyang commented Jan 19, 2017

bleskes commented Jan 19, 2017

bleskes commented Jan 19, 2017

makeyang commented Jan 19, 2017 • edited

bleskes commented Jan 19, 2017

makeyang commented Jan 19, 2017

makeyang commented May 2, 2017

jasontedor commented May 2, 2017 • edited

jasontedor Sep 15, 2016 •

edited

jasontedor commented Sep 15, 2016 •

edited

jasontedor Sep 15, 2016 •

edited

jasontedor Sep 15, 2016 •

edited

makeyang commented Jan 19, 2017 •

edited

makeyang commented Jan 19, 2017 •

edited

makeyang commented Jan 19, 2017 •

edited

jasontedor commented May 2, 2017 •

edited