Add current cluster state version to zen pings and use them in master election #20384

Merged
merged 16 commits into from Sep 15, 2016

Projects

None yet

5 participants

@bleskes
Member
bleskes commented Sep 8, 2016

During a networking partition, cluster states updates (like mapping changes or shard assignments)
are committed if a majority of the masters node received the update correctly. This means that the current master has access to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
is still recovering from the previous one and the old master is put in the minority side, it may be that a new master is elected which did not yet catch up. If that happens, cluster state updates can be lost.

This commit fixed 95% of this rare problem by adding the current cluster state version to PingResponse and use them when deciding which master to join (and thus casting the node's vote).

Note: this doesn't fully mitigate the problem as a cluster state update which is issued concurrently with a network partition can be lost if the partition prevents the commit message (part of the two phased commit of cluster state updates) from reaching any single node in the majority side and the partition does allow for the master to acknowledge the change. We are working on a more comprehensive fix but that requires considerate work and is targeted at 6.0.

PS this PR contains and depends on #20348 , which was required for long testing. That part doesn't need to be reviewed.

bleskes added some commits Sep 5, 2016
@bleskes bleskes failing test 41044ea
@bleskes bleskes first version compiles 221f124
@bleskes bleskes remove unneeded package 588da25
@bleskes bleskes deal with the case where no master nodes are found 5b8971e
@bleskes bleskes more removal of incompatibleMinVersion f158640
@bleskes bleskes Fix LongGCDisruption to be aware of log4j
LongGCDisruption simulates a Long GC by suspending all threads belonging to a node. That's fine, unless those threads hold shared locks that can prevent other nodes from running. Concretely the logging infrastructure, which is shared between the nodes, can cause some deadlocks. LongGCDisruption has protection for this, but it needs to be updated to point at log4j2 classes, introduced in #20235

This commit also fixes improper handling of retry logic in LongGCDisruption and adds a protection against deadlocking the test code which activates the disruption (and uses logging too! :)).

On top of that we have some new, evil and nasty tests.
7ea445a
@bleskes bleskes tweaks
b546a62
@bleskes bleskes Merge remote-tracking branch 'upstream/master' into zen_elect_by_version 9f012eb
@bleskes bleskes Merge branch 'gc_disrupt_log4j' into zen_elect_by_version 9f90a1f
@bleskes bleskes resiliency page update
942523d
@jasontedor jasontedor was assigned by bleskes Sep 8, 2016
@clintongormley clintongormley and 2 others commented on an outdated diff Sep 8, 2016
docs/resiliency/index.asciidoc
@@ -64,6 +64,22 @@ framework. As the Jepsen tests evolve, we will continue porting new scenarios th
all new scenarios and will report issues that we find on this page and in our GitHub repository.
[float]
+=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)
+
+During a networking partition, cluster states updates (like mapping changes or shard assignments)
+are committed if a majority of the masters node received the update correctly. This means that the current master has access
@clintongormley
clintongormley Sep 8, 2016 Member

masters -> master-eligible

@jasontedor
jasontedor Sep 15, 2016 edited Contributor

As @clintongormley said, masters -> master-eligible but then node -> nodes so it reads majority of the master-eligible nodes....

@clintongormley clintongormley commented on an outdated diff Sep 8, 2016
docs/resiliency/index.asciidoc
@@ -64,6 +64,22 @@ framework. As the Jepsen tests evolve, we will continue porting new scenarios th
all new scenarios and will report issues that we find on this page and in our GitHub repository.
[float]
+=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)
+
+During a networking partition, cluster states updates (like mapping changes or shard assignments)
+are committed if a majority of the masters node received the update correctly. This means that the current master has access
+to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch
+up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
@clintongormley
clintongormley Sep 8, 2016 Member

get the changes they couldn't receive before -> receive the previously missed changes.

@clintongormley
clintongormley Sep 8, 2016 Member

happens -> occurs

@clintongormley clintongormley and 1 other commented on an outdated diff Sep 8, 2016
docs/resiliency/index.asciidoc
@@ -64,6 +64,22 @@ framework. As the Jepsen tests evolve, we will continue porting new scenarios th
all new scenarios and will report issues that we find on this page and in our GitHub repository.
[float]
+=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)
+
+During a networking partition, cluster states updates (like mapping changes or shard assignments)
+are committed if a majority of the masters node received the update correctly. This means that the current master has access
+to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch
+up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
+is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected
@clintongormley
clintongormley Sep 8, 2016 Member

is put in -> falls on

@bleskes
bleskes Sep 15, 2016 Member

changed

@clintongormley clintongormley and 1 other commented on an outdated diff Sep 8, 2016
docs/resiliency/index.asciidoc
@@ -64,6 +64,22 @@ framework. As the Jepsen tests evolve, we will continue porting new scenarios th
all new scenarios and will report issues that we find on this page and in our GitHub repository.
[float]
+=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)
+
+During a networking partition, cluster states updates (like mapping changes or shard assignments)
+are committed if a majority of the masters node received the update correctly. This means that the current master has access
+to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch
+up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
+is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected
+which did not yet catch up. If that happens, cluster state updates can be lost.
@clintongormley
clintongormley Sep 8, 2016 Member

did not -> has not

@bleskes
bleskes Sep 15, 2016 Member

changed

@clintongormley clintongormley and 2 others commented on an outdated diff Sep 8, 2016
docs/resiliency/index.asciidoc
@@ -64,6 +64,22 @@ framework. As the Jepsen tests evolve, we will continue porting new scenarios th
all new scenarios and will report issues that we find on this page and in our GitHub repository.
[float]
+=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)
+
+During a networking partition, cluster states updates (like mapping changes or shard assignments)
+are committed if a majority of the masters node received the update correctly. This means that the current master has access
+to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch
+up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
+is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected
+which did not yet catch up. If that happens, cluster state updates can be lost.
+
+This problem is mostly fixed by {GIT}TBD[#TBD] (v5.0.0), which takes committed cluster states updates into account during master
@clintongormley
clintongormley Sep 8, 2016 Member

states updates -> state updates

@jasontedor
jasontedor Sep 15, 2016 Contributor

The TBD can be updated with a link to this PR now.

@bleskes
bleskes Sep 15, 2016 Member

adapted and replaced

@clintongormley clintongormley and 1 other commented on an outdated diff Sep 8, 2016
docs/resiliency/index.asciidoc
@@ -64,6 +64,22 @@ framework. As the Jepsen tests evolve, we will continue porting new scenarios th
all new scenarios and will report issues that we find on this page and in our GitHub repository.
[float]
+=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)
+
+During a networking partition, cluster states updates (like mapping changes or shard assignments)
+are committed if a majority of the masters node received the update correctly. This means that the current master has access
+to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch
+up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
+is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected
+which did not yet catch up. If that happens, cluster state updates can be lost.
+
+This problem is mostly fixed by {GIT}TBD[#TBD] (v5.0.0), which takes committed cluster states updates into account during master
+election. This considerably reduces the chance of this rare problem to occur but does not fully mitigate it. If the second partition
@clintongormley
clintongormley Sep 8, 2016 Member

to occur -> occurring

@bleskes
bleskes Sep 15, 2016 Member

changed

@clintongormley clintongormley and 2 others commented on an outdated diff Sep 8, 2016
docs/resiliency/index.asciidoc
@@ -64,6 +64,22 @@ framework. As the Jepsen tests evolve, we will continue porting new scenarios th
all new scenarios and will report issues that we find on this page and in our GitHub repository.
[float]
+=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)
+
+During a networking partition, cluster states updates (like mapping changes or shard assignments)
+are committed if a majority of the masters node received the update correctly. This means that the current master has access
+to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch
+up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
+is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected
+which did not yet catch up. If that happens, cluster state updates can be lost.
+
+This problem is mostly fixed by {GIT}TBD[#TBD] (v5.0.0), which takes committed cluster states updates into account during master
+election. This considerably reduces the chance of this rare problem to occur but does not fully mitigate it. If the second partition
+happens concurrently with a cluster state update and blocks the cluster state commit message from reaching a majority of nodes, it may be
+that the in flight update will be lost. If the, now isolated, master can still acknowledge the cluster state update to the client this
@clintongormley
clintongormley Sep 8, 2016 Member

, now isolated, -> now isolated

@jasontedor
jasontedor Sep 15, 2016 Contributor

Although it should be now-isolated.

@bleskes
bleskes Sep 15, 2016 Member

went with jason's version

@clintongormley clintongormley and 1 other commented on an outdated diff Sep 8, 2016
docs/resiliency/index.asciidoc
@@ -64,6 +64,22 @@ framework. As the Jepsen tests evolve, we will continue porting new scenarios th
all new scenarios and will report issues that we find on this page and in our GitHub repository.
[float]
+=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)
+
+During a networking partition, cluster states updates (like mapping changes or shard assignments)
+are committed if a majority of the masters node received the update correctly. This means that the current master has access
+to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch
+up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
+is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected
+which did not yet catch up. If that happens, cluster state updates can be lost.
+
+This problem is mostly fixed by {GIT}TBD[#TBD] (v5.0.0), which takes committed cluster states updates into account during master
+election. This considerably reduces the chance of this rare problem to occur but does not fully mitigate it. If the second partition
+happens concurrently with a cluster state update and blocks the cluster state commit message from reaching a majority of nodes, it may be
+that the in flight update will be lost. If the, now isolated, master can still acknowledge the cluster state update to the client this
+will amount to a loss of an acknowledge changed. Fixing that last scenario needs considerate work and is currently targeted at (v6.0.0).
@clintongormley
clintongormley Sep 8, 2016 Member

a loss -> the loss

@clintongormley
clintongormley Sep 8, 2016 Member

acknowledge changed -> acknowledged change

@clintongormley
clintongormley Sep 8, 2016 Member

considerate -> considerable

@s1monw s1monw and 1 other commented on an outdated diff Sep 8, 2016
...g/elasticsearch/discovery/zen/ElectMasterService.java
+ @Override
+ public String toString() {
+ return "Candidate{" +
+ "node=" + node +
+ ", clusterStateVersion=" + clusterStateVersion +
+ '}';
+ }
+
+ /**
+ * compares two candidate to indicate who's the a better master.
+ * A higher cluster state version is better
+ *
+ * @return -1 if c1 is a batter candidate, 1 if c2.
+ */
+ public static int compare(Candidate c1, Candidate c2) {
+ int ret = -1 * Long.compare(c1.clusterStateVersion, c2.clusterStateVersion);
@s1monw
s1monw Sep 8, 2016 Contributor

can't you just swap c1 and c2 ? and add a comment in there that it's intentional?

@s1monw
s1monw Sep 15, 2016 Contributor

any updates?

@bleskes
bleskes Sep 15, 2016 Member

changed and added a comment

@s1monw s1monw and 1 other commented on an outdated diff Sep 8, 2016
...g/elasticsearch/discovery/zen/ElectMasterService.java
+ }
+
+ /**
+ * Elects a new master out of the possible nodes, returning it. Returns <tt>null</tt>
+ * if no master has been elected.
+ */
+ public Candidate electMaster(Collection<Candidate> candidates) {
+ assert hasEnoughCandidates(candidates);
+ List<Candidate> sortedCandidates = new ArrayList<>(candidates);
+ sortedCandidates.sort(Candidate::compare);
+ return sortedCandidates.get(0);
+ }
+
+ /** selects the best active master to join, where multiple are discovered (oh noes) */
+ public DiscoveryNode tieBreakActiveMasters(Collection<DiscoveryNode> activeMasters) {
+ List<DiscoveryNode> tmp = new ArrayList<>(activeMasters);
@s1monw
s1monw Sep 8, 2016 Contributor

activeMasters.stream().min(ElectMasterService::compareNodes);?

@s1monw
s1monw Sep 15, 2016 Contributor

any updates

@s1monw s1monw and 2 others commented on an outdated diff Sep 8, 2016
...ava/org/elasticsearch/discovery/zen/ping/ZenPing.java
for (PingResponse ping : pings) {
addPing(ping);
}
}
- /** serialize current pings to an array */
- public synchronized PingResponse[] toArray() {
- return pings.values().toArray(new PingResponse[pings.size()]);
+ /** serialize current pings to an array. It is guaranteed that the array contains one ping response per node */
+ public synchronized List<PingResponse> toList() {
+ return new ArrayList<>(pings.values());
@s1monw
s1monw Sep 8, 2016 Contributor

why do you copy it? can't you just use Collections.unmodifiableCollection(ping.values()) and let the user do it if needed or do we have problems with concurrent modifications here? Maybe a CopyOnWriteHashMap would be the right thing todo?

@jasontedor
jasontedor Sep 15, 2016 Contributor

The return value does get modified by a caller in ZenDiscovery#findMaster; I think there is risk of a modification while the caller is copying, so it's better to do the copying under the synchronized lock.

@bleskes
bleskes Sep 15, 2016 Member

This is not a high performance code (especially not when this is called, where almost always things settled down) so I opted for the safest simplest option (imo) - return a new list and not worry what people do with it. Going with CopyOnWriteHashMap is a shame because it is updated quite frequently during the pinging phase.

@s1monw s1monw commented on the diff Sep 8, 2016
.../elasticsearch/discovery/zen/ping/ZenPingService.java
- latch.countDown();
+ public ZenPing.PingCollection pingAndWait(TimeValue timeout) {
+ final ZenPing.PingCollection response = new ZenPing.PingCollection();
+ final CountDownLatch latch = new CountDownLatch(zenPings.size());
+ for (ZenPing zenPing : zenPings) {
+ final AtomicBoolean counted = new AtomicBoolean();
+ try {
+ zenPing.ping(pings -> {
+ response.addPings(pings);
+ if (counted.compareAndSet(false, true)) {
+ latch.countDown();
+ }
+ }, timeout);
+ } catch (Exception ex) {
+ logger.warn("Ping execution failed", ex);
+ if (counted.compareAndSet(false, true)) {
@s1monw
s1monw Sep 8, 2016 Contributor

isn't it an error condition when it's already counted?

@s1monw
s1monw Sep 15, 2016 Contributor

any updates?

@bleskes
bleskes Sep 15, 2016 Member

at the moment no - but just rather be defensive then worry about it.

@s1monw s1monw and 2 others commented on an outdated diff Sep 8, 2016
...ava/org/elasticsearch/discovery/zen/ping/ZenPing.java
for (PingResponse ping : pings) {
addPing(ping);
}
}
- /** serialize current pings to an array */
- public synchronized PingResponse[] toArray() {
- return pings.values().toArray(new PingResponse[pings.size()]);
+ /** serialize current pings to an array. It is guaranteed that the array contains one ping response per node */
@s1monw
s1monw Sep 8, 2016 Contributor

s/to an array/to a collection/

@jasontedor
jasontedor Sep 15, 2016 Contributor

that the array -> that the collection

@bleskes
bleskes Sep 15, 2016 Member

adapted, but when with List

@jasontedor
Contributor
jasontedor commented Sep 15, 2016 edited

Would you mind merging master in after you integrate #20348?

bleskes added some commits Sep 15, 2016
@bleskes bleskes Merge branch 'master' into zen_elect_by_version
18a6492
@bleskes bleskes removed formatting changes
db9a137
@bleskes bleskes but keep the import change
9ad5c0e
@bleskes
Member
bleskes commented Sep 15, 2016

@jasontedor thx. I merged #20348 and folder master into this PR.

@s1monw

I looked over it one more time, it looks pretty good but you haven't addressed my last review rounds comments yet?

+
+ public Candidate(DiscoveryNode node, long clusterStateVersion) {
+ Objects.requireNonNull(node);
+ assert clusterStateVersion >= -1;
@s1monw
s1monw Sep 15, 2016 edited Contributor

can you please add the clusterStateVersion to the assert message

+ * @return -1 if c1 is a batter candidate, 1 if c2.
+ */
+ public static int compare(Candidate c1, Candidate c2) {
+ int ret = -1 * Long.compare(c1.clusterStateVersion, c2.clusterStateVersion);
@s1monw
s1monw Sep 15, 2016 Contributor

any updates?

+ return count > 0 && (minimumMasterNodes < 0 || count >= minimumMasterNodes);
+ }
+
+ public boolean hasEnoughCandidates(Collection<Candidate> candidates) {
@s1monw
s1monw Sep 15, 2016 Contributor

a javadoc comment would be awesome

+
+ /** selects the best active master to join, where multiple are discovered (oh noes) */
+ public DiscoveryNode tieBreakActiveMasters(Collection<DiscoveryNode> activeMasters) {
+ List<DiscoveryNode> tmp = new ArrayList<>(activeMasters);
@s1monw
s1monw Sep 15, 2016 Contributor

any updates

+ /**
+ * the current cluster state version of that node ({@link ElectMasterService.Candidate#UNRECOVERED_CLUSTER_VERSION}
+ * for not recovered) */
+ public long clusterStateVersion() {
@s1monw
s1monw Sep 15, 2016 Contributor

can we use get prefix when these are getters?

+ }, timeout);
+ } catch (Exception ex) {
+ logger.warn("Ping execution failed", ex);
+ if (counted.compareAndSet(false, true)) {
@s1monw
s1monw Sep 15, 2016 Contributor

any updates?

@bleskes
Member
bleskes commented Sep 15, 2016

@s1monw yeah - As the comments were minor, I wanted to go and do it in one sweep with whatever @jasontedor finds. They will be addressed 👍

@bleskes bleskes fix compilation issues
0ef5ec6
+ }
+
+ /**
+ * compares two candidate to indicate who's the a better master.
@jasontedor
jasontedor Sep 15, 2016 Contributor

Nit: candidate -> candidates

@jasontedor
jasontedor Sep 15, 2016 Contributor

Nit: who's the a better -> which is the better

@bleskes
bleskes Sep 15, 2016 Member

changed. You know that to me the nodes are human...

private volatile int minimumMasterNodes;
+ public static class Candidate {
@jasontedor
jasontedor Sep 15, 2016 Contributor

Can this class have Javadocs please?

@jasontedor
jasontedor Sep 15, 2016 edited Contributor

I wonder if the class should be called something like MasterCandidate or CandidateMaster?

@bleskes
bleskes Sep 15, 2016 Member

It's a inner class of ElectMaster, but sure. Can do MasterCandidate

+ return sortedCandidates.get(0);
+ }
+
+ /** selects the best active master to join, where multiple are discovered (oh noes) */
@jasontedor
jasontedor Sep 15, 2016 Contributor

Drop the "oh noes"?

@bleskes
bleskes Sep 15, 2016 Member

party pooper. removed.

out.writeLong(id);
}
@Override
public String toString() {
- return "ping_response{node [" + node + "], id[" + id + "], master [" + master + "], hasJoinedOnce [" + hasJoinedOnce + "], cluster_name[" + clusterName.value() + "]}";
+ return "ping_response{node [" + node + "], id[" + id + "], master [" + master + "], cs version [" + clusterStateVersion
@jasontedor
jasontedor Sep 15, 2016 Contributor

Nit: cs version -> cluster_state_version, please.

@bleskes
bleskes Sep 15, 2016 Member

sooo long. replaced.

- /** serialize current pings to an array */
- public synchronized PingResponse[] toArray() {
- return pings.values().toArray(new PingResponse[pings.size()]);
+ /** serialize current pings to an array. It is guaranteed that the array contains one ping response per node */
@jasontedor
jasontedor Sep 15, 2016 Contributor

that the array -> that the collection

- return pings.values().toArray(new PingResponse[pings.size()]);
+ /** serialize current pings to an array. It is guaranteed that the array contains one ping response per node */
+ public synchronized List<PingResponse> toList() {
+ return new ArrayList<>(pings.values());
@jasontedor
jasontedor Sep 15, 2016 Contributor

The return value does get modified by a caller in ZenDiscovery#findMaster; I think there is risk of a modification while the caller is copying, so it's better to do the copying under the synchronized lock.

docs/resiliency/index.asciidoc
@@ -64,6 +64,22 @@ framework. As the Jepsen tests evolve, we will continue porting new scenarios th
all new scenarios and will report issues that we find on this page and in our GitHub repository.
[float]
+=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)
+
+During a networking partition, cluster states updates (like mapping changes or shard assignments)
@jasontedor
jasontedor Sep 15, 2016 Contributor

states -> state

docs/resiliency/index.asciidoc
+=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)
+
+During a networking partition, cluster states updates (like mapping changes or shard assignments)
+are committed if a majority of the masters node received the update correctly. This means that the current master has access
@jasontedor
jasontedor Sep 15, 2016 edited Contributor

As @clintongormley said, masters -> master-eligible but then node -> nodes so it reads majority of the master-eligible nodes....

docs/resiliency/index.asciidoc
+is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected
+which did not yet catch up. If that happens, cluster state updates can be lost.
+
+This problem is mostly fixed by {GIT}TBD[#TBD] (v5.0.0), which takes committed cluster states updates into account during master
@jasontedor
jasontedor Sep 15, 2016 Contributor

The TBD can be updated with a link to this PR now.

docs/resiliency/index.asciidoc
+This problem is mostly fixed by {GIT}TBD[#TBD] (v5.0.0), which takes committed cluster states updates into account during master
+election. This considerably reduces the chance of this rare problem to occur but does not fully mitigate it. If the second partition
+happens concurrently with a cluster state update and blocks the cluster state commit message from reaching a majority of nodes, it may be
+that the in flight update will be lost. If the, now isolated, master can still acknowledge the cluster state update to the client this
@jasontedor
jasontedor Sep 15, 2016 Contributor

Although it should be now-isolated.

+ public void testIsolateAll() {
+ Set<String> nodes = generateRandomStringSet(1, 10);
+ NetworkDisruption.DisruptedLinks topology = new NetworkDisruption.IsolateAllNodes(nodes);
+ for (int i = 0; i < 10; i++) {
@jasontedor
jasontedor Sep 15, 2016 edited Contributor

Why not test all possible pairs, it's only 10 choose 2?

@bleskes
bleskes Sep 15, 2016 Member

yeah , balancing act between speed and chances the test fails if you get something wrong. I also just hate the resulting double loop for a "check all combinations"

ElectMasterService service = electMasterService();
- int min_master_nodes = randomIntBetween(0, nodes.size());
+ int min_master_nodes = randomIntBetween(0, candidates.size());
@jasontedor
jasontedor Sep 15, 2016 Contributor

While we are here, can we give this a proper Java variable name (minMasterNodes)?

@bleskes
bleskes Sep 15, 2016 Member

changed.

- } else if (min_master_nodes > 0 && master_nodes < min_master_nodes) {
- assertNull(master);
- } else {
+ Candidate master = service.electMaster(candidates);
assertNotNull(master);
@jasontedor
jasontedor Sep 15, 2016 Contributor

The indentation is off here and the rest of the way through this test.

- assertTrue(master.getId().compareTo(node.getId()) <= 0);
+ for (Candidate candidate : candidates) {
+ if (candidate.getNode().equals(master.getNode())) {
+ // meh
@jasontedor
jasontedor Sep 15, 2016 Contributor

Maybe a more descriptive comment? 😄

@bleskes
bleskes Sep 15, 2016 Member

I made a longer but just as meaningless text :)

+ assertThat("candidate " + candidate + " has a lower or equal id than master " + master, candidate.getNode().getId(),
+ greaterThan(master.getNode().getId()));
+ } else {
+ assertThat("candidate " + master + " has a higher id than candidate " + candidate, master.getClusterStateVersion(),
@jasontedor
jasontedor Sep 15, 2016 Contributor

This should say higher cluster state version instead of higher id.

@@ -1189,6 +1192,61 @@ public void testIndicesDeleted() throws Exception {
assertFalse(client().admin().indices().prepareExists(idxName).get().isExists());
}
+ public void testElectMasterWithLatestVersion() throws Exception {
@jasontedor
jasontedor Sep 15, 2016 Contributor

This is a beautiful test.

@jasontedor

Thanks @bleskes, I left some feedback. In general, it looks sound.

+ final AtomicBoolean counted = new AtomicBoolean();
+ try {
+ zenPing.ping(pings -> {
+ response.addPings(pings);
@jasontedor
jasontedor Sep 15, 2016 Contributor

Should the add pings only be done inside the guard?

@bleskes
bleskes Sep 15, 2016 Member

tja - doesn't really matter. I figured every extra bit of information, if we manage to get it in, counts

bleskes added some commits Sep 15, 2016
@bleskes bleskes Merge branch 'master' of github.com:elastic/elasticsearch into zen_el…
…ect_by_version
6cde8b3
@bleskes bleskes feedback
2394902
@bleskes
Member
bleskes commented Sep 15, 2016

thx @jasontedor , @s1monw and @clintongormley . I addressed all the comments.

@jasontedor

LGTM

@bleskes bleskes merged commit 577dcb3 into elastic:master Sep 15, 2016

1 of 2 checks passed

elasticsearch-ci Build finished.
Details
CLA Commit author is a member of Elasticsearch
Details
@bleskes bleskes deleted the bleskes:zen_elect_by_version branch Sep 15, 2016
@bleskes bleskes added a commit that referenced this pull request Sep 15, 2016
@bleskes bleskes Add current cluster state version to zen pings and use them in master…
… election (#20384)

During a networking partition, cluster states updates (like mapping changes or shard assignments)
are committed if a majority of the masters node received the update correctly. This means that the current master has access to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected which did not yet catch up. If that happens, cluster state updates can be lost.

This commit fixed 95% of this rare problem by adding the current cluster state version to `PingResponse` and use them when deciding which master to join (and thus casting the node's vote).

Note: this doesn't fully mitigate the problem as a cluster state update which is issued concurrently with a network partition can be lost if the partition prevents the commit message (part of the two phased commit of cluster state updates) from reaching any single node in the majority side *and* the partition does allow for the master to acknowledge the change. We are working on a more comprehensive fix but that requires considerate work  and is targeted at 6.0.
3ad66ba
@bleskes bleskes added a commit that referenced this pull request Sep 15, 2016
@bleskes bleskes Add current cluster state version to zen pings and use them in master…
… election (#20384)

During a networking partition, cluster states updates (like mapping changes or shard assignments)
are committed if a majority of the masters node received the update correctly. This means that the current master has access to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected which did not yet catch up. If that happens, cluster state updates can be lost.

This commit fixed 95% of this rare problem by adding the current cluster state version to `PingResponse` and use them when deciding which master to join (and thus casting the node's vote).

Note: this doesn't fully mitigate the problem as a cluster state update which is issued concurrently with a network partition can be lost if the partition prevents the commit message (part of the two phased commit of cluster state updates) from reaching any single node in the majority side *and* the partition does allow for the master to acknowledge the change. We are working on a more comprehensive fix but that requires considerate work  and is targeted at 6.0.
95992a1
@clintongormley clintongormley added the bug label Sep 19, 2016
@clintongormley clintongormley removed the v5.1.1 label Dec 8, 2016
@makeyang
Contributor
makeyang commented Jan 19, 2017 edited

this one plus logical time plus this PR: #13062 is really makeing a raft.
then u guys will make pacificA, your log replication method really solid because pacificA requires raft/paxos to maintain replication set config.

@bleskes
Member
bleskes commented Jan 19, 2017

@makeyang there are a lot of similarities between ZenDiscovery and Raft if you look at it the right way (although ZenDiscovery was built before Raft was there). I'm not sure 100% what you mean, but a pacificA like log replication model requires an external concensus oracle. For pacifica it indeed doesn't matter which algorithm you use,.

@makeyang
Contributor
makeyang commented Jan 19, 2017 edited

@bleskes that's all I mean: as long as leader election is solid concensus, then your make log replication solid.
just wonder: no matter what u call it, ZenDiscovery or whatever it is and no matter whether it is before or after raft, what really matter is: it used to to wrong and it is still wrong. so why not just make it raft espencially it is really similarity to raft?

@makeyang
Contributor

@bleskes just another question which is more serious than last one: based on the current condition, ES won't pass jepsen-like test, right?

@bleskes
Member
bleskes commented Jan 19, 2017

@makeyang sadly there is no "just implement it" in distributed systems. It's a long process consisting of small steps. This and and other PRs you follow are part of that journey.

@bleskes
Member
bleskes commented Jan 19, 2017

ES won't pass jepsen-like test, right?

That's broad question. ES 5.0 is light years ahead of 1.x but there are still known issues. You can read about them in our documentation here.

@makeyang
Contributor
makeyang commented Jan 19, 2017 edited

@bleskes agree what u said. but please make these critical small setps, which is impact data safety, faster and faster before ES ruin reputation like MongoDB does before due to MongoDB's careless to data loss.

@bleskes
Member
bleskes commented Jan 19, 2017

@makeyang we're making as a fast as we can responsibly make them.

careless to data loss.

I think this very conversation shows otherwise. If you are speaking from experience, please do share your problem so we can see if it has already been solved or we need to fix something and add it to the working queue. Abstract claims are dangerous and hard to address.

@makeyang
Contributor

@bleskes what I mentioned is mongdb, just google "mongodb loses data" and u'll see that. I'm not saying ES.
I'll share anything related to ES to github or discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment