Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add current cluster state version to zen pings and use them in master election #20384

Merged
merged 16 commits into from Sep 15, 2016

Conversation

Projects
None yet
5 participants
@bleskes
Copy link
Member

commented Sep 8, 2016

During a networking partition, cluster states updates (like mapping changes or shard assignments)
are committed if a majority of the masters node received the update correctly. This means that the current master has access to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
is still recovering from the previous one and the old master is put in the minority side, it may be that a new master is elected which did not yet catch up. If that happens, cluster state updates can be lost.

This commit fixed 95% of this rare problem by adding the current cluster state version to PingResponse and use them when deciding which master to join (and thus casting the node's vote).

Note: this doesn't fully mitigate the problem as a cluster state update which is issued concurrently with a network partition can be lost if the partition prevents the commit message (part of the two phased commit of cluster state updates) from reaching any single node in the majority side and the partition does allow for the master to acknowledge the change. We are working on a more comprehensive fix but that requires considerate work and is targeted at 6.0.

PS this PR contains and depends on #20348 , which was required for long testing. That part doesn't need to be reviewed.

bleskes added some commits Sep 5, 2016

Fix LongGCDisruption to be aware of log4j
LongGCDisruption simulates a Long GC by suspending all threads belonging to a node. That's fine, unless those threads hold shared locks that can prevent other nodes from running. Concretely the logging infrastructure, which is shared between the nodes, can cause some deadlocks. LongGCDisruption has protection for this, but it needs to be updated to point at log4j2 classes, introduced in #20235

This commit also fixes improper handling of retry logic in LongGCDisruption and adds a protection against deadlocking the test code which activates the disruption (and uses logging too! :)).

On top of that we have some new, evil and nasty tests.
=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)

During a networking partition, cluster states updates (like mapping changes or shard assignments)
are committed if a majority of the masters node received the update correctly. This means that the current master has access

This comment has been minimized.

Copy link
@clintongormley

clintongormley Sep 8, 2016

Member

masters -> master-eligible

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 15, 2016

Member

As @clintongormley said, masters -> master-eligible but then node -> nodes so it reads majority of the master-eligible nodes....

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

check

During a networking partition, cluster states updates (like mapping changes or shard assignments)
are committed if a majority of the masters node received the update correctly. This means that the current master has access
to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch
up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster

This comment has been minimized.

Copy link
@clintongormley

clintongormley Sep 8, 2016

Member

get the changes they couldn't receive before -> receive the previously missed changes.

This comment has been minimized.

Copy link
@clintongormley

clintongormley Sep 8, 2016

Member

happens -> occurs

are committed if a majority of the masters node received the update correctly. This means that the current master has access
to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch
up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected

This comment has been minimized.

Copy link
@clintongormley

clintongormley Sep 8, 2016

Member

is put in -> falls on

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

changed

to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch
up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected
which did not yet catch up. If that happens, cluster state updates can be lost.

This comment has been minimized.

Copy link
@clintongormley

clintongormley Sep 8, 2016

Member

did not -> has not

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

changed

is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected
which did not yet catch up. If that happens, cluster state updates can be lost.

This problem is mostly fixed by {GIT}TBD[#TBD] (v5.0.0), which takes committed cluster states updates into account during master

This comment has been minimized.

Copy link
@clintongormley

clintongormley Sep 8, 2016

Member

states updates -> state updates

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 15, 2016

Member

The TBD can be updated with a link to this PR now.

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

adapted and replaced

which did not yet catch up. If that happens, cluster state updates can be lost.

This problem is mostly fixed by {GIT}TBD[#TBD] (v5.0.0), which takes committed cluster states updates into account during master
election. This considerably reduces the chance of this rare problem to occur but does not fully mitigate it. If the second partition

This comment has been minimized.

Copy link
@clintongormley

clintongormley Sep 8, 2016

Member

to occur -> occurring

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

changed

This problem is mostly fixed by {GIT}TBD[#TBD] (v5.0.0), which takes committed cluster states updates into account during master
election. This considerably reduces the chance of this rare problem to occur but does not fully mitigate it. If the second partition
happens concurrently with a cluster state update and blocks the cluster state commit message from reaching a majority of nodes, it may be
that the in flight update will be lost. If the, now isolated, master can still acknowledge the cluster state update to the client this

This comment has been minimized.

Copy link
@clintongormley

clintongormley Sep 8, 2016

Member

, now isolated, -> now isolated

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 15, 2016

Member

Although it should be now-isolated.

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

went with jason's version

election. This considerably reduces the chance of this rare problem to occur but does not fully mitigate it. If the second partition
happens concurrently with a cluster state update and blocks the cluster state commit message from reaching a majority of nodes, it may be
that the in flight update will be lost. If the, now isolated, master can still acknowledge the cluster state update to the client this
will amount to a loss of an acknowledge changed. Fixing that last scenario needs considerate work and is currently targeted at (v6.0.0).

This comment has been minimized.

Copy link
@clintongormley

clintongormley Sep 8, 2016

Member

a loss -> the loss

This comment has been minimized.

Copy link
@clintongormley

clintongormley Sep 8, 2016

Member

acknowledge changed -> acknowledged change

This comment has been minimized.

Copy link
@clintongormley

clintongormley Sep 8, 2016

Member

considerate -> considerable

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

fixed

* @return -1 if c1 is a batter candidate, 1 if c2.
*/
public static int compare(Candidate c1, Candidate c2) {
int ret = -1 * Long.compare(c1.clusterStateVersion, c2.clusterStateVersion);

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 8, 2016

Contributor

can't you just swap c1 and c2 ? and add a comment in there that it's intentional?

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 15, 2016

Contributor

any updates?

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

changed and added a comment


/** selects the best active master to join, where multiple are discovered (oh noes) */
public DiscoveryNode tieBreakActiveMasters(Collection<DiscoveryNode> activeMasters) {
List<DiscoveryNode> tmp = new ArrayList<>(activeMasters);

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 8, 2016

Contributor

activeMasters.stream().min(ElectMasterService::compareNodes);?

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 15, 2016

Contributor

any updates

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

done

return pings.values().toArray(new PingResponse[pings.size()]);
/** serialize current pings to an array. It is guaranteed that the array contains one ping response per node */
public synchronized List<PingResponse> toList() {
return new ArrayList<>(pings.values());

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 8, 2016

Contributor

why do you copy it? can't you just use Collections.unmodifiableCollection(ping.values()) and let the user do it if needed or do we have problems with concurrent modifications here? Maybe a CopyOnWriteHashMap would be the right thing todo?

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 15, 2016

Member

The return value does get modified by a caller in ZenDiscovery#findMaster; I think there is risk of a modification while the caller is copying, so it's better to do the copying under the synchronized lock.

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

This is not a high performance code (especially not when this is called, where almost always things settled down) so I opted for the safest simplest option (imo) - return a new list and not worry what people do with it. Going with CopyOnWriteHashMap is a shame because it is updated quite frequently during the pinging phase.

}, timeout);
} catch (Exception ex) {
logger.warn("Ping execution failed", ex);
if (counted.compareAndSet(false, true)) {

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 8, 2016

Contributor

isn't it an error condition when it's already counted?

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 15, 2016

Contributor

any updates?

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

at the moment no - but just rather be defensive then worry about it.

/** serialize current pings to an array */
public synchronized PingResponse[] toArray() {
return pings.values().toArray(new PingResponse[pings.size()]);
/** serialize current pings to an array. It is guaranteed that the array contains one ping response per node */

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 8, 2016

Contributor

s/to an array/to a collection/

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 15, 2016

Member

that the array -> that the collection

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

adapted, but when with List

@jasontedor

This comment has been minimized.

Copy link
Member

commented Sep 15, 2016

Would you mind merging master in after you integrate #20348?

@bleskes bleskes force-pushed the bleskes:zen_elect_by_version branch to 0ef5ec6 Sep 15, 2016

}

/**
* compares two candidate to indicate who's the a better master.

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 15, 2016

Member

Nit: candidate -> candidates

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 15, 2016

Member

Nit: who's the a better -> which is the better

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

changed. You know that to me the nodes are human...

private volatile int minimumMasterNodes;

public static class Candidate {

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 15, 2016

Member

Can this class have Javadocs please?

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 15, 2016

Member

I wonder if the class should be called something like MasterCandidate or CandidateMaster?

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

It's a inner class of ElectMaster, but sure. Can do MasterCandidate

return sortedCandidates.get(0);
}

/** selects the best active master to join, where multiple are discovered (oh noes) */

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 15, 2016

Member

Drop the "oh noes"?

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

party pooper. removed.

out.writeLong(id);
}

@Override
public String toString() {
return "ping_response{node [" + node + "], id[" + id + "], master [" + master + "], hasJoinedOnce [" + hasJoinedOnce + "], cluster_name[" + clusterName.value() + "]}";
return "ping_response{node [" + node + "], id[" + id + "], master [" + master + "], cs version [" + clusterStateVersion

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 15, 2016

Member

Nit: cs version -> cluster_state_version, please.

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

sooo long. replaced.

@@ -64,6 +64,22 @@ framework. As the Jepsen tests evolve, we will continue porting new scenarios th
all new scenarios and will report issues that we find on this page and in our GitHub repository.

[float]
=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)

During a networking partition, cluster states updates (like mapping changes or shard assignments)

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 15, 2016

Member

states -> state

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

yep

public void testIsolateAll() {
Set<String> nodes = generateRandomStringSet(1, 10);
NetworkDisruption.DisruptedLinks topology = new NetworkDisruption.IsolateAllNodes(nodes);
for (int i = 0; i < 10; i++) {

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 15, 2016

Member

Why not test all possible pairs, it's only 10 choose 2?

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

yeah , balancing act between speed and chances the test fails if you get something wrong. I also just hate the resulting double loop for a "check all combinations"

ElectMasterService service = electMasterService();
int min_master_nodes = randomIntBetween(0, nodes.size());
int min_master_nodes = randomIntBetween(0, candidates.size());

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 15, 2016

Member

While we are here, can we give this a proper Java variable name (minMasterNodes)?

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

changed.

} else if (min_master_nodes > 0 && master_nodes < min_master_nodes) {
assertNull(master);
} else {
Candidate master = service.electMaster(candidates);
assertNotNull(master);

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 15, 2016

Member

The indentation is off here and the rest of the way through this test.

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

fixed

assertTrue(master.getId().compareTo(node.getId()) <= 0);
for (Candidate candidate : candidates) {
if (candidate.getNode().equals(master.getNode())) {
// meh

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 15, 2016

Member

Maybe a more descriptive comment? 😄

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

I made a longer but just as meaningless text :)

assertThat("candidate " + candidate + " has a lower or equal id than master " + master, candidate.getNode().getId(),
greaterThan(master.getNode().getId()));
} else {
assertThat("candidate " + master + " has a higher id than candidate " + candidate, master.getClusterStateVersion(),

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 15, 2016

Member

This should say higher cluster state version instead of higher id.

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

oops

@@ -1189,6 +1192,61 @@ public void testIndicesDeleted() throws Exception {
assertFalse(client().admin().indices().prepareExists(idxName).get().isExists());
}

public void testElectMasterWithLatestVersion() throws Exception {

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 15, 2016

Member

This is a beautiful test.

@jasontedor
Copy link
Member

left a comment

Thanks @bleskes, I left some feedback. In general, it looks sound.

final AtomicBoolean counted = new AtomicBoolean();
try {
zenPing.ping(pings -> {
response.addPings(pings);

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 15, 2016

Member

Should the add pings only be done inside the guard?

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 15, 2016

Author Member

tja - doesn't really matter. I figured every extra bit of information, if we manage to get it in, counts

bleskes added some commits Sep 15, 2016

@bleskes

This comment has been minimized.

Copy link
Member Author

commented Sep 15, 2016

thx @jasontedor , @s1monw and @clintongormley . I addressed all the comments.

@jasontedor
Copy link
Member

left a comment

LGTM

@bleskes bleskes merged commit 577dcb3 into elastic:master Sep 15, 2016

1 of 2 checks passed

elasticsearch-ci Build finished.
Details
CLA Commit author is a member of Elasticsearch
Details

@bleskes bleskes deleted the bleskes:zen_elect_by_version branch Sep 15, 2016

bleskes added a commit that referenced this pull request Sep 15, 2016

Add current cluster state version to zen pings and use them in master…
… election (#20384)

During a networking partition, cluster states updates (like mapping changes or shard assignments)
are committed if a majority of the masters node received the update correctly. This means that the current master has access to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected which did not yet catch up. If that happens, cluster state updates can be lost.

This commit fixed 95% of this rare problem by adding the current cluster state version to `PingResponse` and use them when deciding which master to join (and thus casting the node's vote).

Note: this doesn't fully mitigate the problem as a cluster state update which is issued concurrently with a network partition can be lost if the partition prevents the commit message (part of the two phased commit of cluster state updates) from reaching any single node in the majority side *and* the partition does allow for the master to acknowledge the change. We are working on a more comprehensive fix but that requires considerate work  and is targeted at 6.0.

bleskes added a commit that referenced this pull request Sep 15, 2016

Add current cluster state version to zen pings and use them in master…
… election (#20384)

During a networking partition, cluster states updates (like mapping changes or shard assignments)
are committed if a majority of the masters node received the update correctly. This means that the current master has access to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected which did not yet catch up. If that happens, cluster state updates can be lost.

This commit fixed 95% of this rare problem by adding the current cluster state version to `PingResponse` and use them when deciding which master to join (and thus casting the node's vote).

Note: this doesn't fully mitigate the problem as a cluster state update which is issued concurrently with a network partition can be lost if the partition prevents the commit message (part of the two phased commit of cluster state updates) from reaching any single node in the majority side *and* the partition does allow for the master to acknowledge the change. We are working on a more comprehensive fix but that requires considerate work  and is targeted at 6.0.

@clintongormley clintongormley added the >bug label Sep 19, 2016

@clintongormley clintongormley removed the v5.1.1 label Dec 8, 2016

@makeyang

This comment has been minimized.

Copy link
Contributor

commented Jan 19, 2017

this one plus logical time plus this PR: #13062 is really makeing a raft.
then u guys will make pacificA, your log replication method really solid because pacificA requires raft/paxos to maintain replication set config.

@bleskes

This comment has been minimized.

Copy link
Member Author

commented Jan 19, 2017

@makeyang there are a lot of similarities between ZenDiscovery and Raft if you look at it the right way (although ZenDiscovery was built before Raft was there). I'm not sure 100% what you mean, but a pacificA like log replication model requires an external concensus oracle. For pacifica it indeed doesn't matter which algorithm you use,.

@makeyang

This comment has been minimized.

Copy link
Contributor

commented Jan 19, 2017

@bleskes that's all I mean: as long as leader election is solid concensus, then your make log replication solid.
just wonder: no matter what u call it, ZenDiscovery or whatever it is and no matter whether it is before or after raft, what really matter is: it used to to wrong and it is still wrong. so why not just make it raft espencially it is really similarity to raft?

@makeyang

This comment has been minimized.

Copy link
Contributor

commented Jan 19, 2017

@bleskes just another question which is more serious than last one: based on the current condition, ES won't pass jepsen-like test, right?

@bleskes

This comment has been minimized.

Copy link
Member Author

commented Jan 19, 2017

@makeyang sadly there is no "just implement it" in distributed systems. It's a long process consisting of small steps. This and and other PRs you follow are part of that journey.

@bleskes

This comment has been minimized.

Copy link
Member Author

commented Jan 19, 2017

ES won't pass jepsen-like test, right?

That's broad question. ES 5.0 is light years ahead of 1.x but there are still known issues. You can read about them in our documentation here.

@makeyang

This comment has been minimized.

Copy link
Contributor

commented Jan 19, 2017

@bleskes agree what u said. but please make these critical small setps, which is impact data safety, faster and faster before ES ruin reputation like MongoDB does before due to MongoDB's careless to data loss.

@bleskes

This comment has been minimized.

Copy link
Member Author

commented Jan 19, 2017

@makeyang we're making as a fast as we can responsibly make them.

careless to data loss.

I think this very conversation shows otherwise. If you are speaking from experience, please do share your problem so we can see if it has already been solved or we need to fix something and add it to the working queue. Abstract claims are dangerous and hard to address.

@makeyang

This comment has been minimized.

Copy link
Contributor

commented Jan 19, 2017

@bleskes what I mentioned is mongdb, just google "mongodb loses data" and u'll see that. I'm not saying ES.
I'll share anything related to ES to github or discussion.

@makeyang

This comment has been minimized.

Copy link
Contributor

commented May 2, 2017

@bleskes I have a question related pacificA:
according to section of "Change of Primary" of paper:
"During reconciliation, p sends prepare messages for uncommitted requests in its prepared list and have them committed on the new configuration"
assume below scenario:
primary sends a update request(we call it RN) with its configuration version and the serial number in a prepare message to all replicas and network partition happens, some replicas get this prepare message and put it into their prepared list while othres not.
the old master will acked failed to client.
then one secondary which get RN get master lease and "sends prepare messages for uncommitted requests in its prepared list and have them committed on the new configuration"
by accidently, although the system achieve consistency, but the system have the date it shouldn't have.
do I miss something?

@jasontedor

This comment has been minimized.

Copy link
Member

commented May 2, 2017

@makeyang Sorry, but this is not the place for discussion, we have the forum for that. However, general questions that are not at all specific to Elasticsearch are outside what you can expect to be answered there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.