Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add current cluster state version to zen pings and use them in master election #20384

Merged
merged 16 commits into from
Sep 15, 2016

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Sep 8, 2016

During a networking partition, cluster states updates (like mapping changes or shard assignments)
are committed if a majority of the masters node received the update correctly. This means that the current master has access to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
is still recovering from the previous one and the old master is put in the minority side, it may be that a new master is elected which did not yet catch up. If that happens, cluster state updates can be lost.

This commit fixed 95% of this rare problem by adding the current cluster state version to PingResponse and use them when deciding which master to join (and thus casting the node's vote).

Note: this doesn't fully mitigate the problem as a cluster state update which is issued concurrently with a network partition can be lost if the partition prevents the commit message (part of the two phased commit of cluster state updates) from reaching any single node in the majority side and the partition does allow for the master to acknowledge the change. We are working on a more comprehensive fix but that requires considerate work and is targeted at 6.0.

PS this PR contains and depends on #20348 , which was required for long testing. That part doesn't need to be reviewed.

LongGCDisruption simulates a Long GC by suspending all threads belonging to a node. That's fine, unless those threads hold shared locks that can prevent other nodes from running. Concretely the logging infrastructure, which is shared between the nodes, can cause some deadlocks. LongGCDisruption has protection for this, but it needs to be updated to point at log4j2 classes, introduced in elastic#20235

This commit also fixes improper handling of retry logic in LongGCDisruption and adds a protection against deadlocking the test code which activates the disruption (and uses logging too! :)).

On top of that we have some new, evil and nasty tests.
@bleskes bleskes added resiliency :Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure v5.0.0-beta1 labels Sep 8, 2016
=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)

During a networking partition, cluster states updates (like mapping changes or shard assignments)
are committed if a majority of the masters node received the update correctly. This means that the current master has access

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

masters -> master-eligible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @clintongormley said, masters -> master-eligible but then node -> nodes so it reads majority of the master-eligible nodes....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check

@jasontedor
Copy link
Member

jasontedor commented Sep 15, 2016

Would you mind merging master in after you integrate #20348?

}

/**
* compares two candidate to indicate who's the a better master.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: candidate -> candidates

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: who's the a better -> which is the better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed. You know that to me the nodes are human...

private volatile int minimumMasterNodes;

public static class Candidate {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this class have Javadocs please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the class should be called something like MasterCandidate or CandidateMaster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a inner class of ElectMaster, but sure. Can do MasterCandidate

return sortedCandidates.get(0);
}

/** selects the best active master to join, where multiple are discovered (oh noes) */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the "oh noes"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

party pooper. removed.

out.writeLong(id);
}

@Override
public String toString() {
return "ping_response{node [" + node + "], id[" + id + "], master [" + master + "], hasJoinedOnce [" + hasJoinedOnce + "], cluster_name[" + clusterName.value() + "]}";
return "ping_response{node [" + node + "], id[" + id + "], master [" + master + "], cs version [" + clusterStateVersion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: cs version -> cluster_state_version, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sooo long. replaced.

@@ -64,6 +64,22 @@ framework. As the Jepsen tests evolve, we will continue porting new scenarios th
all new scenarios and will report issues that we find on this page and in our GitHub repository.

[float]
=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)

During a networking partition, cluster states updates (like mapping changes or shard assignments)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

states -> state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

public void testIsolateAll() {
Set<String> nodes = generateRandomStringSet(1, 10);
NetworkDisruption.DisruptedLinks topology = new NetworkDisruption.IsolateAllNodes(nodes);
for (int i = 0; i < 10; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not test all possible pairs, it's only 10 choose 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah , balancing act between speed and chances the test fails if you get something wrong. I also just hate the resulting double loop for a "check all combinations"

ElectMasterService service = electMasterService();
int min_master_nodes = randomIntBetween(0, nodes.size());
int min_master_nodes = randomIntBetween(0, candidates.size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are here, can we give this a proper Java variable name (minMasterNodes)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

} else if (min_master_nodes > 0 && master_nodes < min_master_nodes) {
assertNull(master);
} else {
Candidate master = service.electMaster(candidates);
assertNotNull(master);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation is off here and the rest of the way through this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

assertTrue(master.getId().compareTo(node.getId()) <= 0);
for (Candidate candidate : candidates) {
if (candidate.getNode().equals(master.getNode())) {
// meh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a more descriptive comment? 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a longer but just as meaningless text :)

assertThat("candidate " + candidate + " has a lower or equal id than master " + master, candidate.getNode().getId(),
greaterThan(master.getNode().getId()));
} else {
assertThat("candidate " + master + " has a higher id than candidate " + candidate, master.getClusterStateVersion(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say higher cluster state version instead of higher id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops

@@ -1189,6 +1192,61 @@ public void testIndicesDeleted() throws Exception {
assertFalse(client().admin().indices().prepareExists(idxName).get().isExists());
}

public void testElectMasterWithLatestVersion() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a beautiful test.

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bleskes, I left some feedback. In general, it looks sound.

final AtomicBoolean counted = new AtomicBoolean();
try {
zenPing.ping(pings -> {
response.addPings(pings);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the add pings only be done inside the guard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tja - doesn't really matter. I figured every extra bit of information, if we manage to get it in, counts

@bleskes
Copy link
Contributor Author

bleskes commented Sep 15, 2016

thx @jasontedor , @s1monw and @clintongormley . I addressed all the comments.

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bleskes bleskes merged commit 577dcb3 into elastic:master Sep 15, 2016
@bleskes bleskes deleted the zen_elect_by_version branch September 15, 2016 21:39
bleskes added a commit that referenced this pull request Sep 15, 2016
… election (#20384)

During a networking partition, cluster states updates (like mapping changes or shard assignments)
are committed if a majority of the masters node received the update correctly. This means that the current master has access to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected which did not yet catch up. If that happens, cluster state updates can be lost.

This commit fixed 95% of this rare problem by adding the current cluster state version to `PingResponse` and use them when deciding which master to join (and thus casting the node's vote).

Note: this doesn't fully mitigate the problem as a cluster state update which is issued concurrently with a network partition can be lost if the partition prevents the commit message (part of the two phased commit of cluster state updates) from reaching any single node in the majority side *and* the partition does allow for the master to acknowledge the change. We are working on a more comprehensive fix but that requires considerate work  and is targeted at 6.0.
bleskes added a commit that referenced this pull request Sep 15, 2016
… election (#20384)

During a networking partition, cluster states updates (like mapping changes or shard assignments)
are committed if a majority of the masters node received the update correctly. This means that the current master has access to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected which did not yet catch up. If that happens, cluster state updates can be lost.

This commit fixed 95% of this rare problem by adding the current cluster state version to `PingResponse` and use them when deciding which master to join (and thus casting the node's vote).

Note: this doesn't fully mitigate the problem as a cluster state update which is issued concurrently with a network partition can be lost if the partition prevents the commit message (part of the two phased commit of cluster state updates) from reaching any single node in the majority side *and* the partition does allow for the master to acknowledge the change. We are working on a more comprehensive fix but that requires considerate work  and is targeted at 6.0.
@makeyang
Copy link
Contributor

makeyang commented Jan 19, 2017

this one plus logical time plus this PR: #13062 is really makeing a raft.
then u guys will make pacificA, your log replication method really solid because pacificA requires raft/paxos to maintain replication set config.

@bleskes
Copy link
Contributor Author

bleskes commented Jan 19, 2017

@makeyang there are a lot of similarities between ZenDiscovery and Raft if you look at it the right way (although ZenDiscovery was built before Raft was there). I'm not sure 100% what you mean, but a pacificA like log replication model requires an external concensus oracle. For pacifica it indeed doesn't matter which algorithm you use,.

@makeyang
Copy link
Contributor

makeyang commented Jan 19, 2017

@bleskes that's all I mean: as long as leader election is solid concensus, then your make log replication solid.
just wonder: no matter what u call it, ZenDiscovery or whatever it is and no matter whether it is before or after raft, what really matter is: it used to to wrong and it is still wrong. so why not just make it raft espencially it is really similarity to raft?

@makeyang
Copy link
Contributor

@bleskes just another question which is more serious than last one: based on the current condition, ES won't pass jepsen-like test, right?

@bleskes
Copy link
Contributor Author

bleskes commented Jan 19, 2017

@makeyang sadly there is no "just implement it" in distributed systems. It's a long process consisting of small steps. This and and other PRs you follow are part of that journey.

@bleskes
Copy link
Contributor Author

bleskes commented Jan 19, 2017

ES won't pass jepsen-like test, right?

That's broad question. ES 5.0 is light years ahead of 1.x but there are still known issues. You can read about them in our documentation here.

@makeyang
Copy link
Contributor

makeyang commented Jan 19, 2017

@bleskes agree what u said. but please make these critical small setps, which is impact data safety, faster and faster before ES ruin reputation like MongoDB does before due to MongoDB's careless to data loss.

@bleskes
Copy link
Contributor Author

bleskes commented Jan 19, 2017

@makeyang we're making as a fast as we can responsibly make them.

careless to data loss.

I think this very conversation shows otherwise. If you are speaking from experience, please do share your problem so we can see if it has already been solved or we need to fix something and add it to the working queue. Abstract claims are dangerous and hard to address.

@makeyang
Copy link
Contributor

@bleskes what I mentioned is mongdb, just google "mongodb loses data" and u'll see that. I'm not saying ES.
I'll share anything related to ES to github or discussion.

@makeyang
Copy link
Contributor

makeyang commented May 2, 2017

@bleskes I have a question related pacificA:
according to section of "Change of Primary" of paper:
"During reconciliation, p sends prepare messages for uncommitted requests in its prepared list and have them committed on the new configuration"
assume below scenario:
primary sends a update request(we call it RN) with its configuration version and the serial number in a prepare message to all replicas and network partition happens, some replicas get this prepare message and put it into their prepared list while othres not.
the old master will acked failed to client.
then one secondary which get RN get master lease and "sends prepare messages for uncommitted requests in its prepared list and have them committed on the new configuration"
by accidently, although the system achieve consistency, but the system have the date it shouldn't have.
do I miss something?

@jasontedor
Copy link
Member

jasontedor commented May 2, 2017

@makeyang Sorry, but this is not the place for discussion, we have the forum for that. However, general questions that are not at all specific to Elasticsearch are outside what you can expect to be answered there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure resiliency v5.0.0-beta1 v6.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants