Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
split brain condition after second network disconnect - even with minimum_master_nodes set #2117
Split brain can occur on the second network disconnect of a node, when the minimum_master_nodes is configured correctly(n/2+1). The split brain occurs if the nodeId(UUID) of the disconnected node is such that the disconnected node picks itself as the next logical master while pinging the other nodes(NodeFaultDetection). The split brain only occurs on the second time that the node is disconnected/isolated.
Using ZenDiscovery, Node Id's are randomly generated(A UUID): ZenDiscovery:169.
When the node is disconnected/isolated it the ElectMasterService uses an ordered list of the Nodes (Ordered by nodeId) to determine a new potential master. It picks the first of the ordered list: ElectMasterService:95
Because the nodeId's are random, it's possible for the disconnected/isolated node to be first in the ordered list, electing itself as a possible master.
The first time network is disconnected, the minimum_master_nodes property is honored and the disconnected/isolated node goes into a "ping" mode, where it simply tries to ping for other nodes. Once the network is re-connected, the node re-joins the cluster successfully.
The Second time the network is disconnected, the minimum_master_nodes intent is not honored. The disconnected/isolated node fails to realise that it's not connected to the remaining node in the 3 node cluster and elects itself as master, still thinking it's connected.
It feels like there is a failure in the transition between MasterFaultDetection and NodeFaultDetection, because it works the first time!
The fault only occurs if the nodeId is ordered such that the disconnected node picks itself as the master while isolated. If the nodeId's are ordered such that it picks one of the other 2 nodes to be potential master then the isolated node honors the minimum_master_nodes intent every time.
Because the nodeId's are randomly(UUID) generated, the probability of this occuring drops as the number of nodes in the cluster goes up. For our 3 node cluster it's ~50% (with one node detected as gone, it's up to the ordering of the remaining two nodeId's)
Note, While we were trying track this down we found that the cluster.service TRACE level logging (which outputs the cluster state) does not list the nodes in election order. IE, the first node in that printed list is not necessarily going to elected as master by the isolated node.
Detail Steps to reproduce:
Because the ordering of the nodeId's is random(UUID) we were having trouble getting a consitantly reproducable test case. To fix the ordering, we made a patch to ZenDiscovery to allow us to optionally configure a nodeId. This allowed us to set the nodeId of the disconnected/isolated node to guarantee it's ordering, allowing us to consistently reproduce.
We've tested this scenario on the 0.19.4, 0.19.7, 0.19.8 distributions and see the error when the nodeId's were ordered just right.
We also tested this scenario on the current git master with the supplied patch.
In this scenario, node3 will the be the node we disconnect/isolate. So we start the nodes up in numerical order to ensure node3 doesn't start as master.
At this point, everything is working as expected.
Here's a gist with the patch to ZenDiscovery and the 3 node configs.
Logs for the three nodes here:
The disconnected/isolated node is the top file, log named "splitbrain-isolatednode.log".
Timestamps of note (the clocks of the 3 nodes are within a second of eachother):
14:42:49 -> first network disconnect
There's a few index status request errors logged because we used elasticsearch-head to check status on the isolated node.
Looks like we are facing a similar issue...
[2012-08-28 06:54:20,729][INFO ][discovery.zen ] [TES3] master_left [[TES1][-G0NH7iwRQevY3La-zlxaA][inet[/18.104.22.168:9300]]], reason [failed to ping, tried  times, each with maximum [30s] timeout]
referenced this issue
Dec 18, 2012
I have been watching this thread for a while since I have encountered the issue as well (running 0.20RC1). I tend to avoid adding a +1 comment on GitHub, but I wanted to lend my support to this issue.
Another related problem this week: https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/erpa7mMT5DM
My hopes is that the new field cache changes will alleviate memory pressure, causing fewer garbage collection (which might have been our issue as well).
Yes. I didn't have time to identify the problematic node, so I restarted the node I thought was having issues (I choose correctly, I guess I should have played PowerBall after all). One shard (on a different node) was stuck in a RECOVERING state, which I fixed by reducing the number of replicas from 2 to 1 and then increasing the number again.
Simon mentioned that issue is with zen discovery, so I am assuming switching from Multicast to Unicast will not alleviate the problem.