split brain condition after second network disconnect - even with minimum_master_nodes set #2117

Closed
owenbutler opened this Issue Jul 25, 2012 · 18 comments

Comments

Projects
None yet
@owenbutler

Summary:

Split brain can occur on the second network disconnect of a node, when the minimum_master_nodes is configured correctly(n/2+1). The split brain occurs if the nodeId(UUID) of the disconnected node is such that the disconnected node picks itself as the next logical master while pinging the other nodes(NodeFaultDetection). The split brain only occurs on the second time that the node is disconnected/isolated.

Detail:

Using ZenDiscovery, Node Id's are randomly generated(A UUID): ZenDiscovery:169.

When the node is disconnected/isolated it the ElectMasterService uses an ordered list of the Nodes (Ordered by nodeId) to determine a new potential master. It picks the first of the ordered list: ElectMasterService:95

Because the nodeId's are random, it's possible for the disconnected/isolated node to be first in the ordered list, electing itself as a possible master.

The first time network is disconnected, the minimum_master_nodes property is honored and the disconnected/isolated node goes into a "ping" mode, where it simply tries to ping for other nodes. Once the network is re-connected, the node re-joins the cluster successfully.

The Second time the network is disconnected, the minimum_master_nodes intent is not honored. The disconnected/isolated node fails to realise that it's not connected to the remaining node in the 3 node cluster and elects itself as master, still thinking it's connected.

It feels like there is a failure in the transition between MasterFaultDetection and NodeFaultDetection, because it works the first time!

The fault only occurs if the nodeId is ordered such that the disconnected node picks itself as the master while isolated. If the nodeId's are ordered such that it picks one of the other 2 nodes to be potential master then the isolated node honors the minimum_master_nodes intent every time.

Because the nodeId's are randomly(UUID) generated, the probability of this occuring drops as the number of nodes in the cluster goes up. For our 3 node cluster it's ~50% (with one node detected as gone, it's up to the ordering of the remaining two nodeId's)

Note, While we were trying track this down we found that the cluster.service TRACE level logging (which outputs the cluster state) does not list the nodes in election order. IE, the first node in that printed list is not necessarily going to elected as master by the isolated node.

Detail Steps to reproduce:

Because the ordering of the nodeId's is random(UUID) we were having trouble getting a consitantly reproducable test case. To fix the ordering, we made a patch to ZenDiscovery to allow us to optionally configure a nodeId. This allowed us to set the nodeId of the disconnected/isolated node to guarantee it's ordering, allowing us to consistently reproduce.

We've tested this scenario on the 0.19.4, 0.19.7, 0.19.8 distributions and see the error when the nodeId's were ordered just right.

We also tested this scenario on the current git master with the supplied patch.

In this scenario, node3 will the be the node we disconnect/isolate. So we start the nodes up in numerical order to ensure node3 doesn't start as master.

  1. Configure nodes with attached configs (one is provided for each node)
  2. Start up nodes 1 and 2. After they are attached and one is master, start node 3
  3. Create a blank index with default shard/replica(5/1) settings
  4. Pull network cable from node 3
  5. Node 3 detects master has gone (MasterFaultDetection)
  6. Node 3 elects itself as master (Because the nodeId's are ordered just right)
  7. Node 3 detects the remaining node has gone, enters ZenDiscovery minimum_master_nodes mode, prints a message indicating not enough nodes
  8. Node 3 goes into a ping state looking for nodes
  9. At this point, node 1 and node 2 report a valid cluster, they know about each other but not about node 3.
  10. Reconnect network to node 3
  11. Node 3 rejoins the cluster correctly, seeing that there is already a master in the cluster.

At this point, everything is working as expected.

  1. Pull network cable from node 3 again
  2. Node 3 detects master has gone (MasterFaultDetection)
  3. Node 3 elects as itself as master (Because the nodeId's are ordered just right)
  4. Node 3 now fails to detect that the remaining node in the cluster is not accessible. It starts throwing a number of Netty NoRouteToHostExceptions about the remaining node.
  5. According to node 3, cluster health is yellow and cluster state shows 2 data nodes
  6. Reconnect network to node 3
  7. Node 3 appears to connect to the node that it thinks it's still connected to. (can see that via the cluster state api). The other nodes log nothing and do not show the disconnected node as connected in any way.
  8. Node 3 at this point accepts indexing and search requests, a classic split brain.

Here's a gist with the patch to ZenDiscovery and the 3 node configs.

https://gist.github.com/3174651

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Jul 30, 2012

Member

Thanks for the detailed explanation, do you have the logs for the 3 nodes around, would love to have a look at them.

Member

kimchy commented Jul 30, 2012

Thanks for the detailed explanation, do you have the logs for the 3 nodes around, would love to have a look at them.

@owenbutler

This comment has been minimized.

Show comment
Hide comment
@owenbutler

owenbutler Aug 1, 2012

Hi Shay,

Logs for the three nodes here:

https://gist.github.com/3223822

The disconnected/isolated node is the top file, log named "splitbrain-isolatednode.log".

Timestamps of note (the clocks of the 3 nodes are within a second of eachother):

14:42:49 -> first network disconnect
14:44:30 -> Reconnect
14:45:13 -> Second network disconnect
14:46:11 -> Split brain begins (the isolated node still thinks it's connected to one of the others at this point)
14:47:41 -> After second reconnect the isolated node now just sees one node.

There's a few index status request errors logged because we used elasticsearch-head to check status on the isolated node.

Hi Shay,

Logs for the three nodes here:

https://gist.github.com/3223822

The disconnected/isolated node is the top file, log named "splitbrain-isolatednode.log".

Timestamps of note (the clocks of the 3 nodes are within a second of eachother):

14:42:49 -> first network disconnect
14:44:30 -> Reconnect
14:45:13 -> Second network disconnect
14:46:11 -> Split brain begins (the isolated node still thinks it's connected to one of the others at this point)
14:47:41 -> After second reconnect the isolated node now just sees one node.

There's a few index status request errors logged because we used elasticsearch-head to check status on the isolated node.

@praveenbm5

This comment has been minimized.

Show comment
Hide comment
@praveenbm5

praveenbm5 Aug 28, 2012

Looks like we are facing a similar issue...

[2012-08-28 06:54:20,729][INFO ][discovery.zen ] [TES3] master_left [[TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]]], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2012-08-28 06:54:20,741][INFO ][cluster.service ] [TES3] master {new [TES3][DIhQQmanQOOgo1qVFz8tSA][inet[/178.238.237.240:9300]], previous [TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]]}, removed {[TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]],}, reason: zen-disco-master_failed ([TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]])
[2012-08-28 06:54:50,707][INFO ][cluster.service ] [TES3] added {[TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]],}, reason: zen-disco-receive(join from node[[TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]]])
[2012-08-28 07:06:03,238][INFO ][cluster.service ] [TES3] removed {[TES2][kT7r8dFxSt6bjVKUAvdcdg][inet[/178.238.237.239:9300]],}, reason: zen-disco-node_failed([TES2][kT7r8dFxSt6bjVKUAvdcdg][inet[/178.238.237.239:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout
[2012-08-28 07:06:03,278][WARN ][discovery.zen ] [TES3] not enough master nodes, current nodes: {[TES3][DIhQQmanQOOgo1qVFz8tSA][inet[/178.238.237.240:9300]],}
[2012-08-28 07:06:03,279][INFO ][cluster.service ] [TES3] removed {[TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]],}, reason: zen-disco-node_failed([TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout
[2012-08-28 07:06:33,290][INFO ][cluster.service ] [TES3] detected_master [TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]], added {[TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]],[TES2][kT7r8dFxSt6bjVKUAvdcdg][inet[/178.238.237.239:9300]],}, reason: zen-disco-receive(from master [[TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]]])
[2012-08-28 08:36:45,100][INFO ][discovery.zen ] [TES3] master_left [[TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]]], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2012-08-28 08:36:45,112][INFO ][cluster.service ] [TES3] master {new [TES3][DIhQQmanQOOgo1qVFz8tSA][inet[/178.238.237.240:9300]], previous [TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]]}, removed {[TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]],}, reason: zen-disco-master_failed ([TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]])

Looks like we are facing a similar issue...

[2012-08-28 06:54:20,729][INFO ][discovery.zen ] [TES3] master_left [[TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]]], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2012-08-28 06:54:20,741][INFO ][cluster.service ] [TES3] master {new [TES3][DIhQQmanQOOgo1qVFz8tSA][inet[/178.238.237.240:9300]], previous [TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]]}, removed {[TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]],}, reason: zen-disco-master_failed ([TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]])
[2012-08-28 06:54:50,707][INFO ][cluster.service ] [TES3] added {[TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]],}, reason: zen-disco-receive(join from node[[TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]]])
[2012-08-28 07:06:03,238][INFO ][cluster.service ] [TES3] removed {[TES2][kT7r8dFxSt6bjVKUAvdcdg][inet[/178.238.237.239:9300]],}, reason: zen-disco-node_failed([TES2][kT7r8dFxSt6bjVKUAvdcdg][inet[/178.238.237.239:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout
[2012-08-28 07:06:03,278][WARN ][discovery.zen ] [TES3] not enough master nodes, current nodes: {[TES3][DIhQQmanQOOgo1qVFz8tSA][inet[/178.238.237.240:9300]],}
[2012-08-28 07:06:03,279][INFO ][cluster.service ] [TES3] removed {[TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]],}, reason: zen-disco-node_failed([TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout
[2012-08-28 07:06:33,290][INFO ][cluster.service ] [TES3] detected_master [TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]], added {[TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]],[TES2][kT7r8dFxSt6bjVKUAvdcdg][inet[/178.238.237.239:9300]],}, reason: zen-disco-receive(from master [[TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]]])
[2012-08-28 08:36:45,100][INFO ][discovery.zen ] [TES3] master_left [[TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]]], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2012-08-28 08:36:45,112][INFO ][cluster.service ] [TES3] master {new [TES3][DIhQQmanQOOgo1qVFz8tSA][inet[/178.238.237.240:9300]], previous [TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]]}, removed {[TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]],}, reason: zen-disco-master_failed ([TES1][-G0NH7iwRQevY3La-zlxaA][inet[/178.238.237.241:9300]])

@tallpsmith

This comment has been minimized.

Show comment
Hide comment
@tallpsmith

tallpsmith Dec 18, 2012

@kimchy et al have you had a chance to try to reproduce this with the steps outlined above? I see a fair number of Split Brain style issues appearing on the mailing lists, and likely a smaller cluster size for them is exacerbating this issue?

@kimchy et al have you had a chance to try to reproduce this with the steps outlined above? I see a fair number of Split Brain style issues appearing on the mailing lists, and likely a smaller cluster size for them is exacerbating this issue?

@tallpsmith

This comment has been minimized.

Show comment
Hide comment
@tallpsmith

tallpsmith Mar 21, 2013

@kimchy @s1monw anyone able to comment on whether they've even tried the steps above ? I still think people are vulnerable to this condition.

It is a pain to setup the test I know, but trust me, once you see it happen, it's bad.

@kimchy @s1monw anyone able to comment on whether they've even tried the steps above ? I still think people are vulnerable to this condition.

It is a pain to setup the test I know, but trust me, once you see it happen, it's bad.

@brusic

This comment has been minimized.

Show comment
Hide comment
@brusic

brusic Mar 21, 2013

Contributor

I have been watching this thread for a while since I have encountered the issue as well (running 0.20RC1). I tend to avoid adding a +1 comment on GitHub, but I wanted to lend my support to this issue.

Another related problem this week: https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/erpa7mMT5DM

My hopes is that the new field cache changes will alleviate memory pressure, causing fewer garbage collection (which might have been our issue as well).

Contributor

brusic commented Mar 21, 2013

I have been watching this thread for a while since I have encountered the issue as well (running 0.20RC1). I tend to avoid adding a +1 comment on GitHub, but I wanted to lend my support to this issue.

Another related problem this week: https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/erpa7mMT5DM

My hopes is that the new field cache changes will alleviate memory pressure, causing fewer garbage collection (which might have been our issue as well).

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Mar 21, 2013

Contributor

@tallpsmith thanks for pinging me on this one again. I started looking into this today but it might take until next week to get some results / comments on this. I haven't had this issue on my radar so it's great that you pinged again!

Contributor

s1monw commented Mar 21, 2013

@tallpsmith thanks for pinging me on this one again. I started looking into this today but it might take until next week to get some results / comments on this. I haven't had this issue on my radar so it's great that you pinged again!

@ghost ghost assigned s1monw Mar 21, 2013

@synhershko

This comment has been minimized.

Show comment
Hide comment
@synhershko

synhershko Apr 4, 2013

Contributor

radar blip

Contributor

synhershko commented Apr 4, 2013

radar blip

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Apr 8, 2013

Contributor

hey folks,
sorry to come back to this so late. I have worked out a setup where I can reproduce this issue. Unfortunately, this situation can in-fact occur with zen discovery at this point. We are working on a fix for this issue which might take a bit until we have something that can bring a solid solution for this.

Contributor

s1monw commented Apr 8, 2013

hey folks,
sorry to come back to this so late. I have worked out a setup where I can reproduce this issue. Unfortunately, this situation can in-fact occur with zen discovery at this point. We are working on a fix for this issue which might take a bit until we have something that can bring a solid solution for this.

@jprante

This comment has been minimized.

Show comment
Hide comment
@jprante

jprante May 8, 2013

Contributor

Maybe it should be possible to add an alternative (selectable) Leader Election algorithm? For example, Chang-Roberts or Hirschfeld-Sinclair?

Contributor

jprante commented May 8, 2013

Maybe it should be possible to add an alternative (selectable) Leader Election algorithm? For example, Chang-Roberts or Hirschfeld-Sinclair?

@brusic

This comment has been minimized.

Show comment
Hide comment
@brusic

brusic May 23, 2013

Contributor

Our live cluster experienced the split-brain scenario once again after a node was unresponsive during garbage collection. Plenty of logs if wanted.

Contributor

brusic commented May 23, 2013

Our live cluster experienced the split-brain scenario once again after a node was unresponsive during garbage collection. Plenty of logs if wanted.

@synhershko

This comment has been minimized.

Show comment
Hide comment
@synhershko

synhershko May 23, 2013

Contributor

Ivan, how do you connect it to the cluster again - simple restart to that node?

Contributor

synhershko commented May 23, 2013

Ivan, how do you connect it to the cluster again - simple restart to that node?

@brusic

This comment has been minimized.

Show comment
Hide comment
@brusic

brusic May 23, 2013

Contributor

Yes. I didn't have time to identify the problematic node, so I restarted the node I thought was having issues (I choose correctly, I guess I should have played PowerBall after all). One shard (on a different node) was stuck in a RECOVERING state, which I fixed by reducing the number of replicas from 2 to 1 and then increasing the number again.

Simon mentioned that issue is with zen discovery, so I am assuming switching from Multicast to Unicast will not alleviate the problem.

Contributor

brusic commented May 23, 2013

Yes. I didn't have time to identify the problematic node, so I restarted the node I thought was having issues (I choose correctly, I guess I should have played PowerBall after all). One shard (on a different node) was stuck in a RECOVERING state, which I fixed by reducing the number of replicas from 2 to 1 and then increasing the number again.

Simon mentioned that issue is with zen discovery, so I am assuming switching from Multicast to Unicast will not alleviate the problem.

@brusic

This comment has been minimized.

Show comment
Hide comment
@brusic

brusic Jul 11, 2013

Contributor

Have I mentioned how serious this problem is? Our production cluster has pretty much gone away. It is like a game of whackomole trying to kill instances that think they are the master.

Contributor

brusic commented Jul 11, 2013

Have I mentioned how serious this problem is? Our production cluster has pretty much gone away. It is like a game of whackomole trying to kill instances that think they are the master.

@fasher

This comment has been minimized.

Show comment
Hide comment
@fasher

fasher Jul 11, 2013

I agree, this issue is critical please give this priority.
We gone to a single master node in our production to avoid split brains we had in the past that corrupted our index.

fasher commented Jul 11, 2013

I agree, this issue is critical please give this priority.
We gone to a single master node in our production to avoid split brains we had in the past that corrupted our index.

@bitsofinfo

This comment has been minimized.

Show comment
Hide comment
@bitsofinfo

bitsofinfo Jan 2, 2014

Contributor

Any update on a timeline for a fix for this?

Contributor

bitsofinfo commented Jan 2, 2014

Any update on a timeline for a fix for this?

@XANi

This comment has been minimized.

Show comment
Hide comment
@XANi

XANi May 28, 2014

Any progress on that?

XANi commented May 28, 2014

Any progress on that?

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Jun 16, 2014

Member

this is the same issue as #2488, so closing this in favor of the other one. Please comment if you think its different.

Member

kimchy commented Jun 16, 2014

this is the same issue as #2488, so closing this in favor of the other one. Please comment if you think its different.

@kimchy kimchy closed this Jun 16, 2014

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Jan 31, 2018

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Jan 31, 2018

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Jan 31, 2018

martijnvg added a commit that referenced this issue Feb 5, 2018

martijnvg added a commit that referenced this issue Feb 5, 2018

Merge remote-tracking branch 'es/6.x' into ccr-6.x
* es/6.x: (155 commits)
  Make persistent tasks work. Made persistent tasks executors pluggable.
  Removed ClientHelper dependency from PersistentTasksService.
  Added AllocatedPersistentTask#waitForPersistentTaskStatus(...) that delegates to PersistentTasksService#waitForPersistentTaskStatus(...)
  Add adding ability to associate an ID with tasks.
  Remove InternalClient and InternalSecurityClient (#3054)
  Make the persistent task status available to PersistentTasksExecutor.nodeOperation(...) method
  Refactor/to x content fragments2 (#2329)
  Make AllocatedPersistentTask members volatile (#2297)
  Moves more classes over to ToXContentObject/Fragment (#2283)
  Adapt to upstream changes made to AbstractStreamableXContentTestCase (#2117)
  Move tribe to a module (#2088)
  Persistent Tasks: remove unused isCurrentStatus method (#2076)
  Call initialising constructor of BaseTasksRequest (#1771)
  Always Accumulate Transport Exceptions (#1619)
  Pass down the provided timeout.
  Fix static / version based BWC tests (#1456)
  Don't call ClusterService.state() in a ClusterStateUpdateTask
  Separate publishing from applying cluster states
  Persistent tasks: require allocation id on task completion (#1107)
  Fixes compile errors in Eclipse due to generics
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment