Failure to recover from 1-way network partition #6469

magnhaug · 2014-06-11T16:22:02Z

After a 1-way network partition scenario, two nodes might be stuck in different opinions of which nodes are currently in the cluster.

I.e. with 5 nodes:
Node 1 sees node 1, 3 and 5. This is a RED cluster.
Node 2 sees node 1, 2, 4 and 5. This is a GREEN cluster.
All of this according to the nodes endpoint.

The cluster is stuck in this state, even when every node could reach every other node (verified with curl on the es http port and binary port).

Some of these nodes give an NPE for the _status endpoint:
{"error":"NullPointerException[null]","status":500}

All nodes are configured to discover all the others through unicast.
Discovered on ES 0.90.13 on RHEL 6.5.

The text was updated successfully, but these errors were encountered:

bleskes · 2014-06-16T09:43:11Z

it sounds like a similar issue to issue #2488, which we are working on. I assume this one is different due to the asymmetric nature of the network partition. But in essence it is the same as you have two masters in your cluster (correct?).

The cluster is stuck in this state, even when every node could reach every other node (verified with curl on the es http port and binary port).

Once a cluster has formed, ES will verify the all the nodes of the cluster are active (via pinging) but it will not consistently look for new nodes. The assumption is that if a new node comes up, it will actively join the cluster. The down side is that if you end up with two stable "sub" cluster - they will not discover each other.

Some of these nodes give an NPE for the _status endpoint:
{"error":"NullPointerException[null]","status":500}

Do you have a stack trace in your ES logs?

magnhaug · 2014-06-16T10:26:34Z

Yes, the issue looks similar to #2488, it might be wise to use this as another test case on the improved_zen branch.
As far as I remember, we saw two master nodes in the two different clusters. I cannot verify, as we had to reboot the cluster to continue other tests.

Do you have a stack trace in your ES logs?

No, unfortunately the logs were completely silent when _status failed with a NPE.
I tried to re-configure the logger to log everything from TRACE and up, but the node seemed to reload itself when I curled a new root logger level to it(?), which cleared up the issue (it joined the other master).

SeanTAllen · 2014-06-19T17:54:13Z

We've seen this multiple times. We've since moved to dedicated master nodes. Since that time, we haven't seen any but I'm sure the possibility still exists as a possibility.

AeroNotix · 2014-07-09T13:33:47Z

Using dedicated master nodes is not a fix!!!!!!111!!!

What dedicated master nodes does is lighten the load for those nodes which are the masters and therefore makes the likelihood of a split less likely.

bleskes · 2015-01-26T08:54:59Z

I'm going to close this assuming it is indeed a duplicate of #2488.

bleskes self-assigned this Jun 16, 2014

bleskes closed this as completed Jan 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to recover from 1-way network partition #6469

Failure to recover from 1-way network partition #6469

magnhaug commented Jun 11, 2014

bleskes commented Jun 16, 2014

magnhaug commented Jun 16, 2014

SeanTAllen commented Jun 19, 2014

AeroNotix commented Jul 9, 2014

bleskes commented Jan 26, 2015

Failure to recover from 1-way network partition #6469

Failure to recover from 1-way network partition #6469

Comments

magnhaug commented Jun 11, 2014

bleskes commented Jun 16, 2014

magnhaug commented Jun 16, 2014

SeanTAllen commented Jun 19, 2014

AeroNotix commented Jul 9, 2014

bleskes commented Jan 26, 2015