Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Clone in Desktop Download ZIP

Loading…

Failure to recover from 1-way network partition #6469

Closed
magnhaug opened this Issue · 5 comments

4 participants

@magnhaug

After a 1-way network partition scenario, two nodes might be stuck in different opinions of which nodes are currently in the cluster.

I.e. with 5 nodes:
Node 1 sees node 1, 3 and 5. This is a RED cluster.
Node 2 sees node 1, 2, 4 and 5. This is a GREEN cluster.
All of this according to the nodes endpoint.

The cluster is stuck in this state, even when every node could reach every other node (verified with curl on the es http port and binary port).

Some of these nodes give an NPE for the _status endpoint:
{"error":"NullPointerException[null]","status":500}

All nodes are configured to discover all the others through unicast.
Discovered on ES 0.90.13 on RHEL 6.5.

@bleskes bleskes self-assigned this
@bleskes
Owner

it sounds like a similar issue to issue #2488, which we are working on. I assume this one is different due to the asymmetric nature of the network partition. But in essence it is the same as you have two masters in your cluster (correct?).

The cluster is stuck in this state, even when every node could reach every other node (verified with curl on the es http port and binary port).

Once a cluster has formed, ES will verify the all the nodes of the cluster are active (via pinging) but it will not consistently look for new nodes. The assumption is that if a new node comes up, it will actively join the cluster. The down side is that if you end up with two stable "sub" cluster - they will not discover each other.

Some of these nodes give an NPE for the _status endpoint:
{"error":"NullPointerException[null]","status":500}

Do you have a stack trace in your ES logs?

@magnhaug

Yes, the issue looks similar to #2488, it might be wise to use this as another test case on the improved_zen branch.
As far as I remember, we saw two master nodes in the two different clusters. I cannot verify, as we had to reboot the cluster to continue other tests.

Do you have a stack trace in your ES logs?

No, unfortunately the logs were completely silent when _status failed with a NPE.
I tried to re-configure the logger to log everything from TRACE and up, but the node seemed to reload itself when I curled a new root logger level to it(?), which cleared up the issue (it joined the other master).

@SeanTAllen

We've seen this multiple times. We've since moved to dedicated master nodes. Since that time, we haven't seen any but I'm sure the possibility still exists as a possibility.

@AeroNotix

Using dedicated master nodes is not a fix!!!!!!111!!!

What dedicated master nodes does is lighten the load for those nodes which are the masters and therefore makes the likelihood of a split less likely.

@bleskes
Owner

I'm going to close this assuming it is indeed a duplicate of #2488.

@bleskes bleskes closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.