Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to recover from 1-way network partition #6469

Closed
magnhaug opened this issue Jun 11, 2014 · 5 comments
Closed

Failure to recover from 1-way network partition #6469

magnhaug opened this issue Jun 11, 2014 · 5 comments
Assignees

Comments

@magnhaug
Copy link

After a 1-way network partition scenario, two nodes might be stuck in different opinions of which nodes are currently in the cluster.

I.e. with 5 nodes:
Node 1 sees node 1, 3 and 5. This is a RED cluster.
Node 2 sees node 1, 2, 4 and 5. This is a GREEN cluster.
All of this according to the nodes endpoint.

The cluster is stuck in this state, even when every node could reach every other node (verified with curl on the es http port and binary port).

Some of these nodes give an NPE for the _status endpoint:
{"error":"NullPointerException[null]","status":500}

All nodes are configured to discover all the others through unicast.
Discovered on ES 0.90.13 on RHEL 6.5.

@bleskes bleskes self-assigned this Jun 16, 2014
@bleskes
Copy link
Contributor

bleskes commented Jun 16, 2014

it sounds like a similar issue to issue #2488, which we are working on. I assume this one is different due to the asymmetric nature of the network partition. But in essence it is the same as you have two masters in your cluster (correct?).

The cluster is stuck in this state, even when every node could reach every other node (verified with curl on the es http port and binary port).

Once a cluster has formed, ES will verify the all the nodes of the cluster are active (via pinging) but it will not consistently look for new nodes. The assumption is that if a new node comes up, it will actively join the cluster. The down side is that if you end up with two stable "sub" cluster - they will not discover each other.

Some of these nodes give an NPE for the _status endpoint:
{"error":"NullPointerException[null]","status":500}

Do you have a stack trace in your ES logs?

@magnhaug
Copy link
Author

Yes, the issue looks similar to #2488, it might be wise to use this as another test case on the improved_zen branch.
As far as I remember, we saw two master nodes in the two different clusters. I cannot verify, as we had to reboot the cluster to continue other tests.

Do you have a stack trace in your ES logs?

No, unfortunately the logs were completely silent when _status failed with a NPE.
I tried to re-configure the logger to log everything from TRACE and up, but the node seemed to reload itself when I curled a new root logger level to it(?), which cleared up the issue (it joined the other master).

@SeanTAllen
Copy link

We've seen this multiple times. We've since moved to dedicated master nodes. Since that time, we haven't seen any but I'm sure the possibility still exists as a possibility.

@AeroNotix
Copy link

Using dedicated master nodes is not a fix!!!!!!111!!!

What dedicated master nodes does is lighten the load for those nodes which are the masters and therefore makes the likelihood of a split less likely.

@bleskes
Copy link
Contributor

bleskes commented Jan 26, 2015

I'm going to close this assuming it is indeed a duplicate of #2488.

@bleskes bleskes closed this as completed Jan 26, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants