Failure to recover from 1-way network partition #6469

Closed
magnhaug opened this Issue Jun 11, 2014 · 5 comments

Comments

Projects
None yet
4 participants
@magnhaug

After a 1-way network partition scenario, two nodes might be stuck in different opinions of which nodes are currently in the cluster.

I.e. with 5 nodes:
Node 1 sees node 1, 3 and 5. This is a RED cluster.
Node 2 sees node 1, 2, 4 and 5. This is a GREEN cluster.
All of this according to the nodes endpoint.

The cluster is stuck in this state, even when every node could reach every other node (verified with curl on the es http port and binary port).

Some of these nodes give an NPE for the _status endpoint:
{"error":"NullPointerException[null]","status":500}

All nodes are configured to discover all the others through unicast.
Discovered on ES 0.90.13 on RHEL 6.5.

@bleskes bleskes self-assigned this Jun 16, 2014

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Jun 16, 2014

Member

it sounds like a similar issue to issue #2488, which we are working on. I assume this one is different due to the asymmetric nature of the network partition. But in essence it is the same as you have two masters in your cluster (correct?).

The cluster is stuck in this state, even when every node could reach every other node (verified with curl on the es http port and binary port).

Once a cluster has formed, ES will verify the all the nodes of the cluster are active (via pinging) but it will not consistently look for new nodes. The assumption is that if a new node comes up, it will actively join the cluster. The down side is that if you end up with two stable "sub" cluster - they will not discover each other.

Some of these nodes give an NPE for the _status endpoint:
{"error":"NullPointerException[null]","status":500}

Do you have a stack trace in your ES logs?

Member

bleskes commented Jun 16, 2014

it sounds like a similar issue to issue #2488, which we are working on. I assume this one is different due to the asymmetric nature of the network partition. But in essence it is the same as you have two masters in your cluster (correct?).

The cluster is stuck in this state, even when every node could reach every other node (verified with curl on the es http port and binary port).

Once a cluster has formed, ES will verify the all the nodes of the cluster are active (via pinging) but it will not consistently look for new nodes. The assumption is that if a new node comes up, it will actively join the cluster. The down side is that if you end up with two stable "sub" cluster - they will not discover each other.

Some of these nodes give an NPE for the _status endpoint:
{"error":"NullPointerException[null]","status":500}

Do you have a stack trace in your ES logs?

@magnhaug

This comment has been minimized.

Show comment
Hide comment
@magnhaug

magnhaug Jun 16, 2014

Yes, the issue looks similar to #2488, it might be wise to use this as another test case on the improved_zen branch.
As far as I remember, we saw two master nodes in the two different clusters. I cannot verify, as we had to reboot the cluster to continue other tests.

Do you have a stack trace in your ES logs?

No, unfortunately the logs were completely silent when _status failed with a NPE.
I tried to re-configure the logger to log everything from TRACE and up, but the node seemed to reload itself when I curled a new root logger level to it(?), which cleared up the issue (it joined the other master).

Yes, the issue looks similar to #2488, it might be wise to use this as another test case on the improved_zen branch.
As far as I remember, we saw two master nodes in the two different clusters. I cannot verify, as we had to reboot the cluster to continue other tests.

Do you have a stack trace in your ES logs?

No, unfortunately the logs were completely silent when _status failed with a NPE.
I tried to re-configure the logger to log everything from TRACE and up, but the node seemed to reload itself when I curled a new root logger level to it(?), which cleared up the issue (it joined the other master).

@SeanTAllen

This comment has been minimized.

Show comment
Hide comment
@SeanTAllen

SeanTAllen Jun 19, 2014

We've seen this multiple times. We've since moved to dedicated master nodes. Since that time, we haven't seen any but I'm sure the possibility still exists as a possibility.

We've seen this multiple times. We've since moved to dedicated master nodes. Since that time, we haven't seen any but I'm sure the possibility still exists as a possibility.

@AeroNotix

This comment has been minimized.

Show comment
Hide comment
@AeroNotix

AeroNotix Jul 9, 2014

Using dedicated master nodes is not a fix!!!!!!111!!!

What dedicated master nodes does is lighten the load for those nodes which are the masters and therefore makes the likelihood of a split less likely.

Using dedicated master nodes is not a fix!!!!!!111!!!

What dedicated master nodes does is lighten the load for those nodes which are the masters and therefore makes the likelihood of a split less likely.

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Jan 26, 2015

Member

I'm going to close this assuming it is indeed a duplicate of #2488.

Member

bleskes commented Jan 26, 2015

I'm going to close this assuming it is indeed a duplicate of #2488.

@bleskes bleskes closed this Jan 26, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment