New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to recover from 1-way network partition #6469
Comments
it sounds like a similar issue to issue #2488, which we are working on. I assume this one is different due to the asymmetric nature of the network partition. But in essence it is the same as you have two masters in your cluster (correct?).
Once a cluster has formed, ES will verify the all the nodes of the cluster are active (via pinging) but it will not consistently look for new nodes. The assumption is that if a new node comes up, it will actively join the cluster. The down side is that if you end up with two stable "sub" cluster - they will not discover each other.
Do you have a stack trace in your ES logs? |
Yes, the issue looks similar to #2488, it might be wise to use this as another test case on the improved_zen branch.
No, unfortunately the logs were completely silent when _status failed with a NPE. |
We've seen this multiple times. We've since moved to dedicated master nodes. Since that time, we haven't seen any but I'm sure the possibility still exists as a possibility. |
Using dedicated master nodes is not a fix!!!!!!111!!! What dedicated master nodes does is lighten the load for those nodes which are the masters and therefore makes the likelihood of a split less likely. |
I'm going to close this assuming it is indeed a duplicate of #2488. |
After a 1-way network partition scenario, two nodes might be stuck in different opinions of which nodes are currently in the cluster.
I.e. with 5 nodes:
Node 1 sees node 1, 3 and 5. This is a RED cluster.
Node 2 sees node 1, 2, 4 and 5. This is a GREEN cluster.
All of this according to the
nodes
endpoint.The cluster is stuck in this state, even when every node could reach every other node (verified with curl on the es http port and binary port).
Some of these nodes give an NPE for the
_status
endpoint:{"error":"NullPointerException[null]","status":500}
All nodes are configured to discover all the others through unicast.
Discovered on ES 0.90.13 on RHEL 6.5.
The text was updated successfully, but these errors were encountered: