Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Failure to recover from 1-way network partition #6469
After a 1-way network partition scenario, two nodes might be stuck in different opinions of which nodes are currently in the cluster.
I.e. with 5 nodes:
The cluster is stuck in this state, even when every node could reach every other node (verified with curl on the es http port and binary port).
Some of these nodes give an NPE for the
All nodes are configured to discover all the others through unicast.
it sounds like a similar issue to issue #2488, which we are working on. I assume this one is different due to the asymmetric nature of the network partition. But in essence it is the same as you have two masters in your cluster (correct?).
Once a cluster has formed, ES will verify the all the nodes of the cluster are active (via pinging) but it will not consistently look for new nodes. The assumption is that if a new node comes up, it will actively join the cluster. The down side is that if you end up with two stable "sub" cluster - they will not discover each other.
Do you have a stack trace in your ES logs?
Yes, the issue looks similar to #2488, it might be wise to use this as another test case on the improved_zen branch.
No, unfortunately the logs were completely silent when _status failed with a NPE.