Nodes give up trying to discover another node after 3x30s and never try again #1799

Yesterday we had some network upgrades which made 1/3 of our nodes unavailable for some time. We're using unicast discovery with a manual host list on 0.18.7.

I think if a node can't discover its peer nodes it should try them again periodically to re-establish contact with the rest of the cluster, as-is it seems that if the network is having problems for 3x30s the cluster might split completely without any attempt to re-establish itself.

Here's the relevant logs from the node that got disconnected from the rest:

$ zgrep removed /var/log/elasticsearch/elasticsearch.log-20120320.gz
[2012-03-19 15:31:46,039][INFO ][cluster.service          ] [dc01search-02] removed {[dc01search-03][i_Dqb2XLRi-CqqZgOIpSdA][inet[/]],}, reason: zen-disco-node_failed([dc01search-03][i_Dqb2XLRi-CqqZgOIpSdA][inet[/]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout
[2012-03-19 15:31:46,451][INFO ][cluster.service          ] [dc01search-02] removed {[dc01search-01][SGsefxoKSX2EdFYS2cku_w][inet[/]],}, reason: zen-disco-node_failed([dc01search-01][SGsefxoKSX2EdFYS2cku_w][inet[/]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout

And here's logs from one of the other nodes:

$ zcat /var/log/elasticsearch/elasticsearch.log-20120320.gz
[2012-03-19 15:31:46,038][INFO ][discovery.zen            ] [dc01search-03] master_left [[dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/]]], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2012-03-19 15:31:46,039][INFO ][cluster.service          ] [dc01search-03] master {new [dc01search-01][SGsefxoKSX2EdFYS2cku_w][inet[/]], previous [dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/]]}, removed {[dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/]],}, reason: zen-disco-master_failed ([dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/]])

So at the time dc01search-02 was the master, it became unavailable due
to network issues, dc01search-01 was elected the master, and until I
manually restarted ES on dc01search-02 today the cluster was in a
split brain state.


I am also facing the same issue. ES version 0.19.2


I'm suffering from this issue also in version 0.19.2


I've seen this with 0.19.5


This issue happens to us often in EC2, with 0.19.9.


After upgrading to 0.20.1 + JDK 7 in EC2 somehow we have faced this 2 times in a week.Though need more information to find out the root cause.


seeing the same issue with 20.5. any hints on "best strategy" of dealing with it? (the way I deal with it is a python script that monitors cluster wreck and forces disconnecting nodes to restart and re-discover the master).


Also seeing the same issue in 0.20.5, JRE 6 and 7, unicast discovery. The cluster pretty much goes split brain every night and we sometimes have to do a full cluster restart to get the nodes to join up again. I can't remember having this issue with 0.19. Back in those days we used EC2 API discovery, but we're not on AWS anymore and now use unicast instead. Could this be a unicast related bug?

The idea is to set the minimum_master_nodes setting, in which case the node disconnected will not elect itself as master, but get back to a state where it keeps on retrying to form the cluster again.


@kimchy That's a workaround, but is there a way to increase the ping timeout ?

As @kimchy commented, minimum_master_nodes is the right answer here. Closing

