Description
Yesterday we had some network upgrades which made 1/3 of our nodes unavailable for some time. We're using unicast discovery with a manual host list on 0.18.7.
I think if a node can't discover its peer nodes it should try them again periodically to re-establish contact with the rest of the cluster, as-is it seems that if the network is having problems for 3x30s the cluster might split completely without any attempt to re-establish itself.
Here's the relevant logs from the node that got disconnected from the rest:
$ zgrep removed /var/log/elasticsearch/elasticsearch.log-20120320.gz
[2012-03-19 15:31:46,039][INFO ][cluster.service ] [dc01search-02] removed {[dc01search-03][i_Dqb2XLRi-CqqZgOIpSdA][inet[/10.149.206.44:9300]],}, reason: zen-disco-node_failed([dc01search-03][i_Dqb2XLRi-CqqZgOIpSdA][inet[/10.149.206.44:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout
[2012-03-19 15:31:46,451][INFO ][cluster.service ] [dc01search-02] removed {[dc01search-01][SGsefxoKSX2EdFYS2cku_w][inet[/10.149.206.41:9300]],}, reason: zen-disco-node_failed([dc01search-01][SGsefxoKSX2EdFYS2cku_w][inet[/10.149.206.41:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout
And here's logs from one of the other nodes:
$ zcat /var/log/elasticsearch/elasticsearch.log-20120320.gz
[2012-03-19 15:31:46,038][INFO ][discovery.zen ] [dc01search-03] master_left [[dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/10.149.208.42:9300]]], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2012-03-19 15:31:46,039][INFO ][cluster.service ] [dc01search-03] master {new [dc01search-01][SGsefxoKSX2EdFYS2cku_w][inet[/10.149.206.41:9300]], previous [dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/10.149.208.42:9300]]}, removed {[dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/10.149.208.42:9300]],}, reason: zen-disco-master_failed ([dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/10.149.208.42:9300]])
So at the time dc01search-02 was the master, it became unavailable due
to network issues, dc01search-01 was elected the master, and until I
manually restarted ES on dc01search-02 today the cluster was in a
split brain state.