Nodes give up trying to discover another node after 3x30s and never try again

Yesterday we had some network upgrades which made 1/3 of our nodes unavailable for some time. We're using unicast discovery with a manual host list on 0.18.7.

I think if a node can't discover its peer nodes it should try them again periodically to re-establish contact with the rest of the cluster, as-is it seems that if the network is having problems for 3x30s the cluster might split completely without any attempt to re-establish itself.

Here's the relevant logs from the node that got disconnected from the rest:

```
$ zgrep removed /var/log/elasticsearch/elasticsearch.log-20120320.gz
[2012-03-19 15:31:46,039][INFO ][cluster.service          ] [dc01search-02] removed {[dc01search-03][i_Dqb2XLRi-CqqZgOIpSdA][inet[/10.149.206.44:9300]],}, reason: zen-disco-node_failed([dc01search-03][i_Dqb2XLRi-CqqZgOIpSdA][inet[/10.149.206.44:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout
[2012-03-19 15:31:46,451][INFO ][cluster.service          ] [dc01search-02] removed {[dc01search-01][SGsefxoKSX2EdFYS2cku_w][inet[/10.149.206.41:9300]],}, reason: zen-disco-node_failed([dc01search-01][SGsefxoKSX2EdFYS2cku_w][inet[/10.149.206.41:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout
```

And here's logs from one of the other nodes:

```
$ zcat /var/log/elasticsearch/elasticsearch.log-20120320.gz
[2012-03-19 15:31:46,038][INFO ][discovery.zen            ] [dc01search-03] master_left [[dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/10.149.208.42:9300]]], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2012-03-19 15:31:46,039][INFO ][cluster.service          ] [dc01search-03] master {new [dc01search-01][SGsefxoKSX2EdFYS2cku_w][inet[/10.149.206.41:9300]], previous [dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/10.149.208.42:9300]]}, removed {[dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/10.149.208.42:9300]],}, reason: zen-disco-master_failed ([dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/10.149.208.42:9300]])
```

So at the time dc01search-02 was the master, it became unavailable due
to network issues, dc01search-01 was elected the master, and until I
manually restarted ES on dc01search-02 today the cluster was in a
split brain state.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nodes give up trying to discover another node after 3x30s and never try again #1799

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nodes give up trying to discover another node after 3x30s and never try again #1799

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions