Skip to content

Nodes give up trying to discover another node after 3x30s and never try again #1799

Closed
@avar

Description

@avar

Yesterday we had some network upgrades which made 1/3 of our nodes unavailable for some time. We're using unicast discovery with a manual host list on 0.18.7.

I think if a node can't discover its peer nodes it should try them again periodically to re-establish contact with the rest of the cluster, as-is it seems that if the network is having problems for 3x30s the cluster might split completely without any attempt to re-establish itself.

Here's the relevant logs from the node that got disconnected from the rest:

$ zgrep removed /var/log/elasticsearch/elasticsearch.log-20120320.gz
[2012-03-19 15:31:46,039][INFO ][cluster.service          ] [dc01search-02] removed {[dc01search-03][i_Dqb2XLRi-CqqZgOIpSdA][inet[/10.149.206.44:9300]],}, reason: zen-disco-node_failed([dc01search-03][i_Dqb2XLRi-CqqZgOIpSdA][inet[/10.149.206.44:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout
[2012-03-19 15:31:46,451][INFO ][cluster.service          ] [dc01search-02] removed {[dc01search-01][SGsefxoKSX2EdFYS2cku_w][inet[/10.149.206.41:9300]],}, reason: zen-disco-node_failed([dc01search-01][SGsefxoKSX2EdFYS2cku_w][inet[/10.149.206.41:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout

And here's logs from one of the other nodes:

$ zcat /var/log/elasticsearch/elasticsearch.log-20120320.gz
[2012-03-19 15:31:46,038][INFO ][discovery.zen            ] [dc01search-03] master_left [[dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/10.149.208.42:9300]]], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2012-03-19 15:31:46,039][INFO ][cluster.service          ] [dc01search-03] master {new [dc01search-01][SGsefxoKSX2EdFYS2cku_w][inet[/10.149.206.41:9300]], previous [dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/10.149.208.42:9300]]}, removed {[dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/10.149.208.42:9300]],}, reason: zen-disco-master_failed ([dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/10.149.208.42:9300]])

So at the time dc01search-02 was the master, it became unavailable due
to network issues, dc01search-01 was elected the master, and until I
manually restarted ES on dc01search-02 today the cluster was in a
split brain state.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions