Nodes give up trying to discover another node after 3x30s and never try again #1799

Closed
avar opened this Issue Mar 21, 2012 · 10 comments

Projects

None yet
@avar

Yesterday we had some network upgrades which made 1/3 of our nodes unavailable for some time. We're using unicast discovery with a manual host list on 0.18.7.

I think if a node can't discover its peer nodes it should try them again periodically to re-establish contact with the rest of the cluster, as-is it seems that if the network is having problems for 3x30s the cluster might split completely without any attempt to re-establish itself.

Here's the relevant logs from the node that got disconnected from the rest:

$ zgrep removed /var/log/elasticsearch/elasticsearch.log-20120320.gz
[2012-03-19 15:31:46,039][INFO ][cluster.service          ] [dc01search-02] removed {[dc01search-03][i_Dqb2XLRi-CqqZgOIpSdA][inet[/10.149.206.44:9300]],}, reason: zen-disco-node_failed([dc01search-03][i_Dqb2XLRi-CqqZgOIpSdA][inet[/10.149.206.44:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout
[2012-03-19 15:31:46,451][INFO ][cluster.service          ] [dc01search-02] removed {[dc01search-01][SGsefxoKSX2EdFYS2cku_w][inet[/10.149.206.41:9300]],}, reason: zen-disco-node_failed([dc01search-01][SGsefxoKSX2EdFYS2cku_w][inet[/10.149.206.41:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout

And here's logs from one of the other nodes:

$ zcat /var/log/elasticsearch/elasticsearch.log-20120320.gz
[2012-03-19 15:31:46,038][INFO ][discovery.zen            ] [dc01search-03] master_left [[dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/10.149.208.42:9300]]], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2012-03-19 15:31:46,039][INFO ][cluster.service          ] [dc01search-03] master {new [dc01search-01][SGsefxoKSX2EdFYS2cku_w][inet[/10.149.206.41:9300]], previous [dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/10.149.208.42:9300]]}, removed {[dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/10.149.208.42:9300]],}, reason: zen-disco-master_failed ([dc01search-02][drJcyXbfSr687MW4PSN7Fg][inet[/10.149.208.42:9300]])

So at the time dc01search-02 was the master, it became unavailable due
to network issues, dc01search-01 was elected the master, and until I
manually restarted ES on dc01search-02 today the cluster was in a
split brain state.

@Jagdeep1

I am also facing the same issue. ES version 0.19.2

@moliware

I'm suffering from this issue also in version 0.19.2

@artemredkin

I've seen this with 0.19.5

@jimdickinson

This issue happens to us often in EC2, with 0.19.9.

@ferhatsb

After upgrading to 0.20.1 + JDK 7 in EC2 somehow we have faced this 2 times in a week.Though need more information to find out the root cause.

@fygrave

seeing the same issue with 20.5. any hints on "best strategy" of dealing with it? (the way I deal with it is a python script that monitors cluster wreck and forces disconnecting nodes to restart and re-discover the master).

@nordbergm

Also seeing the same issue in 0.20.5, JRE 6 and 7, unicast discovery. The cluster pretty much goes split brain every night and we sometimes have to do a full cluster restart to get the nodes to join up again. I can't remember having this issue with 0.19. Back in those days we used EC2 API discovery, but we're not on AWS anymore and now use unicast instead. Could this be a unicast related bug?

@kimchy
elastic member

The idea is to set the minimum_master_nodes setting, in which case the node disconnected will not elect itself as master, but get back to a state where it keeps on retrying to form the cluster again.

@HenleyChiu

@kimchy That's a workaround, but is there a way to increase the ping timeout ?

@clintongormley
elastic member

As @kimchy commented, minimum_master_nodes is the right answer here. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment