
Loading…
A possibly related issue: if the join process crashes on multiple machines, leaving the leader spinning complaining about missing heartbeats from down nodes, the leader will never respond to a join request again. Every join attempt, concurrent or not, results in
ubuntu@n2:/opt/etcd$ sudo bin/etcd -peer-addr n2:7001 -addr n2:4001 -peer-bind-addr 0.0.0.0:7001 -bind-addr 0.0.0.0:4001 -data-dir /var/lib/etcd -name n2 -peers n1:7001,n2:7001,n3:7001,n4:7001,n5:7001
[etcd] Apr 13 23:00:11.748 INFO | etcd server [name n2, listen on 0.0.0.0:4001, advertised url http://n2:4001]
[etcd] Apr 13 23:00:11.751 INFO | URLs: /_etcd/machines: / n2 ()
[etcd] Apr 13 23:00:11.753 INFO | Send Join Request to http://n1:7001/v2/admin/machines/n2
[etcd] Apr 13 23:00:12.103 WARNING | Attempt to join via n1:7001 failed: Unable to join: Put http://n1:7001/v2/admin/machines/n2: read tcp 192.168.122.11:7001: i/o timeout
[etcd] Apr 13 23:00:12.104 WARNING | Attempt to join via n3:7001 failed: Error during join version check: Get http://n3:7001/version: dial tcp 192.168.122.13:7001: connection refused
[etcd] Apr 13 23:00:12.104 WARNING | Attempt to join via n4:7001 failed: Error during join version check: Get http://n4:7001/version: dial tcp 192.168.122.14:7001: connection refused
[etcd] Apr 13 23:00:12.105 INFO | Send Join Request to http://n5:7001/v2/admin/machines/n2
[etcd] Apr 13 23:00:12.106 INFO | »»»» 307
[etcd] Apr 13 23:00:12.107 INFO | »»»» 404
[etcd] Apr 13 23:00:12.107 WARNING | Attempt to join via n5:7001 failed: Unable to join
[etcd] Apr 13 23:00:12.107 WARNING | No living peers are found!
[etcd] Apr 13 23:00:12.107 CRITICAL | No available peers in backup list, and no log data
These are in LXC containers; ping time between nodes is on the order of a tenth of a millisecond.
Will do! Happy to provide logs, try test branches, or anything else that might be of use. :)
@xiangli-cmu @philips
To fix this bug, we could merge #707 for a short-term solution. It would try 3 times for new node to join the cluster, and n3, n4, n5 could be added to the cluster then.
And our Request Timeout is incorrect now, please review this one #624, which should set requestTimeout more reasonable.
For tests, when creating cluster, currently it starts the first instance before starting the others in parallel to avoid time waste, but this makes the tests less powerful. I would try to remove this part.
For long-term solution, the nodes should wait longer if the cluster is under leader election.
And I would increase the timeout for join commands, because this is an important command for the cluster, and deserves a longer timeout.
I can confirm that master dramatically reduces the probability of getting the cluster into a broken state during startup; I can reliably stand up a cluster on the first try now (staggered start, 2 seconds between each process launch), instead of taking 5-6 tries.
When joining fresh nodes to the cluster, I've found a reproducible but somewhat random crash where the node claims that either the version or join request timed out, or its connection was refused. Running the joins simultaneously reliably crashes all but one joiner; running them with a 5-second delay usually works but still crashes sometimes. I'm not sure whether this is an artifact of aggressive IO timeouts or what; it looks like the nodes gave up after only a few hundred milliseconds. Re-running the join process from the crashed node has a chance to reconnect, so this whole thing smells of a timing bug or race condition.
n1
n2
n3
n4
n5