Join GitHub today
Possible race condition in cluster join #716
When joining fresh nodes to the cluster, I've found a reproducible but somewhat random crash where the node claims that either the version or join request timed out, or its connection was refused. Running the joins simultaneously reliably crashes all but one joiner; running them with a 5-second delay usually works but still crashes sometimes. I'm not sure whether this is an artifact of aggressive IO timeouts or what; it looks like the nodes gave up after only a few hundred milliseconds. Re-running the join process from the crashed node has a chance to reconnect, so this whole thing smells of a timing bug or race condition.
A possibly related issue: if the join process crashes on multiple machines, leaving the leader spinning complaining about missing heartbeats from down nodes, the leader will never respond to a join request again. Every join attempt, concurrent or not, results in
These are in LXC containers; ping time between nodes is on the order of a tenth of a millisecond.