Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
"No node has been set up yet" during concurrent cluster join #2145
On the current nightly (2018/02/19), concurrently launching a five-node cluster can occasionally deadlock with errors like...
... when n3, the node n1 is trying to join, is in some sort of timeout loop on its join RPC:
See full logs, attached:
This one is hard to reproduce (I ran the upsert test 100 times but couldn't reproduce it) though I can see when this would happen.
This happens because
Removing the timeout fixes this. Though it doesn't fix the case where
Sometimes EntryConfChange proposals were dropped silently by raft and without timeout JoinCluster was stuck waiting for response.
JoinCluster is already in a loop, the problem is that after the first context deadline exceeded all subsequent requests fail because there is already a
I think the solution here is simple, check the leader if the node being added is already part of the