On dgraph 5b93fb4 (v1.0.5-dev, 2018-05-03 06:17:34 -0700), roughly one test in 20 winds up with a node stuck in the cluster join process for over a minute, refusing to serve requests. In 20180507T151028.000-0500.zip, alpha on n2 gets stuck calling JoinCluster:
...
2018/05/07 13:10:53 draft.go:180: Node ID: 2 with GroupID: 1
2018/05/07 13:10:53 node.go:240: Group 1 found 0 entries
2018/05/07 13:10:53 draft.go:930: Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
2018/05/07 13:10:54 pool.go:108: == CONNECT ==> Setting n3:5080
2018/05/07 13:10:54 draft.go:930: Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
2018/05/07 13:10:55 draft.go:895: Calling IsPeer
2018/05/07 13:10:55 draft.go:900: Done with IsPeer call
2018/05/07 13:10:55 draft.go:947: New Node for group: 1
2018/05/07 13:10:55 draft.go:952: Retrieving snapshot.
2018/05/07 13:10:55 draft.go:955: Trying to join peers.
2018/05/07 13:10:55 draft.go:878: Calling JoinCluster
... where other nodes (e.g. n4) concurrently make it through JoinCluster, or don't seem to call JoinCluster at all. I haven't seen this cluster recover yet, but my automation gives up after a little over a minute, so this might just be a slow (60s?) timeout or something.
The text was updated successfully, but these errors were encountered:
aphyr
changed the title
Another deadlock in cluster join
Another deadlock in cluster join?
May 7, 2018
On dgraph 5b93fb4 (v1.0.5-dev, 2018-05-03 06:17:34 -0700), roughly one test in 20 winds up with a node stuck in the cluster join process for over a minute, refusing to serve requests. In 20180507T151028.000-0500.zip, alpha on n2 gets stuck calling JoinCluster:
... where other nodes (e.g. n4) concurrently make it through JoinCluster, or don't seem to call JoinCluster at all. I haven't seen this cluster recover yet, but my automation gives up after a little over a minute, so this might just be a slow (60s?) timeout or something.
The text was updated successfully, but these errors were encountered: