New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another deadlock in cluster join? #2376

Closed
aphyr opened this Issue May 7, 2018 · 3 comments

Comments

Projects
None yet
3 participants
@aphyr

aphyr commented May 7, 2018

On dgraph 5b93fb4 (v1.0.5-dev, 2018-05-03 06:17:34 -0700), roughly one test in 20 winds up with a node stuck in the cluster join process for over a minute, refusing to serve requests. In 20180507T151028.000-0500.zip, alpha on n2 gets stuck calling JoinCluster:

...
2018/05/07 13:10:53 draft.go:180: Node ID: 2 with GroupID: 1
2018/05/07 13:10:53 node.go:240: Group 1 found 0 entries
2018/05/07 13:10:53 draft.go:930: Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
2018/05/07 13:10:54 pool.go:108: == CONNECT ==> Setting n3:5080
2018/05/07 13:10:54 draft.go:930: Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
2018/05/07 13:10:55 draft.go:895: Calling IsPeer
2018/05/07 13:10:55 draft.go:900: Done with IsPeer call
2018/05/07 13:10:55 draft.go:947: New Node for group: 1
2018/05/07 13:10:55 draft.go:952: Retrieving snapshot.
2018/05/07 13:10:55 draft.go:955: Trying to join peers.
2018/05/07 13:10:55 draft.go:878: Calling JoinCluster

... where other nodes (e.g. n4) concurrently make it through JoinCluster, or don't seem to call JoinCluster at all. I haven't seen this cluster recover yet, but my automation gives up after a little over a minute, so this might just be a slow (60s?) timeout or something.

@aphyr aphyr changed the title from Another deadlock in cluster join to Another deadlock in cluster join? May 7, 2018

@manishrjain manishrjain self-assigned this Jun 14, 2018

@manishrjain manishrjain removed their assignment Aug 14, 2018

@mkcp

This comment has been minimized.

Show comment
Hide comment
@mkcp

mkcp Aug 25, 2018

Looks like it's resolved in 1.0.8-rc1! We can close this out

mkcp commented Aug 25, 2018

Looks like it's resolved in 1.0.8-rc1! We can close this out

@manishrjain

This comment has been minimized.

Show comment
Hide comment
@manishrjain

manishrjain Aug 25, 2018

Member

Thanks for confirming, @mkcp !

Member

manishrjain commented Aug 25, 2018

Thanks for confirming, @mkcp !

@manishrjain

This comment has been minimized.

Show comment
Hide comment
@manishrjain

manishrjain Aug 31, 2018

Member

If this was not already fixed, the commit 8779066 fixed this issue.

Member

manishrjain commented Aug 31, 2018

If this was not already fixed, the commit 8779066 fixed this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment