Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another deadlock in cluster join? #2376

Closed
aphyr opened this issue May 7, 2018 · 3 comments
Closed

Another deadlock in cluster join? #2376

aphyr opened this issue May 7, 2018 · 3 comments
Labels

Comments

@aphyr
Copy link

@aphyr aphyr commented May 7, 2018

On dgraph 5b93fb4 (v1.0.5-dev, 2018-05-03 06:17:34 -0700), roughly one test in 20 winds up with a node stuck in the cluster join process for over a minute, refusing to serve requests. In 20180507T151028.000-0500.zip, alpha on n2 gets stuck calling JoinCluster:

...
2018/05/07 13:10:53 draft.go:180: Node ID: 2 with GroupID: 1
2018/05/07 13:10:53 node.go:240: Group 1 found 0 entries
2018/05/07 13:10:53 draft.go:930: Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
2018/05/07 13:10:54 pool.go:108: == CONNECT ==> Setting n3:5080
2018/05/07 13:10:54 draft.go:930: Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
2018/05/07 13:10:55 draft.go:895: Calling IsPeer
2018/05/07 13:10:55 draft.go:900: Done with IsPeer call
2018/05/07 13:10:55 draft.go:947: New Node for group: 1
2018/05/07 13:10:55 draft.go:952: Retrieving snapshot.
2018/05/07 13:10:55 draft.go:955: Trying to join peers.
2018/05/07 13:10:55 draft.go:878: Calling JoinCluster

... where other nodes (e.g. n4) concurrently make it through JoinCluster, or don't seem to call JoinCluster at all. I haven't seen this cluster recover yet, but my automation gives up after a little over a minute, so this might just be a slow (60s?) timeout or something.

@aphyr aphyr changed the title Another deadlock in cluster join Another deadlock in cluster join? May 7, 2018
@manishrjain manishrjain self-assigned this Jun 14, 2018
@manishrjain manishrjain removed their assignment Aug 14, 2018
@mkcp
Copy link

@mkcp mkcp commented Aug 25, 2018

Looks like it's resolved in 1.0.8-rc1! We can close this out

@manishrjain
Copy link
Member

@manishrjain manishrjain commented Aug 25, 2018

Thanks for confirming, @mkcp !

@manishrjain
Copy link
Member

@manishrjain manishrjain commented Aug 31, 2018

If this was not already fixed, the commit 8779066 fixed this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.