Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster convergence can fail when nodes in a group start at the same time. #2144

Closed
pawanrawal opened this issue Feb 19, 2018 · 4 comments
Closed
Assignees
Labels

Comments

@pawanrawal
Copy link
Contributor

@pawanrawal pawanrawal commented Feb 19, 2018

Say, if we have --replicas as 3 and we start three Dgraph nodes. It is possible that Zero adds them to group 1 state, but no leader has been elected yet. Now since these nodes see a peer, they all try to get the snapshot from a leader but no leader is elected or would be elected.

Similar logs can be observed on all three servers.

2018-02-19 05:20:59 Jepsen starting dgraph :server :--memory_mb 1024 :--idx 1 :--my n1:7080 :--zero n1:5080
2018/02/19 05:21:00 gRPC server started.  Listening on port 9080
2018/02/19 05:21:00 HTTP server started.  Listening on port 8080
2018/02/19 05:21:00 worker.go:99: Worker listening at address: [::]:7080
2018/02/19 05:21:00 groups.go:86: Current Raft Id: 1
2018/02/19 05:21:00 pool.go:118: == CONNECT ==> Setting n1:5080
2018/02/19 05:21:00 groups.go:109: Connected to group zero. Assigned group: 1
2018/02/19 05:21:00 pool.go:118: == CONNECT ==> Setting n3:7080
2018/02/19 05:21:00 pool.go:118: == CONNECT ==> Setting n4:7080
2018/02/19 05:21:00 pool.go:118: == CONNECT ==> Setting n2:5080
2018/02/19 05:21:00 pool.go:118: == CONNECT ==> Setting n4:5080
2018/02/19 05:21:00 pool.go:118: == CONNECT ==> Setting n3:5080
2018/02/19 05:21:00 draft.go:139: Node ID: 1 with GroupID: 1
2018/02/19 05:21:00 node.go:258: Group 1 found 0 entries
2018/02/19 05:21:00 draft.go:679: New Node for group: 1
2018/02/19 05:21:00 draft.go:684: Retrieving snapshot.
2018/02/19 05:21:00 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...
2018/02/19 05:21:01 pool.go:118: == CONNECT ==> Setting n2:7080
2018/02/19 05:21:01 pool.go:118: == CONNECT ==> Setting n5:5080
2018/02/19 05:21:01 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...
2018/02/19 05:21:02 pool.go:118: == CONNECT ==> Setting n5:7080
2018/02/19 05:21:02 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...
2018/02/19 05:21:03 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...
2018/02/19 05:21:04 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...
2018/02/19 05:21:05 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...
2018/02/19 05:21:06 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...
2018/02/19 05:21:07 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...
2018/02/19 05:21:08 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...
2018/02/19 05:21:09 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...

A solution would be that Zero shouldn't add a node to a group until a leader has been elected.

@pawanrawal pawanrawal self-assigned this Feb 19, 2018
@manishrjain
Copy link
Member

@manishrjain manishrjain commented Feb 19, 2018

The leader would be eventually elected (which should be fast). So, this is a temporary error?

@pawanrawal
Copy link
Contributor Author

@pawanrawal pawanrawal commented Feb 19, 2018

No, it's never elected. Imagine a situation where all alpha nodes see some peer while starting up because Zero added all of them to Group 1. And they are all trying to retrieve snapshot from the leader of group 1 which is blocking. Now, none of them would elect a leader until snapshot completes which won't complete because there is no leader.

@pawanrawal
Copy link
Contributor Author

@pawanrawal pawanrawal commented Feb 19, 2018

As discussed Zero needs to ensure that its handling one request at a time for adding peers.

@manishrjain
Copy link
Member

@manishrjain manishrjain commented Mar 9, 2018

Fixed by @janardhan1993 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants