Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster convergence can fail when nodes in a group start at the same time. #2144

Closed
pawanrawal opened this Issue Feb 19, 2018 · 4 comments

Comments

Projects
None yet
2 participants
@pawanrawal
Copy link
Contributor

pawanrawal commented Feb 19, 2018

Say, if we have --replicas as 3 and we start three Dgraph nodes. It is possible that Zero adds them to group 1 state, but no leader has been elected yet. Now since these nodes see a peer, they all try to get the snapshot from a leader but no leader is elected or would be elected.

Similar logs can be observed on all three servers.

2018-02-19 05:20:59 Jepsen starting dgraph :server :--memory_mb 1024 :--idx 1 :--my n1:7080 :--zero n1:5080
2018/02/19 05:21:00 gRPC server started.  Listening on port 9080
2018/02/19 05:21:00 HTTP server started.  Listening on port 8080
2018/02/19 05:21:00 worker.go:99: Worker listening at address: [::]:7080
2018/02/19 05:21:00 groups.go:86: Current Raft Id: 1
2018/02/19 05:21:00 pool.go:118: == CONNECT ==> Setting n1:5080
2018/02/19 05:21:00 groups.go:109: Connected to group zero. Assigned group: 1
2018/02/19 05:21:00 pool.go:118: == CONNECT ==> Setting n3:7080
2018/02/19 05:21:00 pool.go:118: == CONNECT ==> Setting n4:7080
2018/02/19 05:21:00 pool.go:118: == CONNECT ==> Setting n2:5080
2018/02/19 05:21:00 pool.go:118: == CONNECT ==> Setting n4:5080
2018/02/19 05:21:00 pool.go:118: == CONNECT ==> Setting n3:5080
2018/02/19 05:21:00 draft.go:139: Node ID: 1 with GroupID: 1
2018/02/19 05:21:00 node.go:258: Group 1 found 0 entries
2018/02/19 05:21:00 draft.go:679: New Node for group: 1
2018/02/19 05:21:00 draft.go:684: Retrieving snapshot.
2018/02/19 05:21:00 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...
2018/02/19 05:21:01 pool.go:118: == CONNECT ==> Setting n2:7080
2018/02/19 05:21:01 pool.go:118: == CONNECT ==> Setting n5:5080
2018/02/19 05:21:01 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...
2018/02/19 05:21:02 pool.go:118: == CONNECT ==> Setting n5:7080
2018/02/19 05:21:02 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...
2018/02/19 05:21:03 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...
2018/02/19 05:21:04 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...
2018/02/19 05:21:05 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...
2018/02/19 05:21:06 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...
2018/02/19 05:21:07 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...
2018/02/19 05:21:08 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...
2018/02/19 05:21:09 draft.go:655: Error while calling fn: Unable to reach leader in group 1. Retrying...

A solution would be that Zero shouldn't add a node to a group until a leader has been elected.

@pawanrawal pawanrawal added the bug label Feb 19, 2018

@pawanrawal pawanrawal self-assigned this Feb 19, 2018

@manishrjain

This comment has been minimized.

Copy link
Member

manishrjain commented Feb 19, 2018

The leader would be eventually elected (which should be fast). So, this is a temporary error?

@pawanrawal

This comment has been minimized.

Copy link
Contributor Author

pawanrawal commented Feb 19, 2018

No, it's never elected. Imagine a situation where all alpha nodes see some peer while starting up because Zero added all of them to Group 1. And they are all trying to retrieve snapshot from the leader of group 1 which is blocking. Now, none of them would elect a leader until snapshot completes which won't complete because there is no leader.

@pawanrawal

This comment has been minimized.

Copy link
Contributor Author

pawanrawal commented Feb 19, 2018

As discussed Zero needs to ensure that its handling one request at a time for adding peers.

@manishrjain

This comment has been minimized.

Copy link
Member

manishrjain commented Mar 9, 2018

Fixed by @janardhan1993 .

@manishrjain manishrjain closed this Mar 9, 2018

@manishrjain manishrjain added the bug label Mar 21, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.