New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node locks up after partitions #2273

Closed
aphyr opened this Issue Mar 28, 2018 · 1 comment

Comments

Projects
None yet
4 participants
@aphyr

aphyr commented Mar 28, 2018

While trying to fix a bug in the tests for #2152, I managed to lock up dgraph into a state where every request to one node (n1) timed out, despite the process running on all nodes. Here's the complete logs and data file from all five nodes.

dgraph-n1-lockup.zip

This occurs on

Dgraph version : v1.0.4
Commit SHA-1 : 807976c
Commit timestamp : 2018-03-22 14:55:24 +1100
Branch : master

and involves a series of network partitions; it looks as if the final partition healing might have left n1 in a state where it believed the leader was... possibly a node which was not the leader?

2018/03/27 12:22:51 node.go:344: Error while sending message to node with addr: n2:7080, err: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/03/27 12:22:51 groups.go:702: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/03/27 12:22:51 Error while retrieving timestamps: rpc error: code = Unknown desc = Assigning IDs is only allowed on leader.. Will retry...

To reproduce this, run with Jepsen e31b29d1a5302766c2c83454eeed9124ef9820f5:

lein run test --package-url https://transfer.sh/TjHBo/dgraph-linux-amd64.tar.gz --force-download -w set --time-limit 300 --concurrency 2n --nemesis partition-random-halves

This appears to be a semi-rare fault; I've only seen it once so far.

@janardhan1993

This comment has been minimized.

Show comment
Hide comment
@janardhan1993

janardhan1993 Apr 2, 2018

Contributor

I could also reproduce this. Happens rarely.

Contributor

janardhan1993 commented Apr 2, 2018

I could also reproduce this. Happens rarely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment