Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster unresponsive following node crashes #2312

Closed
aphyr opened this Issue Apr 6, 2018 · 1 comment

Comments

Projects
None yet
3 participants
@aphyr
Copy link

aphyr commented Apr 6, 2018

On recent nightlies (up to 2018-04-06), I've seen dgraph sporadically lock up when testing with a combination of randomized alpha and zero process crashes & restarts. This can leave the cluster in a state where all queries throw DEADLINE_EXCEEDED or context deadline exceeded indefinitely.

Dgraph version   : v1.0.4-dev
Commit SHA-1     : 8333f725
Commit timestamp : 2018-04-06 13:20:06 +1000
Branch           : HEAD

dgraph-lockup.zip

In this case, ll processes are running, and all ports are bound (at least, as per netstat) on every node. All nodes have uninterrupted network connectivity, and no network failures occurred during the test. However, alpha will complain

2018-04-06 10:14:39 Jepsen starting dgraph :server :--lru_mb 1024 :--idx 1 :--my n1:7080 :--zero n1:5080
...
2018/04/06 12:04:38 groups.go:701: WARNING: We don't have address of any dgraphzero server.
2018/04/06 12:04:39 groups.go:701: WARNING: We don't have address of any dgraphzero server.
2018/04/06 12:04:40 Error while retrieving timestamps: No connection exists. Will retry...
2018/04/06 12:04:40 groups.go:701: WARNING: We don't have address of any dgraphzero server.
2018/04/06 12:04:41 groups.go:701: WARNING: We don't have address of any dgraphzero server.

Note that the zero server we passed on startup was the local zero, which has bound its internal and public ports! However, zero logs:

2018/04/06 10:14:14 pool.go:158: Echo error from n2:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/04/06 10:14:14 pool.go:158: Echo error from n1:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/04/06 10:14:14 pool.go:158: Echo error from n4:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/04/06 10:14:14 pool.go:158: Echo error from n3:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/04/06 10:14:14 pool.go:158: Echo error from n5:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/04/06 10:14:14 pool.go:158: Echo error from n4:5080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
...
2018/04/06 10:14:24 oracle.go:373: No healthy connection found to leader of group 1

This is particularly odd because n1 is the local node! It's odd for n1 to complain about not being able to talk to itself, no? Makes me wonder if maybe a reconnect loop got stuck?

You can reproduce this with Jepsen 6e80c469ce6f6c361b6343b78964ee1bfab04fb8 by running

lein run test --package-url https://github.com/dgraph-io/dgraph/releases/download/nightly/dgraph-linux-amd64.tar.gz --time-limit 300 --concurrency 10 --nemesis kill-alpha,fix-alpha,kill-zero --test-count 20 --workload set

I don't have older builds to test against, but I strongly suspect this problem was introduced sometime after v1.0.4, and before 8333f72.

@janardhan1993 janardhan1993 self-assigned this Apr 9, 2018

@pawanrawal pawanrawal added the bug label Apr 16, 2018

@janardhan1993 janardhan1993 modified the milestone: Sprint-001 Apr 16, 2018

@janardhan1993

This comment has been minimized.

Copy link
Contributor

janardhan1993 commented Apr 16, 2018

Fixed in master now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.