New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible deadlock in JoinCluster #2286

Closed
aphyr opened this Issue Mar 30, 2018 · 6 comments

Comments

Projects
None yet
4 participants
@aphyr

aphyr commented Mar 30, 2018

On recent builds, I've seen new clusters get stuck during concurrent join, where one or two nodes refuse to service requests. It looooks like they might get stuck in JoinCluster for 20+ seconds; after which point my test gives up. This is on

Dgraph version   : v1.0.4
Commit SHA-1     : 224b560
Commit timestamp : 2018-03-27 16:56:17 +0530
Branch           : fix/jepsen_delete

and can be reproduced with jepsen 6b87441dbd1db2aca039226690f74b50ac78ef15 by running any test, e.g.

lein run test --package-url https://transfer.sh/10WC2g/dgraph-linux-amd64.tar.gz --force-download -w sequential --time-limit 300 --nemesis partition-random-halves --concurrency 10```

@manishrjain manishrjain added the bug label Mar 30, 2018

@manishrjain manishrjain added this to the Sprint-001 milestone Mar 30, 2018

@janardhan1993

This comment has been minimized.

Show comment
Hide comment
@janardhan1993

janardhan1993 Apr 2, 2018

Contributor

This bug was introduced after removing timeout in joinCluster call (807976c).

I can see the proposal being dropped/ignored silently.

When you start three nodes, then if two followers call joinCluster concurrently then one of the entryConfproposal would get dropped because etcd/raft by design allows only one pending entryConfChange. So one of the node was stuck waiting for joinCluster to finish.

Contributor

janardhan1993 commented Apr 2, 2018

This bug was introduced after removing timeout in joinCluster call (807976c).

I can see the proposal being dropped/ignored silently.

When you start three nodes, then if two followers call joinCluster concurrently then one of the entryConfproposal would get dropped because etcd/raft by design allows only one pending entryConfChange. So one of the node was stuck waiting for joinCluster to finish.

@manishrjain

This comment has been minimized.

Show comment
Hide comment
@manishrjain

manishrjain Apr 2, 2018

Member

Does it return an error when it drops the proposal?

Member

manishrjain commented Apr 2, 2018

Does it return an error when it drops the proposal?

@janardhan1993

This comment has been minimized.

Show comment
Hide comment
@janardhan1993

janardhan1993 Apr 2, 2018

Contributor

It pushes to a channel and returns. Drop happens later asynchronously. We can probably serialise join cluster calls.

Contributor

janardhan1993 commented Apr 2, 2018

It pushes to a channel and returns. Drop happens later asynchronously. We can probably serialise join cluster calls.

@manishrjain manishrjain removed the size: 8 label Apr 3, 2018

@manishrjain

This comment has been minimized.

Show comment
Hide comment
@manishrjain

manishrjain Apr 3, 2018

Member

We should talk to Etcd/Raft guys before we make any changes here. I think there has to be a better solution here, than another quick fix.

Member

manishrjain commented Apr 3, 2018

We should talk to Etcd/Raft guys before we make any changes here. I think there has to be a better solution here, than another quick fix.

@manishrjain

This comment has been minimized.

Show comment
Hide comment
@manishrjain

manishrjain Jun 22, 2018

Member
bank-dg3 | 2018/06/22 04:01:19 raft.go:815: INFO: propose conf Type:EntryConfChange Data:"\010\235\300\230\310\274\340\271\240\024\020\000\030\003\"\025\t\003\000\000\000\000\000\000\000\020\001\032\010dg1:7180"  ignored since pending unapplied configuration
bank-dg3 | 2018/06/22 04:01:19 node.go:116: Setting conf state to nodes:1 nodes:2 
bank-dg2 | 2018/06/22 04:01:19 groups.go:494: Got address of a Zero server: zero1:5080
bank-dg3 | 2018/06/22 04:01:19 node.go:301: Error while sending message to node with addr: dg2:7181, err: rpc error: code = Unknown desc = No node has been set up yet

This can cause a server to just never join the cluster.

Member

manishrjain commented Jun 22, 2018

bank-dg3 | 2018/06/22 04:01:19 raft.go:815: INFO: propose conf Type:EntryConfChange Data:"\010\235\300\230\310\274\340\271\240\024\020\000\030\003\"\025\t\003\000\000\000\000\000\000\000\020\001\032\010dg1:7180"  ignored since pending unapplied configuration
bank-dg3 | 2018/06/22 04:01:19 node.go:116: Setting conf state to nodes:1 nodes:2 
bank-dg2 | 2018/06/22 04:01:19 groups.go:494: Got address of a Zero server: zero1:5080
bank-dg3 | 2018/06/22 04:01:19 node.go:301: Error while sending message to node with addr: dg2:7181, err: rpc error: code = Unknown desc = No node has been set up yet

This can cause a server to just never join the cluster.

@manishrjain

This comment has been minimized.

Show comment
Hide comment
@manishrjain

manishrjain Jul 1, 2018

Member
INFO [2018-07-01 02:51:35,704] jepsen node n1 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:45,911] jepsen node n4 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:45,911] jepsen node n5 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:45,911] jepsen node n3 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:45,911] jepsen node n2 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:46,341] jepsen node n3 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:46,342] jepsen node n2 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:46,344] jepsen node n5 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:46,347] jepsen node n1 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:46,370] jepsen node n4 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:46,656] jepsen node n1 - jepsen.dgraph.support Waiting for cluster convergence
INFO [2018-07-01 02:51:47,668] jepsen node n1 - jepsen.dgraph.support Cluster converged

Cluster converges smoothly with #2467.

Member

manishrjain commented Jul 1, 2018

INFO [2018-07-01 02:51:35,704] jepsen node n1 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:45,911] jepsen node n4 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:45,911] jepsen node n5 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:45,911] jepsen node n3 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:45,911] jepsen node n2 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:46,341] jepsen node n3 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:46,342] jepsen node n2 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:46,344] jepsen node n5 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:46,347] jepsen node n1 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:46,370] jepsen node n4 - jepsen.control.util starting dgraph
INFO [2018-07-01 02:51:46,656] jepsen node n1 - jepsen.dgraph.support Waiting for cluster convergence
INFO [2018-07-01 02:51:47,668] jepsen node n1 - jepsen.dgraph.support Cluster converged

Cluster converges smoothly with #2467.

@manishrjain manishrjain closed this in #2467 Jul 1, 2018

manishrjain added a commit that referenced this issue Jul 1, 2018

Fix deadlock in 10-node cluster convergence (#2467)
This PR fixes #2286 .

- CheckQuorum was causing us multiple issues. When doing a 5-node Zero cluster
  bootstrap, it would cause a leader to step down when the size of the cluster
  is 2, then causing all the rest of the joins to be blocked indefinitely. It
  would also cause leader step down in a seemingly healthy cluster which is
  processing proposals. CheckQuorum was mandated by raft.ReadOnlyLeaseBased,
  which is a less safe option to do linearizable reads. Switch ReadOnlyOption
  back to raft.ReadOnlySafe. Moreover, we don't need to do quorum based lin
  reads in the Alpha servers, because of the switch to proposing and then
  applying transaction updates.
- raft.ReadIndex is not working for some reason. So, commented out its usage in
  Zero (and removed it from Alpha permanently). Needs to be fixed when the
  following issue is resolved. etcd-io/etcd#9893
- The logic to do lin reads was replicated in both Zero and Alpha. Refactor that
  into one place in conn/node.go.
- Retry conf change proposals if they timeout. This mechanism is similar to the
  one introduced for normal proposals in a previous commit 06ea4c.
- Use a lock to only allow one JoinCluster call at a time. Block JoinCluster
  until node.AddToCluster is successful (or return the error).
- Set raft library to 3.2.23. Before upgrade, we were at 3.2.6.

Commit log:

* Trying to understand why JoinCluster doesn't work properly.
* Fucking works. Fucking works.
* It all works now.
* More Dgraph servers. Found a new issue where requesting read quorum doesn't respond.
* Refactor wait lin read code and move it to conn/node.go
* Remove lin read wait for server, because txn timestamp should be sufficient for waiting. Also, for the time being, comment out lin read wait from Zero as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment