Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predicate moves can take down Dgraph indefinitely #2397

Closed
aphyr opened this issue May 17, 2018 · 5 comments
Closed

Predicate moves can take down Dgraph indefinitely #2397

aphyr opened this issue May 17, 2018 · 5 comments
Assignees
Labels

Comments

@aphyr
Copy link

@aphyr aphyr commented May 17, 2018

In the build @manishrjain provided for testing on 2018-05-16 (sorry, don't have a SHA!), with predicate moves roughly every 10 seconds, Dgraph went down: 20180516T194304.000-0500.zip

latency-raw

In particular, it looks like a predicate move caused a single node's Alpha to crash with the following message:

2018/05/16 17:47:21 predicate_move.go:300: Move predicate request for pred: [amount_0], src: [2], dst: [1]
2018/05/16 17:47:21 predicate_move.go:308: Trying to abort pending mutations. Loop: 0
2018/05/16 17:47:21 predicate_move.go:176: Sent 2 keys for predicate amount_0
2018/05/16 17:47:21 predicate_move.go:244: Got ReceivePredicate. Group: 2. Am leader: true
2018/05/16 17:47:21 predicate_move.go:193: Receiving predicate. Batching and proposing key values
2018/05/16 17:47:21 predicate_move.go:213: Predicate being received: type_0
2018/05/16 17:47:21 index.go:934: Dropping predicate: [type_0]
2018/05/16 17:47:21 schema.go:63: Deleting schema for predicate: [type_0]
2018/05/16 17:47:22 predicate_move.go:43: Writing 3 keys
2018/05/16 17:47:22 schema.go:252: Setting schema for attr type_0: string, tokenizer: [exact], directive: INDEX, count: false
2018/05/16 17:47:22 predicate_move.go:278: Proposed 3 keys. Error: <nil>
2018/05/16 17:47:22 predicate_move.go:300: Move predicate request for pred: [type_1], src: [2], dst: [1]
2018/05/16 17:47:22 predicate_move.go:308: Trying to abort pending mutations. Loop: 0
2018/05/16 17:47:22 predicate_move.go:308: Trying to abort pending mutations. Loop: 1
2018/05/16 17:47:22 unexpected meta: 0
github.com/dgraph-io/dgraph/x.Fatalf
	/home/mrjn/go/src/github.com/dgraph-io/dgraph/x/error.go:100
github.com/dgraph-io/dgraph/posting.ReadPostingList
	/home/mrjn/go/src/github.com/dgraph-io/dgraph/posting/mvcc.go:424
github.com/dgraph-io/dgraph/worker.movePredicateHelper
	/home/mrjn/go/src/github.com/dgraph-io/dgraph/worker/predicate_move.go:124
github.com/dgraph-io/dgraph/worker.(*grpcWorker).MovePredicate
	/home/mrjn/go/src/github.com/dgraph-io/dgraph/worker/predicate_move.go:321
github.com/dgraph-io/dgraph/protos/intern._Worker_MovePredicate_Handler
	/home/mrjn/go/src/github.com/dgraph-io/dgraph/protos/intern/internal.pb.go:2772
google.golang.org/grpc.(*Server).processUnaryRPC
	/home/mrjn/go/src/google.golang.org/grpc/server.go:1011
google.golang.org/grpc.(*Server).handleStream
	/home/mrjn/go/src/google.golang.org/grpc/server.go:1249
google.golang.org/grpc.(*Server).serveStreams.func1.1
	/home/mrjn/go/src/google.golang.org/grpc/server.go:680
runtime.goexit
	/usr/lib/go-1.10/src/runtime/asm_amd64.s:2361

Oddly, the cluster was never able to recover from this state, even though five Zero nodes and four Alpha nodes were running and connected. All alpha nodes logged

2018/05/16 17:47:24 oracle.go:425: Error while fetching minTs from group 2, err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.122.13:7080: connect: connection refused"
2018/05/16 17:47:28 pool.go:158: Echo error from n3:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.122.13:7080: connect: connection refused

... indefinitely, and Zero nodes logged the same thing:

2018/05/16 17:47:24 oracle.go:425: Error while fetching minTs from group 2, err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.122.13:7080: connect: connection refused"
2018/05/16 17:47:28 pool.go:158: Echo error from n3:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.122.13:7080: connect: connection refused"

Every request either timed out (on n3), or returned UNAVAILABLE. Attempts to perform further predicate moves all timed out as well, though Zero would answer /state queries:

{:counter "1079",
 :groups
 {1
  {:members
   {1 {:id "1", :groupId 1, :addr "n1:7080"},
    2
    {:id "2",
     :groupId 1,
     :addr "n2:7080",
     :leader true,
     :lastUpdate "1526517895"},
    4 {:id "4", :groupId 1, :addr "n4:7080"}},
   :tablets
   {:amount_0 {:groupId 1, :predicate "amount_0", :force true},
    :key_2 {:groupId 1, :predicate "key_2", :force true},
    :type_2 {:groupId 1, :predicate "type_2", :force true}}},
  2
  {:members
   {3
    {:id "3",
     :groupId 2,
     :addr "n3:7080",
     :leader true,
     :lastUpdate "1526517898"},
    5 {:id "5", :groupId 2, :addr "n5:7080"}},
   :tablets
   {:_predicate_ {:groupId 2, :predicate "_predicate_", :force true},
    :amount_1 {:groupId 2, :predicate "amount_1", :force true},
    :amount_2 {:groupId 2, :predicate "amount_2", :force true},
    :key_0 {:groupId 2, :predicate "key_0", :force true},
    :key_1 {:groupId 2, :predicate "key_1", :force true},
    :type_0 {:groupId 2, :predicate "type_0", :force true},
    :type_1 {:groupId 2, :predicate "type_1", :force true}}}},
 :zeros
 {1 {:id "1", :addr "n1:5080"},
  2 {:id "2", :addr "n2:5080"},
  3 {:id "3", :addr "n3:5080", :leader true},
  4 {:id "4", :addr "n4:5080"},
  5 {:id "5", :addr "n5:5080"}},
 :maxLeaseId "20000",
 :maxTxnTs "20000"}
@manishrjain manishrjain self-assigned this May 17, 2018
@manishrjain
Copy link
Member

@manishrjain manishrjain commented May 17, 2018

Any suggestions on how quickly can it be reproduced? And with which test?

@aphyr
Copy link
Author

@aphyr aphyr commented May 17, 2018

I haven't figured out a fast path to reproduce this yet, but the original case was

lein run test --local-binary ~/tmp/dgraph --force-download --nemesis move-tablet --sequencing server --upsert-schema --time-limit 300 --concurrency 1n --nemesis-interval 10 --workload bank --retry-db-setup --test-count 50

And it took about 4 hours to appear. My guess is that the bank test is more likely to be interesting, since there are more predicates. I'm also going to explore reducing the time between nemeses, and longer test times, so we spend more time migrating.

@manishrjain
Copy link
Member

@manishrjain manishrjain commented May 17, 2018

Okay. I'll try running it at my end as well for long.

manishrjain added a commit that referenced this issue May 17, 2018
@manishrjain
Copy link
Member

@manishrjain manishrjain commented May 17, 2018

Ah.. this is a stupid one. I could see this in Dgraph's own integration tests. Have pushed a potential fix in master. 7f21b82

This could be killing off many Dgraph alphas, which is why the entire cluster is failing. Testing the fix right now.

@manishrjain
Copy link
Member

@manishrjain manishrjain commented May 18, 2018

Commit 7205275 also helps fix this one. Ran it on my computer, with the command above. All tests passed.

Closing this. Will send you a binary to test as well at your end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.