New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predicate moves can take down Dgraph indefinitely #2397

Closed
aphyr opened this Issue May 17, 2018 · 5 comments

Comments

Projects
None yet
2 participants
@aphyr

aphyr commented May 17, 2018

In the build @manishrjain provided for testing on 2018-05-16 (sorry, don't have a SHA!), with predicate moves roughly every 10 seconds, Dgraph went down: 20180516T194304.000-0500.zip

latency-raw

In particular, it looks like a predicate move caused a single node's Alpha to crash with the following message:

2018/05/16 17:47:21 predicate_move.go:300: Move predicate request for pred: [amount_0], src: [2], dst: [1]
2018/05/16 17:47:21 predicate_move.go:308: Trying to abort pending mutations. Loop: 0
2018/05/16 17:47:21 predicate_move.go:176: Sent 2 keys for predicate amount_0
2018/05/16 17:47:21 predicate_move.go:244: Got ReceivePredicate. Group: 2. Am leader: true
2018/05/16 17:47:21 predicate_move.go:193: Receiving predicate. Batching and proposing key values
2018/05/16 17:47:21 predicate_move.go:213: Predicate being received: type_0
2018/05/16 17:47:21 index.go:934: Dropping predicate: [type_0]
2018/05/16 17:47:21 schema.go:63: Deleting schema for predicate: [type_0]
2018/05/16 17:47:22 predicate_move.go:43: Writing 3 keys
2018/05/16 17:47:22 schema.go:252: Setting schema for attr type_0: string, tokenizer: [exact], directive: INDEX, count: false
2018/05/16 17:47:22 predicate_move.go:278: Proposed 3 keys. Error: <nil>
2018/05/16 17:47:22 predicate_move.go:300: Move predicate request for pred: [type_1], src: [2], dst: [1]
2018/05/16 17:47:22 predicate_move.go:308: Trying to abort pending mutations. Loop: 0
2018/05/16 17:47:22 predicate_move.go:308: Trying to abort pending mutations. Loop: 1
2018/05/16 17:47:22 unexpected meta: 0
github.com/dgraph-io/dgraph/x.Fatalf
	/home/mrjn/go/src/github.com/dgraph-io/dgraph/x/error.go:100
github.com/dgraph-io/dgraph/posting.ReadPostingList
	/home/mrjn/go/src/github.com/dgraph-io/dgraph/posting/mvcc.go:424
github.com/dgraph-io/dgraph/worker.movePredicateHelper
	/home/mrjn/go/src/github.com/dgraph-io/dgraph/worker/predicate_move.go:124
github.com/dgraph-io/dgraph/worker.(*grpcWorker).MovePredicate
	/home/mrjn/go/src/github.com/dgraph-io/dgraph/worker/predicate_move.go:321
github.com/dgraph-io/dgraph/protos/intern._Worker_MovePredicate_Handler
	/home/mrjn/go/src/github.com/dgraph-io/dgraph/protos/intern/internal.pb.go:2772
google.golang.org/grpc.(*Server).processUnaryRPC
	/home/mrjn/go/src/google.golang.org/grpc/server.go:1011
google.golang.org/grpc.(*Server).handleStream
	/home/mrjn/go/src/google.golang.org/grpc/server.go:1249
google.golang.org/grpc.(*Server).serveStreams.func1.1
	/home/mrjn/go/src/google.golang.org/grpc/server.go:680
runtime.goexit
	/usr/lib/go-1.10/src/runtime/asm_amd64.s:2361

Oddly, the cluster was never able to recover from this state, even though five Zero nodes and four Alpha nodes were running and connected. All alpha nodes logged

2018/05/16 17:47:24 oracle.go:425: Error while fetching minTs from group 2, err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.122.13:7080: connect: connection refused"
2018/05/16 17:47:28 pool.go:158: Echo error from n3:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.122.13:7080: connect: connection refused

... indefinitely, and Zero nodes logged the same thing:

2018/05/16 17:47:24 oracle.go:425: Error while fetching minTs from group 2, err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.122.13:7080: connect: connection refused"
2018/05/16 17:47:28 pool.go:158: Echo error from n3:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.122.13:7080: connect: connection refused"

Every request either timed out (on n3), or returned UNAVAILABLE. Attempts to perform further predicate moves all timed out as well, though Zero would answer /state queries:

{:counter "1079",
 :groups
 {1
  {:members
   {1 {:id "1", :groupId 1, :addr "n1:7080"},
    2
    {:id "2",
     :groupId 1,
     :addr "n2:7080",
     :leader true,
     :lastUpdate "1526517895"},
    4 {:id "4", :groupId 1, :addr "n4:7080"}},
   :tablets
   {:amount_0 {:groupId 1, :predicate "amount_0", :force true},
    :key_2 {:groupId 1, :predicate "key_2", :force true},
    :type_2 {:groupId 1, :predicate "type_2", :force true}}},
  2
  {:members
   {3
    {:id "3",
     :groupId 2,
     :addr "n3:7080",
     :leader true,
     :lastUpdate "1526517898"},
    5 {:id "5", :groupId 2, :addr "n5:7080"}},
   :tablets
   {:_predicate_ {:groupId 2, :predicate "_predicate_", :force true},
    :amount_1 {:groupId 2, :predicate "amount_1", :force true},
    :amount_2 {:groupId 2, :predicate "amount_2", :force true},
    :key_0 {:groupId 2, :predicate "key_0", :force true},
    :key_1 {:groupId 2, :predicate "key_1", :force true},
    :type_0 {:groupId 2, :predicate "type_0", :force true},
    :type_1 {:groupId 2, :predicate "type_1", :force true}}}},
 :zeros
 {1 {:id "1", :addr "n1:5080"},
  2 {:id "2", :addr "n2:5080"},
  3 {:id "3", :addr "n3:5080", :leader true},
  4 {:id "4", :addr "n4:5080"},
  5 {:id "5", :addr "n5:5080"}},
 :maxLeaseId "20000",
 :maxTxnTs "20000"}

@manishrjain manishrjain added the bug label May 17, 2018

@manishrjain manishrjain self-assigned this May 17, 2018

@manishrjain

This comment has been minimized.

Show comment
Hide comment
@manishrjain

manishrjain May 17, 2018

Member

Any suggestions on how quickly can it be reproduced? And with which test?

Member

manishrjain commented May 17, 2018

Any suggestions on how quickly can it be reproduced? And with which test?

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr May 17, 2018

I haven't figured out a fast path to reproduce this yet, but the original case was

lein run test --local-binary ~/tmp/dgraph --force-download --nemesis move-tablet --sequencing server --upsert-schema --time-limit 300 --concurrency 1n --nemesis-interval 10 --workload bank --retry-db-setup --test-count 50

And it took about 4 hours to appear. My guess is that the bank test is more likely to be interesting, since there are more predicates. I'm also going to explore reducing the time between nemeses, and longer test times, so we spend more time migrating.

aphyr commented May 17, 2018

I haven't figured out a fast path to reproduce this yet, but the original case was

lein run test --local-binary ~/tmp/dgraph --force-download --nemesis move-tablet --sequencing server --upsert-schema --time-limit 300 --concurrency 1n --nemesis-interval 10 --workload bank --retry-db-setup --test-count 50

And it took about 4 hours to appear. My guess is that the bank test is more likely to be interesting, since there are more predicates. I'm also going to explore reducing the time between nemeses, and longer test times, so we spend more time migrating.

@manishrjain

This comment has been minimized.

Show comment
Hide comment
@manishrjain

manishrjain May 17, 2018

Member

Okay. I'll try running it at my end as well for long.

Member

manishrjain commented May 17, 2018

Okay. I'll try running it at my end as well for long.

manishrjain added a commit that referenced this issue May 17, 2018

@manishrjain

This comment has been minimized.

Show comment
Hide comment
@manishrjain

manishrjain May 17, 2018

Member

Ah.. this is a stupid one. I could see this in Dgraph's own integration tests. Have pushed a potential fix in master. 7f21b82

This could be killing off many Dgraph alphas, which is why the entire cluster is failing. Testing the fix right now.

Member

manishrjain commented May 17, 2018

Ah.. this is a stupid one. I could see this in Dgraph's own integration tests. Have pushed a potential fix in master. 7f21b82

This could be killing off many Dgraph alphas, which is why the entire cluster is failing. Testing the fix right now.

@manishrjain

This comment has been minimized.

Show comment
Hide comment
@manishrjain

manishrjain May 18, 2018

Member

Commit 7205275 also helps fix this one. Ran it on my computer, with the command above. All tests passed.

Closing this. Will send you a binary to test as well at your end.

Member

manishrjain commented May 18, 2018

Commit 7205275 also helps fix this one. Ran it on my computer, with the command above. All tests passed.

Closing this. Will send you a binary to test as well at your end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment