New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All reads time out indefinitely, possibly due to stalled predicate migration? #2405

Open
aphyr opened this Issue May 22, 2018 · 0 comments

Comments

Projects
None yet
2 participants
@aphyr

aphyr commented May 22, 2018

On the build @manishrjain submitted for testing on Friday, May 18, 2018, single-record set tests can stall indefinitely during reading. Although the cluster is stable, all nodes are running, and the network is totally connected, and all test-initiated predicate migrations appear to have completed, every read request will time out. This condition can last for at least an hour.

For instance, see 20180522T125649.000-0500.zip, where in the middle of the final reads, transactions just... start timing out.

2018-05-22 13:04:41,203{GMT}	INFO	[jepsen worker 7] jepsen.dgraph.set: Forcing conflict by inserting 7517
2018-05-22 13:04:41,297{GMT}	INFO	[jepsen worker 6] jepsen.dgraph.set: Forcing conflict by inserting 5551
2018-05-22 13:04:41,297{GMT}	INFO	[jepsen worker 5] jepsen.dgraph.set: Forcing conflict by inserting 6854
2018-05-22 13:04:41,297{GMT}	INFO	[jepsen worker 0] jepsen.dgraph.set: Forcing conflict by inserting 7368
2018-05-22 13:04:46,198{GMT}	INFO	[jepsen worker 4] jepsen.util: 4	:info	:read	nil	:timeout
2018-05-22 13:04:46,202{GMT}	INFO	[jepsen worker 9] jepsen.util: 9	:info	:read	nil	:timeout
2018-05-22 13:04:46,230{GMT}	INFO	[jepsen worker 7] jepsen.util: 7	:info	:read	nil	:timeout
2018-05-22 13:04:46,234{GMT}	INFO	[jepsen worker 2] jepsen.util: 2	:info	:read	nil	:timeout
2018-05-22 13:04:46,271{GMT}	INFO	[jepsen worker 1] jepsen.util: 1	:info	:read	nil	:timeout

Although each process goes on to retry the read process, no subsequent query ever returns:

8-05-22 13:52:27,172{GMT}	INFO	[jepsen worker 9] jepsen.util: 839	:invoke	:read	nil
2018-05-22 13:52:28,607{GMT}	INFO	[jepsen worker 8] jepsen.util: 798	:info	:read	nil	:timeout
2018-05-22 13:52:32,179{GMT}	INFO	[jepsen worker 9] jepsen.util: 839	:info	:read	nil	:timeout
2018-05-22 13:52:38,642{GMT}	INFO	[jepsen worker 2] jepsen.util: 892	:invoke	:read	nil
2018-05-22 13:52:40,683{GMT}	INFO	[jepsen worker 3] jepsen.util: 793	:invoke	:read	nil
2018-05-22 13:52:43,650{GMT}	INFO	[jepsen worker 2] jepsen.util: 892	:info	:read	nil	:timeout
2018-05-22 13:52:45,689{GMT}	INFO	[jepsen worker 3] jepsen.util: 793	:info	:read	nil	:timeout

About ten seconds after operations start timing out, Zero on n1 logs:

2018/05/22 11:04:57 tablet.go:179: 

Groups sorted by size: [{gid:1 size:17} {gid:2 size:2235}]

2018/05/22 11:04:57 tablet.go:184: size_diff 2218
2018/05/22 11:04:57 tablet.go:73: Going to move predicate: [value], size: [1.1 kB] from group 2 to 1
2018/05/22 11:24:57 tablet.go:221: Got error during move: While calling MovePredicate: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/05/22 11:24:57 tablet.go:60: Error while trying to move predicate value from 2 to 1: While calling MovePredicate: rpc error: code = DeadlineExceeded desc = context deadline exceeded

2018/05/22 11:24:57 tablet.go:179: 

Groups sorted by size: [{gid:1 size:17} {gid:2 size:2235}]

2018/05/22 11:24:57 tablet.go:184: size_diff 2218
2018/05/22 11:24:57 tablet.go:73: Going to move predicate: [value], size: [1.1 kB] from group 2 to 1
2018/05/22 11:44:57 tablet.go:221: Got error during move: While calling MovePredicate: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/05/22 11:44:57 tablet.go:60: Error while trying to move predicate value from 2 to 1: While calling MovePredicate: rpc error: code = DeadlineExceeded desc = context deadline exceeded

2018/05/22 11:44:57 tablet.go:179: 

Groups sorted by size: [{gid:1 size:17} {gid:2 size:2235}]

2018/05/22 11:44:57 tablet.go:184: size_diff 2218
2018/05/22 11:44:57 tablet.go:73: Going to move predicate: [value], size: [1.1 kB] from group 2 to 1

And n4 logs corresponding predicate moves:

2018/05/22 11:04:57 predicate_move.go:301: Move predicate request for pred: [value], src: [2], dst: [1]
2018/05/22 11:04:57 predicate_move.go:311: Trying to abort pending mutations. Loop: 0
2018/05/22 11:05:11 draft.go:825: Writing snapshot at index: 13819, applied mark: 26530
2018/05/22 11:24:57 predicate_move.go:301: Move predicate request for pred: [value], src: [2], dst: [1]
2018/05/22 11:24:57 predicate_move.go:311: Trying to abort pending mutations. Loop: 0
2018/05/22 11:44:57 predicate_move.go:301: Move predicate request for pred: [value], src: [2], dst: [1]
2018/05/22 11:44:57 predicate_move.go:311: Trying to abort pending mutations. Loop: 0

No other node logs anything after 11:04. Is it possible that an automatic predicate migration started, then got stuck somehow? The 20-minute intervals between migrations suggest that it's retrying the migration, at least, but timing out every request looks... odd.

You can reproduce this with Jepsen b0b458d32e43c072f257b75ea786431ea0d0c7a5 by running:

lein run test --local-binary ~/tmp/dgraph --force-download --nemesis move-tablet --sequencing client --upsert-schema --time-limit 300 --concurrency 2n --nemesis-interval 5 --workload uid-set --retry-db-setup --test-count 10

@aphyr aphyr changed the title from All reads time out to All reads time out indefinitely, possibly due to stalled predicate migration May 22, 2018

@aphyr aphyr changed the title from All reads time out indefinitely, possibly due to stalled predicate migration to All reads time out indefinitely, possibly due to stalled predicate migration? May 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment