On the build @manishrjain submitted for testing on Friday, May 18, 2018, single-record set tests can stall indefinitely during reading. Although the cluster is stable, all nodes are running, and the network is totally connected, and all test-initiated predicate migrations appear to have completed, every read request will time out. This condition can last for at least an hour.
For instance, see 20180522T125649.000-0500.zip, where in the middle of the final reads, transactions just... start timing out.
2018-05-22 13:04:41,203{GMT} INFO [jepsen worker 7] jepsen.dgraph.set: Forcing conflict by inserting 7517
2018-05-22 13:04:41,297{GMT} INFO [jepsen worker 6] jepsen.dgraph.set: Forcing conflict by inserting 5551
2018-05-22 13:04:41,297{GMT} INFO [jepsen worker 5] jepsen.dgraph.set: Forcing conflict by inserting 6854
2018-05-22 13:04:41,297{GMT} INFO [jepsen worker 0] jepsen.dgraph.set: Forcing conflict by inserting 7368
2018-05-22 13:04:46,198{GMT} INFO [jepsen worker 4] jepsen.util: 4 :info :read nil :timeout
2018-05-22 13:04:46,202{GMT} INFO [jepsen worker 9] jepsen.util: 9 :info :read nil :timeout
2018-05-22 13:04:46,230{GMT} INFO [jepsen worker 7] jepsen.util: 7 :info :read nil :timeout
2018-05-22 13:04:46,234{GMT} INFO [jepsen worker 2] jepsen.util: 2 :info :read nil :timeout
2018-05-22 13:04:46,271{GMT} INFO [jepsen worker 1] jepsen.util: 1 :info :read nil :timeout
Although each process goes on to retry the read process, no subsequent query ever returns:
No other node logs anything after 11:04. Is it possible that an automatic predicate migration started, then got stuck somehow? The 20-minute intervals between migrations suggest that it's retrying the migration, at least, but timing out every request looks... odd.
You can reproduce this with Jepsen b0b458d32e43c072f257b75ea786431ea0d0c7a5 by running:
The text was updated successfully, but these errors were encountered:
aphyr
changed the title
All reads time out
All reads time out indefinitely, possibly due to stalled predicate migration
May 22, 2018
aphyr
changed the title
All reads time out indefinitely, possibly due to stalled predicate migration
All reads time out indefinitely, possibly due to stalled predicate migration?
May 22, 2018
The reads no longer time out. But, the test never finishes. It keeps on spitting:
2019-02-01 23:22:16,808{GMT} INFO [jepsen worker 1] jepsen.dgraph.set: Forcing conflict by deleting ...
Looks like these conflicts don't happen, and the workers just keep looping over them forever. Might be a logic based on an old behavior of Dgraph which no longer happens.
On the build @manishrjain submitted for testing on Friday, May 18, 2018, single-record set tests can stall indefinitely during reading. Although the cluster is stable, all nodes are running, and the network is totally connected, and all test-initiated predicate migrations appear to have completed, every read request will time out. This condition can last for at least an hour.
For instance, see 20180522T125649.000-0500.zip, where in the middle of the final reads, transactions just... start timing out.
Although each process goes on to retry the read process, no subsequent query ever returns:
About ten seconds after operations start timing out, Zero on n1 logs:
And n4 logs corresponding predicate moves:
No other node logs anything after 11:04. Is it possible that an automatic predicate migration started, then got stuck somehow? The 20-minute intervals between migrations suggest that it's retrying the migration, at least, but timing out every request looks... odd.
You can reproduce this with Jepsen b0b458d32e43c072f257b75ea786431ea0d0c7a5 by running:
The text was updated successfully, but these errors were encountered: