Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
All reads time out indefinitely, possibly due to stalled predicate migration? #2405
On the build @manishrjain submitted for testing on Friday, May 18, 2018, single-record set tests can stall indefinitely during reading. Although the cluster is stable, all nodes are running, and the network is totally connected, and all test-initiated predicate migrations appear to have completed, every read request will time out. This condition can last for at least an hour.
For instance, see 20180522T125649.000-0500.zip, where in the middle of the final reads, transactions just... start timing out.
Although each process goes on to retry the read process, no subsequent query ever returns:
About ten seconds after operations start timing out, Zero on n1 logs:
And n4 logs corresponding predicate moves:
No other node logs anything after 11:04. Is it possible that an automatic predicate migration started, then got stuck somehow? The 20-minute intervals between migrations suggest that it's retrying the migration, at least, but timing out every request looks... odd.
You can reproduce this with Jepsen b0b458d32e43c072f257b75ea786431ea0d0c7a5 by running:
The reads no longer time out. But, the test never finishes. It keeps on spitting:
Looks like these conflicts don't happen, and the workers just keep looping over them forever. Might be a logic based on an old behavior of Dgraph which no longer happens.