Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Cluster unresponsive following node crashes #2312
On recent nightlies (up to 2018-04-06), I've seen dgraph sporadically lock up when testing with a combination of randomized alpha and zero process crashes & restarts. This can leave the cluster in a state where all queries throw
In this case, ll processes are running, and all ports are bound (at least, as per
Note that the zero server we passed on startup was the local zero, which has bound its internal and public ports! However, zero logs:
This is particularly odd because n1 is the local node! It's odd for n1 to complain about not being able to talk to itself, no? Makes me wonder if maybe a reconnect loop got stuck?
You can reproduce this with Jepsen 6e80c469ce6f6c361b6343b78964ee1bfab04fb8 by running
I don't have older builds to test against, but I strongly suspect this problem was introduced sometime after v1.0.4, and before 8333f72.