When starting up an Alpha node without its associated Zero, Alpha will try to connect to Zero several times, then panic with a message like
2018/03/30 13:32:28 pool.go:168: Echo error from n1:5080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.122.11:5080: getsockopt: connection refused"
2018/03/30 13:32:33 Unable to join cluster via dgraphzero
This makes it hard to keep a Dgraph cluster running reliably, because Alpha servers can't be trusted to stay up. Even if a watchdog like monit or systemd watches Alpha, it may conclude after several repeat crashes that the service is truly broken and should not be restarted. May I suggest having Alpha retry forever, instead of panicking?
I think this might also make it possible to start Zero and Alpha concurrently, which would simplify cluster startup. :)
The text was updated successfully, but these errors were encountered: