When starting up an Alpha node without its associated Zero, Alpha will try to connect to Zero several times, then panic with a message like
2018/03/30 13:32:28 pool.go:168: Echo error from n1:5080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.122.11:5080: getsockopt: connection refused"
2018/03/30 13:32:33 Unable to join cluster via dgraphzero
github.com/dgraph-io/dgraph/x.Fatalf
/home/janardhan/go/src/github.com/dgraph-io/dgraph/x/error.go:103
github.com/dgraph-io/dgraph/worker.StartRaftNodes
/home/janardhan/go/src/github.com/dgraph-io/dgraph/worker/groups.go:107
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:2337
This makes it hard to keep a Dgraph cluster running reliably, because Alpha servers can't be trusted to stay up. Even if a watchdog like monit or systemd watches Alpha, it may conclude after several repeat crashes that the service is truly broken and should not be restarted. May I suggest having Alpha retry forever, instead of panicking?
I think this might also make it possible to start Zero and Alpha concurrently, which would simplify cluster startup. :)
The text was updated successfully, but these errors were encountered:
On version
When starting up an Alpha node without its associated Zero, Alpha will try to connect to Zero several times, then panic with a message like
This makes it hard to keep a Dgraph cluster running reliably, because Alpha servers can't be trusted to stay up. Even if a watchdog like monit or systemd watches Alpha, it may conclude after several repeat crashes that the service is truly broken and should not be restarted. May I suggest having Alpha retry forever, instead of panicking?
I think this might also make it possible to start Zero and Alpha concurrently, which would simplify cluster startup. :)
The text was updated successfully, but these errors were encountered: