New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alpha panics when started without its Zero node #2289

Closed
aphyr opened this Issue Mar 30, 2018 · 2 comments

Comments

Projects
None yet
4 participants
@aphyr

aphyr commented Mar 30, 2018

On version

Dgraph version   : v1.0.4
Commit SHA-1     : 224b560
Commit timestamp : 2018-03-27 16:56:17 +0530
Branch           : fix/jepsen_delete

When starting up an Alpha node without its associated Zero, Alpha will try to connect to Zero several times, then panic with a message like

2018/03/30 13:32:28 pool.go:168: Echo error from n1:5080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.122.11:5080: getsockopt: connection refused"
2018/03/30 13:32:33 Unable to join cluster via dgraphzero
github.com/dgraph-io/dgraph/x.Fatalf
        /home/janardhan/go/src/github.com/dgraph-io/dgraph/x/error.go:103
github.com/dgraph-io/dgraph/worker.StartRaftNodes
        /home/janardhan/go/src/github.com/dgraph-io/dgraph/worker/groups.go:107
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:2337

This makes it hard to keep a Dgraph cluster running reliably, because Alpha servers can't be trusted to stay up. Even if a watchdog like monit or systemd watches Alpha, it may conclude after several repeat crashes that the service is truly broken and should not be restarted. May I suggest having Alpha retry forever, instead of panicking?

I think this might also make it possible to start Zero and Alpha concurrently, which would simplify cluster startup. :)

@manishrjain manishrjain added the cleanup label Mar 30, 2018

@manishrjain manishrjain added this to the Sprint-001 milestone Mar 30, 2018

@manishrjain

This comment has been minimized.

Show comment
Hide comment
@manishrjain

manishrjain Mar 30, 2018

Member

May I suggest having Zero retry forever, instead of panicking?

Just to clarify, you mean having Alpha retry forever, instead of panicking.

Member

manishrjain commented Mar 30, 2018

May I suggest having Zero retry forever, instead of panicking?

Just to clarify, you mean having Alpha retry forever, instead of panicking.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Mar 30, 2018

Whoops, yes, got my names jumbled. :)

aphyr commented Mar 30, 2018

Whoops, yes, got my names jumbled. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment