Skip to content

server: draining hangs when quorum is lost #14620

@jseldess

Description

@jseldess

This isn't an issue for production clusters, which will be upgraded in a rolling fashion, but it is a usability issue for quick test clusters.

Once you lose quorum, the remaining nodes can't be shut down with cockroach quit. Instead, you need to do a force kill.

~/src/github.com/cockroachdb/cockroach$ cockroach start --background --store=repdemo-node1
CockroachDB node starting at 2017-04-04 18:08:22.607233193 -0400 EDT
build:      CCL 274f7e5 @ 2017/04/04 04:16:48 (go1.8)
admin:      http://localhost:8080
sql:        postgresql://root@localhost:26257?sslmode=disable
logs:       repdemo-node1/logs
store[0]:   path=repdemo-node1
status:     initialized new cluster
clusterID:  5f41d0b5-c814-40b8-a356-6def69281b92
nodeID:     1
~/src/github.com/cockroachdb/cockroach$ cockroach start --background --store=repdemo-node2 --port=26258 --http-port=8081 --join=localhost:26257
CockroachDB node starting at 2017-04-04 18:08:32.159974578 -0400 EDT
build:      CCL 274f7e5 @ 2017/04/04 04:16:48 (go1.8)
admin:      http://localhost:8081
sql:        postgresql://root@localhost:26258?sslmode=disable
logs:       repdemo-node2/logs
store[0]:   path=repdemo-node2
status:     initialized new node, joined pre-existing cluster
clusterID:  5f41d0b5-c814-40b8-a356-6def69281b92
nodeID:     2
~/src/github.com/cockroachdb/cockroach$ cockroach start --background --store=repdemo-node3 --port=26259 --http-port=8082 --join=localhost:26257
CockroachDB node starting at 2017-04-04 18:08:39.068806601 -0400 EDT
build:      CCL 274f7e5 @ 2017/04/04 04:16:48 (go1.8)
admin:      http://localhost:8082
sql:        postgresql://root@localhost:26259?sslmode=disable
logs:       repdemo-node3/logs
store[0]:   path=repdemo-node3
status:     initialized new node, joined pre-existing cluster
clusterID:  5f41d0b5-c814-40b8-a356-6def69281b92
nodeID:     3
~/src/github.com/cockroachdb/cockroach$ cockroach quit --port=26259
initiating graceful shutdown of server
server drained and shutdown completed
ok
~/src/github.com/cockroachdb/cockroach$ cockroach quit --port=26258
initiating graceful shutdown of server
ok
server drained and shutdown completed
~/src/github.com/cockroachdb/cockroach$ cockroach quit --port=26257 --logtostderr

Note that the third node never quits. Here's what you see toward the end of the logs:

I170404 18:09:49.786408 390 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:26258: getsockopt: connection refused"; Reconnecting to {localhost:26258 <nil>}
I170404 18:09:50.644194 576 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:26259: getsockopt: connection refused"; Reconnecting to {localhost:26259 <nil>}
I170404 18:09:50.840000 390 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:26258: getsockopt: connection refused"; Reconnecting to {localhost:26258 <nil>}
I170404 18:09:51.706244 576 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:26259: getsockopt: connection refused"; Reconnecting to {localhost:26259 <nil>}
I170404 18:09:51.983215 390 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:26258: getsockopt: connection refused"; Reconnecting to {localhost:26258 <nil>}

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions