Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign uplocality-advertise-addr is not wokring #42741
Comments
This comment has been minimized.
This comment has been minimized.
I suspect that this may be related to the gossip bootstrap persistence feature: cockroach/pkg/gossip/gossip.go Line 845 in a7f0af0 This uses This feature was built with the assumption that the primary/public address for the node would be reachable from anywhere; the locality-specific address would just be an optimization. We haven't done much testing in cases where the primary address is sometimes unreachable and so it's possible that we're relying on the primary address in some bootstrapping cases (or maybe it's not just bootstrapping, which would be a more significant bug in this feature). It looks like you're (intentionally) in a single region/AZ for now, but what is your plan when you go to multiple regions? Will there be a public IP that works across regions or will you be using multiple private IPs? Assuming the former is your goal (which is what we usually see), you'll need to adjust your firewall to get there, and making that adjustment now should get things working (although we'll need to confirm that after bootstrapping it is transitioning onto the more efficient private IPs).
This is redundant - it's a list of rules with first-match-wins, so you only want to specify the level that determines access to the private IP (typically |
Running a 9 nodes cluster on GCP with 19.2.0.
./cockroach start --cache=25% --max-sql-memory=35% --background --locality=cloud=gcp,region=us-east1,datacenter=us-east1-c --store=path=/mnt/d1,attrs=ssd,size=90% --log-dir=log --certs-dir=certs --max-disk-temp-storage=100GB --locality-advertise-addr=cloud=gcp@{Private IP},region=us-east1@{Private IP},datacenter=us-east1-c@{Private IP} --join={N1 Private IP},{N2 Private IP},{Nx Prive IP} --advertise-addr={Public IP}
Start all nodes, and it looks like all nodes are healthy

However, in the network diagnostics pages

Confirmed that all nodes are in the same region

If I shutdown the cluster and restart, on the network diagnostics pages it will become

On the problematic node, there will be spam with these log entries
W191125 17:58:47.883009 19657 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N3}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp {Public IP N3}:26257: i/o timeout". Reconnecting... I191125 17:58:48.663426 20622 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n4] circuitbreaker: gossip [::]:26257->{Public IP N9}:26257 tripped: initial connection heartbeat failed: operation "rpc heartbeat" timed out after 6s: rpc error: code = DeadlineExceeded desc = context deadline exceeded I191125 17:58:48.663437 20622 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 [n4] circuitbreaker: gossip [::]:26257->{Public IP N9}:26257 event: BreakerTripped W191125 17:58:48.883192 19657 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N3}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting... I191125 17:58:52.045207 187 server/status/runtime.go:498 [n4] runtime stats: 5.0 GiB RSS, 363 goroutines, 174 MiB/60 MiB/271 MiB GO alloc/idle/total, 4.1 GiB/4.8 GiB CGO alloc/total, 91.6 CGO/sec, 14.8/0.8 %(u/s)time, 0.0 %gc (1x), 606 KiB/456 KiB (r/w)net W191125 17:58:52.057512 182 server/node.go:745 [n4] [n4,s4]: unable to compute metrics: [n4,s4]: system config not yet available W191125 17:58:52.217886 161 storage/replica_range_lease.go:554 can't determine lease status due to node liveness error: node not in the liveness table github.com/cockroachdb/cockroach/pkg/storage.init.ializers /go/src/github.com/cockroachdb/cockroach/pkg/storage/node_liveness.go:44 runtime.main /usr/local/go/src/runtime/proc.go:188 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1337 W191125 17:58:57.217893 162 storage/replica_range_lease.go:554 can't determine lease status due to node liveness error: node not in the liveness table github.com/cockroachdb/cockroach/pkg/storage.init.ializers /go/src/github.com/cockroachdb/cockroach/pkg/storage/node_liveness.go:44 runtime.main /usr/local/go/src/runtime/proc.go:188 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1337 W191125 17:58:58.008692 20241 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N2}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp {Public IP N2}:26257: i/o timeout". Reconnecting... I191125 17:58:58.361063 19445 storage/store_snapshot.go:978 [n4,raftsnapshot,s4,r262/3:/Table/60/2/"5{9aaca…-b073e…}] sending LEARNER snapshot fcabe123 at applied index 2404159 I191125 17:58:58.517305 155 storage/store_remove_replica.go:129 [n4,s4,r262/3:/Table/60/2/"5{9aaca…-b073e…}] removing replica r262/3 W191125 17:58:59.008852 20241 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N2}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting... W191125 17:58:59.008859 20391 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N8}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp {Public IP N8}:26257: i/o timeout". Reconnecting... I191125 17:58:59.010597 21293 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n4] circuitbreaker: gossip [::]:26257->{Public IP N3}:26257 tripped: initial connection heartbeat failed: operation "rpc heartbeat" timed out after 6s: rpc error: code = DeadlineExceeded desc = context deadline exceeded I191125 17:58:59.010610 21293 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 [n4] circuitbreaker: gossip [::]:26257->{Public IP N3}:26257 event: BreakerTripped
If I start all nodes with --advertise-addr={Private IP}, everything back to normal.