rpc: automatically maintain RPC connections across cluster #70111

erikgrinaker · 2021-09-13T09:51:24Z

Currently, components must themselves dial remote nodes whenever they interact with them, to make sure a connection is established. This mostly applies to Raft, DistSender, DistSQL, and the closed timestamp side transport. These components will also interact with the RPC health checks and circuit breakers.

This can be problematic because it can introduce very high latency (tens of seconds) when interacting with unresponsive nodes (e.g. when the server/VM is shut down or during network unavailability), which is not acceptable for many performance-critical code paths (see #70017). It also muddies the water wrt. who is responsible for running health checks and circuit breaker probes (see #68419).

Instead, we should have a single actor (i.e. goroutine) responsible for maintaining RPC connections to other nodes and performing health checks. The connection health should be exposed in such a way that RPC clients can fail fast whenever they try to interact with a known-bad node.

@tbg has some thoughts on the health check/breaker issues in #68419 (comment), but we should consider extending that proposal to also manage all RPC connections by dialing remote nodes as appropriate.

/cc @cockroachdb/kv

Jira issue: CRDB-9948

gz#13169

Epic CRDB-32137

irfansharif · 2021-09-16T13:27:59Z

Would be nice to address this while here. I think I was seeing this happen for an internal support case.

cockroach/pkg/rpc/nodedialer/nodedialer.go

Lines 191 to 196 in 793b4c8

    
           // TODO(bdarnell): Reconcile the different health checks and circuit breaker 
        
           // behavior in this file. Note that this different behavior causes problems 
        
           // for higher-levels in the system. For example, DistSQL checks for 
        
           // ConnHealth when scheduling processors, but can then see attempts to send 
        
           // RPCs fail when dial fails due to an open breaker. Reset the breaker here 
        
           // as a stop-gap before the reconciliation occurs.

See cockroachdb#68419 (comment) for the original discussion. This commit adds a new `circuit` package that uses probing-based circuit breakers. This breaker does *not* recruit the occasional request to carry out the probing. Instead, the circuit breaker is configured with an "asychronous probe" that effectively determines when the breaker should reset. We prefer this approach precisely because it avoids recruiting regular traffic, which is often tied to end-user requests, and led to inacceptable latencies there. The potential downside of the probing approach is that the breaker setup is more complex and there is residual risk of configuring the probe differently from the actual client requests. In the worst case, the breaker would be perpetually tripped even though everything should be fine. This isn't expected - our two uses of circuit breakers are pretty clear about what they protect - but it is worth mentioning as this consideration likely influenced the design of the original breaker. Touches cockroachdb#69888 Touches cockroachdb#70111 Touches cockroachdb#53410 Also, this breaker was designed to be a good fit for: cockroachdb#33007 Release note: None

nvanbenschoten · 2021-09-30T19:49:19Z

I think I was seeing this happen for an internal support case.

We no longer believe that this was the issue in the referenced support case.

See cockroachdb#68419 (comment) for the original discussion. This commit adds a new `circuit` package that uses probing-based circuit breakers. This breaker does *not* recruit the occasional request to carry out the probing. Instead, the circuit breaker is configured with an "asychronous probe" that effectively determines when the breaker should reset. We prefer this approach precisely because it avoids recruiting regular traffic, which is often tied to end-user requests, and led to inacceptable latencies there. The potential downside of the probing approach is that the breaker setup is more complex and there is residual risk of configuring the probe differently from the actual client requests. In the worst case, the breaker would be perpetually tripped even though everything should be fine. This isn't expected - our two uses of circuit breakers are pretty clear about what they protect - but it is worth mentioning as this consideration likely influenced the design of the original breaker. Touches cockroachdb#69888 Touches cockroachdb#70111 Touches cockroachdb#53410 Also, this breaker was designed to be a good fit for: cockroachdb#33007 which will use the `Signal()` call. Release note: None

tbg · 2021-10-04T13:56:26Z

There's a push to re-work some of the rpc circuit breaking which is outlined here #70485 (comment)

erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-server Relating to the KV-level RPC server A-server-networking Pertains to network addressing,routing,initialization labels Sep 13, 2021

This was referenced Sep 13, 2021

rpc: circuit breaker livelock if 2nd check comes too fast after the 1st #68419

Closed

rpc: avoid network IO in Dialer.ConnHealth #70017

Merged

erikgrinaker added the T-kv KV Team label Sep 13, 2021

blathers-crl bot added this to Incoming in KV Sep 13, 2021

lunevalex added this to To do in DB Server & Security via automation Sep 14, 2021

blathers-crl bot added the T-server-and-security DB Server & Security label Sep 14, 2021

knz mentioned this issue Sep 17, 2021

server: back-connect after node join fails with mismatched node ID #54762

Closed

tbg mentioned this issue Sep 21, 2021

rpc: use async-probing based circuit breakers #70485

Closed

lunevalex moved this from Incoming to Current Milestone in KV Oct 1, 2021

nvanbenschoten removed this from Current Milestone in KV Oct 29, 2021

nvanbenschoten added this to Incoming in Replication via automation Oct 29, 2021

nvanbenschoten moved this from Incoming to Current Milestone in Replication Oct 29, 2021

erikgrinaker moved this from Current Milestone to Backlog in Replication Nov 13, 2021

erikgrinaker removed this from Backlog in Replication Feb 25, 2022

knz removed this from To do in DB Server & Security Jun 16, 2022

knz added this to Incoming in KV via automation Jun 16, 2022

knz removed the T-server-and-security DB Server & Security label Jun 16, 2022

erikgrinaker mentioned this issue Jul 12, 2022

rpc: refuse incoming connections unless dialback succeeds #84289

Closed

nvanbenschoten moved this from Incoming to On Hold in KV Jul 18, 2022

tbg mentioned this issue Aug 9, 2022

rpc: adopt util/circuit #85811

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: automatically maintain RPC connections across cluster #70111

rpc: automatically maintain RPC connections across cluster #70111

erikgrinaker commented Sep 13, 2021 •

edited by exalate-issue-sync bot

irfansharif commented Sep 16, 2021

nvanbenschoten commented Sep 30, 2021

tbg commented Oct 4, 2021

rpc: automatically maintain RPC connections across cluster #70111

rpc: automatically maintain RPC connections across cluster #70111

Comments

erikgrinaker commented Sep 13, 2021 • edited by exalate-issue-sync bot

irfansharif commented Sep 16, 2021

nvanbenschoten commented Sep 30, 2021

tbg commented Oct 4, 2021

erikgrinaker commented Sep 13, 2021 •

edited by exalate-issue-sync bot