Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: automatically maintain RPC connections across cluster #70111

Open
erikgrinaker opened this issue Sep 13, 2021 · 3 comments
Open

rpc: automatically maintain RPC connections across cluster #70111

erikgrinaker opened this issue Sep 13, 2021 · 3 comments
Labels
A-kv-server Relating to the KV-level RPC server A-server-networking Pertains to network addressing,routing,initialization C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Sep 13, 2021

Currently, components must themselves dial remote nodes whenever they interact with them, to make sure a connection is established. This mostly applies to Raft, DistSender, DistSQL, and the closed timestamp side transport. These components will also interact with the RPC health checks and circuit breakers.

This can be problematic because it can introduce very high latency (tens of seconds) when interacting with unresponsive nodes (e.g. when the server/VM is shut down or during network unavailability), which is not acceptable for many performance-critical code paths (see #70017). It also muddies the water wrt. who is responsible for running health checks and circuit breaker probes (see #68419).

Instead, we should have a single actor (i.e. goroutine) responsible for maintaining RPC connections to other nodes and performing health checks. The connection health should be exposed in such a way that RPC clients can fail fast whenever they try to interact with a known-bad node.

@tbg has some thoughts on the health check/breaker issues in #68419 (comment), but we should consider extending that proposal to also manage all RPC connections by dialing remote nodes as appropriate.

/cc @cockroachdb/kv

Jira issue: CRDB-9948

gz#13169

Epic CRDB-32137

@erikgrinaker erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-server Relating to the KV-level RPC server A-server-networking Pertains to network addressing,routing,initialization labels Sep 13, 2021
@erikgrinaker erikgrinaker added the T-kv KV Team label Sep 13, 2021
@blathers-crl blathers-crl bot added this to Incoming in KV Sep 13, 2021
@lunevalex lunevalex added this to To do in DB Server & Security via automation Sep 14, 2021
@blathers-crl blathers-crl bot added the T-server-and-security DB Server & Security label Sep 14, 2021
@irfansharif
Copy link
Contributor

Would be nice to address this while here. I think I was seeing this happen for an internal support case.

// TODO(bdarnell): Reconcile the different health checks and circuit breaker
// behavior in this file. Note that this different behavior causes problems
// for higher-levels in the system. For example, DistSQL checks for
// ConnHealth when scheduling processors, but can then see attempts to send
// RPCs fail when dial fails due to an open breaker. Reset the breaker here
// as a stop-gap before the reconciliation occurs.

tbg added a commit to tbg/cockroach that referenced this issue Sep 30, 2021
See cockroachdb#68419 (comment) for the original discussion.

This commit adds a new `circuit` package that uses probing-based
circuit breakers. This breaker does *not* recruit the occasional
request to carry out the probing. Instead, the circuit breaker
is configured with an "asychronous probe" that effectively
determines when the breaker should reset.

We prefer this approach precisely because it avoids recruiting
regular traffic, which is often tied to end-user requests, and
led to inacceptable latencies there.

The potential downside of the probing approach is that the breaker setup
is more complex and there is residual risk of configuring the probe
differently from the actual client requests. In the worst case, the
breaker would be perpetually tripped even though everything should be
fine. This isn't expected - our two uses of circuit breakers are
pretty clear about what they protect - but it is worth mentioning
as this consideration likely influenced the design of the original
breaker.

Touches cockroachdb#69888
Touches cockroachdb#70111
Touches cockroachdb#53410

Also, this breaker was designed to be a good fit for:
cockroachdb#33007

Release note: None
@nvanbenschoten
Copy link
Member

I think I was seeing this happen for an internal support case.

We no longer believe that this was the issue in the referenced support case.

tbg added a commit to tbg/cockroach that referenced this issue Oct 1, 2021
See cockroachdb#68419 (comment) for the original discussion.

This commit adds a new `circuit` package that uses probing-based
circuit breakers. This breaker does *not* recruit the occasional
request to carry out the probing. Instead, the circuit breaker
is configured with an "asychronous probe" that effectively
determines when the breaker should reset.

We prefer this approach precisely because it avoids recruiting
regular traffic, which is often tied to end-user requests, and
led to inacceptable latencies there.

The potential downside of the probing approach is that the breaker setup
is more complex and there is residual risk of configuring the probe
differently from the actual client requests. In the worst case, the
breaker would be perpetually tripped even though everything should be
fine. This isn't expected - our two uses of circuit breakers are pretty
clear about what they protect - but it is worth mentioning as this
consideration likely influenced the design of the original breaker.

Touches cockroachdb#69888
Touches cockroachdb#70111
Touches cockroachdb#53410

Also, this breaker was designed to be a good fit for:
cockroachdb#33007
which will use the `Signal()` call.

Release note: None
@lunevalex lunevalex moved this from Incoming to Current Milestone in KV Oct 1, 2021
@tbg
Copy link
Member

tbg commented Oct 4, 2021

There's a push to re-work some of the rpc circuit breaking which is outlined here #70485 (comment)

@nvanbenschoten nvanbenschoten removed this from Current Milestone in KV Oct 29, 2021
@nvanbenschoten nvanbenschoten added this to Incoming in Replication via automation Oct 29, 2021
@nvanbenschoten nvanbenschoten moved this from Incoming to Current Milestone in Replication Oct 29, 2021
@erikgrinaker erikgrinaker moved this from Current Milestone to Backlog in Replication Nov 13, 2021
@erikgrinaker erikgrinaker removed this from Backlog in Replication Feb 25, 2022
@knz knz removed this from To do in DB Server & Security Jun 16, 2022
@knz knz added this to Incoming in KV via automation Jun 16, 2022
@knz knz removed the T-server-and-security DB Server & Security label Jun 16, 2022
@nvanbenschoten nvanbenschoten moved this from Incoming to On Hold in KV Jul 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-server Relating to the KV-level RPC server A-server-networking Pertains to network addressing,routing,initialization C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
KV
On Hold
Development

No branches or pull requests

5 participants