kvclient: DistSender circuit breakers for unreachable leaseholders #93501

erikgrinaker · 2022-12-13T12:44:35Z

We should implement per-range circuit breakers in the DistSender. The motivation is to avoid clients getting "stuck" and hanging forever when the SQL gateway is unable to reach a leaseholder, e.g. because of a partial network connection where the link between the SQL gateway and leaseholder is down but the rest of the network is functional, allowing the leaseholder to maintain its lease and Raft leadership (see internal document). Clients getting stuck like this without client-side timeouts can quickly exhaust connection pools or worker pools, causing a complete outage of the entire application even though requests to other gateways or ranges might work just fine.

The DistSender should maintain per-range circuit breakers, shared between DistSender instances on a node. When the DistSender is unable to reach a range's leaseholder after some time, it should trip the circuit breaker and fail fast on any pending and future requests to it. In the background, it should regularly probe the range and reset the breaker when it recovers. This should be built using pkg/util/circuit.Breaker.

We should only trip the breaker on very specific failures. Notably, we should not trip it just because some request is slow to process (e.g. a big scan or something blocked on a latch), we should only trip it when we know for certain that we're unable to reach the leaseholder. This typically implies that we receive network errors from the current leaseholder (i.e. we're unable to establish an RPC connection to it), and/or we keep getting NotLeaseHolderError from all reachable replicas in the range and noone else eventually acquires a lease and serves requests. This includes the case where there is no leaseholder, and noone is able to acquire a lease.

This is similar to existing circuit breakers we have at the replica level and RPC level. Its purpose here is specifically to terminate the otherwise-indefinite retry loops in the DistSender when we're unable to reach a leaseholder.

This should be integrated with requests via followers (#93503) if we choose to implement that, in which case the circuit breaker should only trip when no followers were able to process the request either.

Jira issue: CRDB-22368

Epic CRDB-25200

The text was updated successfully, but these errors were encountered:

erikgrinaker · 2023-04-14T09:50:08Z

It's possible that we can do something cheap here.

In general, we only want to fail fast on permanent errors -- in particular, we shouldn't fail fast on slow/stuck requests, since there's no way to determine whether they're actually making progress or not.

Does this mean that the only errors we can fail fast on are network-level errors, e.g. DNS timeouts, TCP timeouts, connection refused, etc? If so, then we already have RPC-level circuit breakers which will fail fast on network errors. In particular, these prevent clients getting stuck waiting for network timeouts. Unfortunately, these still hijack client requests to probe nodes, which contribute to tail latency, but this is being resolved by using new circuit breaker infrastructure in #96566.

If all permanent network errors fail fast via RPC circuit breakers, and those are the only errors we can fail fast on, can we then simply bound the number of DistSender retries for permanent errors like in #93501? We'll probably want to be careful about which cases are/aren't covered by this logic, but it seems viable (and cheap!).

See also internal Slack thread.

erikgrinaker · 2023-04-14T09:55:27Z

I guess I answered my own question in the description above. We probably want to handle this case:

and/or we keep getting NotLeaseHolderError from all reachable replicas in the range and noone else eventually acquires a lease and serves requests. This includes the case where there is no leaseholder, and noone is able to acquire a lease.

We could handle that at the replica level instead though, similarly to the replica circuit breakers on the write path.

@tbg points out that the lease acquisition should already be covered by the replication circuit breakers, which may be sufficient here. We need to make sure this is properly propagated to the DistSender as a permanent error (rather than a NLHE), and that we also handle cases where there is a leaseholder but we're unable to reach it.

arulajmani · 2024-05-08T17:16:09Z

This was addressed by #118943.

erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-client Relating to the KV client and the KV interface. T-kv KV Team labels Dec 13, 2022

erikgrinaker mentioned this issue Dec 13, 2022

kvclient: route requests via followers when leaseholder is unreachable #93503

Closed

exalate-issue-sync bot assigned andrewbaptist Jan 3, 2023

exalate-issue-sync bot added O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-2 Issues/test failures with a fix SLA of 3 months labels Feb 29, 2024

andrewbaptist assigned erikgrinaker Mar 6, 2024

exalate-issue-sync bot unassigned andrewbaptist Mar 15, 2024

erikgrinaker removed their assignment Mar 27, 2024

arulajmani closed this as completed May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvclient: DistSender circuit breakers for unreachable leaseholders #93501

kvclient: DistSender circuit breakers for unreachable leaseholders #93501

erikgrinaker commented Dec 13, 2022 •

edited by exalate-issue-sync bot

Loading

erikgrinaker commented Apr 14, 2023 •

edited

Loading

erikgrinaker commented Apr 14, 2023 •

edited

Loading

arulajmani commented May 8, 2024

kvclient: DistSender circuit breakers for unreachable leaseholders #93501

kvclient: DistSender circuit breakers for unreachable leaseholders #93501

Comments

erikgrinaker commented Dec 13, 2022 • edited by exalate-issue-sync bot Loading

erikgrinaker commented Apr 14, 2023 • edited Loading

erikgrinaker commented Apr 14, 2023 • edited Loading

arulajmani commented May 8, 2024

erikgrinaker commented Dec 13, 2022 •

edited by exalate-issue-sync bot

Loading

erikgrinaker commented Apr 14, 2023 •

edited

Loading

erikgrinaker commented Apr 14, 2023 •

edited

Loading