kvclient: DistSender circuit breakers for unreachable leaseholders #93501
Labels
A-kv-client
Relating to the KV client and the KV interface.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
P-2
Issues/test failures with a fix SLA of 3 months
T-kv
KV Team
We should implement per-range circuit breakers in the DistSender. The motivation is to avoid clients getting "stuck" and hanging forever when the SQL gateway is unable to reach a leaseholder, e.g. because of a partial network connection where the link between the SQL gateway and leaseholder is down but the rest of the network is functional, allowing the leaseholder to maintain its lease and Raft leadership (see internal document). Clients getting stuck like this without client-side timeouts can quickly exhaust connection pools or worker pools, causing a complete outage of the entire application even though requests to other gateways or ranges might work just fine.
The DistSender should maintain per-range circuit breakers, shared between DistSender instances on a node. When the DistSender is unable to reach a range's leaseholder after some time, it should trip the circuit breaker and fail fast on any pending and future requests to it. In the background, it should regularly probe the range and reset the breaker when it recovers. This should be built using
pkg/util/circuit.Breaker
.We should only trip the breaker on very specific failures. Notably, we should not trip it just because some request is slow to process (e.g. a big scan or something blocked on a latch), we should only trip it when we know for certain that we're unable to reach the leaseholder. This typically implies that we receive network errors from the current leaseholder (i.e. we're unable to establish an RPC connection to it), and/or we keep getting
NotLeaseHolderError
from all reachable replicas in the range and noone else eventually acquires a lease and serves requests. This includes the case where there is no leaseholder, and noone is able to acquire a lease.This is similar to existing circuit breakers we have at the replica level and RPC level. Its purpose here is specifically to terminate the otherwise-indefinite retry loops in the DistSender when we're unable to reach a leaseholder.
This should be integrated with requests via followers (#93503) if we choose to implement that, in which case the circuit breaker should only trip when no followers were able to process the request either.
Jira issue: CRDB-22368
Epic CRDB-25200
The text was updated successfully, but these errors were encountered: