Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvclient: DistSender circuit breakers for unreachable leaseholders #93501

Closed
erikgrinaker opened this issue Dec 13, 2022 · 3 comments
Closed
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Dec 13, 2022

We should implement per-range circuit breakers in the DistSender. The motivation is to avoid clients getting "stuck" and hanging forever when the SQL gateway is unable to reach a leaseholder, e.g. because of a partial network connection where the link between the SQL gateway and leaseholder is down but the rest of the network is functional, allowing the leaseholder to maintain its lease and Raft leadership (see internal document). Clients getting stuck like this without client-side timeouts can quickly exhaust connection pools or worker pools, causing a complete outage of the entire application even though requests to other gateways or ranges might work just fine.

The DistSender should maintain per-range circuit breakers, shared between DistSender instances on a node. When the DistSender is unable to reach a range's leaseholder after some time, it should trip the circuit breaker and fail fast on any pending and future requests to it. In the background, it should regularly probe the range and reset the breaker when it recovers. This should be built using pkg/util/circuit.Breaker.

We should only trip the breaker on very specific failures. Notably, we should not trip it just because some request is slow to process (e.g. a big scan or something blocked on a latch), we should only trip it when we know for certain that we're unable to reach the leaseholder. This typically implies that we receive network errors from the current leaseholder (i.e. we're unable to establish an RPC connection to it), and/or we keep getting NotLeaseHolderError from all reachable replicas in the range and noone else eventually acquires a lease and serves requests. This includes the case where there is no leaseholder, and noone is able to acquire a lease.

This is similar to existing circuit breakers we have at the replica level and RPC level. Its purpose here is specifically to terminate the otherwise-indefinite retry loops in the DistSender when we're unable to reach a leaseholder.

This should be integrated with requests via followers (#93503) if we choose to implement that, in which case the circuit breaker should only trip when no followers were able to process the request either.

Jira issue: CRDB-22368

Epic CRDB-25200

@erikgrinaker erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-client Relating to the KV client and the KV interface. T-kv KV Team labels Dec 13, 2022
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Apr 14, 2023

It's possible that we can do something cheap here.

In general, we only want to fail fast on permanent errors -- in particular, we shouldn't fail fast on slow/stuck requests, since there's no way to determine whether they're actually making progress or not.

Does this mean that the only errors we can fail fast on are network-level errors, e.g. DNS timeouts, TCP timeouts, connection refused, etc? If so, then we already have RPC-level circuit breakers which will fail fast on network errors. In particular, these prevent clients getting stuck waiting for network timeouts. Unfortunately, these still hijack client requests to probe nodes, which contribute to tail latency, but this is being resolved by using new circuit breaker infrastructure in #96566.

If all permanent network errors fail fast via RPC circuit breakers, and those are the only errors we can fail fast on, can we then simply bound the number of DistSender retries for permanent errors like in #93501? We'll probably want to be careful about which cases are/aren't covered by this logic, but it seems viable (and cheap!).

See also internal Slack thread.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Apr 14, 2023

I guess I answered my own question in the description above. We probably want to handle this case:

and/or we keep getting NotLeaseHolderError from all reachable replicas in the range and noone else eventually acquires a lease and serves requests. This includes the case where there is no leaseholder, and noone is able to acquire a lease.

We could handle that at the replica level instead though, similarly to the replica circuit breakers on the write path.

@tbg points out that the lease acquisition should already be covered by the replication circuit breakers, which may be sufficient here. We need to make sure this is properly propagated to the DistSender as a permanent error (rather than a NLHE), and that we also handle cases where there is a leaseholder but we're unable to reach it.

@exalate-issue-sync exalate-issue-sync bot added O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-2 Issues/test failures with a fix SLA of 3 months labels Feb 29, 2024
@erikgrinaker erikgrinaker removed their assignment Mar 27, 2024
@arulajmani
Copy link
Collaborator

This was addressed by #118943.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team
Projects
None yet
Development

No branches or pull requests

3 participants