kvcoord: handle follower reads with DistSender circuit breakers #119923

erikgrinaker · 2024-03-05T13:21:55Z

The DistSender circuit breakers from #118943 should possibly handle follower reads better. The two main cases are:

A replica partitioned away from the leader will fail to acquire a lease which may cause repeated client timeout errors, but it may still be able to serve follower reads below the closed timestamp.
A stalled replica may not be able to serve follower reads either.

There are two interactions to keep in mind:

Successful follower reads may mix with lease timeout errors, preventing the breaker from tripping and causing leased request failures.
It may be possible for a replica with a tripped breaker to serve follower reads.

Potential follower reads use RoutingPolicy_NEAREST (this includes meta range lookups). We should consider adding specialized handling for these. RoutingPolicy_LEASEHOLDER may also be served as follower reads if the timestamp falls below the replica's closed timestamp, but this typically implies that the replica is in fact receiving closed timestamp updates from the leader -- it will typically take 6 seconds to trip the breaker, and the replica must have been failing at the start of this interval, implying that 6 second old follower reads will fail.

It may be reasonable to assume that the follower read timestamp is in the recent past, and that a faulty replica will remain faulty for longer than the follower read lag. Given that, we can likely omit successful follower reads when considering recent error runs, and trip the breaker regardless of successful follower reads. This will incur a latency penalty for follower reads that now have to hit other replicas, but once the follower read lag moves above the closed timestamp (typically a few seconds later) it will incur that penalty anyway.

Alternatively, we can differentiate handling of follower reads and other requests, and e.g. use separate breakers and probes for them, but it isn't clear that this is worthwhile.

Jira issue: CRDB-36391

The text was updated successfully, but these errors were encountered:

erikgrinaker · 2024-03-05T13:22:06Z

cc @nvanbenschoten in case you have opinions on this.

erikgrinaker · 2024-03-11T11:41:38Z

We discussed this recently, and concluded that as a first step we should simply add a setting (enabled by default) which ignores RoutingPolicy_NEAREST entirely in the circuit breaker -- they will not be considered at all for replica activity tracking or error/stall detection, nor will they be rejected when the breaker is tripped.

blathers-crl · 2024-03-27T13:21:26Z

Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this GA-blocker affects.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

erikgrinaker · 2024-03-27T13:22:49Z

Currently, the circuit breakers don't do anything in particular about follower reads. There is an initial PR in #120198 to ignore them during tracking, but it isn't clear that this is the right approach.

We may not need to or be able to do anything here for 24.1 (my inclination is to not have any special handling for them at lease for now), but I'm tentatively marking this as a GA blocker so that the KV team can make a determination.

arulajmani · 2024-04-08T20:44:01Z

We may not need to or be able to do anything here for 24.1 (my inclination is to not have any special handling for them at lease for now)

Spoke to @andrewbaptist about this. We're fine not having any special case handling for them in 24.1. Removing the GA-blocker label.

erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-client Relating to the KV client and the KV interface. T-kv KV Team labels Mar 5, 2024

erikgrinaker self-assigned this Mar 5, 2024

blathers-crl bot added this to Incoming in KV Mar 5, 2024

erikgrinaker mentioned this issue Mar 11, 2024

kvcoord: ignore follower reads in DistSender circuit breakers #120198

Closed

erikgrinaker added the GA-blocker label Mar 27, 2024

erikgrinaker removed their assignment Mar 27, 2024

erikgrinaker added the branch-release-24.1 Used to mark GA and release blockers and technical advisories for 24.1 label Mar 27, 2024

erikgrinaker mentioned this issue Mar 27, 2024

kvcoord: DistSender circuit breaker testing #119918

Closed

arulajmani removed GA-blocker branch-release-24.1 Used to mark GA and release blockers and technical advisories for 24.1 labels Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvcoord: handle follower reads with DistSender circuit breakers #119923

kvcoord: handle follower reads with DistSender circuit breakers #119923

erikgrinaker commented Mar 5, 2024 •

edited by exalate-issue-sync bot

erikgrinaker commented Mar 5, 2024

erikgrinaker commented Mar 11, 2024

blathers-crl bot commented Mar 27, 2024

erikgrinaker commented Mar 27, 2024 •

edited

arulajmani commented Apr 8, 2024

kvcoord: handle follower reads with DistSender circuit breakers #119923

kvcoord: handle follower reads with DistSender circuit breakers #119923

Comments

erikgrinaker commented Mar 5, 2024 • edited by exalate-issue-sync bot

erikgrinaker commented Mar 5, 2024

erikgrinaker commented Mar 11, 2024

blathers-crl bot commented Mar 27, 2024

erikgrinaker commented Mar 27, 2024 • edited

arulajmani commented Apr 8, 2024

erikgrinaker commented Mar 5, 2024 •

edited by exalate-issue-sync bot

erikgrinaker commented Mar 27, 2024 •

edited