Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvcoord: handle follower reads with DistSender circuit breakers #119923

Open
erikgrinaker opened this issue Mar 5, 2024 · 5 comments
Open

kvcoord: handle follower reads with DistSender circuit breakers #119923

erikgrinaker opened this issue Mar 5, 2024 · 5 comments
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Mar 5, 2024

The DistSender circuit breakers from #118943 should possibly handle follower reads better. The two main cases are:

  • A replica partitioned away from the leader will fail to acquire a lease which may cause repeated client timeout errors, but it may still be able to serve follower reads below the closed timestamp.
  • A stalled replica may not be able to serve follower reads either.

There are two interactions to keep in mind:

  • Successful follower reads may mix with lease timeout errors, preventing the breaker from tripping and causing leased request failures.
  • It may be possible for a replica with a tripped breaker to serve follower reads.

Potential follower reads use RoutingPolicy_NEAREST (this includes meta range lookups). We should consider adding specialized handling for these. RoutingPolicy_LEASEHOLDER may also be served as follower reads if the timestamp falls below the replica's closed timestamp, but this typically implies that the replica is in fact receiving closed timestamp updates from the leader -- it will typically take 6 seconds to trip the breaker, and the replica must have been failing at the start of this interval, implying that 6 second old follower reads will fail.

It may be reasonable to assume that the follower read timestamp is in the recent past, and that a faulty replica will remain faulty for longer than the follower read lag. Given that, we can likely omit successful follower reads when considering recent error runs, and trip the breaker regardless of successful follower reads. This will incur a latency penalty for follower reads that now have to hit other replicas, but once the follower read lag moves above the closed timestamp (typically a few seconds later) it will incur that penalty anyway.

Alternatively, we can differentiate handling of follower reads and other requests, and e.g. use separate breakers and probes for them, but it isn't clear that this is worthwhile.

Jira issue: CRDB-36391

@erikgrinaker erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-client Relating to the KV client and the KV interface. T-kv KV Team labels Mar 5, 2024
@erikgrinaker erikgrinaker self-assigned this Mar 5, 2024
@blathers-crl blathers-crl bot added this to Incoming in KV Mar 5, 2024
@erikgrinaker
Copy link
Contributor Author

cc @nvanbenschoten in case you have opinions on this.

@erikgrinaker
Copy link
Contributor Author

We discussed this recently, and concluded that as a first step we should simply add a setting (enabled by default) which ignores RoutingPolicy_NEAREST entirely in the circuit breaker -- they will not be considered at all for replica activity tracking or error/stall detection, nor will they be rejected when the breaker is tripped.

Copy link

blathers-crl bot commented Mar 27, 2024

Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this GA-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@erikgrinaker erikgrinaker removed their assignment Mar 27, 2024
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Mar 27, 2024

Currently, the circuit breakers don't do anything in particular about follower reads. There is an initial PR in #120198 to ignore them during tracking, but it isn't clear that this is the right approach.

We may not need to or be able to do anything here for 24.1 (my inclination is to not have any special handling for them at lease for now), but I'm tentatively marking this as a GA blocker so that the KV team can make a determination.

@erikgrinaker erikgrinaker added the branch-release-24.1 Used to mark GA and release blockers and technical advisories for 24.1 label Mar 27, 2024
@arulajmani
Copy link
Collaborator

We may not need to or be able to do anything here for 24.1 (my inclination is to not have any special handling for them at lease for now)

Spoke to @andrewbaptist about this. We're fine not having any special case handling for them in 24.1. Removing the GA-blocker label.

@arulajmani arulajmani removed GA-blocker branch-release-24.1 Used to mark GA and release blockers and technical advisories for 24.1 labels Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
KV
Incoming
Development

Successfully merging a pull request may close this issue.

2 participants