-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvcoord: handle follower reads with DistSender circuit breakers #119923
Comments
cc @nvanbenschoten in case you have opinions on this. |
We discussed this recently, and concluded that as a first step we should simply add a setting (enabled by default) which ignores |
Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this GA-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Currently, the circuit breakers don't do anything in particular about follower reads. There is an initial PR in #120198 to ignore them during tracking, but it isn't clear that this is the right approach. We may not need to or be able to do anything here for 24.1 (my inclination is to not have any special handling for them at lease for now), but I'm tentatively marking this as a GA blocker so that the KV team can make a determination. |
Spoke to @andrewbaptist about this. We're fine not having any special case handling for them in 24.1. Removing the GA-blocker label. |
The DistSender circuit breakers from #118943 should possibly handle follower reads better. The two main cases are:
There are two interactions to keep in mind:
Potential follower reads use
RoutingPolicy_NEAREST
(this includes meta range lookups). We should consider adding specialized handling for these.RoutingPolicy_LEASEHOLDER
may also be served as follower reads if the timestamp falls below the replica's closed timestamp, but this typically implies that the replica is in fact receiving closed timestamp updates from the leader -- it will typically take 6 seconds to trip the breaker, and the replica must have been failing at the start of this interval, implying that 6 second old follower reads will fail.It may be reasonable to assume that the follower read timestamp is in the recent past, and that a faulty replica will remain faulty for longer than the follower read lag. Given that, we can likely omit successful follower reads when considering recent error runs, and trip the breaker regardless of successful follower reads. This will incur a latency penalty for follower reads that now have to hit other replicas, but once the follower read lag moves above the closed timestamp (typically a few seconds later) it will incur that penalty anyway.
Alternatively, we can differentiate handling of follower reads and other requests, and e.g. use separate breakers and probes for them, but it isn't clear that this is worthwhile.
Jira issue: CRDB-36391
The text was updated successfully, but these errors were encountered: