kvcoord: retry `ReplicaUnavailableError` against other replicas #118737

erikgrinaker · 2024-02-03T19:07:30Z

kvserver: add test cases for ReplicaUnavailableError mark

kvpb: add error type for ReplicaUnavailableError

This allows accessing the error via kvpb.Error.GetDetail().

kvcoord: retry ReplicaUnavailableError against other replicas

Previously, a ReplicaUnavailableError returned by a tripped replica circuit breaker was not retried by the DistSender, instead failing fast back to the client. This was sort of by design, since these errors are intended to avoid clients getting stuck in infinite retry loops.

However, a single replica having a tripped circuit breaker does not imply that the range is unavailable -- it's possible that e.g. this replica is partially partitioned away from the quorum, or is a stale replica waiting to be caught up. If a DistSender attempts this replica first, e.g. because it was the last-known leaseholder or it's the lowest latency replica, then it will always error out, giving the appearance of an unavailable range to the client.

This patch adds DistSender retry handling of ReplicaUnavailableError, which attempts to contact other replicas in case a functional lease exists elsewhere, while also taking care to avoid retry loops where other replicas return a NotLeaseHolderError pointing back to the unavailable replica.

Resolves #118266.
Resolves #105167.
Epic: none
Release note (bug fix): if an individual replica's circuit breaker had tripped but the range was otherwise functional, e.g. because the replica was partially partitioned away from the leaseholder, it was possible for a gateway to persistently error when contacting this replica instead of retrying against a functional leaseholder elsewhere. The gateway will now retry such errors against other replicas once.

cockroach-teamcity · 2024-02-03T19:07:39Z

This change is

erikgrinaker · 2024-02-03T19:09:33Z

I have to look into what the error type addition means for backports. It's a bit unfortunate, but I don't know if it can be avoided.

@nvanbenschoten I don't think your idea of rejecting lease requests if we're not the leader will work with replica circuit breakers in their current form, see #118435 (comment).

erikgrinaker · 2024-02-04T13:26:29Z

I have to look into what the error type addition means for backports. It's a bit unfortunate, but I don't know if it can be avoided.

This shouldn't affect backports, afaict. We just implement ErrorDetailInterface for ReplicaUnavailableError, which allows retrieving it locally via GetDetail() and adds a DistSender error metric for it. It doesn't affect the wire format, so it doesn't matter if the ReplicaUnavailableError originates from a node with or without this patch.

andrewbaptist

Reviewed 1 of 1 files at r1, 3 of 3 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/dist_sender.go line 2673 at r3 (raw file):

							if rue := (*kvpb.ReplicaUnavailableError)(nil); errors.As(err, &rue) {
								if lh.IsSame(rue.Replica) {
									return nil, replicaUnavailableError

There is a problematic case worth considering as it could be common.

Initial state - r1-r5 healthy, r2 is the leaseholder, and r3 is lagging.

Fault: r2 fails (becomes unavailable)
New state:
r1 - leaseholder
r2 - unavailable
r3 - lagging
r4, r5 - not-lagging

Client has ordered replicas as r2, r3, ...
Hits r2 - gets replica unavailable
Hits r3 - gets NotLeaseholder with leaseholder = r2.

Client hits this check and returns the ReplicaUnavailableError.

Future requests will hit the same replicas (r2, r3) and keep returning error until r3 catches up.

pkg/kv/kvclient/kvcoord/dist_sender.go line 2941 at r3 (raw file):

		return noMoreReplicasErr(
			ambiguousError,
			nil, // ignore the replicaUnavailableError, retry with new routing info

I'm not sure if this matters, but should we also clear the replicaUnavailableError since we are re-creating the routing? I'm not sure what the implications of not clearing it would be.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/dist_sender.go line 2673 at r3 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

There is a problematic case worth considering as it could be common.

Initial state - r1-r5 healthy, r2 is the leaseholder, and r3 is lagging.

Fault: r2 fails (becomes unavailable)
New state:
r1 - leaseholder
r2 - unavailable
r3 - lagging
r4, r5 - not-lagging

Client has ordered replicas as r2, r3, ...
Hits r2 - gets replica unavailable
Hits r3 - gets NotLeaseholder with leaseholder = r2.

Client hits this check and returns the ReplicaUnavailableError.

Future requests will hit the same replicas (r2, r3) and keep returning error until r3 catches up.

I didn't think this case would be any different than what already happens with other errors, e.g. in the case of a dead node where we'll hit a connection refused. But I think it is, because MoveToFront() will simply swap the assumed leaseholder with the next pending replica. So if we hit r3 -> r2, and r2 fails again, we'll move onto r4 instead of spinning on r3.

There is another way we could handle this, by ignoring NLHEs that point back to a RUE, and instead keep trying replicas until we run out. I don't think this should end up in NLHE loops either (if nodes disagree on the NLHE leaseholder), but should doublecheck.

Does that sound reasonable? I'll update the PR when I get a chance.

pkg/kv/kvclient/kvcoord/dist_sender.go line 2941 at r3 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

I'm not sure if this matters, but should we also clear the replicaUnavailableError since we are re-creating the routing? I'm not sure what the implications of not clearing it would be.

We're in a different scope, so we can't clear it here. It doesn't matter either, since the error will be returned back up past sendToReplicas, and we'll re-enter it with a new, empty replicaUnavailableError.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/dist_sender.go line 2673 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I didn't think this case would be any different than what already happens with other errors, e.g. in the case of a dead node where we'll hit a connection refused. But I think it is, because MoveToFront() will simply swap the assumed leaseholder with the next pending replica. So if we hit r3 -> r2, and r2 fails again, we'll move onto r4 instead of spinning on r3.

There is another way we could handle this, by ignoring NLHEs that point back to a RUE, and instead keep trying replicas until we run out. I don't think this should end up in NLHE loops either (if nodes disagree on the NLHE leaseholder), but should doublecheck.

Does that sound reasonable? I'll update the PR when I get a chance.

Updated the PR to simply skip NLHEs pointing back to a RUE. I believe this should avoid retry loops since we only update the leaseholder with newer generations and don't retry replicas that haven't had a NLHE point to them.

Nice catch, thanks!

andrewbaptist

The complexity of this code and looping makes me a little sad and nervous, but I don't see any issues with it. I'm hoping I can build some dist_sender_nemesis type testing in the future and simplifying the way the transport is used.

erikgrinaker · 2024-02-20T17:05:00Z

The complexity of this code and looping makes me a little sad and nervous, but I don't see any issues with it. I'm hoping I can build some dist_sender_nemesis type testing in the future and simplifying the way the transport is used.

Yeah, I never feel particularly confident about changes here, and I forget how it works between each time I touch it. Could definitely do with a better test harness.

TFTR!

bors r+

erikgrinaker · 2024-02-20T17:40:40Z

Hm, I'll have to doublecheck the interactions with AmbiguousError here -- I think we have to wrap these as ambiguous if they are commits, since the breaker can trip while the command is in flight.

bors r-

craig · 2024-02-20T17:40:43Z

Canceled.

Epic: none Release note: None

This allows accessing the error via `kvpb.Error.GetDetail()`. Epic: none Release note: None

Previously, a `ReplicaUnavailableError` returned by a tripped replica circuit breaker was not retried by the DistSender, instead failing fast back to the client. This was sort of by design, since these errors are intended to avoid clients getting stuck in infinite retry loops. However, a single replica having a tripped circuit breaker does not imply that the range is unavailable -- it's possible that e.g. this replica is partially partitioned away from the quorum, or is a stale replica waiting to be caught up. If a DistSender attempts this replica first, e.g. because it was the last-known leaseholder or it's the lowest latency replica, then it will always error out, giving the appearance of an unavailable range to the client. This patch adds DistSender retry handling of `ReplicaUnavailableError`, which attempts to contact other replicas in case a functional lease exists elsewhere, while also taking care to avoid retry loops where other replicas return a `NotLeaseHolderError` pointing back to the unavailable replica. Epic: none Release note (bug fix): if an individual replica's circuit breaker had tripped but the range was otherwise functional, e.g. because the replica was partially partitioned away from the leaseholder, it was possible for a gateway to persistently error when contacting this replica instead of retrying against a functional leaseholder elsewhere. The gateway will now retry such errors against other replicas once.

erikgrinaker · 2024-02-21T09:20:27Z

Hm, I'll have to doublecheck the interactions with AmbiguousError here -- I think we have to wrap these as ambiguous if they are commits, since the breaker can trip while the command is in flight.

Indeed, updated the PR to mark these as ambiguous when used with commits.

erikgrinaker · 2024-02-21T10:23:00Z

bors r+

craig · 2024-02-21T11:05:16Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2024-02-21T11:05:33Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from d83eed2 to blathers/backport-release-23.1-118737: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

erikgrinaker added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. labels Feb 3, 2024

erikgrinaker requested review from nvanbenschoten and andrewbaptist February 3, 2024 19:07

erikgrinaker self-assigned this Feb 3, 2024

erikgrinaker requested review from a team as code owners February 3, 2024 19:07

erikgrinaker force-pushed the distsender-retry-replica-unavailable branch 3 times, most recently from 47ba91e to e2f1316 Compare February 4, 2024 11:43

andrewbaptist requested changes Feb 13, 2024

View reviewed changes

erikgrinaker requested a review from andrewbaptist February 15, 2024 12:15

erikgrinaker commented Feb 15, 2024

View reviewed changes

erikgrinaker force-pushed the distsender-retry-replica-unavailable branch from e2f1316 to f94e018 Compare February 19, 2024 11:35

erikgrinaker commented Feb 19, 2024

View reviewed changes

erikgrinaker mentioned this pull request Feb 20, 2024

kv: reject lease proposal locally when leader not known #118435

Open

andrewbaptist approved these changes Feb 20, 2024

View reviewed changes

erikgrinaker added 3 commits February 21, 2024 08:57

kvserver: add test cases for ReplicaUnavailableError mark

fdf17c9

Epic: none Release note: None

kvpb: add error type for ReplicaUnavailableError

d83eed2

This allows accessing the error via `kvpb.Error.GetDetail()`. Epic: none Release note: None

erikgrinaker force-pushed the distsender-retry-replica-unavailable branch from f94e018 to 2523c1c Compare February 21, 2024 09:20

craig bot merged commit 7bf2235 into cockroachdb:master Feb 21, 2024
14 checks passed

blathers-crl bot mentioned this pull request Feb 21, 2024

release-23.2: kvcoord: retry ReplicaUnavailableError against other replicas #119446

Closed

erikgrinaker mentioned this pull request Feb 21, 2024

release-23.1: kvcoord: retry ReplicaUnavailableError against other replicas #119447

Closed

erikgrinaker mentioned this pull request Mar 6, 2024

kvclient: don't backoff with unavailable leaseholder #119936

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvcoord: retry `ReplicaUnavailableError` against other replicas #118737

kvcoord: retry `ReplicaUnavailableError` against other replicas #118737

erikgrinaker commented Feb 3, 2024 •

edited

cockroach-teamcity commented Feb 3, 2024

erikgrinaker commented Feb 3, 2024

erikgrinaker commented Feb 4, 2024

andrewbaptist left a comment

erikgrinaker left a comment

erikgrinaker left a comment

andrewbaptist left a comment •

edited by cockroach-dev-inf

erikgrinaker commented Feb 20, 2024

erikgrinaker commented Feb 20, 2024

craig bot commented Feb 20, 2024

erikgrinaker commented Feb 21, 2024

erikgrinaker commented Feb 21, 2024

craig bot commented Feb 21, 2024

blathers-crl bot commented Feb 21, 2024

kvcoord: retry ReplicaUnavailableError against other replicas #118737

kvcoord: retry ReplicaUnavailableError against other replicas #118737

Conversation

erikgrinaker commented Feb 3, 2024 • edited

cockroach-teamcity commented Feb 3, 2024

erikgrinaker commented Feb 3, 2024

erikgrinaker commented Feb 4, 2024

andrewbaptist left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

andrewbaptist left a comment • edited by cockroach-dev-inf

Choose a reason for hiding this comment

erikgrinaker commented Feb 20, 2024

erikgrinaker commented Feb 20, 2024

craig bot commented Feb 20, 2024

erikgrinaker commented Feb 21, 2024

erikgrinaker commented Feb 21, 2024

craig bot commented Feb 21, 2024

blathers-crl bot commented Feb 21, 2024

kvcoord: retry `ReplicaUnavailableError` against other replicas #118737

kvcoord: retry `ReplicaUnavailableError` against other replicas #118737

erikgrinaker commented Feb 3, 2024 •

edited

andrewbaptist left a comment •

edited by cockroach-dev-inf