Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvcoord: retry ReplicaUnavailableError against other replicas #118737

Merged

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Feb 3, 2024

kvserver: add test cases for ReplicaUnavailableError mark

kvpb: add error type for ReplicaUnavailableError

This allows accessing the error via kvpb.Error.GetDetail().

kvcoord: retry ReplicaUnavailableError against other replicas

Previously, a ReplicaUnavailableError returned by a tripped replica circuit breaker was not retried by the DistSender, instead failing fast back to the client. This was sort of by design, since these errors are intended to avoid clients getting stuck in infinite retry loops.

However, a single replica having a tripped circuit breaker does not imply that the range is unavailable -- it's possible that e.g. this replica is partially partitioned away from the quorum, or is a stale replica waiting to be caught up. If a DistSender attempts this replica first, e.g. because it was the last-known leaseholder or it's the lowest latency replica, then it will always error out, giving the appearance of an unavailable range to the client.

This patch adds DistSender retry handling of ReplicaUnavailableError, which attempts to contact other replicas in case a functional lease exists elsewhere, while also taking care to avoid retry loops where other replicas return a NotLeaseHolderError pointing back to the unavailable replica.

Resolves #118266.
Resolves #105167.
Epic: none
Release note (bug fix): if an individual replica's circuit breaker had tripped but the range was otherwise functional, e.g. because the replica was partially partitioned away from the leaseholder, it was possible for a gateway to persistently error when contacting this replica instead of retrying against a functional leaseholder elsewhere. The gateway will now retry such errors against other replicas once.

@erikgrinaker erikgrinaker added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. labels Feb 3, 2024
@erikgrinaker erikgrinaker self-assigned this Feb 3, 2024
@erikgrinaker erikgrinaker requested review from a team as code owners February 3, 2024 19:07
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker
Copy link
Contributor Author

I have to look into what the error type addition means for backports. It's a bit unfortunate, but I don't know if it can be avoided.

@nvanbenschoten I don't think your idea of rejecting lease requests if we're not the leader will work with replica circuit breakers in their current form, see #118435 (comment).

@erikgrinaker erikgrinaker force-pushed the distsender-retry-replica-unavailable branch 3 times, most recently from 47ba91e to e2f1316 Compare February 4, 2024 11:43
@erikgrinaker
Copy link
Contributor Author

I have to look into what the error type addition means for backports. It's a bit unfortunate, but I don't know if it can be avoided.

This shouldn't affect backports, afaict. We just implement ErrorDetailInterface for ReplicaUnavailableError, which allows retrieving it locally via GetDetail() and adds a DistSender error metric for it. It doesn't affect the wire format, so it doesn't matter if the ReplicaUnavailableError originates from a node with or without this patch.

Copy link
Collaborator

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, 3 of 3 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/dist_sender.go line 2673 at r3 (raw file):

							if rue := (*kvpb.ReplicaUnavailableError)(nil); errors.As(err, &rue) {
								if lh.IsSame(rue.Replica) {
									return nil, replicaUnavailableError

There is a problematic case worth considering as it could be common.

Initial state - r1-r5 healthy, r2 is the leaseholder, and r3 is lagging.

Fault: r2 fails (becomes unavailable)
New state:
r1 - leaseholder
r2 - unavailable
r3 - lagging
r4, r5 - not-lagging

Client has ordered replicas as r2, r3, ...
Hits r2 - gets replica unavailable
Hits r3 - gets NotLeaseholder with leaseholder = r2.

Client hits this check and returns the ReplicaUnavailableError.

Future requests will hit the same replicas (r2, r3) and keep returning error until r3 catches up.


pkg/kv/kvclient/kvcoord/dist_sender.go line 2941 at r3 (raw file):

		return noMoreReplicasErr(
			ambiguousError,
			nil, // ignore the replicaUnavailableError, retry with new routing info

I'm not sure if this matters, but should we also clear the replicaUnavailableError since we are re-creating the routing? I'm not sure what the implications of not clearing it would be.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/dist_sender.go line 2673 at r3 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

There is a problematic case worth considering as it could be common.

Initial state - r1-r5 healthy, r2 is the leaseholder, and r3 is lagging.

Fault: r2 fails (becomes unavailable)
New state:
r1 - leaseholder
r2 - unavailable
r3 - lagging
r4, r5 - not-lagging

Client has ordered replicas as r2, r3, ...
Hits r2 - gets replica unavailable
Hits r3 - gets NotLeaseholder with leaseholder = r2.

Client hits this check and returns the ReplicaUnavailableError.

Future requests will hit the same replicas (r2, r3) and keep returning error until r3 catches up.

I didn't think this case would be any different than what already happens with other errors, e.g. in the case of a dead node where we'll hit a connection refused. But I think it is, because MoveToFront() will simply swap the assumed leaseholder with the next pending replica. So if we hit r3 -> r2, and r2 fails again, we'll move onto r4 instead of spinning on r3.

There is another way we could handle this, by ignoring NLHEs that point back to a RUE, and instead keep trying replicas until we run out. I don't think this should end up in NLHE loops either (if nodes disagree on the NLHE leaseholder), but should doublecheck.

Does that sound reasonable? I'll update the PR when I get a chance.


pkg/kv/kvclient/kvcoord/dist_sender.go line 2941 at r3 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

I'm not sure if this matters, but should we also clear the replicaUnavailableError since we are re-creating the routing? I'm not sure what the implications of not clearing it would be.

We're in a different scope, so we can't clear it here. It doesn't matter either, since the error will be returned back up past sendToReplicas, and we'll re-enter it with a new, empty replicaUnavailableError.

@erikgrinaker erikgrinaker force-pushed the distsender-retry-replica-unavailable branch from e2f1316 to f94e018 Compare February 19, 2024 11:35
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/dist_sender.go line 2673 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I didn't think this case would be any different than what already happens with other errors, e.g. in the case of a dead node where we'll hit a connection refused. But I think it is, because MoveToFront() will simply swap the assumed leaseholder with the next pending replica. So if we hit r3 -> r2, and r2 fails again, we'll move onto r4 instead of spinning on r3.

There is another way we could handle this, by ignoring NLHEs that point back to a RUE, and instead keep trying replicas until we run out. I don't think this should end up in NLHE loops either (if nodes disagree on the NLHE leaseholder), but should doublecheck.

Does that sound reasonable? I'll update the PR when I get a chance.

Updated the PR to simply skip NLHEs pointing back to a RUE. I believe this should avoid retry loops since we only update the leaseholder with newer generations and don't retry replicas that haven't had a NLHE point to them.

Nice catch, thanks!

Copy link
Collaborator

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

The complexity of this code and looping makes me a little sad and nervous, but I don't see any issues with it. I'm hoping I can build some dist_sender_nemesis type testing in the future and simplifying the way the transport is used.

@erikgrinaker
Copy link
Contributor Author

The complexity of this code and looping makes me a little sad and nervous, but I don't see any issues with it. I'm hoping I can build some dist_sender_nemesis type testing in the future and simplifying the way the transport is used.

Yeah, I never feel particularly confident about changes here, and I forget how it works between each time I touch it. Could definitely do with a better test harness.

TFTR!

bors r+

@erikgrinaker
Copy link
Contributor Author

Hm, I'll have to doublecheck the interactions with AmbiguousError here -- I think we have to wrap these as ambiguous if they are commits, since the breaker can trip while the command is in flight.

bors r-

@craig
Copy link
Contributor

craig bot commented Feb 20, 2024

Canceled.

This allows accessing the error via `kvpb.Error.GetDetail()`.

Epic: none
Release note: None
Previously, a `ReplicaUnavailableError` returned by a tripped replica
circuit breaker was not retried by the DistSender, instead failing fast
back to the client. This was sort of by design, since these errors are
intended to avoid clients getting stuck in infinite retry loops.

However, a single replica having a tripped circuit breaker does not
imply that the range is unavailable -- it's possible that e.g. this
replica is partially partitioned away from the quorum, or is a stale
replica waiting to be caught up. If a DistSender attempts this replica
first, e.g. because it was the last-known leaseholder or it's the lowest
latency replica, then it will always error out, giving the appearance of
an unavailable range to the client.

This patch adds DistSender retry handling of `ReplicaUnavailableError`,
which attempts to contact other replicas in case a functional lease
exists elsewhere, while also taking care to avoid retry loops where
other replicas return a `NotLeaseHolderError` pointing back to the
unavailable replica.

Epic: none
Release note (bug fix): if an individual replica's circuit breaker had
tripped but the range was otherwise functional, e.g. because the replica
was partially partitioned away from the leaseholder, it was possible for
a gateway to persistently error when contacting this replica instead of
retrying against a functional leaseholder elsewhere. The gateway will
now retry such errors against other replicas once.
@erikgrinaker erikgrinaker force-pushed the distsender-retry-replica-unavailable branch from f94e018 to 2523c1c Compare February 21, 2024 09:20
@erikgrinaker
Copy link
Contributor Author

Hm, I'll have to doublecheck the interactions with AmbiguousError here -- I think we have to wrap these as ambiguous if they are commits, since the breaker can trip while the command is in flight.

Indeed, updated the PR to mark these as ambiguous when used with commits.

@erikgrinaker
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 21, 2024

Build succeeded:

@craig craig bot merged commit 7bf2235 into cockroachdb:master Feb 21, 2024
14 checks passed
Copy link

blathers-crl bot commented Feb 21, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from d83eed2 to blathers/backport-release-23.1-118737: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2.
Projects
None yet
3 participants