Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: proxy requests that return a NotLeaseholderError #117340

Merged
merged 5 commits into from Mar 23, 2024

Conversation

andrewbaptist
Copy link
Collaborator

@andrewbaptist andrewbaptist commented Jan 4, 2024

Previously if a request that required a direct connection to a
leaseholder would fail if the client was unable to initiate this
connection. There is a new field ProxyRangeInfo added to the
BatchRequest header. This field is set by a client which wants to
request the server will proxy the request for it.

There are three main changes as part of this request.

  1. In DistSender if the client needs to send a request to a Leaseholder
    but its transport has moved to the next replica, then it will add the
    ProxyRangeInfo header to the request and send it to the follower.
  2. In DistSender if it is attempting to send a request that already has
    the ProxyRangeInfo header set, then it will short circuit much of the
    retry logic and instead only send the request to the leaseholder the
    original client wanted to send it to.
  3. On the server side of the node, it will normally evaulate a request
    even if ProxyRangeInfo is set. If its local evaluation results in a
    NotLeaseHolderError and the range info matches what the client set in
    the ProxyRangeInfo header, it will attempt to proxy the request to
    the leaseholder and return the response from that request instead.

The impact of this change is to allow Batch requests to succeed as long
as there is connectivity between the client and at least one replica
that has a direct connection to the leaseholder.

Epic: none

Fixes: #93503

Release note (ops change): This PR adds an additional setting
kv.dist_sender.proxy.enabled which is defaulted to true. When it is
enabled, proxy requests will be routed through a follower replica when
the leaseholder is unavailable.

Copy link

blathers-crl bot commented Jan 4, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andrewbaptist andrewbaptist force-pushed the 2024-01-04-partial-partition branch 7 times, most recently from a0dffdd to 369e786 Compare February 5, 2024 19:06
@andrewbaptist andrewbaptist force-pushed the 2024-01-04-partial-partition branch 12 times, most recently from 2995265 to 4eab574 Compare February 9, 2024 18:28
@andrewbaptist andrewbaptist force-pushed the 2024-01-04-partial-partition branch 3 times, most recently from 774f3cb to 255f599 Compare February 21, 2024 23:11
@andrewbaptist andrewbaptist marked this pull request as ready for review February 22, 2024 03:05
@andrewbaptist andrewbaptist requested review from a team as code owners February 22, 2024 03:05
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed a few high-level concerns here previously, did you give these any consideration?

  • The client should opt-in to proxying. At the very least, this gives us an escape hatch if we find problems with this later and have to backport a fix, without requiring protocol changes. It also allows us to be more deliberate about when it is and isn't safe to do this. This could be done e.g. by making RedirectCount specify the number of remaining proxy hops, decremented by 1 for each hop and returning when it reaches 0.

  • Making the proxy determination on the server in response to a local NLHE limits the proxy capability to nodes that have replicas. While this is how we envision this to work right now, it seems like an artificial restriction.

  • Splitting this logic across the DistSender and maybeProxyRequest appears like it could be a bit brittle and harder to discover or reason about. If we simply had Stores.SendWithWriteBytes() pass the request on to its local DistSender whenever it received a proxy request, the DistSender could make the determination about which replica to use (including a local one if it exists) itself, and we could collect all of the proxying logic (both client- and server-side) in one place. This might allow us to reuse more of the sophisticated DistSender processing like range cache updates if we find that we need it. It would also avoid processing the request on the local replica at all if the DistSender believed a different node has the lease, which is the common case.

cc @nvanbenschoten in case you have thoughts on the above.

Reviewed 1 of 1 files at r7, 17 of 17 files at r10, 1 of 1 files at r11, 19 of 20 files at r12, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)


pkg/kv/db.go line 1117 at r12 (raw file):

	// directly. We don't need to create a txnCoordinator at this level since the
	// original client will manage the transaction.
	return db.factory.NonTransactionalSender().Send(ctx, ba)

NonTransactionalSender doesn't handle transactional batches. How does this work with transactions?

cockroach/pkg/kv/db.go

Lines 225 to 227 in 340cc72

if _, ok := pErr.GetDetail().(*kvpb.OpRequiresTxnError); !ok {
return br, pErr
}

I think I'd be partial to plumbing the DistSender down into the store instead, to be more deliberate about this.


pkg/kv/kvpb/api.go line 36 at r12 (raw file):

//go:generate mockgen -package=kvpbmock -destination=kvpbmock/mocks_generated.go . InternalClient,Internal_RangeFeedClient,Internal_MuxRangeFeedClient

var ProxyBatchRequest = settings.RegisterBoolSetting(

nit: should this be in kvserver or something?


pkg/kv/kvpb/api.go line 38 at r12 (raw file):

var ProxyBatchRequest = settings.RegisterBoolSetting(
	settings.SystemOnly,
	"kv.proxy.enabled",

nit: consider making this a bit more specific, since I could imagine other uses of proxying in KV as well.


pkg/kv/kvpb/api.proto line 2936 at r12 (raw file):

  // prevent infinite retries we cap this count based on the number of replicas
  // for a range.
  int32 redirect_count = 34;

nit: minor preference for using the proxy nomenclature here. We often use the term redirect to refer to NLHEs.


pkg/kv/kvpb/batch.go line 70 at r10 (raw file):

	if txn := ba.Txn; txn != nil {
		if !ba.Timestamp.IsEmpty() && ba.Timestamp != txn.ReadTimestamp {
			return errors.Newf("transactional request must not set batch timestamp %d != %d", ba.Timestamp, txn.ReadTimestamp)

nit: comment needs updating.


pkg/rpc/context_testutils.go line 108 at r11 (raw file):

}

// Embed the isPartitioned function into the stream and check it when we are

nit: isPartitioned or partitionCheck?


pkg/rpc/context_testutils.go line 173 at r11 (raw file):

// return errors if EnablePartition has been called.
func (p *Partitioner) CreateTestingKnobs(
	id roachpb.NodeID, partition [][2]roachpb.NodeID,

nit: should we decouple this from the ContextTestingKnobs by simply returning the interceptors? Alternatively taking a *ContextTestingKnobs to allow combining it with other knobs?

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)


pkg/server/node.go line 1525 at r30 (raw file):

This presents an additional problem though: this error is propagated back as a br.Error rather than an error, and this will cause the original DistSender to not consider this failure an ambiguous error, instead retrying it on a different replica.

This isn't true: a br.Error will just be returned to the client without retries. We'll need to mark the original error as an ambiguous error though, but only for requests where it matters (the withCommit condition). The proxying DistSender will do this for us I believe. So I think we just have to check for AmbiguousResultError here and propagate the error in that case, or something similar.

Copy link
Collaborator Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - and thanks for all the reviews - I feel like its finally close!

Also I removed the other commit this was on top of and will merge that separately. It handles a specific case where we don't retry proxy errors when we should, but is OK to merge indepedently.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvclient/kvcoord/dist_sender.go line 2342 at r30 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I see. I'm not intimately familiar with the range cache token interactions -- I see there is some handling of this in sendPartialBatch now, but could imagine cases where there aren't. Regardless, we should handle these cases.

TODO: Try removing this.

I ended up removing this case now that there is handling in sendPartialBatch. The problematic case before was if we got all the way into sendToReplicas but our token didn't match either the remote client or the local node. At this point this should never happen, and we have handline


pkg/kv/kvclient/kvcoord/dist_sender.go line 2363 at r30 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Thanks -- we could make it an errors.AssertionFailedf then, which will include extra debug information if we ever do hit.

changed to AssertionFailedf - good point!


pkg/kv/kvclient/kvcoord/dist_sender.go line 308 at r33 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Should this have WithPublic, or do we consider this an internal setting?

I wanted to leave it not public for now as I don't know when we would tell customers to turn it off. If we later decide it is problematic and we find cases where this makes sense we can switch this.


pkg/kv/kvclient/kvcoord/dist_sender.go line 1993 at r33 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: s/requests/request

Fixed


pkg/kv/kvclient/kvcoord/dist_sender.go line 2337 at r33 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: s/requested proxy/requested replica/

changed


pkg/kv/kvclient/kvcoord/dist_sender.go line 2340 at r33 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: let's add in a summary of the ClientRangeInfo, so we can easily see what the client requested in traces. Or maybe we already have a separate event with that info elsewhere?

Also, this should be VEventf, not VErrEventf.

Changed to VEventf and added the ProxyRangeInfo information.


pkg/kv/kvclient/kvcoord/dist_sender.go line 2347 at r33 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: overly long line, let's break it up. This goes for a few other very long lines as well.

Fixed


pkg/kv/kvclient/kvcoord/dist_sender.go line 2352 at r33 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: we're only using this in one place, so consider moving it down to the NewNotLeaseHolderError.

Moved


pkg/kv/kvpb/api.proto line 2950 at r33 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: let's add a comment saying that this is only supported by 24.1 nodes, and that older nodes will simply ignore proxying. This should be fine with the current logic, but if we make any further changes to the DistSender logic in 24.1 we need to be careful to not assume that remote nodes respect it.

If we think we'll need this we could consider a version gate, but I don't think it will be necessary, and it's better to eagerly enable this in mixed-version clusters such that operators can still downgrade if they find a problem.

I added a lot more text here - please give it a read over.


pkg/server/node.go line 1473 at r30 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I see, so the tenant is propagated via the context, and we reuse the client context through the entire proxy chain. If that's accurate then that should address the security concern.

yes - it should be handled correctly from a security perspective, but it is a good thing to validate. I'll make sure to run some additional tests on multi-tenant systems. Node that the TestPartialPartition test does currently run on both single and multi-tenant modes, so it already covers some of this.


pkg/server/node.go line 1525 at r30 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This presents an additional problem though: this error is propagated back as a br.Error rather than an error, and this will cause the original DistSender to not consider this failure an ambiguous error, instead retrying it on a different replica.

This isn't true: a br.Error will just be returned to the client without retries. We'll need to mark the original error as an ambiguous error though, but only for requests where it matters (the withCommit condition). The proxying DistSender will do this for us I believe. So I think we just have to check for AmbiguousResultError here and propagate the error in that case, or something similar.

Thanks for noticing this! This was a hazard in case the error contained an ambiguous error internally. The handling of these errors is still a little messy but at least it seems correct.

I tightened to only look forRequestDidNotStart errors now. In practice this won't exclude many requests from proxying.

We might want to discuss to consider testing for this specific case as it is pretty tricky.


pkg/server/node.go line 1497 at r33 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: we could consider merging this with maybeProxyRequest, since both methods are pretty small and it's only called in one place. We could also consider dropping the bool return argument and return nil when not proxied, since we only return true together with a non-nil response anyway.

I merged this back. It was originally doing a bit more, but there isn't much need for this anymore and I changed to remove the bool parameter.


pkg/server/node.go line 1504 at r33 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Something smells a bit off to me about this.

For non-transactional requests without a timestamp, we set this in Store.Send() via SetActiveTimestamp(). In Node.batchInternal() we're sitting above the store, and if we proxy the request via the DistSender the request won't go via a store until it reaches the remote node. In other words, noone above us in the call chain should have set Timestamp here.

I think this happens because Store.Send() doesn't make a shallow copy of the batch request before calling SetActiveTimestamp() on it, which modifies our upstream copy of the request. This violates the sender contract. I think it would be better to fix Store.Send() to make a shallow copy, which lets us remove this reset.

I had a brief conversation with Nathan about it. The underlying problem is that the server DOES modify the request even though the contract of Send states that it isn't allowed to "the callee has to treat everything inside the BatchRequest as read-only." See some of the other notes, but specifically for now I'm always shallow copying any proxiable request to prevent any need to manually clear fields like this.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r34, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)


pkg/server/node.go line 1525 at r30 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

Thanks for noticing this! This was a hazard in case the error contained an ambiguous error internally. The handling of these errors is still a little messy but at least it seems correct.

I tightened to only look forRequestDidNotStart errors now. In practice this won't exclude many requests from proxying.

We might want to discuss to consider testing for this specific case as it is pretty tricky.

Isn't this going to result in more errors than before? If we ignore those that satisfy RequestDidNotStart, we're now basically promoting any send error (where transport.SendNext() returns an error rather than a br.Error) to a br.Error, which the original DistSender will interpret as a final error returned by the server and thus won't be retried.

In particular, won't e.g. read-only requests (which can be retried at will) which see a connection failure in the middle of processing now return an error back to the client instead of retrying it?

I think it would be better if we could somehow mark this state in the proxying DistSender, which does have access to the br.Error / error distinction, and preserve its retry policy here. Barring that, we can at least allow retries for read-only requests for certain error types like connection closures.


pkg/server/node.go line 1504 at r33 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

I had a brief conversation with Nathan about it. The underlying problem is that the server DOES modify the request even though the contract of Send states that it isn't allowed to "the callee has to treat everything inside the BatchRequest as read-only." See some of the other notes, but specifically for now I'm always shallow copying any proxiable request to prevent any need to manually clear fields like this.

Why do we have to copy the request though? We're not modifying it. Isn't the bug here that Store.Send() modifies the request without copying it, which ends up modifying our request too?

Copy link
Collaborator Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/server/node.go line 1525 at r30 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Isn't this going to result in more errors than before? If we ignore those that satisfy RequestDidNotStart, we're now basically promoting any send error (where transport.SendNext() returns an error rather than a br.Error) to a br.Error, which the original DistSender will interpret as a final error returned by the server and thus won't be retried.

In particular, won't e.g. read-only requests (which can be retried at will) which see a connection failure in the middle of processing now return an error back to the client instead of retrying it?

I think it would be better if we could somehow mark this state in the proxying DistSender, which does have access to the br.Error / error distinction, and preserve its retry policy here. Barring that, we can at least allow retries for read-only requests for certain error types like connection closures.

Yes it will increase the error rate, but I don't think this matters in practice. We are only getting to this code after we have failed against the leaseholder, returned a NLHE from the local server and returned a different error from the proxy server. So we would return an error either way. I'm not sure it makes a material difference to split this out further with a new error mark. I wasn't too worried about failing requests that should have otherwise succeeded. I wanted to make sure we don't lose any information about errors that come back.

I agree the error and br.Error distinction makes this hard, but I'm not sure but this isn't very easy to clean up.


pkg/server/node.go line 1504 at r33 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Why do we have to copy the request though? We're not modifying it. Isn't the bug here that Store.Send() modifies the request without copying it, which ends up modifying our request too?

I'm not sure if it is consider a bug or not that Store.Send() modifies the request, however there are a number of places internally that do modify it, and changing those is non-trivial. I could add a TODO that we should fix this, but I'm not sure there is a good alternative.

Note that we ONLY copy on proxy requests, so it won't impact normal writes at all.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)


pkg/server/node.go line 1525 at r30 (raw file):

So we would return an error either way.

We would, but the question is whether we return an error that the original DistSender will retry until successful, or if it terminates the retry loop and returns an error back to the client.

Currently, when we're proxying, errors that would normally be retried are now getting promoted to non-retriable errors that will be propagated back to the caller. In other words, clients may see increased error rates when proxying compared to when not proxying. If we instead returned a NLHE, the original DistSender would continue to retry.


pkg/server/node.go line 1504 at r33 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

I'm not sure if it is consider a bug or not that Store.Send() modifies the request, however there are a number of places internally that do modify it, and changing those is non-trivial. I could add a TODO that we should fix this, but I'm not sure there is a good alternative.

Note that we ONLY copy on proxy requests, so it won't impact normal writes at all.

The Sender contract says that the passed request shouldn't be modified. Stores.Send() does. If it didn't, and instead made a shallow copy before modifying it as it's supposed to, then we wouldn't have this problem here.

@andrewbaptist andrewbaptist force-pushed the 2024-01-04-partial-partition branch 6 times, most recently from ad09eaa to f890ee0 Compare March 21, 2024 13:29
@andrewbaptist andrewbaptist force-pushed the 2024-01-04-partial-partition branch 4 times, most recently from 25d8050 to 64b4d48 Compare March 22, 2024 13:38
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not confident that this won't introduce subtle error handling bugs, e.g. due to loss of information as the error is propagated across the network or mixing of send errors with other errors returned by the DistSender. I'm not aware of any as-is though (except IsAuthError, but see #120884) Let's get this in, and do another pass over this later once we get some experience with it -- consider a TODO so we don't forget.

For posterity, we considered an alternative approach where we ignored most proxy errors (except e.g. ambiguous result errors) and returned a NLHE instead, but this could incorrectly omit some errors that should be propagated and would also require protocol additions to determine ambiguous errors (which needs access to the entire pre-split batch request).

Reviewed 4 of 4 files at r38, 3 of 4 files at r39, 2 of 2 files at r40, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)

Previously DistSender would short circuit attempting replicas if there
was an auth failure. The assumption was that the only reason for an auth
error was due to the client having bad information and attempts against
any other replicas would fail with the same error. This is an incorrect
assumption in several cases.

1) The auth error was caused by a server side certificate expiration.
2) The receiver may have stale information.
  a) The receiver didn't know this node had yet joined.
  b) The receiver was using stale tenant authorization information.
3) The receiver proxied the request and received an auth error on the
	 other side of the connection.
4) The remote server is removed from the cluster and can't validate our
	 auth information.

There may be other reasons as well either today or in the future. We
still want to terminate the loop after one pass through the replicas,
but we can't short circuit on the first auth error.

Epic: none

Release note: None
This commit adds a new field ProxyRangeInfo to the header of
BatchRequest. If a server receives a request with the header is set, and
local evaulation fails, with a NotLeaseHolderError, it can decide to
proxy the request. Before proxying, it should validate that the
ProxyRangeInfo matches its local replica RangeInfo. A server that
doesn't know how to proxy a request can safely ignore the header.

Epic: none

Release note: None
A ProxyFailedError can be returned in a BatchResponse if a proxy request
fails with a send error. The originator of the proxy request can use
this error to decide how to proceed. This error is necessary to prevent
cases where a network connection is broken mid-stream and the client may
need to perform different behavior based on whether the server received
the message.

Epic: none

Release note: None
Previously if a request that required a direct connection to a
leaseholder would fail if the client was unable to initiate this
connection. This PR adds the client and server sides of proxy handling
to work around partial partitions.

There are three main changes as part of this request.
1) In DistSender, if the client needs to send a request to a leaseholder
	 but the transport has moved to the next replica, then it will add the
	 ProxyRangeInfo header to the request and send it to each follower.
2) In DistSender, if it is attempting to send a request that already has
	 the ProxyRangeInfo header set, then it will short circuit much of the
	 retry logic and only send the request to the leaseholder based on the
	 ProxyRangeInfo header.
3) On the server side of the node, it will normally evaulate a request
	 even if ProxyRangeInfo is set. If local evaluation results in a
	 NotLeaseHolderError and the range info matches what the client set in
	 the ProxyRangeInfo header, it will attempt to proxy the request to
	 the leaseholder. It will return the end users response rather than
	 its local response.

The impact of this change is to allow a BatchRequest to succeed as long
as there is connectivity between the client and at least one replica
that has a direct connection to the leaseholder.

Epic: none

Fixes: cockroachdb#93503

Release note (ops change): This PR adds an additional setting
kv.dist_sender.proxy.enabled which is defaulted to true. When it is
enabled, proxy requests will be routed through a follower replica when
the leaseholder is unavailable.
When we receive a NotLeaseHolderError and change the leaseholder based
on updated lease information, we want to run through the entire
transport again before giving up and returning to sendPartialBatch.
On proxy requests, we need to retry the non-leaseholder replicas with
the updated leaseholder information as they may be able to proxy the
request to the leaseholder.

Epic: none

Release note: None
@andrewbaptist
Copy link
Collaborator Author

bors r=erikgrinaker

TFTR!

@craig
Copy link
Contributor

craig bot commented Mar 23, 2024

@craig craig bot merged commit 794a129 into cockroachdb:master Mar 23, 2024
20 of 22 checks passed
@andrewbaptist andrewbaptist deleted the 2024-01-04-partial-partition branch March 23, 2024 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-24.1 Used to mark GA and release blockers and technical advisories for 24.1 release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvclient: route requests via followers when leaseholder is unreachable
3 participants