Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
120884: kvclient: remove auth short-circuit r=erikgrinaker a=andrewbaptist

Previously DistSender would short circuit attempting replicas if there was an auth failure. THe assumption was that the only reason for an auth error was due to the client having bad information and attempts against any other replicas would fail with the same error. This is an incorrect assumption for several reasons.

1) The auth error was caused by a server side certificate expiration. 
2) The receiver may have stale information.
  a) The receiver didn't know this node had yet joined.
	b) The receiver was using stale tenant authorization information.
3) The receiver needed to make an extra request to complete this batch
	 request and that request encountered an auth error which was part of
	 the returned error.
4) The receiver proxied the request and received an auth error on the
	 other side of the connection.
5) The remote server is removed from the cluster and can't validate our
	 auth information.

There may be other reasons as well either today or in the future. We still want to terminate the loop after one pass through the replicas, but we can't short circuit on the first auth error.

Epic: none

Release note: None

120903: changefeedccl: deflake TestChangefeedWithSimpleDistributionStrategy r=jayshrivastava a=andyyang890

This patch deflakes `TestChangefeedWithSimpleDistributionStrategy`
by disabling the new `changefeed.random_replica_selection.enabled`
cluster setting, since leaving it enabled could lead to ranges being
assigned to follower replicas, which is in contradiction with the
test assumption that ranges are always assigned to the leaseholder.

Fixes #120870

Release note: None

Co-authored-by: Andrew Baptist <baptist@cockroachlabs.com>
Co-authored-by: Andy Yang <yang@cockroachlabs.com>
  • Loading branch information
3 people committed Mar 22, 2024
3 parents a2e6ec5 + 3d154e5 + b69e890 commit 5a5a248
Show file tree
Hide file tree
Showing 4 changed files with 34 additions and 30 deletions.
4 changes: 2 additions & 2 deletions pkg/ccl/changefeedccl/changefeed_dist_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -496,8 +496,6 @@ func TestChangefeedWithSimpleDistributionStrategy(t *testing.T) {
defer leaktest.AfterTest(t)()
defer log.Scope(t).Close(t)

skip.WithIssue(t, 120870)

// The test is slow and will time out under deadlock/race/stress.
skip.UnderShort(t)
skip.UnderDuress(t)
Expand All @@ -512,6 +510,8 @@ func TestChangefeedWithSimpleDistributionStrategy(t *testing.T) {
tester := newRangeDistributionTester(t, noLocality)
defer tester.cleanup()
tester.sqlDB.Exec(t, "SET CLUSTER SETTING changefeed.default_range_distribution_strategy = 'balanced_simple'")
// We need to disable the bulk oracle in order to ensure the leaseholder is selected.
tester.sqlDB.Exec(t, "SET CLUSTER SETTING changefeed.random_replica_selection.enabled = false")
tester.sqlDB.Exec(t, "CREATE CHANGEFEED FOR x INTO 'null://' WITH initial_scan='no'")
partitions := tester.getPartitions()
counts := tester.countRangesPerNode(partitions)
Expand Down
14 changes: 6 additions & 8 deletions pkg/kv/kvclient/kvcoord/dist_sender.go
Original file line number Diff line number Diff line change
Expand Up @@ -2300,6 +2300,12 @@ func noMoreReplicasErr(ambiguousErr, replicaUnavailableErr, lastAttemptErr error
return replicaUnavailableErr
}

// Authentication and authorization errors should be propagated up rather than
// wrapped in a SendError and retried as they are likely to be fatal if they
// are returned from multiple servers.
if grpcutil.IsAuthError(lastAttemptErr) {
return lastAttemptErr
}
// TODO(bdarnell): The error from the last attempt is not necessarily the best
// one to return; we may want to remember the "best" error we've seen (for
// example, a NotLeaseHolderError conveys more information than a
Expand Down Expand Up @@ -2574,14 +2580,6 @@ func (ds *DistSender) sendToReplicas(

} else if err != nil {
log.VErrEventf(ctx, 2, "RPC error: %s", err)
if grpcutil.IsAuthError(err) {
// Authentication or authorization error. Propagate.
if ambiguousError != nil {
return nil, kvpb.NewAmbiguousResultErrorf("error=%v [propagate] (last error: %v)",
ambiguousError, err)
}
return nil, err
}

// For most connection errors, we cannot tell whether or not the request
// may have succeeded on the remote server (exceptions are captured in the
Expand Down
18 changes: 6 additions & 12 deletions pkg/kv/kvclient/kvcoord/dist_sender_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1353,12 +1353,11 @@ func TestDistSenderRetryOnTransportErrors(t *testing.T) {
defer stopper.Stop(ctx)

for _, spec := range []struct {
errorCode codes.Code
shouldRetry bool
errorCode codes.Code
}{
{codes.FailedPrecondition, true},
{codes.PermissionDenied, false},
{codes.Unauthenticated, false},
{codes.FailedPrecondition},
{codes.PermissionDenied},
{codes.Unauthenticated},
} {
t.Run(fmt.Sprintf("retry_after_%v", spec.errorCode), func(t *testing.T) {
clock := hlc.NewClockForTesting(nil)
Expand Down Expand Up @@ -1424,13 +1423,8 @@ func TestDistSenderRetryOnTransportErrors(t *testing.T) {

get := kvpb.NewGet(roachpb.Key("a"))
_, pErr := kv.SendWrapped(ctx, ds, get)
if spec.shouldRetry {
require.True(t, secondReplicaTried, "Second replica was not retried")
require.Nil(t, pErr, "Call should not fail")
} else {
require.False(t, secondReplicaTried, "DistSender did not abort retry loop")
require.NotNil(t, pErr, "Call should fail")
}
require.True(t, secondReplicaTried, "Second replica was not retried")
require.Nil(t, pErr, "Call should not fail")
})
}
}
Expand Down
28 changes: 20 additions & 8 deletions pkg/util/grpcutil/grpc_util.go
Original file line number Diff line number Diff line change
Expand Up @@ -123,15 +123,27 @@ func IsConnectionRejected(err error) bool {
return false
}

// IsAuthError returns true if err's Cause is an error produced by
// gRPC due to an authentication or authorization error for the operation.
// AuthErrors should generally be considered non-retriable. They indicate
// that the operation would not succeed even if directed at another node
// in the cluster.
// IsAuthError returns true if err's Cause is an authentication or authorization
// error for the operation.
//
// As a special case, an AuthError (PermissionDenied) is returned on outbound
// dialing when the source node is in the process of terminating (see
// rpc.errDialRejected).
// AuthErrors can be due to a problem on either the client or the server. An
// AuthError received from multiple different servers can be used an indication
// that the issue is most likely on the client should not be retried.
//
// A non-exhaustive list of reasons from the client side are:
// 1) The client has a bad TLS certificate.
// 2) The client does not have permission to run this batch request.
// 3) The client has been decommissioned.
// 4) The client is in the process of terminating.
//
// A non-exhaustive list of reasons fro the server side is:
// 1) The server has a bad TLS certificate.
// 2) The server has stale liveness information for this node.
// 3) The server has stale tenant authorization information.
// 4) The server sent the request as a proxy and returned a ProxyFailedError
// 5) The server was removed from the cluster and can't validate our auth info.
// TODO(baptist): Validate uses of this check. Prior code assumed an auth error
// from one sever should always be treated as terminal.
func IsAuthError(err error) bool {
if s, ok := status.FromError(errors.UnwrapAll(err)); ok {
switch s.Code() {
Expand Down

0 comments on commit 5a5a248

Please sign in to comment.