Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: fail gracefully in TestLeaseTransferRejectedIfTargetNeedsSnapshot #107526

Merged

Conversation

tbg
Copy link
Member

@tbg tbg commented Jul 25, 2023

We saw this test hang in CI. What likely happened (according to the stacks) is
that a lease transfer that was supposed to be caught by an interceptor never
showed up in the interceptor. The most likely explanation is that it errored
out before it got to evaluation. It then signaled a channel the test was only
prepared to check later, so the test hung (waiting for a channel that was now
never to be touched).

This test is hard to maintain. It would be great (though, for now, out of reach)
to write tests like it in a deterministic framework1

For now, fix the test so that when the (so far unknown) error rears its
head again, it will fail properly, so we get to see the error and can
take another pass at fixing the test (separately). Stressing
this commit2, we get:

transferErrC unexpectedly signaled: /Table/Max: transfer lease unexpected
error: refusing to transfer lease to (n3,s3):3 because target may need a Raft
snapshot: replica in StateProbe

This makes sense. The test wants to exercise the below-raft mechanism, but
the above-raft mechanism also exists and while we didn't want to interact
with it, we sometimes do1

The second commit introduces a testing knob that disables the above-raft
mechanism selectively. I've stressed the test for 15 minutes without issues
after this change.

Fixes #106383.

Epic: None
Release note: None

Footnotes

  1. see https://github.com/cockroachdb/cockroach/issues/105177. 2

  2. ./dev test --filter TestLeaseTransferRejectedIfTargetNeedsSnapshot --stress ./pkg/kv/kvserver/ on gceworker, 285s

@cockroach-teamcity
Copy link
Member

This change is Reviewable

…apshot

We saw this test hang in CI. What likely happened (according to the stacks) is
that a lease transfer that was supposed to be caught by an interceptor never
showed up in the interceptor. The most likely explanation is that it errored
out before it got to evaluation. It then signaled a channel the test was only
prepared to check later, so the test hung (waiting for a channel that was now
never to be touched).

This test is hard to maintain. It would be great (though, for now, out of reach)
to write tests like it in a deterministic framework[^1]

[^1]: see cockroachdb#105177.

For now, fix the test so that when the (so far unknown) error rears its
head again, it will fail properly, so we get to see the error and can
take another pass at fixing the test (separately). Stressing
this commit[^2], we get:

> transferErrC unexpectedly signaled: /Table/Max: transfer lease unexpected
> error: refusing to transfer lease to (n3,s3):3 because target may need a Raft
> snapshot: replica in StateProbe

This makes sense. The test wants to exercise the below-raft mechanism, but
the above-raft mechanism also exists and while we didn't want to interact
with it, we sometimes do[^1]

[^1]: somewhat related to cockroachdb#107524
[^2]: `./dev test --filter TestLeaseTransferRejectedIfTargetNeedsSnapshot --stress ./pkg/kv/kvserver/` on gceworker, 285s

Touches cockroachdb#106383.

Epic: None
Release note: None
@tbg tbg force-pushed the TestLeaseTransferRejectedIfTargetNeedsSnapshot branch from 3f72ae1 to 608c949 Compare July 25, 2023 12:38
@tbg tbg requested a review from a team July 25, 2023 12:39
@tbg tbg marked this pull request as ready for review July 25, 2023 12:39
@tbg tbg requested a review from a team as a code owner July 25, 2023 12:39
See previous commit. We sometimes hit the above-raft check when we wanted
to hit only the below-raft one.

This commit fixes this by selectively disabling the above-raft check in
this test.

Note that not all tests using lease transfers are susceptible to this problem
in the way that this test is. This is because this test also lowers the
`LeaseTransferRejectedRetryLoopCount`[^1] because it is intentionally
manufacturing failed lease transfers and doesn't want to sit out the retry
loop. It is that time-saving optimization that also allows the spurious error
to bubble up.

[^1]: https://github.com/cockroachdb/cockroach/blob/66c9f93ae86bddd7ba4c5f6a6b8b6cb700ca23ce/pkg/kv/kvserver/testing_knobs.go#L375

Epic: none
Release note: None
@tbg tbg force-pushed the TestLeaseTransferRejectedIfTargetNeedsSnapshot branch from ef1302b to 1c8c503 Compare July 25, 2023 13:26
@tbg tbg added the db-cy-23 label Jul 25, 2023
@tbg tbg requested a review from erikgrinaker July 26, 2023 08:07
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new testing knob isn't necessary, you can use AdminTransferLeaseBypassingSafetyChecks:

cockroach/pkg/kv/db.go

Lines 666 to 675 in adfa97f

// AdminTransferLeaseBypassingSafetyChecks is like AdminTransferLease, but
// configures the lease transfer to bypass safety checks. See the comment on
// AdminTransferLeaseRequest.BypassSafetyChecks for details.
func (db *DB) AdminTransferLeaseBypassingSafetyChecks(
ctx context.Context, key interface{}, target roachpb.StoreID,
) error {
b := &Batch{}
b.adminTransferLease(key, target, true /* bypassSafetyChecks */)
return getOneErr(db.Run(ctx, b), b)
}

@tbg
Copy link
Member Author

tbg commented Jul 26, 2023

The new testing knob isn't necessary, you can use AdminTransferLeaseBypassingSafetyChecks:

That disables both checks (above and below raft), but I need the below-raft to stay active.

TFTR!

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Jul 26, 2023

Build succeeded:

@craig craig bot merged commit f147c2b into cockroachdb:master Jul 26, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv/kvserver: TestLeaseTransferRejectedIfTargetNeedsSnapshot timed out [brittle test]
3 participants