Skip to content

Conversation

tbg
Copy link
Member

@tbg tbg commented Nov 8, 2024

The test sets up an environment in which 40 replicas of interest are
supposed to enter the lease queue purgatory. The test was waiting for
this to happen before proceeding, but was doing so incorrectly: It
checked that the number of replicas in the purgatory matches 40 (as
opposed to checking directly that all ranges of interest had entered
it). Since other ranges could slip in, occasionally the test would
proceed too early, remove the condition that causes ranges to enter the
purgatory, and then find that a few ranges would not be processed (since
they never entered the purgatory in the first place).

This commit fixes this by waiting explicitly for the RangeIDs of
interest to be represented in the lease queue purgatory.

I was able to reproduce the flake in a few minutes on my gceworker via

./dev test --count 10000 --stress ./pkg/kv/kvserver \
	--filter TestLeaseQueueLeasePreferencePurgatoryError  -- \
	--jobs 100 --local_resources=cpu=100 --local_resources=memory=HOST_RAM 2>&1

This no longer reproduces as of this PR:

INFO: Elapsed time: 2091.413s, Critical Path: 166.06s
INFO: 3356 processes: 2 internal, 3354 linux-sandbox.
INFO: Build completed successfully, 3356 total actions
INFO:
//pkg/kv/kvserver:kvserver_test                                          PASSED in 50.3s
  Stats over 3000 runs: max = 50.3s, min = 12.3s, avg = 26.6s, dev = 6.6s

Fixes #134578.
Fixes #134768.

The backports will fix but this
Touches #134765.

Epic: none
Release note: None

The test sets up an environment in which 40 replicas of interest are
supposed to enter the lease queue purgatory. The test was waiting for
this to happen before proceeding, but was doing so incorrectly: It
checked that the number of replicas in the purgatory matches 40 (as
opposed to checking directly that all ranges of interest had entered
it). Since other ranges could slip in, occasionally the test would
proceed too early, remove the condition that causes ranges to enter the
purgatory, and then find that a few ranges would not be processed (since
they never entered the purgatory in the first place).

This commit fixes this by waiting explicitly for the RangeIDs of
interest to be represented in the lease queue purgatory.

I was able to reproduce the flake in a few minutes on my gceworker via

```
./dev test --count 10000 --stress ./pkg/kv/kvserver \
	--filter TestLeaseQueueLeasePreferencePurgatoryError  -- \
	--jobs 100 --local_resources=cpu=100 --local_resources=memory=HOST_RAM 2>&1
```

This no longer reproduces as of this PR.

Fixes cockroachdb#134578.

Epic: none
Release note: None
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested review from nvb and kvoli November 8, 2024 13:10
@tbg tbg marked this pull request as ready for review November 8, 2024 14:50
@tbg tbg requested a review from a team as a code owner November 8, 2024 14:50
@tbg
Copy link
Member Author

tbg commented Nov 8, 2024

@kvoli side comment: does it make sense for this error to kick ranges into the purgatory?

I241108 12:19:59.151356 1512 kv/kvserver/queue.go:1223 ⋮ [T1,Vsystem,n1,lease,s1,r16/1:‹/Table/1{3-4}›] 81 purgatory: can't acquire allocator token, held by ‹replicate›

(see #134578)

@kvoli
Copy link
Contributor

kvoli commented Nov 8, 2024

@kvoli side comment: does it make sense for this error to kick ranges into the purgatory?

I241108 12:19:59.151356 1512 kv/kvserver/queue.go:1223 ⋮ [T1,Vsystem,n1,lease,s1,r16/1:‹/Table/1{3-4}›] 81 purgatory: can't acquire allocator token, held by ‹replicate›

(see #134578)

We are somewhat abusing purgatory, however it was intentional, as purgatory is a convenient retry mechanism for leaseholders who are not currently the raft leader and will therefore fail a lease transfer check in the allocator.

@tbg tbg added backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.2.x backport-24.3.x Flags PRs that need to be backported to 24.3 labels Nov 11, 2024
Copy link
Contributor

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

@tbg
Copy link
Member Author

tbg commented Nov 11, 2024

bors r+
TFTR!

@craig craig bot merged commit dcc832a into cockroachdb:master Nov 11, 2024
23 checks passed
Copy link

blathers-crl bot commented Nov 11, 2024

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.


Issue #134578: branch-release-24.1, branch-release-24.3.


Issue #134768: branch-release-24.1, branch-release-24.2.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

tbg added a commit to tbg/cockroach that referenced this pull request Sep 11, 2025
Accidentally missed this in cockroachdb#134653.

Epic: none
craig bot pushed a commit that referenced this pull request Sep 12, 2025
153413: kvserver: fix missing locking in test helper r=tbg a=tbg

Accidentally missed this in #134653.

Epic: none

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
craig bot pushed a commit that referenced this pull request Sep 12, 2025
153413: kvserver: fix missing locking in test helper r=tbg a=tbg

Accidentally missed this in #134653.

Epic: none

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.3.x Flags PRs that need to be backported to 24.3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv/kvserver: TestLeaseQueueLeasePreferencePurgatoryError failed kv/kvserver: TestLeaseQueueLeasePreferencePurgatoryError failed
3 participants