Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: lease-preferences/manual-violating-transfer failed #123998

Closed
cockroach-teamcity opened this issue May 12, 2024 · 6 comments · Fixed by #124223
Closed

roachtest: lease-preferences/manual-violating-transfer failed #123998

cockroach-teamcity opened this issue May 12, 2024 · 6 comments · Fixed by #124223
Assignees
Labels
branch-master Failures on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team
Projects

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 12, 2024

roachtest.lease-preferences/manual-violating-transfer failed with artifacts on master @ 4c2e7761acd050aaee565443932b6b0eca55620b:

(test_runner.go:1237).runTest: test timed out (30m0s)
test artifacts and logs in: /artifacts/lease-preferences/manual-violating-transfer/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-38653

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels May 12, 2024
@cockroach-teamcity cockroach-teamcity added this to roachtest/unit test backlog in KV May 12, 2024
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 13, 2024
Informs cockroachdb#123998.

The attempt to limit how long waitForLeasePreferences was broken.

Release note: None
@nvanbenschoten
Copy link
Member

We see the test transfer leases away from the preferred zone at 10:32:45:

10:32:45 lease_preferences.go:77: transferring leases to node 5 and waiting for lease preference conformance

We then see lease transfers over the next minute:

I240512 10:32:56.116834 134690 13@kv/kvserver/lease_queue.go:143 ⋮ [T1,Vsystem,n5,lease,s5,r313/1:‹/Table/106/1/39{71302…-89730…}›] 13954  transferring lease to s2 usage=[batches/s=0.0 request_cpu/s=0µs raft_cpu/s=0µs write(keys)/s=0.0 write(bytes)/s=0 B read(keys)/s=0.0 read(bytes)/s=0 B]
I240512 10:32:56.118355 134690 13@kv/kvserver/lease_queue.go:143 ⋮ [T1,Vsystem,n5,lease,s5,r705/3:‹/Table/106/1/-87{4423…-2580…}›] 13955  transferring lease to s2 usage=[batches/s=0.0 request_cpu/s=0µs raft_cpu/s=0µs write(keys)/s=0.0 write(bytes)/s=0 B read(keys)/s=0.0 read(bytes)/s=0 B]
I240512 10:32:56.119643 134690 13@kv/kvserver/lease_queue.go:143 ⋮ [T1,Vsystem,n5,lease,s5,r702/2:‹/Table/106/1/-57{5884…-4042…}›] 13956  transferring lease to s2 usage=[batches/s=0.0 request_cpu/s=0µs raft_cpu/s=0µs write(keys)/s=0.0 write(bytes)/s=0 B read(keys)/s=0.0 read(bytes)/s=0 B]
I240512 10:32:56.120688 134690 13@kv/kvserver/lease_queue.go:143 ⋮ [T1,Vsystem,n5,lease,s5,r229/2:‹/Table/106/1/-50{2171…-0328…}›] 13957  transferring lease to s2 usage=[batches/s=0.0 request_cpu/s=0µs raft_cpu/s=0µs write(keys)/s=0.0 write(bytes)/s=0 B read(keys)/s=0.0 read(bytes)/s=0 B]
I240512 10:32:56.122001 134690 13@kv/kvserver/lease_queue.go:138 ⋮ [T1,Vsystem,n5,lease,s5,r439/1:‹/Table/106/1/66{06551…-24979…}›] 13958  err=can't transfer r439 lease violating preferences, no suitable target
I240512 10:32:56.122123 134690 13@kv/kvserver/lease_queue.go:143 â‹® [T1,Vsystem,n5,lease,s5,r

A few of these hit can't transfer r439 lease violating preferences, no suitable target errors, but those seem to be retried.

At the end, 997 leases transfer back, but 3 remain (r607, r165, r518). These then trickle over the next 15 minutes:

240512 10:33:47.341477 144304 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r46/4:‹/Table/4{3-4}›] 13964  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:33:49.596118 144427 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r66/4:‹/Table/6{4-5}›] 13965  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:35:40.091754 149072 13@kv/kvserver/allocator/plan/lease.go:91 ⋮ [T1,Vsystem,n5,lease,s5,r607/1:‹/Table/106/1/54{45567…-63995…}›] 13966  lease transfer needed transfer(preferences), enqueuing
I240512 10:35:40.091972 149045 13@kv/kvserver/lease_queue.go:143 ⋮ [T1,Vsystem,n5,lease,s5,r607/1:‹/Table/106/1/54{45567…-63995…}›] 13967  transferring lease to s2 usage=[batches/s=0.0 request_cpu/s=2µs raft_cpu/s=7µs write(keys)/s=0.0 write(bytes)/s=0 B read(keys)/s=0.0 read(bytes)/s=0 B]
I240512 10:36:22.928571 150829 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r50/4:‹/Table/4{7-8}›] 13968  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:36:28.002169 151042 13@kv/kvserver/allocator/plan/lease.go:91 ⋮ [T1,Vsystem,n5,lease,s5,r165/1:‹/Table/106/1/16{12477…-30905…}›] 13969  lease transfer needed transfer(preferences), enqueuing
I240512 10:36:28.002335 150998 13@kv/kvserver/lease_queue.go:143 ⋮ [T1,Vsystem,n5,lease,s5,r165/1:‹/Table/106/1/16{12477…-30905…}›] 13970  transferring lease to s2 usage=[batches/s=0.0 request_cpu/s=2µs raft_cpu/s=5µs write(keys)/s=0.0 write(bytes)/s=0 B read(keys)/s=0.0 read(bytes)/s=0 B]
I240512 10:37:08.574011 152688 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r17/4:‹/Table/1{4-5}›] 13971  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:38:00.961573 155013 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r7/4:‹/Table/{3-4}›] 13972  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:38:39.261765 156632 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r52/4:‹/Table/5{0-1}›] 13973  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:38:43.767165 156889 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r2/4:‹/System/NodeLiveness{-Max}›] 13974  no lease transfer needed no-transfer(no valid targets), not enqueueing
I240512 10:39:15.862536 158294 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r42/4:‹/Table/{39-40}›] 13975  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:39:45.133467 159500 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r22/4:‹/Table/{19-20}›] 13976  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:40:37.450481 161716 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r67/4:‹/Table/6{5-6}›] 13977  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:40:53.750843 162543 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r15/4:‹/Table/1{2-3}›] 13978  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:41:11.156499 163319 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r31/4:‹/Table/2{8-9}›] 13979  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:41:29.769833 164075 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r52/4:‹/Table/5{0-1}›] 13980  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:41:34.283110 164258 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r42/4:‹/Table/{39-40}›] 13981  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:42:06.430961 165591 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r50/4:‹/Table/4{7-8}›] 13982  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:42:53.800595 167532 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r36/4:‹/Table/3{3-4}›] 13983  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:43:50.754262 169866 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r26/5:‹/Table/2{3-4}›] 13984  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:44:25.707303 171289 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r31/4:‹/Table/2{8-9}›] 13985  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:44:41.495371 171990 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r7/4:‹/Table/{3-4}›] 13986  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:44:47.694553 172243 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r33/4:‹/NamespaceTable/{30-Max}›] 13987  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:45:16.440319 173401 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r2/4:‹/System/NodeLiveness{-Max}›] 13988  no lease transfer needed no-transfer(no valid targets), not enqueueing
I240512 10:45:33.350405 174103 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r17/4:‹/Table/1{4-5}›] 13989  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:45:57.026280 175081 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r66/4:‹/Table/6{4-5}›] 13990  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:46:32.528680 176555 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r15/4:‹/Table/1{2-3}›] 13991  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:46:46.053365 177098 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r67/4:‹/Table/6{5-6}›] 13992  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:47:46.339336 179643 13@kv/kvserver/allocator/plan/lease.go:96 ⋮ [T1,Vsystem,n5,lease,s5,r44/4:‹/Table/4{1-2}›] 13993  no lease transfer needed no-transfer(balanced), not enqueueing
I240512 10:48:02.149325 180311 13@kv/kvserver/allocator/plan/lease.go:91 ⋮ [T1,Vsystem,n5,lease,s5,r518/1:‹/Table/106/1/52{24427…-42855…}›] 13994  lease transfer needed transfer(preferences), enqueuing
I240512 10:48:02.149521 180291 13@kv/kvserver/lease_queue.go:143 ⋮ [T1,Vsystem,n5,lease,s5,r518/1:‹/Table/106/1/52{24427…-42855…}›] 13995  transferring lease to s1 usage=[batches/s=0.0 request_cpu/s=0µs raft_cpu/s=1µs write(keys)/s=0.0 write(bytes)/s=0 B read(keys)/s=0.0 read(bytes)/s=0 B]

This should have failed after 5 minutes, but the test's timeout was broken (see #124034), so instead, the test harness fires after 30 minutes, just before the last lease is transferred over. I'd guess that this will fail more often once that PR merges.

So it looks like the reactive transfers from ba13d19 aren't working for these 3 leases. This is similar to what we saw in #123866, though I closed that with some suspicion that Azure was to blame.

@kvoli since you recently touched this test and this area, can I ask you to take a closer look at why these three ranges weren't eagerly enqueued? Is it surprising that we don't see this log line fire?

@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure P-2 Issues/test failures with a fix SLA of 3 months and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels May 13, 2024
@kvoli
Copy link
Collaborator

kvoli commented May 13, 2024

So it looks like the reactive transfers from ba13d19 aren't working for these 3 leases. This is similar to what we saw in #123866, though I closed that with some suspicion that Azure was to blame.

None of the leases should be less-preferred, only violating the preference from being on n5 (dc=3), so ba13d19 won't apply.

preferences: `[+dc=1],[+dc=2]`,
ranges: 1000,
replFactor: 5,
eventFn: makeTransferLeasesEventFn(
5 /* gateway */, 5 /* target */),

can I ask you to take a closer look at why these three ranges weren't eagerly enqueued? Is it surprising that we don't see this log line fire?

Will do, I'm surprised as well. The log line is firing when looking at these logs:

r607

I240512 10:32:55.653744 224 kv/kvserver/replica_proposal.go:560 ⋮ [T1,Vsystem,n5,s5,r607/1:‹/Table/106/1/54{45567…-63995…}›,raft] 43012  acquired lease violates lease preferences, enqueuing for transfer [lease=repl=(n5,s5):1 seq=8 start=1715509975.652418247,0 exp=1715509981.652374547,0 pro=1715509975.652374547,0 preferences=[{[‹+dc=1›]} {[‹+dc=2›]}]]

r165

I240512 10:32:55.296963 233 kv/kvserver/replica_proposal.go:560 ⋮ [T1,Vsystem,n5,s5,r165/1:‹/Table/106/1/16{12477…-30905…}›,raft] 41525  acquired lease violates lease preferences, enqueuing for transfer [lease=repl=(n5,s5):1 seq=8 start=1715509975.295690176,0 exp=1715509981.295654521,0 pro=1715509975.295654521,0 preferences=[{[‹+dc=1›]} {[‹+dc=2›]}]]

r518

I240512 10:32:55.634014 212 kv/kvserver/replica_proposal.go:560 ⋮ [T1,Vsystem,n5,s5,r518/1:‹/Table/106/1/52{24427…-42855…}›,raft] 42926  acquired lease violates lease preferences, enqueuing for transfer [lease=repl=(n5,s5):1 seq=6 start=1715509975.632817914,0 exp=1715509981.632774257,0 pro=1715509975.632774257,0 preferences=[{[‹+dc=1›]} {[‹+dc=2›]}]]

The enqueue fails with a LeaseRejectedError and isn't retried because it doesn't have the purgatory marker. This explains why these three didn't transfer, or get retried from purgatory. TBD why the lease queue processing is returning the LeaseRejectedError:

E240512 10:32:55.637160 140508 kv/kvserver/queue.go:1202 ⋮ [T1,Vsystem,n5,lease,s5,r518/1:‹/Table/106/1/52{24427…-42855…}›] 42938  cannot replace lease repl=(n5,s5):1 seq=6 start=1715509975.632817914,0 epo=1 pro=1715509975.633970829,0 with repl=(n5,s5):1 seq=0 start=1715509975.636799733,0 epo=1 pro=1715509975.636799733,0: ‹expected previous lease repl=(n5,s5):1 seq=6 start=1715509975.632817914,0 exp=1715509981.632774257,0 pro=1715509975.632774257,0, found repl=(n5,s5):1 seq=6 start=1715509975.632817914,0 epo=1 pro=1715509975.633970829,0›

From

if prevLease.Sequence != args.PrevLease.Sequence || !prevLease.ProposedTS.Equal(args.PrevLease.ProposedTS) {
rErr.Message = fmt.Sprintf("expected previous lease %s, found %s", args.PrevLease, prevLease)
return newFailedLeaseTrigger(false /* isTransfer */), rErr
}

craig bot pushed a commit that referenced this issue May 13, 2024
123648: eval: optimize UnwrapDatum a bit r=yuzefovich a=yuzefovich

This commit avoids redundant type check for `tree.Datum` being `*tree.Placeholder` in `UnwrapDatum`. Noticed this when looking at CPU profile of evaluating `IN (...)` expression via the row-by-row engine.

Informs: #52802.
Epic: None

Release note: None

123703: invertedidx: right-size pre-allocation for trigram rule r=yuzefovich a=yuzefovich

When encoding each string, I think we always need to add at least three
extra bytes:
- prefix `bytesMarker`
- escape
- terminator.

(We need to add more escapes if the string itself has the escape character.)

Previously, we would pre-allocate extra capacity only for two bytes, so we'd always have to reallocate. This is now fixed.

The same problem is also fixed when encoding inverted index keys for trigrams. Additionally, this commit fixes a couple of typos in the comments.

Epic: None
Informs: #112675.

Release note: None

124034: roachtest: fix maxWaitDuration in lease-preferences r=kvoli a=nvanbenschoten

Informs #123998.

The attempt to limit how long waitForLeasePreferences was broken.

Release note: None

124053: roachtest: run sequelize on insecure mode and default sql port r=annrpom a=DarrylWong

The test suite expects insecure connections and that the SQL
port is on 26257.

Fixes: none
Release note: none
Epic: none

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: DarrylWong <darryl@cockroachlabs.com>
@kvoli
Copy link
Collaborator

kvoli commented May 13, 2024

This is likely a race between the expiration to epoch lease upgrade here:

_ = r.requestLeaseLocked(ctx, st, nil /* limiter */)

and replicaCanBeProcessed, called by the queue before processing a replica:

pErr = realRepl.maybeSwitchLeaseType(ctx, leaseStatus)

This lines up with the other logging we see, using r518 as an example:

# n5 first acquires lease after manual transfer
I240512 10:32:55.633877 212 kv/kvserver/replica_proposal.go:384 â‹® [T1,Vsystem,n5,s5,r518/1:‹/Table/106/1/52{24427…-42855…}âº,raft] 42924  new range lease repl=(n5,s5):1 seq=6 start=1715509975.632817914,0 exp=1715509981.632774257,0 pro=1715509975.632774257,0 following repl=(n2,s2):4 seq=5 start=1715509932.464289154,0 epo=1 pro=1715509932.466375520,0
# expiration to epoch lease upgrade starts within leasePostApply (same function as above)
I240512 10:32:55.633979 212 kv/kvserver/replica_proposal.go:488 â‹® [T1,Vsystem,n5,s5,r518/1:‹/Table/106/1/52{24427…-42855…}âº,raft] 42925  upgrading expiration lease repl=(n5,s5):1 seq=6 start=1715509975.632817914,0 exp=1715509981.632774257,0 pro=1715509975.632774257,0 to an epoch-based one
# replica is enqueued for transfer async
I240512 10:32:55.634014 212 kv/kvserver/replica_proposal.go:560 â‹® [T1,Vsystem,n5,s5,r518/1:‹/Table/106/1/52{24427…-42855…}âº,raft] 42926  acquired lease violates lease preferences, enqueuing for transfer [lease=repl=(n5,s5):1 seq=6 start=1715509975.632817914,0 exp=1715509981.632774257,0 pro=1715509975.632774257,0 preferences=[{[‹+dc=1âº]} {[‹+dc=2âº]}]]
# expiration lease is upgraded to an epoch lease
I240512 10:32:55.636812 216 kv/kvserver/replica_proposal.go:384 â‹® [T1,Vsystem,n5,s5,r518/1:‹/Table/106/1/52{24427…-42855…}âº,raft] 42936  new range lease repl=(n5,s5):1 seq=6 start=1715509975.632817914,0 epo=1 pro=1715509975.633970829,0 following repl=(n5,s5):1 seq=6 start=1715509975.632817914,0 exp=1715509981.632774257,0 pro=1715509975.632774257,0
# replica process attempt, failing the pre-process check
E240512 10:32:55.637160 140508 kv/kvserver/queue.go:1202 â‹® [T1,Vsystem,n5,lease,s5,r518/1:‹/Table/106/1/52{24427…-42855…}âº] 42938  cannot replace lease repl=(n5,s5):1 seq=6 start=1715509975.632817914,0 epo=1 pro=1715509975.633970829,0 with repl=(n5,s5):1 seq=0 start=1715509975.636799733,0 epo=1 pro=1715509975.636799733,0 : ‹expected previous lease repl=(n5,s5):1 seq=6 start=1715509975.632817914,0 exp=1715509981.632774257,0 pro=1715509975.632774257,0, found repl=(n5,s5):1 seq=6 start=1715509975.632817914,0 epo=1 pro=1715509975.633970829,0âº

@kvoli kvoli removed the A-testing Testing tools and infrastructure label May 13, 2024
@kvoli
Copy link
Collaborator

kvoli commented May 13, 2024

Removing A-testing, this is a bug.

blathers-crl bot pushed a commit that referenced this issue May 13, 2024
Informs #123998.

The attempt to limit how long waitForLeasePreferences was broken.

Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue May 13, 2024
The replica queues will attempt to switch a replica's lease type if it
doesn't match the configured lease type. TODO(kvoli): ...

Resolves: cockroachdb#123998
Release note: None
@cockroach-teamcity
Copy link
Member Author

roachtest.lease-preferences/manual-violating-transfer failed with artifacts on master @ b185402d34768c898d63a61de3e585e4723fbe81:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/lease_preferences.go:225
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/lease_preferences.go:280
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/lease_preferences.go:157
	            				main/pkg/cmd/roachtest/test_runner.go:1208
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	timed out before lease preferences satisfied
	            	(1) attached stack trace
	            	  -- stack trace:
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.waitForLeasePreferences
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/lease_preferences.go:375
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runLeasePreferences.func2
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/lease_preferences.go:223
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runLeasePreferences
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/lease_preferences.go:280
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerLeasePreferences.func3
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/lease_preferences.go:157
	            	  | main.(*testRunner).runTest.func2
	            	  | 	main/pkg/cmd/roachtest/test_runner.go:1208
	            	  | runtime.goexit
	            	  | 	src/runtime/asm_amd64.s:1695
	            	Wraps: (2) timed out before lease preferences satisfied
	            	Error types: (1) *withstack.withStack (2) *errutil.leafError
	Test:       	lease-preferences/manual-violating-transfer
	Messages:   	violating(true) duration=300.37s: [n1: 0, n2: 0, n3: 0, n4: 0, n5: 2] less-preferred(false) duration=0.00s: [n1: 0, n2: 0, n3: 0, n4: 0, n5: 0]
(require.go:1360).NoError: FailNow called
test artifacts and logs in: /artifacts/lease-preferences/manual-violating-transfer/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

kvoli added a commit to kvoli/cockroach that referenced this issue May 14, 2024
The replica queues will attempt to switch a replica's lease type if it
doesn't match the configured lease type. TODO(kvoli): ...

Resolves: cockroachdb#123998
Release note: None
@cockroach-teamcity
Copy link
Member Author

roachtest.lease-preferences/manual-violating-transfer failed with artifacts on master @ 6300c3c3367ad46ac48bf24915cf0d73cae446a0:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/lease_preferences.go:225
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/lease_preferences.go:280
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/lease_preferences.go:157
	            				main/pkg/cmd/roachtest/test_runner.go:1214
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	timed out before lease preferences satisfied
	            	(1) attached stack trace
	            	  -- stack trace:
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.waitForLeasePreferences
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/lease_preferences.go:375
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runLeasePreferences.func2
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/lease_preferences.go:223
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runLeasePreferences
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/lease_preferences.go:280
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerLeasePreferences.func3
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/lease_preferences.go:157
	            	  | main.(*testRunner).runTest.func2
	            	  | 	main/pkg/cmd/roachtest/test_runner.go:1214
	            	  | runtime.goexit
	            	  | 	src/runtime/asm_amd64.s:1695
	            	Wraps: (2) timed out before lease preferences satisfied
	            	Error types: (1) *withstack.withStack (2) *errutil.leafError
	Test:       	lease-preferences/manual-violating-transfer
	Messages:   	violating(true) duration=299.88s: [n1: 0, n2: 0, n3: 0, n4: 0, n5: 4] less-preferred(false) duration=0.00s: [n1: 0, n2: 0, n3: 0, n4: 0, n5: 0]
(require.go:1360).NoError: FailNow called
test artifacts and logs in: /artifacts/lease-preferences/manual-violating-transfer/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

kvoli added a commit to kvoli/cockroach that referenced this issue May 15, 2024
A race could occur when a replica queue and post lease application both
attempted to switch the lease type. This race would cause the queue to
not process the replica because the lease type had already changed. As a
result, lease preference violations might not have been quickly
resolved by the lease queue.

Read the lease under the same mutex used for requesting the lease, when
possibly switching the lease type.

Resolves: cockroachdb#123998
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue May 15, 2024
A race could occur when a replica queue and post lease application both
attempted to switch the lease type. This race would cause the queue to
not process the replica because the lease type had already changed. As a
result, lease preference violations might not have been quickly
resolved by the lease queue.

Read the lease under the same mutex used for requesting the lease, when
possibly switching the lease type.

Resolves: cockroachdb#123998
Release note: None
craig bot pushed a commit that referenced this issue May 17, 2024
124223: kvserver: read lease under mutex when switching lease type r=nvanbenschoten a=kvoli

A race could occur when a replica queue and post lease application both attempted to switch the lease type. This race would cause the queue to not process the replica because the lease type had already changed. As a result, lease preference violations might not have been quickly resolved by the lease queue.

Read the lease under the same mutex used for requesting the lease, when possibly switching the lease type.

Resolves: #123998
Release note: None

Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
@craig craig bot closed this as completed in 574f667 May 17, 2024
blathers-crl bot pushed a commit that referenced this issue May 17, 2024
A race could occur when a replica queue and post lease application both
attempted to switch the lease type. This race would cause the queue to
not process the replica because the lease type had already changed. As a
result, lease preference violations might not have been quickly
resolved by the lease queue.

Read the lease under the same mutex used for requesting the lease, when
possibly switching the lease type.

Resolves: #123998
Release note: None
miyamo2 added a commit to miyamo2/cockroach that referenced this issue May 17, 2024
mod comment

lease: remove outdated NightlyStress call

This test config is not used anymore.

Release note: None

roachtest: increase timeout of acceptance/multitenant-multiregion

This has flaked due to timeout in CI.

Release note: None

sqlerrors: move gloval var errNoZoneConfigApplies into sqlerrors package

This commit is a mechnical change that moves the gloval variable
`errNoZoneConfigApplies` from package `sql` to package `sqlerrors`, as
it will be used by the DSC when we support zone config in it.

Informs: cockroachdb#117574
Release note: None

scbuildstmt: Rerewrite a few function with better APIs

This commit cleaned up a few functions using the preferred .FilterXxx()
API to retrieve element from an element set. No functional change is
introduced.

Informs: cockroachdb#117574
Release note: None

roachtest: update dsc job compat to mixed version v241-v242

This patch updates the dsc job compat roachtest to
`declarative_schema_changer/job-compatibility-mixed-version-V241-V242`.

Epic: CRDB-35306
Informs: cockroachdb#116395

Release note: None

admission: introduce elastic io token exhausted duration metric

This patch adds a new metric `elastic_io_tokens_exhausted_duration.kv`.
This is similar to the existing
`admission.granter.io_tokens_exhausted_duration.kv`, but for elastic
traffic.

The patch also does some code refactoring to make it easier to use both
regular and elastic equivalent metrics.

Informs cockroachdb#121574.

Release note: None

sql: expose an ability to request redacted stmt bundles

This commit extends the stmt bundle collection infrastructure to support
requesting redacted stmt bundles. Previously, this was only possible via
an explicit `EXPLAIN ANALYZE (DEBUG, REDACT)` stmt, but this commit adds
the support via overloads to `crdb_internal.request_statement_bundle`
builtin as well as the server API. This paves the way for requesting the
redacted bundles via the DB Console, but it'll be done separately.

To support this we need to add another boolean column to
`system.statement_diagnostics_requests` table to indicate whether
a request is for the redacted bundle or not. This requires adding
a migration in which we populate the existing rows with all `false`
values.

In order to reduce the amount of code churn and simplify this change,
this commit repurposes some of the existing code that we introduced the
last time we added a migration for plan-gist matching in stmt bundles.
In other words, this commit simultenously removes some stale code and
handles the new column.

Release note: None

sql: simplify usage of some enum settings

This commit adjusts a few enum settings to use `EnumSetting.String` when
defining the session variables to clean up the code a bit.

Release note: None

execinfra: add a sanity check that DistSQL version is not bumped

The last bump was in 23.1, and we won't ever bump DistSQL version in
backwards-incompatible going forward.

Epic: None

Release note: None

Updated the AUTHORS file to include Naveen Setlur

sqlstats: remove unnecessary StatementFingerprintID construction

Previously we were constructing the stmt fingerprint id unnecessarily
in 2 functions in sqlstats:
- The Next function for the StmtStatsIterator
- PopAllStats, which returns the sql stats collected thus far and
resets the in-memory sql stats containers.

The ID is available in the recorded statement in both of these cases-
there's no need to reconstruct this value and doing is needlessly
expensive.

Epic: none

Release note: None

pgwire: add a metric for number of pipelined requests

Release note (ops change): Added the sql.pgwire.pipeline.count metric,
which is a gauge that measures how many wire protocol commands have been
received by the server, but have not yet begun processing. This metric
will only grow if clients are using the "pipeline mode" of the Postgres
wire protocol.

roachtest: skip multi-store-remove while debugging

This roachtest has failed consistently for the past few days, since
being enabled in cockroachdb#123694. Skip while we debug, to avoid noise.

Touches: cockroachdb#123989.

Release note: None.

clusterversion: remove deprecated internal versions

Remove deprecated 23.2 gates - these features will always be enabled
in any cluster that a 24.2 node is part of.

`TODODelete_V23_1` has a lot of uses and will be removed separately.

Epic: none
Release note: None

sql: fix the type of copy_num_retries_per_batch

We were incorrectly using the boolean getter for this integer variable.
The bug was introduced when this variable was added in
1c1d313.

Release note: None

sql: fix cost_scans_with_default_col_size

This commit fixes an omission where `cost_scans_with_default_col_size`
session variable didn't actually use the corresponding cluster setting
for its default value. In other words, the cluster setting actually had
no effect. The bug has been present since this variable was introduced
in af32d9d.

Release note: None

sql: fix up defaults for a few session variables

This commit adjusts global defaults for a few session variables to use
the postgres style (so that the global default value and session
variable value would be stored alike).

Release note: None

sql: improve env collection in stmt bundles

This commit refactors how we collect session variables and cluster
settings into `env.sql` file in the stmt bundle. Previously, we used
a hard-coded list of session variables that were always included while
ignoring all variables not in the list. If a variable had a value
different from the "binary default" (meaning the value that you got
right after the cluster startup, before any cluster setting
modifications are applied), then it would be in a SET statement that
would run on the `bundle recreate`; if the value was the same as
"binary default", then the SET statement was commented out. Also, all
cluster setting values were included into the file as a comment.

This commit makes it so that we include all session variables and
cluster settings that differ from their defaults. This allows us to
recreate the environment while also reducing the overhead of including
everything into `env.sql`.

For session variables further clarification is warranted:
- all read-only variables have been audited, and most are explicitly
excluded. Out of 23 currently present session variables, only 9 were
deemed to be possibly useful during investigations, so they are not
excluded (one of 9 is `crdb_version` which we print out separately, so
it's actually excluded too).
- for writable variables, we include it if its value differs from the
"binary" default (the one that you get on a fresh cluster) or from the
"cluster" default (the one that you get on a fresh session on the
current cluster). The latter kind is obtained via the provided
`settings.Values` container whereas the former is obtained via a global
singleton container.

Additionally, this commit adjusts the SET and SET CLUSTER SETTING
statements to have single quotes around the setting values that need them.
For cluster settings all types work with having single quotes, but for
session variables some (like integers) don't work with quotes while
others (like strings) need them. This commit adds a map for those that
need them as well as a simple test to catch some of the missing ones
(the list might be incomplete given that the test exercises the default
config). All values are adjusted to fit on a single line (we have some
cluster settings that might span multiple lines).

It also adjusts `EXPLAIN (OPT, ENV)` to include the list of cluster
settings with non-default values.

Release note: None

roachprod: fetch secrets from cloud store

Due to the complexity of fetching the secrets from the secrets
manager, the secrets are now maintained in cloud storage.

Fixes: cockroachdb#117125
Epic: none

roachprod: deploy cockroach

Previously, to upgrade a cluster it had to be done manually by running several
commands to copy over new binaries, drain and stop cockroach and then running
the same start command.

This change introduces a `deploy` command that works similar as `stage` but also
deploys the new version to the cluster by doing all of the above. It does it
node for node to ensure a cluster remains in a healthy state especially if a
load balancer is used to manage connections.

Epic: None
Release Note: None

roachprod: add deploy applications explanation

Epic: None
Release Notes: None

roachtest: add cluster settings operations

This change adds operations to mutate cluster settings. The values are mutated
based on a preset frequency and RNG. For example, if the frequency of a cluster
setting operation is set to "30 minutes" it will only change to a new value
every 30 minutes, even if the operation has been run multiple times within the
same 30-minute window. Running in the same window will just set the same value
again.

Epic: None
Release Note: None

roachtest: fix `OperationRequiresNodes` dependency check

Epic: None
Release Note: None

roachprod: cluster settings operation owners

Epic: None
Release Notes: None

roachprod: fix UT to ignore the yaml format

currently, the test is looking for the exact match of the yaml.
this can fail as the yaml is generated from a map and the values
may not be in the same sequence. a better approach is to use
a yaml parser, which will be done very soon.

Fixes: cockroachdb#124276
Epic: none

pgwire: deflake TestPipelineMetric

The wrong context was used for the server, which meant that it could get
cancelled too early.

Release note: None

roachprod, roachtest: use same cluster name sanitization

Previously, roachtest had it's own function to sanitize
cluster names, while roachprod had it's own function to
verify cluster names. This change removes both and opts
instead to use vm.DNSSafeName.

This change also introduces MalformedClusterNameError
which gives a hint on what is wrong with the name and
tells roachtest not to retry cluster creation.

intentresolver: mark the test heavy to prevent OOM

Epic: none
Release note: none

ui: move enum type to avoid circular dependencies

When importing the ViewMode enum into a unit test, an error is thrown in
localStorage.reducer.ts relating to ViewMode being undefined. This is
occurring due to a circular dependency. To fix, ViewMode is moved to
a separate types.ts file

Release note: None

ui: prevent /_status/nodes calls for secondary tenants

/_status/nodes is not implemented for secondary tenants, resulting in 501
responses in cloud for serverless tenants. To fix, this api is now gated by
a tenant check and will no longer be called by serverless / secondary tenants.

Fixes: CC-26900
Epic: None
Release note: None

cli/doctor: doctor should read from the right jobs table

In cockroachdb#97762, we started writing a job's payload (and progress)
information to the `system.jobs_info` table. As a result, we
had to change the parts of our code that relied on the `system.jobs`
table to use `crdb_internal.system_jobs` instead (since that table
would inaccurately report that some `payload`s were `NULL`).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to `crdb_internal.system_jobs` instead.

Epic: none
Fixes: cockroachdb#122675

Release note: None

doctor: skip validation for dropped descriptors

In some cases, dropped descriptors appear in our `system.descriptors`
table with dangling job mutations without an associated job.
This patch teaches debug doctor examine to skip validation
on such dropped descriptors.

Epic: none

Fixes: cockroachdb#123477
Fixes: cockroachdb#122956

Release note: none

backupccl: test that we actually downloaded all data during online restore

Release note: none

Epic: none

roachtest: check downloaded spans in online restore roundtrip test

Epic: none

Release note: none

workflows: set timeouts for GitHub Actions Essential CI jobs

I've selected the timeouts according to how long each job seems to run
in practice, with a big buffer to allow for load on the remote execution
cluster, etc. The longest-running jobs I've given 60 minutes, with less
time for the shorter jobs.

Epic: CRDB-8308
Release note: None

master: Update pkg/testutils/release/cockroach_releases.yaml

Update pkg/testutils/release/cockroach_releases.yaml with recent values.

Epic: None
Release note: None
Release justification: test-only updates

cdctest: simplify orderValidator.NoteRow code

This patch eliminates unnecessary nesting in orderValidator's
NoteRow method.

Release note: None

cdctest: rename MakeCountValidator to NewCountValidator

This patch renames `MakeCountValidator` to `NewCountValidator` to
reflect that it returns a pointer, not a struct.

Release note: None

roachtest: rename cdc/sink-chaos to cdc/kafka-chaos and add more logging

This patch renames the `cdc/sink-chaos` test to `cdc/kafka-chaos` to
more accurately reflect that it is a Kafka-only test. It also adds
logging for the Kafka chaos loop iteration number so that we can tell
when the Kafka cluster is restarting from the logs.

Release note: None

roachtest: add order validation to CDC Kafka roachtests

This patch adds order validation to CDC Kafka roachtests so that we
can build more confidence in our ordering guarantees. It can be enabled
for a roachtest either by directly setting the `validateOrder` flag on a
`kafkaManager` before creating consumers, or indirectly by setting the
`validateOrder` flag on `kafkaFeedArgs` for tests that use `cdcTester`.

Release note: None

sql: alter primary key hangs if index references exist

Previously, ALTER PRIMARY KEY could hang if an index reference existed
from either a view or function using the index selection syntax. This
was becasue neither the declarative schema changer or legacy schema
changer support updating these references. To address this, this patch
will prevent ALTER PRIMARY KEY if such references exist to any indexes
that will be recreated.

Fixes: cockroachdb#123017

Release note (bug fix): ALTER PRIMARY KEY could hang if the table had
any indexes which were referred to by views or functions using the force
index syntax.

roachprod: remove logging from updatePrometheusTargets

It's hard to tell what the logging is about when it happens in the
context of a roachtest. It's also verbose as it prints one log line
for each VM whenever a test starts cockroach.

Epic: none

Release note: None

roachtest: explicitly close monitor reader after closing session

Previously the reader did not close, even though the session is closed. This
change ensures the reader is closed by closing it after the session has been
closed. This seems to have only been a problem with local clusters.

Fixes: cockroachdb#116314

Epic: None
Release Note: None

roachtest: lower max-upgrades in `tpcc/mixed-headroom`

Avoids timeouts.

Fixes: cockroachdb#124264

Release note: None

scripts: update `drtprod` create drt-large

Bring the creation in line with what is currently running on cockroach-drt. Uses
zone expansion to balance the nodes evenly across regions.

Epic: None
Release Note: None

scripts: update `drtprod` drt-large region

In an effort to save cost on regional network transfers the Warsaw Europe Zone
will now be switched to Columbus.

Epic: None
Release Note: None

backupccl: deflake TestDataDriven_external_connections_privileges

This test already tests the permissions end-to-end by asserting that
the user can or cannot complete the expected actions. No need to print
the system database as well.

Fixes cockroachdb#123379
Release note: None

externalconn: implement `SHOW EXTERNAL CONNECTIONS`.

This new query displays redacted URIs of external connections along
with other valuable information.

Epic: CRDB-15001
Fixes: cockroachdb#85905

Release note (sql change): New queries `SHOW EXTERNAL CONNECTIONS` and
`SHOW EXTERNAL CONNECTION <connection name>` have been added. These queries
display redacted connection URIs and other useful information such as the
connection type. Access to these queries is restricted to the owner of the
connection or users with `USAGE` privilege. Admin or root users will have
unrestricted access to all connections.

kvstorage: speed up scan for range descriptors

On startup we currently scan through all `/Local/Range/` keys in order
to find the range descriptors. This involves going through all
`/Local/Range/Transaction` records and tombstones so we end up reading
through a lot of data.

This change switches to using `SeekGE`:
 - when we encounter a key that has a suffix before `rdsc` like
   `/Local/Range/foo/prbe`, we seek directly to the corresponding
   descriptor key `/Local/Range/foo/rdsc`;
 - after we decode a descriptor, we seek straight to the next possible
   range descriptor `/Local/Range/EndKey/rdsc` using the `EndKey` in
   the descriptor.

Note that inside Pebble, seeks to keys that are very close to the
current position are optimized to try to iterate through the next few
keys before doing a real seek. So this change should not be
detrimental even when there aren't a lot of keys to skip.

I experimented on a store captured after running kv0 with batch size 2
against a single node. Before this change we step through 353 keys but
the iterator has to internally step through 160k tombstones:
```
range descriptor iteration done: 339 keys, 65 range descriptors (by suffix: map[qlpt:81 rdsc:65 txn-:193]);
  scan stats: stepped 353 times (169.743k internal); seeked 2 times (2 internal)...
```

After, we seek around the transaction records and end up doing just a
seek per range:
```
range descriptor iteration done: 65 range descriptors, 0 intents, 0 tombstones
+‹(interface (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)),›
+‹(internal-stats: (block-bytes: (total 563KB, cached 26KB, read-time 755.128µs))...
```

Fixes: cockroachdb#109740

kvserver: read lease under mutex when switching lease type

A race could occur when a replica queue and post lease application both
attempted to switch the lease type. This race would cause the queue to
not process the replica because the lease type had already changed. As a
result, lease preference violations might not have been quickly
resolved by the lease queue.

Read the lease under the same mutex used for requesting the lease, when
possibly switching the lease type.

Resolves: cockroachdb#123998
Release note: None

logictest: skip flaky select_for_update test

Informs: cockroachdb#124197.
Epic: None

Release note: None

mod comment

roachtest: save cluster earlier if debug-always is set

When the --debug-always flag is passed, we mark the cluster
as saved to let the cluster destroy know not to destroy it.

Previously, we only marked a cluster as saved at the end of
a test. However, if the caller terminated the test through
ctrl c, which also destroys all clusters, it would lead to a
race condition where sometimes the cluster would not be saved
in time and get destroyed.

This change moves the cluster save to immediantly after the
cluster is created.

Release note: none
Fixes: none
Epic: none
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
A race could occur when a replica queue and post lease application both
attempted to switch the lease type. This race would cause the queue to
not process the replica because the lease type had already changed. As a
result, lease preference violations might not have been quickly
resolved by the lease queue.

Read the lease under the same mutex used for requesting the lease, when
possibly switching the lease type.

Resolves: cockroachdb#123998
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team
Projects
KV
roachtest/unit test backlog
Development

Successfully merging a pull request may close this issue.

3 participants