Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: import/tpcc/warehouses=1000/nodes=32 failed [replication failures and clock latency jumps] #122348

Closed
cockroach-teamcity opened this issue Apr 14, 2024 · 8 comments
Assignees
Labels
branch-release-24.1 Used to mark GA and release blockers and technical advisories for 24.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot. P-1 Issues/test failures with a fix SLA of 1 month T-kv KV Team
Projects
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 14, 2024

roachtest.import/tpcc/warehouses=1000/nodes=32 failed with artifacts on release-24.1 @ 6e3e9e6012dc04dabc64333a87e77a80d9f9d46b:

(monitor.go:154).Wait: monitor failure: pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10.008s (given timeout 10s): internal error while retrieving user account: get default settings error: interrupted during singleflight load-value:defaultsettings-roachprod-100-1: context deadline exceeded
test artifacts and logs in: /artifacts/import/tpcc/warehouses=1000/nodes=32/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

Jira issue: CRDB-37821

@cockroach-teamcity cockroach-teamcity added branch-release-24.1 Used to mark GA and release blockers and technical advisories for 24.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-queries SQL Queries Team labels Apr 14, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Apr 14, 2024
@rytaft
Copy link
Collaborator

rytaft commented Apr 17, 2024

Looks like SQL Foundations

@rytaft rytaft added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-queries SQL Queries Team labels Apr 17, 2024
@blathers-crl blathers-crl bot added this to Triage in SQL Foundations Apr 17, 2024
@rytaft rytaft changed the title roachtest: import/tpcc/warehouses=1000/nodes=32 failed roachtest: [operation "get-user-session" timed out] import/tpcc/warehouses=1000/nodes=32 failed Apr 17, 2024
@rafiss
Copy link
Collaborator

rafiss commented Apr 17, 2024

The error message is the workload health check failing at 15:59:07:

health: 15:59:07 health_checker.go:84: health check terminated on node 4 with pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10.008s (given timeout 10s): internal error while retrieving user account: get default settings error: interrupted during singleflight load-value:defaultsettings-roachprod-100-1: context deadline exceeded

The first step of the healthcheck is to connect to the cluster. Part of authenticating is to run the get-user-session operation internally. This has a 10 second timeout. If it fails, it means that some part of the cluster is unhealthy, and prevented the get-user-session operation from loading the user data it needed.

The logs from n4 show some replication errors happening earlier:

I240414 15:58:20.656955 11824 kv/kvserver/replica_command.go:1860 ⋮ [T1,Vsystem,n4,replicate,s4,r78/3:‹/Table/110/1{-/29/3/…}›] 1139  could not successfully add and upreplicate LEARNER replica(s) on [n26,s26], rolling back: operation ‹"send-snapshot"› timed out after 38.43s (given timeout 1h0m0s): ‹range_id:78 coordinator_replica:<node_id:4 store_id:4 replica_id:3 type:VOTER_FULL > recipient_replica:<node_id:26 store_id:26 replica_id:7 type:LEARNER > delegated_sender:<node_id:4 store_id:4 replica_id:3 type:VOTER_FULL > term:6 first_index:112 sender_queue_name:REPLICATE_QUEUE descriptor_generation:24 queue_on_delegate_len:-1 snap_id:81c2d9ef-481e-4e4d-b08b-58056427f2b3 ›: remote failed to send snapshot: recv msg error: grpc: ‹context deadline exceeded› [code 4/DeadlineExceeded]
I240414 15:58:20.671192 11824 kv/kvserver/replica_command.go:2137 ⋮ [T1,Vsystem,n4,replicate,s4,r78/3:‹/Table/110/1{-/29/3/…}›] 1140  rolled back LEARNER n26,s26 in r78:‹/Table/110/1{-/29/3/-2767}› [(n12,s12):6, (n3,s3):2, (n4,s4):3, (n26,s26):7LEARNER, next=8, gen=25, sticky=1713113828.131410469,0]
E240414 15:58:20.671494 11824 kv/kvserver/queue.go:1202 ⋮ [T1,Vsystem,n4,replicate,s4,r78/3:‹/Table/110/1{-/29/3/…}›] 1141  operation ‹"replicate queue process replica 78"› timed out after 1m0.015s (given timeout 1m0s): failed to replicate after 5 retries
E240414 15:58:20.672481 296 kv/kvserver/queue.go:1202 ⋮ [T1,Vsystem,n4,replicate,s4,r287/6:‹/Table/108/1/81{2/10/…-3/3/3…}›] 1142  needs lease, not adding: repl=(n10,s10):4 seq=5 start=1713110295.001470725,0 epo=1 pro=1713110295.003861671,0
I240414 15:58:20.672498 296 kv/kvserver/queue.go:918 ⋮ [T1,Vsystem,n4,replicate,s4,r287/6:‹/Table/108/1/81{2/10/…-3/3/3…}›] 1143  ‹queue.replicate: processing replica [outer]›: skipping 6 since replica can't be processed needs lease, not adding: repl=(n10,s10):4 seq=5 start=1713110295.001470725,0 epo=1 pro=1713110295.003861671,0
E240414 15:58:20.673842 296 kv/kvserver/queue.go:1202 ⋮ [T1,Vsystem,n4,replicate,s4,r370/4:‹/Table/108/1/532/{5/10…-7/22…}›] 1144  needs lease, not adding: repl=(n29,s29):1 seq=5 start=1713110299.001662446,0 epo=1 pro=1713110299.064308260,0

There are also are some clock latency jumps:

W240414 15:59:03.903171 13356 kv/kvserver/spanlatch/manager.go:610 ⋮ [T1,Vsystem,n4,s4,r267/3:‹/Table/110/1/2{9/3/-…-15/9/…}›] 1429  have been waiting 15s to acquire write latch ‹/Local/Range/Table/110/1/29/3/-2767/Transaction/"00000000-0000-0000-0000-000000000000"›@0,0 for request RecomputeStats [/Table/110/1/‹29›/‹3›/‹-2767›], held by write latch ‹/Local/Range/Table/110/1/29/3/-2767/Transaction/"00000000-0000-0000-0000-000000000000"›@0,0 for request RecomputeStats [/Table/110/1/‹29›/‹3›/‹-2767›]
W240414 15:59:03.904504 13150 kv/kvserver/spanlatch/manager.go:610 ⋮ [T1,Vsystem,n4,s4,r267/3:‹/Table/110/1/2{9/3/-…-15/9/…}›] 1430  have been waiting 15s to acquire write latch ‹/Local/Range/Table/110/1/29/3/-2767/Transaction/"00000000-0000-0000-0000-000000000000"›@0,0 for request RecomputeStats [/Table/110/1/‹29›/‹3›/‹-2767›], held by write latch ‹/Local/Range/Table/110/1/29/3/-2767/Transaction/"00000000-0000-0000-0000-000000000000"›@0,0 for request RecomputeStats [/Table/110/1/‹29›/‹3›/‹-2767›]
W240414 15:59:03.914373 14505 kv/kvserver/spanlatch/manager.go:610 ⋮ [T1,Vsystem,n4,s4,r267/3:‹/Table/110/1/2{9/3/-…-15/9/…}›] 1431  have been waiting 15s to acquire write latch ‹/Local/Range/Table/110/1/29/3/-2767/Transaction/"00000000-0000-0000-0000-000000000000"›@0,0 for request RecomputeStats [/Table/110/1/‹29›/‹3›/‹-2767›], held by write latch ‹/Local/Range/Table/110/1/29/3/-2767/Transaction/"00000000-0000-0000-0000-000000000000"›@0,0 for request RecomputeStats [/Table/110/1/‹29›/‹3›/‹-2767›]
W240414 15:59:03.969015 13469 kv/kvserver/spanlatch/manager.go:610 ⋮ [T1,Vsystem,n4,s4,r267/3:‹/Table/110/1/2{9/3/-…-15/9/…}›] 1432  have been waiting 15s to acquire write latch ‹/Local/Range/Table/110/1/29/3/-2767/Transaction/"00000000-0000-0000-0000-000000000000"›@0,0 for request RecomputeStats [/Table/110/1/‹29›/‹3›/‹-2767›], held by write latch ‹/Local/Range/Table/110/1/29/3/-2767/Transaction/"00000000-0000-0000-0000-000000000000"›@0,0 for request RecomputeStats [/Table/110/1/‹29›/‹3›/‹-2767›]
W240414 15:59:04.058812 15381 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=6,raddr=‹10.142.0.253:29000›,class=default,rpc] 1433  latency jump (prev avg 172.50ms, current 338.57ms)
I240414 15:59:04.517880 43559 kv/kvserver/replica_raftstorage.go:607 ⋮ [T1,Vsystem,n4,s4,r1810/5:‹/Table/114/1/97{3/10/…-4/2/-…}›] 1434  applied snapshot 84be03c3 from (n26,s26):3 at applied index 26 (total=26ms data=2.0 KiB excise=true ingestion=6@25ms)
W240414 15:59:04.650433 14588 kv/kvserver/spanlatch/manager.go:610 ⋮ [T1,Vsystem,n4,s4,r267/3:‹/Table/110/1/2{9/3/-…-15/9/…}›] 1435  have been waiting 15s to acquire write latch ‹/Local/Range/Table/110/1/29/3/-2767/Transaction/"00000000-0000-0000-0000-000000000000"›@0,0 for request RecomputeStats [/Table/110/1/‹29›/‹3›/‹-2767›], held by write latch ‹/Local/Range/Table/110/1/29/3/-2767/Transaction/"00000000-0000-0000-0000-000000000000"›@0,0 for request RecomputeStats [/Table/110/1/‹29›/‹3›/‹-2767›]
W240414 15:59:05.321603 16453 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=14,raddr=‹10.142.0.163:29000›,class=raft,rpc] 1436  latency jump (prev avg 183.06ms, current 277.63ms)
W240414 15:59:05.430425 8455 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=10,raddr=‹10.142.0.16:29000›,class=default,rpc] 1437  latency jump (prev avg 166.39ms, current 354.80ms)
W240414 15:59:05.433493 4436 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=17,raddr=‹10.142.0.245:29000›,class=default,rpc] 1438  latency jump (prev avg 178.82ms, current 299.18ms)
W240414 15:59:05.504544 13129 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=14,raddr=‹10.142.0.163:29000›,class=default,rpc] 1439  latency jump (prev avg 194.30ms, current 309.77ms)
W240414 15:59:05.646795 7599 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=18,raddr=‹10.142.0.210:29000›,class=default,rpc] 1440  latency jump (prev avg 187.55ms, current 297.18ms)
W240414 15:59:05.668718 2522 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=5,raddr=‹10.142.0.85:29000›,class=raft,rpc] 1441  latency jump (prev avg 208.74ms, current 357.35ms)
W240414 15:59:05.711929 12983 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=22,raddr=‹10.142.0.80:29000›,class=default,rpc] 1442  latency jump (prev avg 200.82ms, current 301.83ms)
W240414 15:59:05.734415 19491 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=30,raddr=‹10.142.0.213:29000›,class=raft,rpc] 1443  latency jump (prev avg 192.03ms, current 304.17ms)
W240414 15:59:05.756897 5154 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=19,raddr=‹10.142.0.72:29000›,class=rangefeed,rpc] 1444  latency jump (prev avg 174.77ms, current 325.34ms)
W240414 15:59:05.773111 5122 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=6,raddr=‹10.142.0.253:29000›,class=rangefeed,rpc] 1445  latency jump (prev avg 199.78ms, current 392.87ms)
I240414 15:59:05.774899 43728 kv/kvserver/replica_raftstorage.go:607 ⋮ [T1,Vsystem,n4,s4,r3188/4:‹/Table/113/1/2{53/82…-81/25…}›] 1446  applied snapshot 341387ce from (n32,s32):2 at applied index 17 (total=142ms data=901 B excise=true ingestion=6@140ms)
W240414 15:59:05.803060 15220 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=13,raddr=‹10.142.0.199:29000›,class=default,rpc] 1447  latency jump (prev avg 175.47ms, current 302.25ms)
W240414 15:59:05.803116 11134 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=20,raddr=‹10.142.0.81:29000›,class=raft,rpc] 1448  latency jump (prev avg 176.72ms, current 302.51ms)
I240414 15:59:05.812669 243 kv/kvserver/store_raft.go:699 ⋮ [T1,Vsystem,n4,s4,r2877/6:‹/Table/113/1/721/{20941-33238}›,raft] 1449  raft ready handling: 0.50s [append=0.32s, apply=0.00s, non-blocking-sync=0.18s, other=0.00s], wrote [append-batch=222 B, append-sst=807 KiB (1), ]; node might be overloaded
W240414 15:59:05.927763 2719 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=8,raddr=‹10.142.0.192:29000›,class=system,rpc] 1450  latency jump (prev avg 197.27ms, current 407.13ms)
W240414 15:59:05.959638 7061 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=30,raddr=‹10.142.0.213:29000›,class=system,rpc] 1451  latency jump (prev avg 198.32ms, current 439.01ms)
W240414 15:59:05.968020 7681 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=4,raddr=‹10.142.0.147:29000›,class=system,rpc] 1452  latency jump (prev avg 196.96ms, current 327.21ms)
W240414 15:59:06.051436 9848 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=11,raddr=‹10.142.1.32:29000›,class=raft,rpc] 1453  latency jump (prev avg 187.92ms, current 363.86ms)
W240414 15:59:06.051519 11773 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=31,raddr=‹10.142.0.152:29000›,class=default,rpc] 1454  latency jump (prev avg 166.64ms, current 363.84ms)
W240414 15:59:06.051586 11909 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=26,raddr=‹10.142.1.3:29000›,class=rangefeed,rpc] 1455  latency jump (prev avg 269.59ms, current 425.52ms)
W240414 15:59:06.090414 5151 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=21,raddr=‹10.142.1.1:29000›,class=raft,rpc] 1456  latency jump (prev avg 180.60ms, current 317.54ms)
W240414 15:59:06.262189 16271 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=3,raddr=‹10.142.1.20:29000›,class=default,rpc] 1457  latency jump (prev avg 185.04ms, current 338.64ms)
W240414 15:59:06.323990 2860 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=9,raddr=‹10.142.0.175:29000›,class=system,rpc] 1458  latency jump (prev avg 165.17ms, current 271.30ms)
W240414 15:59:06.519444 7734 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=29,raddr=‹10.142.1.4:29000›,class=default,rpc] 1459  latency jump (prev avg 187.07ms, current 331.56ms)
W240414 15:59:06.667912 4851 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=20,raddr=‹10.142.0.81:29000›,class=system,rpc] 1460  latency jump (prev avg 190.67ms, current 305.02ms)
W240414 15:59:06.678668 327 server/node.go:1301 ⋮ [T1,Vsystem,n4,summaries] 1461  health alerts detected: {Alerts:[{StoreID:4 Category:‹METRICS› Description:‹requests.slow.latch› Value:5}]}
W240414 15:59:06.745774 281 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=1,raddr=‹10.142.0.15:29000›,class=system,rpc] 1462  latency jump (prev avg 189.40ms, current 347.14ms)
W240414 15:59:06.746673 5672 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=24,raddr=‹10.142.0.63:29000›,class=system,rpc] 1463  latency jump (prev avg 215.13ms, current 347.94ms)
W240414 15:59:06.836284 15381 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=6,raddr=‹10.142.0.253:29000›,class=default,rpc] 1464  latency jump (prev avg 219.42ms, current 396.75ms)
W240414 15:59:06.867859 3558 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=13,raddr=‹10.142.0.199:29000›,class=system,rpc] 1465  latency jump (prev avg 188.66ms, current 469.20ms)
W240414 15:59:06.867931 3820 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=15,raddr=‹10.142.0.221:29000›,class=system,rpc] 1466  latency jump (prev avg 204.09ms, current 469.24ms)
W240414 15:59:06.896167 4436 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=17,raddr=‹10.142.0.245:29000›,class=default,rpc] 1467  latency jump (prev avg 202.13ms, current 376.45ms)
W240414 15:59:06.896602 9647 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=12,raddr=‹10.142.0.204:29000›,class=default,rpc] 1468  latency jump (prev avg 182.73ms, current 386.62ms)
W240414 15:59:06.896644 17629 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=25,raddr=‹10.142.0.44:29000›,class=default,rpc] 1469  latency jump (prev avg 192.70ms, current 386.72ms)
W240414 15:59:06.991562 7599 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=18,raddr=‹10.142.0.210:29000›,class=default,rpc] 1470  latency jump (prev avg 203.18ms, current 341.56ms)
W240414 15:59:07.059973 18283 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=28,raddr=‹10.142.0.212:29000›,class=raft,rpc] 1471  latency jump (prev avg 201.74ms, current 314.57ms)
W240414 15:59:07.142753 15504 kv/kvserver/replica_command.go:684 ⋮ [T1,Vsystem,n4,s4,r267/3:‹/Table/110/1/2{9/3/-…-15/9/…}›,*kvpb.AdminSplitRequest] 1472  retrying split after err: split at key /Table/110/1/‹312›/‹6›/‹-3001› failed: descriptor changed: [expected] r267:‹/Table/110/{1/29/3/-2767-2}› [(n12,s12):1, (n3,s3):2, (n4,s4):3, (n26,s26):4LEARNER, next=5, gen=25, sticky=1713110851.315439149,0] != [actual] r267:‹/Table/110/1/2{9/3/-2767-15/9/-2665}› [(n12,s12):1, (n3,s3):2, (n4,s4):3, (n26,s26):4LEARNER, next=5, gen=26, sticky=1713110851.315439149,0]: unexpected value: ‹raw_bytes:"\305\231\321/\003\010\213\002\022\010\366n\211\245\213\206\3651\032\t\366n\211\366\327\221\206\365\227\"\010\010\014\020\014\030\001 \000\"\010\010\003\020\003\030\002 \000\"\010\010\004\020\004\030\003 \000\"\010\010\032\020\032\030\004 \001(\0050\032:\n\010\255\254\230\314\342\247\214\343\027" timestamp:<wall_time:1713110296825040740 > ›
W240414 15:59:07.298011 7681 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n4,rnode=4,raddr=‹10.142.0.147:29000›,class=system,rpc] 1473  latency jump (prev avg 210.32ms, current 315.91ms)
W240414 15:59:07.459183 42359 sql/user.go:280 ⋮ [T1,Vsystem,n4,client=35.230.168.164:37914,hostssl,user=‹roachprod›] 1474  user lookup for ‹"roachprod"› failed: get default settings error: interrupted during singleflight ‹load-value:defaultsettings-roachprod-100-1›: context deadline exceeded

This might mean the cluster is overloaded, or there's some other problem at a lower level. @rytaft I'm moving this back to SQL Queries so you can decide which direction to pursue.

@rafiss rafiss changed the title roachtest: [operation "get-user-session" timed out] import/tpcc/warehouses=1000/nodes=32 failed roachtest: import/tpcc/warehouses=1000/nodes=32 failed [replication failures and clock latency jumps] Apr 17, 2024
@rytaft rytaft added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 17, 2024
@rytaft
Copy link
Collaborator

rytaft commented Apr 17, 2024

This shouldn't block a beta, but marking as GA-blocker for now since we'd like to dig a bit deeper

@rytaft rytaft added the P-1 Issues/test failures with a fix SLA of 1 month label Apr 18, 2024
@yuzefovich
Copy link
Member

Looking closer at n4, here are some interesting messages

I240414 15:57:39.385090 247 kv/kvserver/store_raft.go:699 ⋮ [T1,Vsystem,n4,s4,r287/6:‹/Table/108/1/81{2/10/…-3/3/3…}›,raft] 441  raft ready handling: 0.65s [append=0.00s, apply=0.22s, , other=0.00s], wrote [append-batch=51 B, apply=312 B (1)] pebble stats: [commit-wait 107ns sem 442ns]; node might be overloaded
...
I240414 15:58:48.719835 13712 kv/kvserver/replica_command.go:474 ⋮ [T1,Vsystem,n4,s4,r267/3:‹/Table/110/1/2{9/3/-…-15/9/…}›,*kvpb.AdminSplitRequest] 1257  initiating a split of this range at key /Table/110/1/‹375›/‹1›/‹-3001› [r1273] (manual); delayed by 45.1s to resolve: replica r267/4 not caught up: StateSnapshot match=0 next=12 learner paused (without success); r267/4 is waiting for a Raft snapshot

Then at 15:59:07 we get this error. I think KV team should look closer whether this behavior is expected

@yuzefovich yuzefovich added this to Incoming in KV via automation Apr 18, 2024
@yuzefovich yuzefovich removed this from Triage in SQL Foundations Apr 18, 2024
@blathers-crl blathers-crl bot added the T-kv KV Team label Apr 18, 2024
@yuzefovich yuzefovich removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Apr 18, 2024
@cockroach-teamcity
Copy link
Member Author

roachtest.import/tpcc/warehouses=1000/nodes=32 failed with artifacts on release-24.1 @ 88c9d88cb0093d35f2d630137fd178518aa48569:

(monitor.go:154).Wait: monitor failure: pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10.019s (given timeout 10s): interrupted during singleflight load-value:/global/-1: context deadline exceeded
test artifacts and logs in: /artifacts/import/tpcc/warehouses=1000/nodes=32/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@kvoli
Copy link
Collaborator

kvoli commented Apr 23, 2024

This overload is likely due to #122309 which #122824 will address.

@miraradeva
Copy link
Contributor

I managed to repro the issue on a GCE worker running off master, and it definitely looks like the same issue as #122309. I also ran the test with #122824 and confirmed it helps run a more stable cluster (e.g. no slow RPCs, no slow latch requests).

@miraradeva
Copy link
Contributor

The backported fix just merged in; this should be fixed.

@miraradeva miraradeva added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-24.1 Used to mark GA and release blockers and technical advisories for 24.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot. P-1 Issues/test failures with a fix SLA of 1 month T-kv KV Team
Projects
KV
Incoming
Development

No branches or pull requests

6 participants