Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/schemachangerccl: TestBackupMixedVersionElements failed [node unavailable in KV NodeLiveness even though it started up] #120521

Closed
cockroach-teamcity opened this issue Mar 14, 2024 · 23 comments · Fixed by #124288
Assignees
Labels
A-testing Testing tools and infrastructure branch-release-23.1 Used to mark GA and release blockers and technical advisories for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team
Projects
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 14, 2024

ccl/schemachangerccl.TestBackupMixedVersionElements_base_alter_table_alter_primary_key_drop_rowid failed with artifacts on release-23.1 @ 689c53764c44c609313c00f7317f1e747e13a152:

I240314 21:19:02.138565 1303481 3@pebble/event.go:701  [n2,s2,pebble] 41  [JOB 14] sstable deleted 000008
I240314 21:19:02.155747 1300521 13@kv/kvserver/replica_raft.go:381  [T1,n1,s1,r36/1:/Table/3{5-6}] 42  proposing SIMPLE(v2) [(n2,s2):2]: after=[(n1,s1):1 (n2,s2):2] next=3
I240314 21:19:02.156702 1300521 13@kv/kvserver/replicate_queue.go:1283  [T1,n1,replicate,s1,r50/1:/Table/5{0-1}] 43  adding voter n3,s3: [1*:12]
I240314 21:19:02.157280 1300521 13@kv/kvserver/replica_command.go:2376  [T1,n1,replicate,s1,r50/1:/Table/5{0-1}] 44  change replicas (add [(n3,s3):2LEARNER] remove []): existing descriptor r50:/Table/5{0-1} [(n1,s1):1, next=2, gen=0]
I240314 21:19:02.160757 1300521 13@kv/kvserver/replica_raft.go:381  [T1,n1,s1,r50/1:/Table/5{0-1}] 45  proposing SIMPLE(l2) [(n3,s3):2LEARNER]: after=[(n1,s1):1 (n3,s3):2LEARNER] next=3
I240314 21:19:02.176306 1303491 13@kv/kvserver/store_snapshot.go:1579  [T1,n1,s1,r50/1:/Table/5{0-1}] 46  streamed INITIAL snapshot 82cb4430 at applied index 15 to (n3,s3):2LEARNER with 700 B in 0.00s @ 4.4 MiB/s: kvs=9 rangeKVs=0, rate-limit: 32 MiB/s, queued: 0.01s
I240314 21:19:02.176952 1303441 13@kv/kvserver/replica_raftstorage.go:480  [T1,n3,s3,r50/2:{-}] 47  applying INITIAL snapshot 82cb4430 from (n1,s1):1 at applied index 15
I240314 21:19:02.177400 1303441 3@pebble/event.go:697  [n3,s3,pebble] 48  [JOB 15] ingesting: sstable created 000015
I240314 21:19:02.177528 1303441 3@pebble/event.go:697  [n3,s3,pebble] 49  [JOB 15] ingesting: sstable created 000020
I240314 21:19:02.177618 1303441 3@pebble/event.go:697  [n3,s3,pebble] 50  [JOB 15] ingesting: sstable created 000016
I240314 21:19:02.177643 1303441 3@pebble/event.go:697  [n3,s3,pebble] 51  [JOB 15] ingesting: sstable created 000017
I240314 21:19:02.177659 1303441 3@pebble/event.go:697  [n3,s3,pebble] 52  [JOB 15] ingesting: sstable created 000018
I240314 21:19:02.177674 1303441 3@pebble/event.go:697  [n3,s3,pebble] 53  [JOB 15] ingesting: sstable created 000019
I240314 21:19:02.177734 1303441 3@pebble/event.go:717  [n3,s3,pebble] 54  [JOB 16] WAL created 000021 (recycled 000002)
I240314 21:19:02.177845 1303523 3@pebble/event.go:677  [n3,s3,pebble] 55  [JOB 17] flushing 1 memtable to L0
I240314 21:19:02.177891 1303523 3@pebble/event.go:697  [n3,s3,pebble] 56  [JOB 17] flushing: sstable created 000022
I240314 21:19:02.178068 1303523 3@pebble/event.go:681  [n3,s3,pebble] 57  [JOB 17] flushed 1 memtable to L0 [000022] (1.7 K), in 0.0s (0.0s total), output rate 9.6 M/s
I240314 21:19:02.178171 1303526 3@pebble/event.go:665  [n3,s3,pebble] 58  [JOB 18] compacting(default) L0 [000022] (1.7 K) + L6 [000013 000012] (2.8 K)
I240314 21:19:02.178319 1303526 3@pebble/event.go:697  [n3,s3,pebble] 59  [JOB 18] compacting: sstable created 000023
I240314 21:19:02.178563 1303526 3@pebble/event.go:669  [n3,s3,pebble] 60  [JOB 18] compacted(default) L0 [000022] (1.7 K) + L6 [000013 000012] (2.8 K) -> L6 [000023] (2.3 K), in 0.0s (0.0s total), output rate 7.0 M/s
I240314 21:19:02.178816 1303441 3@pebble/event.go:705  [n3,s3,pebble] 61  [JOB 15] ingested L0:000015 (1.4 K), L0:000020 (1.1 K), L0:000016 (1.5 K), L6:000017 (1.2 K), L6:000018 (1.1 K), L6:000019 (1.1 K)
I240314 21:19:02.178933 1303529 3@pebble/event.go:701  [n3,s3,pebble] 62  [JOB 18] sstable deleted 000012
I240314 21:19:02.178979 1303529 3@pebble/event.go:701  [n3,s3,pebble] 63  [JOB 18] sstable deleted 000013
I240314 21:19:02.178995 1303529 3@pebble/event.go:701  [n3,s3,pebble] 64  [JOB 18] sstable deleted 000022
I240314 21:19:02.179066 1303409 3@pebble/event.go:665  [n3,s3,pebble] 65  [JOB 20] compacting(default) L0 [000015 000020 000016] (4.1 K) + L6 [000023] (2.3 K)
I240314 21:19:02.179189 1303409 3@pebble/event.go:697  [n3,s3,pebble] 66  [JOB 20] compacting: sstable created 000024
I240314 21:19:02.179309 1303441 kv/kvserver/replica_raftstorage.go:491  [T1,n3,s3,r50/2:/Table/5{0-1}] 67  applied INITIAL snapshot 82cb4430 from (n1,s1):1 at applied index 15 (total=2ms data=757 B ingestion=6@2ms)
I240314 21:19:02.179466 1303409 3@pebble/event.go:669  [n3,s3,pebble] 68  [JOB 20] compacted(default) L0 [000015 000020 000016] (4.1 K) + L6 [000023] (2.3 K) -> L6 [000024] (2.6 K), in 0.0s (0.0s total), output rate 7.8 M/s
I240314 21:19:02.179576 1303557 3@pebble/event.go:665  [n3,s3,pebble] 69  [JOB 21] compacting(elision-only) L6 [000017] (1.2 K) + L6 [] (0 B)
I240314 21:19:02.179676 1303557 3@pebble/event.go:669  [n3,s3,pebble] 70  [JOB 21] compacted(elision-only) L6 [000017] (1.2 K) + L6 [] (0 B) -> L6 [] (0 B), in 0.0s (0.0s total), output rate 0 B/s
I240314 21:19:02.179718 1303558 3@pebble/event.go:665  [n3,s3,pebble] 71  [JOB 22] compacting(elision-only) L6 [000018] (1.1 K) + L6 [] (0 B)
I240314 21:19:02.179780 1303558 3@pebble/event.go:669  [n3,s3,pebble] 72  [JOB 22] compacted(elision-only) L6 [000018] (1.1 K) + L6 [] (0 B) -> L6 [] (0 B), in 0.0s (0.0s total), output rate 0 B/s
I240314 21:19:02.179827 1303560 3@pebble/event.go:665  [n3,s3,pebble] 73  [JOB 23] compacting(elision-only) L6 [000019] (1.1 K) + L6 [] (0 B)
I240314 21:19:02.179901 1303560 3@pebble/event.go:669  [n3,s3,pebble] 74  [JOB 23] compacted(elision-only) L6 [000019] (1.1 K) + L6 [] (0 B) -> L6 [] (0 B), in 0.0s (0.0s total), output rate 0 B/s
I240314 21:19:02.179956 1303559 3@pebble/event.go:701  [n3,s3,pebble] 75  [JOB 22] sstable deleted 000018
I240314 21:19:02.179987 1303556 3@pebble/event.go:701  [n3,s3,pebble] 76  [JOB 20] sstable deleted 000015
I240314 21:19:02.180017 1303556 3@pebble/event.go:701  [n3,s3,pebble] 77  [JOB 20] sstable deleted 000016
I240314 21:19:02.179998 1303532 3@pebble/event.go:701  [n3,s3,pebble] 78  [JOB 24] sstable deleted 000017
I240314 21:19:02.180034 1303556 3@pebble/event.go:701  [n3,s3,pebble] 79  [JOB 20] sstable deleted 000020
I240314 21:19:02.180043 1303532 3@pebble/event.go:701  [n3,s3,pebble] 80  [JOB 24] sstable deleted 000019
I240314 21:19:02.180046 1303556 3@pebble/event.go:701  [n3,s3,pebble] 81  [JOB 20] sstable deleted 000023
I240314 21:19:02.180403 1300521 13@kv/kvserver/replica_command.go:2376  [T1,n1,replicate,s1,r50/1:/Table/5{0-1}] 82  change replicas (add [(n3,s3):2] remove []): existing descriptor r50:/Table/5{0-1} [(n1,s1):1, (n3,s3):2LEARNER, next=3, gen=1]
I240314 21:19:02.183884 1300521 13@kv/kvserver/replica_raft.go:381  [T1,n1,s1,r50/1:/Table/5{0-1}] 83  proposing SIMPLE(v2) [(n3,s3):2]: after=[(n1,s1):1 (n3,s3):2] next=3
I240314 21:19:02.184837 1300521 13@kv/kvserver/replicate_queue.go:1283  [T1,n1,replicate,s1,r22/1:/Table/2{1-2}] 84  adding voter n2,s2: [1*:18]
I240314 21:19:02.185345 1300521 13@kv/kvserver/replica_command.go:2376  [T1,n1,replicate,s1,r22/1:/Table/2{1-2}] 85  change replicas (add [(n2,s2):2LEARNER] remove []): existing descriptor r22:/Table/2{1-2} [(n1,s1):1, next=2, gen=0]
I240314 21:19:02.188777 1300521 13@kv/kvserver/replica_raft.go:381  [T1,n1,s1,r22/1:/Table/2{1-2}] 86  proposing SIMPLE(l2) [(n2,s2):2LEARNER]: after=[(n1,s1):1 (n2,s2):2LEARNER] next=3
--- FAIL: TestBackupMixedVersionElements_base_alter_table_alter_primary_key_drop_rowid (274.72s)
=== RUN   TestBackupMixedVersionElements_base_alter_table_alter_primary_key_drop_rowid/backup/restore_stage_17_of_19
    testcluster.go:429: pq: n3 required, but unavailable
    --- FAIL: TestBackupMixedVersionElements_base_alter_table_alter_primary_key_drop_rowid/backup/restore_stage_17_of_19 (0.82s)
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-foundations

This test on roachdash | Improve this report!

Jira issue: CRDB-36728

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers and technical advisories for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Mar 14, 2024
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Mar 14, 2024
@rafiss
Copy link
Collaborator

rafiss commented Mar 15, 2024

The error is n3 required, but unavailable. It happens here, when the upgrades framework uses KV node liveness to check which nodes are up:

// NodesFromNodeLiveness returns the IDs and epochs for all nodes that are
// currently part of the cluster (i.e. they haven't been decommissioned away).
// Migrations have the pre-requisite that all nodes are up and running so that
// we're able to execute all relevant node-level operations on them. If any of
// the nodes are found to be unavailable, an error is returned.
//
// It's important to note that this makes no guarantees about new nodes
// being added to the cluster. It's entirely possible for that to happen
// concurrently with the retrieval of the current set of nodes. Appropriate
// usage of this entails wrapping it under a stabilizing loop, like we do in
// EveryNode.
func NodesFromNodeLiveness(ctx context.Context, nl NodeLiveness) (Nodes, error) {
var ns []Node
ls, err := nl.GetLivenessesFromKV(ctx)
if err != nil {
return nil, err
}
for _, l := range ls {
if l.Membership.Decommissioned() {
continue
}
live, err := nl.IsLive(l.NodeID)
if err != nil {
return nil, err
}
if !live {
return nil, errors.Newf("n%d required, but unavailable", l.NodeID)
}
ns = append(ns, Node{ID: l.NodeID, Epoch: l.Epoch})
}
return ns, nil
}

In the CRDB logs, it looks like n3 is in fact up, but n1 doesn't think it is.

I240314 21:19:01.921826 1299534 server/node.go:533 ⋮ [T1,n3] 37325  initialized store s3
I240314 21:19:01.921855 1299534 kv/kvserver/stores.go:264 ⋮ [T1,n3] 37326  read 0 node addresses from persistent storage
I240314 21:19:01.921899 1299534 server/node.go:627 ⋮ [T1,n3] 37327  started with engine type ‹2›
I240314 21:19:01.921907 1299534 server/node.go:629 ⋮ [T1,n3] 37328  started with attributes []
I240314 21:19:01.921996 1299534 1@server/server.go:1845 ⋮ [T1,n3] 37329  starting https server at ‹127.0.0.1:44033› (use: ‹127.0.0.1:44033›)
I240314 21:19:01.922009 1299534 1@server/server.go:1850 ⋮ [T1,n3] 37330  starting postgres server at ‹127.0.0.1:38171› (use: ‹127.0.0.1:38171›)
I240314 21:19:01.922017 1299534 1@server/server.go:1853 ⋮ [T1,n3] 37331  starting grpc server at ‹127.0.0.1:41507›
I240314 21:19:01.922024 1299534 1@server/server.go:1854 ⋮ [T1,n3] 37332  advertising CockroachDB node at ‹127.0.0.1:41507›
I240314 21:19:01.970295 1302322 gossip/client.go:124 ⋮ [T1,n3] 37333  started gossip client to n0 (‹127.0.0.1:39703›)
I240314 21:19:01.970759 1302314 1@server/server.go:1683 ⋮ [T1,n3] 37334  node connected via gossip
I240314 21:19:01.970827 1300132 kv/kvserver/stores.go:283 ⋮ [T1,n3] 37335  wrote 1 node addresses to persistent storage
I240314 21:19:01.970863 1300132 kv/kvserver/stores.go:283 ⋮ [T1,n3] 37336  wrote 2 node addresses to persistent storage
I240314 21:19:01.971190 1299630 kv/kvserver/stores.go:283 ⋮ [T1,n1] 37337  wrote 2 node addresses to persistent storage
I240314 21:19:01.973313 1299881 kv/kvserver/stores.go:283 ⋮ [T1,n2] 37338  wrote 2 node addresses to persistent storage
I240314 21:19:01.996971 1302660 kv/kvclient/rangefeed/rangefeedcache/watcher.go:335 ⋮ [T1,n3] 37339  spanconfig-subscriber: established range feed cache
I240314 21:19:01.997327 1299534 1@util/log/event_log.go:32 ⋮ [T1,n3] 37340 ={"Timestamp":1710451141997325089,"EventType":"node_join","NodeID":3,"StartedAt":1710451141920492875,"LastUp":1710451141920492875}
I240314 21:19:01.997433 1302663 kv/kvclient/rangefeed/rangefeedcache/watcher.go:335 ⋮ [T1,n3] 37341  settings-watcher: established range feed cache
I240314 21:19:01.999090 1299534 sql/sqlliveness/slinstance/slinstance.go:434 ⋮ [T1,n3] 37342  starting SQL liveness instance
I240314 21:19:02.000683 1302694 sql/sqlliveness/slstorage/slstorage.go:540 ⋮ [T1,n3] 37343  inserted sqlliveness session 010180857f00f1c05947a2b8c8fdf9f7527f34
I240314 21:19:02.000711 1302694 sql/sqlliveness/slinstance/slinstance.go:258 ⋮ [T1,n3] 37344  created new SQL liveness session 010180857f00f1c05947a2b8c8fdf9f7527f34
I240314 21:19:02.000735 1299534 sql/sqlinstance/instancestorage/instancestorage.go:342 ⋮ [T1,n3] 37345  assigning instance id to rpc addr ‹127.0.0.1:41507› and sql addr ‹127.0.0.1:38171›
I240314 21:19:02.001733 1299534 server/server_sql.go:1530 ⋮ [T1,n3] 37346  bound sqlinstance: Instance{RegionPrefix: gA==, InstanceID: 3, SQLAddr: ‹127.0.0.1:38171›, RPCAddr: ‹127.0.0.1:41507›, SessionID: 010180857f00f1c05947a2b8c8fdf9f7527f34, Locality: ‹region=us-east3›, BinaryVersion: 23.1}
I240314 21:19:02.002158 1302726 sql/temporary_schema.go:486 ⋮ [T1,n3] 37347  skipping temporary object cleanup run as it is not the leaseholder
I240314 21:19:02.002178 1302726 sql/temporary_schema.go:487 ⋮ [T1,n3] 37348  completed temporary object cleanup job
I240314 21:19:02.002192 1302726 sql/temporary_schema.go:639 ⋮ [T1,n3] 37349  temporary object cleaner next scheduled to run at 2024-03-14 21:49:02.002140061 +0000 UTC m=+2242.917556323
I240314 21:19:02.002243 1302727 sql/sqlstats/persistedsqlstats/provider.go:170 ⋮ [T1,n3] 37350  starting sql-stats-worker with initial delay: 10m43.100369651s
I240314 21:19:02.002335 1299534 upgrade/upgrademanager/manager.go:170 ⋮ [T1,n3] 37351  running permanent upgrades up to version: 22.2
I240314 21:19:02.002590 1302698 kv/kvclient/rangefeed/rangefeedcache/watcher.go:335 ⋮ [T1,n3] 37352  system-config-cache: established range feed cache
I240314 21:19:02.015450 1299534 upgrade/upgrademanager/manager.go:238 ⋮ [T1,n3] 37353  the last permanent upgrade (v22.1-42) does not appear to have completed; attempting to run all upgrades
I240314 21:19:02.015803 1299534 upgrade/upgrademanager/manager.go:283 ⋮ [T1,n3] 37354  running permanent upgrade for version 0.0-2
I240314 21:19:02.018029 1299534 upgrade/upgrademanager/manager.go:283 ⋮ [T1,n3] 37355  running permanent upgrade for version 0.0-4
I240314 21:19:02.019798 1299534 upgrade/upgrademanager/manager.go:283 ⋮ [T1,n3] 37356  running permanent upgrade for version 0.0-6
I240314 21:19:02.021752 1299534 upgrade/upgrademanager/manager.go:283 ⋮ [T1,n3] 37357  running permanent upgrade for version 0.0-8
I240314 21:19:02.023716 1299534 upgrade/upgrademanager/manager.go:283 ⋮ [T1,n3] 37358  running permanent upgrade for version 0.0-10
I240314 21:19:02.031511 1299534 upgrade/upgrademanager/manager.go:283 ⋮ [T1,n3] 37359  running permanent upgrade for version 0.0-12
I240314 21:19:02.035368 1299534 upgrade/upgrademanager/manager.go:283 ⋮ [T1,n3] 37360  running permanent upgrade for version 22.1-42
I240314 21:19:02.036857 1299534 server/server_sql.go:1638 ⋮ [T1,n3] 37361  done ensuring all necessary startup migrations have run
I240314 21:19:02.037407 1303079 jobs/job_scheduler.go:407 ⋮ [T1,n3] 37362  waiting 3m0s before scheduled jobs daemon start
I240314 21:19:02.037538 1303083 server/auto_upgrade.go:43 ⋮ [T1,n3] 37363  auto upgrade disabled by testing
I240314 21:19:02.039156 1303099 kv/kvclient/rangefeed/rangefeedcache/watcher.go:335 ⋮ [T1,n3] 37364  tenant-settings-watcher: established range feed cache
I240314 21:19:02.039836 1303057 kv/kvclient/rangefeed/rangefeedcache/watcher.go:335 ⋮ [T1,n3] 37365  tenant-capability-watcher: established range feed cache
I240314 21:19:02.041329 1303057 multitenant/tenantcapabilities/tenantcapabilitieswatcher/watcher.go:149 ⋮ [T1,n3] 37366  received results of a full table scan for tenant capabilities
I240314 21:19:02.047374 1303081 sql/syntheticprivilegecache/cache.go:204 ⋮ [T1,n3] 37367  warmed privileges for virtual tables in 9.940078ms
I240314 21:19:02.056691 1299534 1@server/server_sql.go:1756 ⋮ [T1,n3] 37368  serving sql connections
I240314 21:19:02.073947 1303241 upgrade/upgrademanager/manager.go:397 ⋮ [T1,n1,client=127.0.0.1:59982,hostssl,user=root,migration-mgr] 37369  migrating cluster from 22.2 to 22.2-48 (stepping through [22.2-2 22.2-4 22.2-6 22.2-8 22.2-10 22.2-12 22.2-14 22.2-16 22.2-18 22.2-20 22.2-22 22.2-24 22.2-26 22.2-28 22.2-30 22.2-32 22.2-34 22.2-36 22.2-38 22.2-40 22.2-42 22.2-44 22.2-46 22.2-48])
I240314 21:19:02.074000 1303241 upgrade/upgrademanager/manager.go:657 ⋮ [T1,n1,client=127.0.0.1:59982,hostssl,user=root,migration-mgr] 37370  executing operation validate-cluster-version=22.2-48
W240314 21:19:02.074168 1303241 upgrade/upgrademanager/manager.go:370 ⋮ [T1,n1,client=127.0.0.1:59982,hostssl,user=root,migration-mgr] 37371  error encountered during version upgrade: n3 required, but unavailable

Checking if KV can investigate why this disagreement might occur.

@rafiss rafiss changed the title ccl/schemachangerccl: TestBackupMixedVersionElements_base_alter_table_alter_primary_key_drop_rowid failed ccl/schemachangerccl: TestBackupMixedVersionElements failed [node unavailable in KV NodeLiveness even though it started up] Mar 15, 2024
@rafiss rafiss added T-kv KV Team and removed T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Mar 15, 2024
@blathers-crl blathers-crl bot added this to Incoming in KV Mar 15, 2024
@rafiss
Copy link
Collaborator

rafiss commented Mar 15, 2024

Is the solution here to improve the UntilClusterStable function so it can tolerate temporary unavailability of a node? I'm not sure if that's just a band-aid we should try to avoid, or if it would be expected usage of the NodeLiveness system, so I'm hoping the KV team can advise.

// UntilClusterStable is part of the upgrade.Cluster interface.
func (c *Cluster) UntilClusterStable(ctx context.Context, fn func() error) error {
ns, err := NodesFromNodeLiveness(ctx, c.c.NodeLiveness)
if err != nil {
return err
}
for {
if err := fn(); err != nil {
return err
}
curNodes, err := NodesFromNodeLiveness(ctx, c.c.NodeLiveness)
if err != nil {
return err
}
if ok, diffs := ns.Identical(curNodes); !ok {
log.Infof(ctx, "%s, retrying", diffs)
ns = curNodes
continue
}
break
}
return nil
}

@kvoli
Copy link
Collaborator

kvoli commented Mar 21, 2024

I'm confused why these series of tests started failing recently (<2 weeks ago) without related changes(?). I've attempted reproducing over an hour to no success with 12a8659895a33f4a29900e6b9d83bd624e120216:

dev test pkg/ccl/schemachangerccl -f TestBackupMixedVersionElements/TestBackupMixedVersionElements_base_alter_table_alter_primary_key_drop_rowid -v --count=1000

Speculatively, the cause the of the failure could be a race between the node joining and heart-beating liveness with the upgrade check.

We know from the logs that the node created a liveness record but not that it had heartbeated its liveness.

func (nl *NodeLiveness) CreateLivenessRecord(ctx context.Context, nodeID roachpb.NodeID) error {

I240318 21:33:38.679735 1264692 kv/kvserver/liveness/liveness.go:617 â‹® [T1,n1] 1974  created liveness record for n3

I'll continue to try and reproduce.

@kvoli
Copy link
Collaborator

kvoli commented Mar 21, 2024

This hasn't reproduced over 2 hours. I'll try a different linked test.

@andrewbaptist
Copy link
Collaborator

In the test it is very close between the n3 is "fullly up" and n1 checking if it is alive. It probably published liveness a few ms earlier, but the gossip delay could explain this miss.

I240314 21:19:02.056691 1299534 1@server/server_sql.go:1756 ⋮ [T1,n3] 37368  serving sql connections
W240314 21:19:02.074168 1303241 upgrade/upgrademanager/manager.go:370 ⋮ [T1,n1,client=127.0.0.1:59982,hostssl,user=root,migration-mgr] 37371  error encountered during version upgrade: n3 required, but unavailable

These are only 20ms apart, so it is definitely possible that n1 hadn't seen the liveness record for n1 yet. I'm actually surprised this doesn't fail more often.

@kvoli
Copy link
Collaborator

kvoli commented Mar 21, 2024

It probably published liveness a few ms earlier, but the gossip delay could explain this miss.

It should be consistent since it is scanning the liveness range, as opposed to using the in-memory gossiped version?

ls, err := nl.GetLivenessesFromKV(ctx)

I was assuming the issue was the first heartbeat, after creating the record didn't complete before the upgrade began.

@andrewbaptist
Copy link
Collaborator

Good point, it does read it from KV. However the write to KV happens here:
https://github.com/cockroachdb/cockroach/blob/release-23.1/pkg/kv/kvserver/liveness/liveness.go#L798
inside the RunAsyncTaskEx loop, so it is possible the write hadn't happened by time the check was made. I haven't gone through this in detail, so I'm not sure of when n1 decides to do the check.

kvoli added a commit to kvoli/cockroach that referenced this issue May 16, 2024
Nodes can be transiently unavailable (failing a heartbeat), in which
case the upgrade manager will error out. Retry `UntilClusterStable` for
up to 10 times when there are unavailable nodes before returning an
error.

Resolves: cockroachdb#120521
Resolves: cockroachdb#121069
Resolves: cockroachdb#119696
Release note: None
@kvoli
Copy link
Collaborator

kvoli commented May 16, 2024

@andrewbaptist I implemented a (less than desirable) retry mechanism when there are unavailable nodes during the upgrade UntilClusterStable function, which these tests are failing on. WDYT about this approach?

#124288

craig bot pushed a commit that referenced this issue Jun 4, 2024
124288: upgradecluster: retry until cluster stable with unavailable nodes r=rafiss a=kvoli

Nodes can be transiently unavailable (failing a heartbeat), in which case the upgrade manager will error out. Retry `UntilClusterStable` for up to 10 times when there are unavailable nodes before returning an error.

Resolves: #120521
Resolves: #121069
Resolves: #119696
Release note: None

Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
@craig craig bot closed this as completed in ab76750 Jun 4, 2024
blathers-crl bot pushed a commit that referenced this issue Jun 4, 2024
Nodes can be transiently unavailable (failing a heartbeat), in which
case the upgrade manager will error out. Retry `UntilClusterStable` for
up to 10 times when there are unavailable nodes before returning an
error.

Resolves: #120521
Resolves: #121069
Resolves: #119696
Release note: None
blathers-crl bot pushed a commit that referenced this issue Jun 4, 2024
Nodes can be transiently unavailable (failing a heartbeat), in which
case the upgrade manager will error out. Retry `UntilClusterStable` for
up to 10 times when there are unavailable nodes before returning an
error.

Resolves: #120521
Resolves: #121069
Resolves: #119696
Release note: None
Dhruv-Sachdev1313 pushed a commit to Dhruv-Sachdev1313/cockroach that referenced this issue Jun 7, 2024
Nodes can be transiently unavailable (failing a heartbeat), in which
case the upgrade manager will error out. Retry `UntilClusterStable` for
up to 10 times when there are unavailable nodes before returning an
error.

Resolves: cockroachdb#120521
Resolves: cockroachdb#121069
Resolves: cockroachdb#119696
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-release-23.1 Used to mark GA and release blockers and technical advisories for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team
Projects
KV
Incoming
Development

Successfully merging a pull request may close this issue.

8 participants