Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv,storage: replica inconsistency in 23.2 scale test #114421

Closed
kvoli opened this issue Nov 14, 2023 · 28 comments
Closed

kv,storage: replica inconsistency in 23.2 scale test #114421

kvoli opened this issue Nov 14, 2023 · 28 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-23.2-scale-testing issues found during 23.2 scale testing O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster T-storage Storage Team

Comments

@kvoli
Copy link
Collaborator

kvoli commented Nov 14, 2023

Describe the problem

Replica inconsistency found on the 23.2 scale testing cluster:

Panic

I231111 11:54:12.341739 640128435 1@kv/kvserver/replica_consistency.go:850 ⋮ [T1,Vsystem,n41,s404,r1081383/5:‹/Table/106/1/15822{257…-339…}›] 8259919  the server is terminating due to a fatal error (see the DEV channel for details)
F231111 11:54:12.343159 640128435 kv/kvserver/replica_consistency.go:850 ⋮ [T1,Vsystem,n41,s404,r1081383/5:‹/Table/106/1/15822{257…-339…}›] 8259920  ATTENTION:
F231111 11:54:12.343159 640128435 kv/kvserver/replica_consistency.go:850 ⋮ [T1,Vsystem,n41,s404,r1081383/5:‹/Table/106/1/15822{257…-339…}›] 8259920 +
F231111 11:54:12.343159 640128435 kv/kvserver/replica_consistency.go:850 ⋮ [T1,Vsystem,n41,s404,r1081383/5:‹/Table/106/1/15822{257…-339…}›] 8259920 +This node is terminating because a replica inconsistency was detected between [n41,s404,r1081383
/5:‹/Table/106/1/15822{257…-339…}›]
F231111 11:54:12.343159 640128435 kv/kvserver/replica_consistency.go:850 ⋮ [T1,Vsystem,n41,s404,r1081383/5:‹/Table/106/1/15822{257…-339…}›] 8259920 +and its other replicas: (n39,s382):6,(n41,s404):5,(n18,s189):4. Please check your cluster-wide l
og files for more

The diff'd checkpoints:

$ diff debug-range-data-n18.out debug-range-data-n41.out
57a58
> /Local/Lock/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor 03c22060f6faba4ff7baf02d658c6f8bf1 (0x017a6b12016b12f289fd15f531ea7e9b97f800ff017264736300010003c22060f6faba4ff7baf02d658c6f8bf112): 1699683422.833100427,0 txn={id=c22060f6 key=/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor iso=Serializable pri=0.00813128 epo=0 ts=1699683422.833100427,0 min=1699683422.833100427,0 seq=1} ts=1699683422.833100427,0 del=false klen=12 vlen=102 mergeTs=<nil> txnDidNotUpdateMeta=true

$ diff debug-range-data-n39.out debug-range-data-n41.out
57a58
> /Local/Lock/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor 03c22060f6faba4ff7baf02d658c6f8bf1 (0x017a6b12016b12f289fd15f531ea7e9b97f800ff017264736300010003c22060f6faba4ff7baf02d658c6f8bf112): 1699683422.833100427,0 txn={id=c22060f6 key=/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor iso=Serializable pri=0.00813128 epo=0 ts=1699683422.833100427,0 min=1699683422.833100427,0 seq=1} ts=1699683422.833100427,0 del=false klen=12 vlen=102 mergeTs=<nil> txnDidNotUpdateMeta=true

Additional data / screenshots

See slack thread (internal) for connection details.

Environment:

  • CockroachDB version v23.2.0-alpha.6-dev
  • Server OS: [e.g. Linux/Distrib] Ubuntu

Jira issue: CRDB-33505

@kvoli kvoli added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-storage Storage Team O-23.2-scale-testing issues found during 23.2 scale testing labels Nov 14, 2023
@blathers-crl blathers-crl bot added this to Incoming in (Deprecated) Storage Nov 14, 2023
@blathers-crl blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Nov 14, 2023
@jbowens jbowens self-assigned this Nov 14, 2023
@jbowens
Copy link
Collaborator

jbowens commented Nov 14, 2023

There's just one SET for this key on n41 in sstable 779345:

$ ~/cockroach debug pebble find /mnt/data4/cockroach 'hex:017a6b12016b12f289fd15f531ea7e9b97f800ff017264736300010003c22060f6faba4ff7baf02d658c6f8bf112'
779345.sst [/Local/Range/Table/106/1/1761183660943698008/RangeDescriptor/1699688485.520930672,0#0,SET-/System/tsd/cr.store.intentresolver.async.throttled/10s/2023-11-09T15:00:00Z/434/0,0#0,SET]
    (compacted L5 [...] + L6 [...])
    /Local/Lock/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor/03c22060f6faba4ff7baf02d658c6f8bf1#0,SET 1699683422.833100427,0 txn={id=c22060f6 key=/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor iso=Serializable pri=0.00813128 epo=0 ts=1699683422.833100427,0 min=1699683422.833100427,0 seq=1} ts=1699683422.833100427,0 del=false klen=12 vlen=102 mergeTs=<nil> txnDidNotUpdateMeta=true
    /Local/Lock/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor/03c22060f6faba4ff7baf02d658c6f8bf1#0,SET 1699683422.833100427,0 txn={id=c22060f6 key=/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor iso=Serializable pri=0.00813128 epo=0 ts=1699683422.833100427,0 min=1699683422.833100427,0 seq=1} ts=1699683422.833100427,0 del=false klen=12 vlen=102 mergeTs=<nil> txnDidNotUpdateMeta=true
$ ~/cockroach debug pebble sstable scan '/mnt/data4/cockroach/779345.sst' | grep '/Local/Lock/Local/Range/Table/106/1/1582225727306831864'
 
/Local/Lock/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor/03c22060f6faba4ff7baf02d658c6f8bf1#0,SET 1699683422.833100427,0 txn={id=c22060f6 key=/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor iso=Serializable pri=0.00813128 epo=0 ts=1699683422.833100427,0 min=1699683422.833100427,0 seq=1} ts=1699683422.833100427,0 del=false klen=12 vlen=102 mergeTs=<nil> txnDidNotUpdateMeta=true

No reference in the other two checkpoints, checked with:

~/cockroach debug pebble find /mnt/data8/cockroach/auxiliary/checkpoints/r1081383_at_3230/ 'hex:017a6b12016b12f289fd15f531ea7e9b97f800ff017264736300010003c22060f6faba4ff7baf02d658c6f8bf112'
~/cockroach debug pebble find /mnt/data2/cockroach/auxiliary/checkpoints/r1081383_at_3230/ 'hex:017a6b12016b12f289fd15f531ea7e9b97f800ff017264736300010003c22060f6faba4ff7baf02d658c6f8bf112'

@jbowens jbowens moved this from Incoming to In Progress (this milestone) in (Deprecated) Storage Nov 14, 2023
@nicktrav nicktrav added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 labels Nov 14, 2023
@jbowens
Copy link
Collaborator

jbowens commented Nov 14, 2023

The txnDidNotUpdateMeta=true in the intent means this intent would've/should've been cleared through a SINGLEDEL, leaving open the possibility that a violation of SINGLEDEL invariants or a bug in SINGLEDEL implementation could be the origin of the key.

Some logs around the time:

n39,s382
4574:I231111 06:11:37.936904 548889411 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n39,s382,r1081383/6:‹/Table/106/1/15822{257…-339…}›] 870491  applying snapshot 5d47af51 from (n18,s189):4 at applied index 3021
16200:I231111 06:17:02.833039 550087248 13@kv/kvserver/allocator/plan/replicate.go:457 ⋮ [T1,Vsystem,n39,replicate,s382,r1081383/6:‹/Table/106/1/15822{257…-339…}›] 882117  replacing voter (n41,s404):5 with n44,s437: ‹[no raft progress]›
16201:I231111 06:17:02.839133 550087248 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n39,replicate,s382,r1081383/6:‹/Table/106/1/15822{257…-339…}›] 882118  change replicas (add [(n44,s437):7LEARNER] remove []): existing descriptor r1081383:‹/Table/106/1/15822{25727306831864-33925856109712}› [(n39,s382):6, (n41,s404):5, (n18,s189):4, next=7, gen=160, sticky=9223372036.854775807,2147483647]
16416:I231111 06:17:10.685377 550102113 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n39,s382,r1081383/6:‹/Table/106/1/15822{257…-339…}›] 882333  applying snapshot 290bc9ee from (n18,s189):4 at applied index 3041
17608:I231111 06:17:38.068668 550087248 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n39,s382,r1081383/6:‹/Table/106/1/15822{257…-339…}›] 883525  proposing SIMPLE(l7) [(n44,s437):7LEARNER]: after=[(n39,s382):6 (n41,s404):5 (n18,s189):4 (n44,s437):7LEARNER] next=8
18192:I231111 06:17:55.734669 550252775 13@kv/kvserver/store_snapshot.go:1863 ⋮ [T1,Vsystem,n39,s382,r1081383/6:‹/Table/106/1/15822{257…-339…}›] 884109  streamed snapshot c538d70f at applied index 3058 to (n44,s437):7LEARNER with 1.8 MiB in 0.18s @ 9.8 MiB/s: kvs=1419 rangeKVs=0 sharedSSTs=0, rate-limit: 32 MiB/s, queued: 0.06s
18193:I231111 06:17:55.864516 550087248 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n39,replicate,s382,r1081383/6:‹/Table/106/1/15822{257…-339…}›] 884110  change replicas (add [(n44,s437):7VOTER_INCOMING] remove [(n41,s404):5VOTER_DEMOTING_LEARNER]): existing descriptor r1081383:‹/Table/106/1/15822{25727306831864-33925856109712}› [(n39,s382):6, (n41,s404):5, (n18,s189):4, (n44,s437):7LEARNER, next=8, gen=161, sticky=9223372036.854775807,2147483647]
18844:I231111 06:18:12.830516 550087248 13@kv/kvserver/replicate_queue.go:785 ⋮ [T1,Vsystem,n39,replicate,s382,r1081383/6:‹/Table/106/1/15822{257…-339…}›] 884761  error processing replica: change replicas of r1081383 failed: context deadline exceeded
19835:I231111 06:18:40.157775 550320207 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n39,replicate,s382,r1081383/6:‹/Table/106/1/15822{257…-339…}›] 885752  change replicas (add [] remove [(n44,s437):7LEARNER]): existing descriptor r1081383:‹/Table/106/1/15822{25727306831864-33925856109712}› [(n39,s382):6, (n41,s404):5, (n18,s189):4, (n44,s437):7LEARNER, next=8, gen=161, sticky=9223372036.854775807,2147483647]
21244:I231111 06:19:16.875882 550320207 13@kv/kvserver/replicate_queue.go:785 ⋮ [T1,Vsystem,n39,replicate,s382,r1081383/6:‹/Table/106/1/15822{257…-339…}›] 887161  error processing replica: removing learners from r1081383:‹/Table/106/1/15822{25727306831864-33925856109712}› [(n39,s382):6, (n41,s404):5, (n18,s189):4, (n44,s437):7LEARNER, next=8, gen=161, sticky=9223372036.854775807,2147483647]: change replicas of r1081383 failed: txn exec: context deadline exceeded
26047:I231111 06:21:25.312836 550983277 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n39,replicate,s382,r1081383/6:‹/Table/106/1/15822{257…-339…}›] 891964  change replicas (add [] remove [(n44,s437):7LEARNER]): existing descriptor r1081383:‹/Table/106/1/15822{25727306831864-33925856109712}› [(n39,s382):6, (n41,s404):5, (n18,s189):4, (n44,s437):7LEARNER, next=8, gen=161, sticky=9223372036.854775807,2147483647]
26337:I231111 06:21:30.847886 550983277 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n39,s382,r1081383/6:‹/Table/106/1/15822{257…-339…}›] 892254  proposing SIMPLE(r7) [(n44,s437):7LEARNER]: after=[(n39,s382):6 (n41,s404):5 (n18,s189):4] next=8
n41,s404
10150:I231110 22:42:50.655325 462222788 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n41,s404,r1081383/5:‹/Table/106/1/15822{257…-339…}›] 438259  applying snapshot 4889a900 from (n18,s189):4 at applied index 2827
n18,s189
4973:I231111 05:56:48.295565 130620943 13@kv/kvserver/store_snapshot.go:1863 ⋮ [T1,Vsystem,n18,s189,r1081383/4:‹/Table/106/1/15822{257…-339…}›] 381691  streamed snapshot 9f150162 at applied index 2979 to (n39,s382):6 with 1.8 MiB in 0.06s @ 30 MiB/s: kvs=1417 rangeKVs=0 sharedSSTs=0, rate-limit: 32 MiB/s, queued: 3.28s
9559:I231111 06:00:57.308741 131748325 13@kv/kvserver/store_snapshot.go:1863 ⋮ [T1,Vsystem,n18,s189,r1081383/4:‹/Table/106/1/15822{257…-339…}›] 386277  streamed snapshot 7faca7c7 at applied index 2985 to (n39,s382):6 with 1.8 MiB in 0.05s @ 36 MiB/s: kvs=1417 rangeKVs=0 sharedSSTs=0, rate-limit: 32 MiB/s, queued: 0.07s
19394:I231111 06:11:37.921347 134615050 13@kv/kvserver/store_snapshot.go:1863 ⋮ [T1,Vsystem,n18,s189,r1081383/4:‹/Table/106/1/15822{257…-339…}›] 396112  streamed snapshot 5d47af51 at applied index 3021 to (n39,s382):6 with 1.8 MiB in 0.05s @ 37 MiB/s: kvs=1417 rangeKVs=0 sharedSSTs=0, rate-limit: 32 MiB/s, queued: 0.50s
24932:I231111 06:17:10.675576 135897447 13@kv/kvserver/store_snapshot.go:1863 ⋮ [T1,Vsystem,n18,s189,r1081383/4:‹/Table/106/1/15822{257…-339…}›] 401650  streamed snapshot 290bc9ee at applied index 3041 to (n39,s382):6 with 1.8 MiB in 0.05s @ 36 MiB/s: kvs=1420 rangeKVs=0 sharedSSTs=0, rate-limit: 32 MiB/s, queued: 2.46s
26100:W231111 06:18:28.035623 136184737 13@kv/kvserver/replica_command.go:2908 ⋮ [T1,Vsystem,n18,raftsnapshot,s189,r1081383/4:‹/Table/106/1/15822{257…-339…}›] 402818  attempt 1: delegate snapshot ‹range_id:1081383 coordinator_replica:<node_id:18 store_id:189 replica_id:4 type:VOTER_FULL > recipient_replica:<node_id:44 store_id:437 replica_id:7 type:LEARNER > delegated_sender:<node_id:41 store_id:404 replica_id:5 type:VOTER_FULL > deprecated_priority:RECOVERY term:689 first_index:3077 sender_queue_name:RAFT_SNAPSHOT_QUEUE descriptor_generation:161 queue_on_delegate_len:3 snap_id:1e861500-faf4-4ae8-a165-122af7808a12 › request failed error sending couldn't accept ‹range_id:1081383 coordinator_replica:<node_id:18 store_id:189 replica_id:4 type:VOTER_FULL > recipient_replica:<node_id:44 store_id:437 replica_id:7 type:LEARNER > delegated_sender:<node_id:41 store_id:404 replica_id:5 type:VOTER_FULL > deprecated_priority:RECOVERY term:689 first_index:3077 sender_queue_name:RAFT_SNAPSHOT_QUEUE descriptor_generation:161 queue_on_delegate_len:3 snap_id:1e861500-faf4-4ae8-a165-122af7808a12 ›: sender: (n41,s404):5 is not fit to send snapshot; sender first index: 3077, coordinator log truncation constraint: 3077
26101:I231111 06:18:28.154697 136184791 13@kv/kvserver/store_snapshot.go:1863 ⋮ [T1,Vsystem,n18,s189,r1081383/4:‹/Table/106/1/15822{257…-339…}›] 402819  streamed snapshot 1e861500 at applied index 3077 to (n44,s437):7LEARNER with 1.8 MiB in 0.05s @ 37 MiB/s: kvs=1420 rangeKVs=0 sharedSSTs=0, rate-limit: 32 MiB/s, queued: 0.07s

On minority node n41 (0021):

ubuntu@erikg-crdb232-0021:/mnt/data4/cockroach$ ~/cockroach debug range-descriptors /mnt/data4/cockroach/ | grep -A 2 -B 1 'Local/Range/Table/106/1/1582225727306831864/'
Output
0,0 /Local/Range/Table/106/1/1582225727306831864/RangeDescriptor (0x016b12f289fd15f531ea7e9b97f800017264736300): 1699683422.833100427,0 txn={id=c22060f6 key=/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor iso=Serializable pri=0.00813128 epo=0 ts=1699683422.833100427,0 min=1699683422.833100427,0 seq=1} ts=1699683422.833100427,0 del=false klen=12 vlen=102 mergeTs=<nil> txnDidNotUpdateMeta=true
1699683685.311701993,0 /Local/Range/Table/106/1/1582225727306831864/RangeDescriptor (0x016b12f289fd15f531ea7e9b97f80001726473630017967d4e7587e3e909): [/Table/106/1/1582225727306831864, /Table/106/1/1582233925856109712)
	Raw:r1081383:/Table/106/1/15822{25727306831864-33925856109712} [(n39,s382):6, (n41,s404):5, (n18,s189):4, next=8, gen=162, sticky=9223372036.854775807,2147483647]

1699683422.833100427,0 /Local/Range/Table/106/1/1582225727306831864/RangeDescriptor (0x016b12f289fd15f531ea7e9b97f80001726473630017967d115896468b09): [/Table/106/1/1582225727306831864, /Table/106/1/1582233925856109712)
	Raw:r1081383:/Table/106/1/15822{25727306831864-33925856109712} [(n39,s382):6, (n41,s404):5, (n18,s189):4, (n44,s437):7LEARNER, next=8, gen=161, sticky=9223372036.854775807,2147483647]

1699460981.297086781,0 /Local/Range/Table/106/1/1582225727306831864/RangeDescriptor (0x016b12f289fd15f531ea7e9b97f8000172647363001795b2c223d9813d09): [/Table/106/1/1582225727306831864, /Table/106/1/1582233925856109712)
	Raw:r1081383:/Table/106/1/15822{25727306831864-33925856109712} [(n39,s382):6, (n41,s404):5, (n18,s189):4, next=7, gen=160, sticky=9223372036.854775807,2147483647]

1699460980.770933176,0 /Local/Range/Table/106/1/1582225727306831864/RangeDescriptor (0x016b12f289fd15f531ea7e9b97f8000172647363001795b2c2047d09b809): [/Table/106/1/1582225727306831864, /Table/106/1/1582233925856109712)
	Raw:r1081383:/Table/106/1/15822{25727306831864-33925856109712} [(n25,s228):1LEARNER, (n41,s404):5, (n18,s189):4, (n39,s382):6, next=7, gen=159, sticky=9223372036.854775807,2147483647]

1699460979.857520266,0 /Local/Range/Table/106/1/1582225727306831864/RangeDescriptor (0x016b12f289fd15f531ea7e9b97f8000172647363001795b2c1ce0b768a09): [/Table/106/1/1582225727306831864, /Table/106/1/1582233925856109712)
	Raw:r1081383:/Table/106/1/15822{25727306831864-33925856109712} [(n25,s228):1VOTER_DEMOTING_LEARNER, (n41,s404):5, (n18,s189):4, (n39,s382):6VOTER_INCOMING, next=7, gen=158, sticky=9223372036.854775807,2147483647]

1699460979.107784275,0 /Local/Range/Table/106/1/1582225727306831864/RangeDescriptor (0x016b12f289fd15f531ea7e9b97f8000172647363001795b2c1a15b665309): [/Table/106/1/1582225727306831864, /Table/106/1/1582233925856109712)
	Raw:r1081383:/Table/106/1/15822{25727306831864-33925856109712} [(n25,s228):1, (n41,s404):5, (n18,s189):4, (n39,s382):6LEARNER, next=7, gen=157, sticky=9223372036.854775807,2147483647]

1699443122.845031648,0 /Local/Range/Table/106/1/1582225727306831864/RangeDescriptor (0x016b12f289fd15f531ea7e9b97f8000172647363001795a28425289ce009): [/Table/106/1/1582225727306831864, /Table/106/1/1582233925856109712)
	Raw:r1081383:/Table/106/1/15822{25727306831864-33925856109712} [(n25,s228):1, (n41,s404):5, (n18,s189):4, next=6, gen=156, sticky=9223372036.854775807,2147483647]

1699443122.824586021,0 /Local/Range/Table/106/1/1582225727306831864/RangeDescriptor (0x016b12f289fd15f531ea7e9b97f8000172647363001795a28423f0a32509): [/Table/106/1/1582225727306831864, /Table/106/1/1582233925856109712)
	Raw:r1081383:/Table/106/1/15822{25727306831864-33925856109712} [(n25,s228):1, (n40,s400):2LEARNER, (n18,s189):4, (n41,s404):5, next=6, gen=155, sticky=9223372036.854775807,2147483647]

1699443122.817777433,0 /Local/Range/Table/106/1/1582225727306831864/RangeDescriptor (0x016b12f289fd15f531ea7e9b97f8000172647363001795a2842388bf1909): [/Table/106/1/1582225727306831864, /Table/106/1/1582233925856109712)
	Raw:r1081383:/Table/106/1/15822{25727306831864-33925856109712} [(n25,s228):1, (n40,s400):2VOTER_DEMOTING_LEARNER, (n18,s189):4, (n41,s404):5VOTER_INCOMING, next=6, gen=154, sticky=9223372036.854775807,2147483647]

1699443122.784801777,0 /Local/Range/Table/106/1/1582225727306831864/RangeDescriptor (0x016b12f289fd15f531ea7e9b97f8000172647363001795a284219193f109): [/Table/106/1/1582225727306831864, /Table/106/1/1582233925856109712)
	Raw:r1081383:/Table/106/1/15822{25727306831864-33925856109712} [(n25,s228):1, (n40,s400):2, (n18,s189):4, (n41,s404):5LEARNER, next=6, gen=153, sticky=9223372036.854775807,2147483647]

1699443122.766706852,0 /Local/Range/Table/106/1/1582225727306831864/RangeDescriptor (0x016b12f289fd15f531ea7e9b97f8000172647363001795a284207d78a409): [/Table/106/1/1582225727306831864, /Table/106/1/1582233925856109712)
	Raw:r1081383:/Table/106/1/15822{25727306831864-33925856109712} [(n25,s228):1, (n40,s400):2, (n18,s189):4, next=5, gen=152, sticky=9223372036.854775807,2147483647]

1699443122.750157373,0 /Local/Range/Table/106/1/1582225727306831864/RangeDescriptor (0x016b12f289fd15f531ea7e9b97f8000172647363001795a2841f80f23d09): [/Table/106/1/1582225727306831864, /Table/106/1/1582233925856109712)
	Raw:r1081383:/Table/106/1/15822{25727306831864-33925856109712} [(n25,s228):1, (n40,s400):2, (n13,s125):3LEARNER, (n18,s189):4, next=5, gen=151, sticky=9223372036.854775807,2147483647]

1699443122.738361061,0 /Local/Range/Table/106/1/1582225727306831864/RangeDescriptor (0x016b12f289fd15f531ea7e9b97f8000172647363001795a2841eccf2e509): [/Table/106/1/1582225727306831864, /Table/106/1/1582233925856109712)
	Raw:r1081383:/Table/106/1/15822{25727306831864-33925856109712} [(n25,s228):1, (n40,s400):2, (n13,s125):3VOTER_DEMOTING_LEARNER, (n18,s189):4VOTER_INCOMING, next=5, gen=150, sticky=9223372036.854775807,2147483647]

1699443122.705043337,0 /Local/Range/Table/106/1/1582225727306831864/RangeDescriptor (0x016b12f289fd15f531ea7e9b97f8000172647363001795a2841cd08f8909): [/Table/106/1/1582225727306831864, /Table/106/1/1582233925856109712)
	Raw:r1081383:/Table/106/1/15822{25727306831864-33925856109712} [(n25,s228):1, (n40,s400):2, (n13,s125):3, (n18,s189):4LEARNER, next=5, gen=149, sticky=9223372036.854775807,2147483647]

1699443122.670534168,0 /Local/Range/Table/106/1/1582225727306831864/RangeDescriptor (0x016b12f289fd15f531ea7e9b97f8000172647363001795a2841ac1fe1809): [/Table/106/1/1582225727306831864, /Table/106/1/1582233925856109712)
	Raw:r1081383:/Table/106/1/15822{25727306831864-33925856109712} [(n25,s228):1, (n40,s400):2, (n13,s125):3, next=4, gen=148, sticky=9223372036.854775807,2147483647]

It looks like the intent was for this range descriptor:

1699683422.833100427,0 /Local/Range/Table/106/1/1582225727306831864/RangeDescriptor (0x016b12f289fd15f531ea7e9b97f80001726473630017967d115896468b09): [/Table/106/1/1582225727306831864, /Table/106/1/1582233925856109712)
	Raw:r1081383:/Table/106/1/15822{25727306831864-33925856109712} [(n39,s382):6, (n41,s404):5, (n18,s189):4, (n44,s437):7LEARNER, next=8, gen=161, sticky=9223372036.854775807,2147483647]

Which seems to correspond to these replication log lines.

16200:I231111 06:17:02.833039 550087248 13@kv/kvserver/allocator/plan/replicate.go:457 ⋮ [T1,Vsystem,n39,replicate,s382,r1081383/6:‹/Table/106/1/15822{257…-339…}›] 882117  replacing voter (n41,s404):5 with n44,s437: ‹[no raft progress]›
16201:I231111 06:17:02.839133 550087248 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n39,replicate,s382,r1081383/6:‹/Table/106/1/15822{257…-339…}›] 882118  change replicas (add [(n44,s437):7LEARNER] remove []): existing descriptor r1081383:‹/Table/106/1/15822{25727306831864-33925856109712}› [(n39,s382):6, (n41,s404):5, (n18,s189):4, next=7, gen=160, sticky=9223372036.854775807,2147483647]

Is there any context KV can infer from this? There's very little for storage to go on off n41's engine state or the other nodes' checkpoints, since the original write of the key was ~5hrs earlier. All history of the internal keys is gone.

@kvoli
Copy link
Collaborator Author

kvoli commented Nov 14, 2023

Is there any context KV can infer from this?

The leaseholder n39,s382 allocator attempted to replace n41,s404 with n44,s437, because it saw n41 as either as dead or decommissioning. The [no raft progress] means the n39 has the lease but has not yet received leadership, probably just after being transferred the lease.

Which is odd, because I don't think s404 was either decommissioning or dead. See these actions, of which only ReplaceDecommissioningVoter and ReplaceDeadVoter would cause a remove+add, instead of just adding.

@erikgrinaker
Copy link
Contributor

Any chance this has something to do with #104539?

@sumeerbhola
Copy link
Collaborator

Is there a log entry corresponding to log.Infof(ctx, "applied %s %s(%s)", inSnap, appliedAsWriteStr, logDetails) in n41? It will tell us whether it was applied as ingest, or written to the memtable.

@sumeerbhola
Copy link
Collaborator

Maybe my previous question is irrelevant. IIUC, the intent is from a RangeDescriptor change at 231111 06:17:02.833039. And the snapshot application at n41 was much earlier at 231110 22:42:50.655325. And the RangeDescriptor has been mutated after the spurious intent, so that intent must have been resolved. Is this accurate?

@jbowens
Copy link
Collaborator

jbowens commented Nov 15, 2023

Any chance this has something to do with #104539?

It's possible. I double checked that our metamorphic tests are exercising those code paths well. A bug in the integration into the metamorphic tests previously obscured an issue with #104539. Based on the intent, I think this should have been a SINGLEDEL rather than a DELSIZED, and SINGLEDEL code paths saw minimal change in #104539. I'm going to keep looking at it from this angle.

And the RangeDescriptor has been mutated after the spurious intent, so that intent must have been resolved. Is this accurate?

That matches my understanding.

From the singledel angle: How do we ensure idempotence of raft application? If committing the batch errors, I guess we need to read the applied index back from the engine to determine where to pick up application?

@nvanbenschoten
Copy link
Member

How do we ensure idempotence of raft application? If committing the batch errors, I guess we need to read the applied index back from the engine to determine where to pick up application?

Correct. Raft application is not immediately fsync-ed, so we rely on the atomicity of a WriteBatch (containing an update to the applied index and the write itself) to ensure that loss of durability will roll back the application of a suffix of the raft log without tearing individual raft log entries. We then read the applied index on startup and begin (re-)applying entries from there.

@jbowens
Copy link
Collaborator

jbowens commented Nov 16, 2023

Miraculously, the MANIFESTs actually capture quite a bit of information in file boundaries of memtable flushes. The problematic key is /Local/Lock/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor.

n41
On n41, the node with the extra key, we ingested six sstables for this range in a snapshot application:

11463978/38494
  next-file-num: 608508
  last-seq-num:  214764488
  added:         L4 608502:1403<#214764483-#214764483>[/Local/RangeID/1081383/r""/0,0#214764483,RANGEKEYDEL-/Local/RangeID/1081383/s""/0,0#inf,RANGEDEL] (2023-11-10T22:42:50Z)
  added:         L0 608507:1074<#214764484-#214764484>[/Local/RangeID/1081383/u""/0,0#214764484,RANGEDEL-/Local/RangeID/1081383/v""/0,0#inf,RANGEDEL] (2023-11-10T22:42:50Z)
  added:         L3 608503:1990<#214764485-#214764485>[/Local/Range/Table/106/1/1582225727306831864/0,0#214764485,RANGEKEYDEL-/Local/Range/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-10T22:42:50Z)
  added:         L3 608504:1136<#214764486-#214764486>[/Local/Lock/Local/Range/Table/106/1/1582225727306831864/0,0#214764486,RANGEKEYDEL-/Local/Lock/Local/Range/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-10T22:42:50Z)
  added:         L3 608505:1112<#214764487-#214764487>[/Local/Lock/Table/106/1/1582225727306831864/0,0#214764487,RANGEKEYDEL-/Local/Lock/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-10T22:42:50Z)
  added:         L0 608506:1776198<#214764488-#214764488>[/Table/106/1/1582225727306831864/0,0#214764488,RANGEKEYDEL-/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-10T22:42:50Z)

Later, miraculously, the manifest captured the problematic key's flush from the memtable, because it was the largest key flushed at the time.

23497698/75055
  log-num:       708753
  next-file-num: 708756
  last-seq-num:  244503908
  added:         L0 708755:25116<#244502908-#244503892>[/Local/RangeID/1317/u/RaftHardState/0,0#244503857,SET-/Local/Lock/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor/03c22060f6faba4ff7baf02d658c6f8bf1#244502947,SET] (2023-11-11T06:17:05Z)

A short time later, we see a different intent for the same range descriptor, again as the largest key flushed:

23621819/75457
  log-num:       709704
  next-file-num: 709707
  last-seq-num:  244553113
  added:         L0 709706:125793<#244548136-#244553107>[/Local/RangeID/881/u/RaftHardState/0,0#244549118,SET-/Local/Lock/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor/03197f483467454a63866e675f30b89730#244552985,SET] (2023-11-11T06:17:58Z)

A short time later, we see a DELSIZED for a third unique intent on the same range descriptor:

23838429/76109
  log-num:       711513
  next-file-num: 711516
  last-seq-num:  244644925
  added:         L0 711515:93871<#244641299-#244644872>[/Local/RangeID/881/u/RaftHardState/0,0#244643573,SET-/Local/Lock/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor/037f1ef289a5604000b814c0c86737a67d#244644142,DELSIZED] (2023-11-11T06:19:41Z)

n39

We see the snapshot ingestion in the checkpoint's manifest.

34930064/110192
  next-file-num: 1114464
  last-seq-num:  261690078
  added:         L0 1114458:1396<#261690073-#261690073>[/Local/RangeID/1081383/r""/0,0#261690073,RANGEKEYDEL-/Local/RangeID/1081383/s""/0,0#inf,RANGEDEL] (2023-11-11T05:56:48Z)
  added:         L0 1114463:1075<#261690074-#261690074>[/Local/RangeID/1081383/u""/0,0#261690074,RANGEDEL-/Local/RangeID/1081383/v""/0,0#inf,RANGEDEL] (2023-11-11T05:56:48Z)
  added:         L5 1114459:1990<#261690075-#261690075>[/Local/Range/Table/106/1/1582225727306831864/0,0#261690075,RANGEKEYDEL-/Local/Range/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-11T05:56:48Z)
  added:         L5 1114460:1136<#261690076-#261690076>[/Local/Lock/Local/Range/Table/106/1/1582225727306831864/0,0#261690076,RANGEKEYDEL-/Local/Lock/Local/Range/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-11T05:56:48Z)
  added:         L5 1114461:1112<#261690077-#261690077>[/Local/Lock/Table/106/1/1582225727306831864/0,0#261690077,RANGEKEYDEL-/Local/Lock/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-11T05:56:48Z)
  added:         L5 1114462:1861810<#261690078-#261690078>[/Table/106/1/1582225727306831864/0,0#261690078,RANGEKEYDEL-/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-11T05:56:48Z)

And then 4 minutes later, again?

34930064/110192
  next-file-num: 1114464
  last-seq-num:  261690078
  added:         L0 1114458:1396<#261690073-#261690073>[/Local/RangeID/1081383/r""/0,0#261690073,RANGEKEYDEL-/Local/RangeID/1081383/s""/0,0#inf,RANGEDEL] (2023-11-11T05:56:48Z)
  added:         L0 1114463:1075<#261690074-#261690074>[/Local/RangeID/1081383/u""/0,0#261690074,RANGEDEL-/Local/RangeID/1081383/v""/0,0#inf,RANGEDEL] (2023-11-11T05:56:48Z)
  added:         L5 1114459:1990<#261690075-#261690075>[/Local/Range/Table/106/1/1582225727306831864/0,0#261690075,RANGEKEYDEL-/Local/Range/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-11T05:56:48Z)
  added:         L5 1114460:1136<#261690076-#261690076>[/Local/Lock/Local/Range/Table/106/1/1582225727306831864/0,0#261690076,RANGEKEYDEL-/Local/Lock/Local/Range/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-11T05:56:48Z)
  added:         L5 1114461:1112<#261690077-#261690077>[/Local/Lock/Table/106/1/1582225727306831864/0,0#261690077,RANGEKEYDEL-/Local/Lock/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-11T05:56:48Z)
  added:         L5 1114462:1861810<#261690078-#261690078>[/Table/106/1/1582225727306831864/0,0#261690078,RANGEKEYDEL-/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-11T05:56:48Z)

And then 15 minutes later (a few minutes before the problematic intent is written) again:

38449262/120122
  next-file-num: 1142136
  last-seq-num:  262506327
  added:         L4 1142129:1390<#262506322-#262506322>[/Local/RangeID/1081383/r""/0,0#262506322,RANGEKEYDEL-/Local/RangeID/1081383/s""/0,0#inf,RANGEDEL] (2023-11-11T06:11:37Z)
  added:         L4 1142134:1075<#262506323-#262506323>[/Local/RangeID/1081383/u""/0,0#262506323,RANGEDEL-/Local/RangeID/1081383/v""/0,0#inf,RANGEDEL] (2023-11-11T06:11:38Z)
  added:         L4 1142130:1990<#262506324-#262506324>[/Local/Range/Table/106/1/1582225727306831864/0,0#262506324,RANGEKEYDEL-/Local/Range/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-11T06:11:37Z)
  added:         L4 1142131:1136<#262506325-#262506325>[/Local/Lock/Local/Range/Table/106/1/1582225727306831864/0,0#262506325,RANGEKEYDEL-/Local/Lock/Local/Range/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-11T06:11:37Z)
  added:         L4 1142132:1112<#262506326-#262506326>[/Local/Lock/Table/106/1/1582225727306831864/0,0#262506326,RANGEKEYDEL-/Local/Lock/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-11T06:11:37Z)
  added:         L4 1142133:1861810<#262506327-#262506327>[/Table/106/1/1582225727306831864/0,0#262506327,RANGEKEYDEL-/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-11T06:11:38Z)

And then ~6 minutes later, (~8 seconds after the problematic intent's timestamp), again:

39646428/123317
  next-file-num: 1151938
  last-seq-num:  262824530
  added:         L0 1151932:1390<#262824524-#262824524>[/Local/RangeID/1081383/r""/0,0#262824524,RANGEKEYDEL-/Local/RangeID/1081383/s""/0,0#inf,RANGEDEL] (2023-11-11T06:17:10Z)
  added:         L0 1151937:1075<#262824525-#262824525>[/Local/RangeID/1081383/u""/0,0#262824525,RANGEDEL-/Local/RangeID/1081383/v""/0,0#inf,RANGEDEL] (2023-11-11T06:17:10Z)
  added:         L0 1151933:2119<#262824526-#262824526>[/Local/Range/Table/106/1/1582225727306831864/0,0#262824526,RANGEKEYDEL-/Local/Range/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-11T06:17:10Z)
  added:         L5 1151934:1326<#262824527-#262824527>[/Local/Lock/Local/Range/Table/106/1/1582225727306831864/0,0#262824527,RANGEKEYDEL-/Local/Lock/Local/Range/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-11T06:17:10Z)
  added:         L5 1151935:1112<#262824528-#262824528>[/Local/Lock/Table/106/1/1582225727306831864/0,0#262824528,RANGEKEYDEL-/Local/Lock/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-11T06:17:10Z)
  added:         L4 1151936:1861810<#262824529-#262824529>[/Table/106/1/1582225727306831864/0,0#262824529,RANGEKEYDEL-/Table/106/1/1582233925856109712/0,0#inf,RANGEDEL] (2023-11-11T06:17:10Z)

Then approximately 30 seconds later, we actually flush a memtable which contains the intent 🤔 :

39776053/123654
  log-num:       1153028
  next-file-num: 1153032
  last-seq-num:  262851090
  added:         L0 1153031:101982<#262847562-#262851076>[/Local/RangeID/86/u/RaftHardState/0,0#262848894,SET-/Local/Lock/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor/03c22060f6faba4ff7baf02d658c6f8bf1#262847744,SET] (2023-11-11T06:17:42Z)

Then ~13 seconds later, we see a memtable flushing the SINGLEDEL to delete this intent:

39832572/123828
  log-num:       1153502
  next-file-num: 1153505
  last-seq-num:  262862907
  added:         L0 1153504:126116<#262858156-#262862901>[/Local/RangeID/86/u/RaftHardState/0,0#262862517,SET-/Local/Lock/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor/03c22060f6faba4ff7baf02d658c6f8bf1#262862525,SINGLEDEL] (2023-11-11T06:17:55Z)

n19

The key relevant range's key start doesn't appear in any of the manifest entries.

@jbowens
Copy link
Collaborator

jbowens commented Nov 16, 2023

Not sure there's too much actionable there, although there's confirmation that a SINGLEDEL was used to delete the intent. I'm not sure what to make of the repeated snapshot applications for this range on n39.

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Nov 16, 2023

I'm not sure what to make of the repeated snapshot applications for this range on n39.

Can we tell from the logs which nodes were the source of these snapshots?
And who was the leaseholder when this intent was being written and resolved?

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Nov 16, 2023

I think we need to cleanup the comments around singleDelOptimizationHelper. They seem very stale and precede SETWITHDEL.

@nvanbenschoten Can we infer anything about whether there could have been multiple writes to the same key by a txn based on the MVCCMetadata value printed:

txn={id=c22060f6 key=/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor iso=Serializable pri=0.00813128 epo=0 ts=1699683422.833100427,0 min=1699683422.833100427,0 seq=1} ts=1699683422.833100427,0 del=false klen=12 vlen=102 mergeTs= txnDidNotUpdateMeta=true

seq starts with 0, yes? So seq=1 doesn't preclude that this key was also written at seq=0?
singleDelOptimizationHelper._hasIgnoredSeqs is initialized with len(roachpb.LockUpdate.IgnoredSeqNums) -- can we have a situation where one LockUpdate before the txn commits cleans up the ignored seqnums and then a later LockUpdate does not bother specifying the IgnoredSeqNums? Or is this the full list of ignored seqnums tracked by the transaction?

If raft application succeeded but some code in the KV layer thought it had failed and didn't advance the in-memory Replica.mu.state.RaftAppliedIndex, and did not crash the node, isn't it possible that the application could be repeated? I worry that the invariants needed for this optimization cut across multiple layers, so we would be safer by turning off the SingleDel optimization.

@sumeerbhola
Copy link
Collaborator

If raft application succeeded but some code in the KV layer thought it had failed and didn't advance the in-memory Replica.mu.state.RaftAppliedIndex, and did not crash the node, isn't it possible that the application could be repeated? I worry that the invariants needed for this optimization cut across multiple layers, so we would be safer by turning off the SingleDel optimization.

I realize state machine replication in general does not require idempotent application, but I am curious what would go wrong in our implementation. I think there are at least two kinds of state that will diverge: (a) timeseries, where we use MERGE, (b) MVCCStats where we carry the delta and below raft do the addition in replicaAppBatch.stageTrivialReplicatedEvalResult and then place it in the Pebble batch in replicaAppBatch.addAppliedStateKeyToBatch.

@nvanbenschoten
Copy link
Member

Can we infer anything about whether there could have been multiple writes to the same key by a txn based on the MVCCMetadata value printed:

txn={id=c22060f6 key=/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor iso=Serializable pri=0.00813128 epo=0 ts=1699683422.833100427,0 min=1699683422.833100427,0 seq=1} ts=1699683422.833100427,0 del=false klen=12 vlen=102 mergeTs= txnDidNotUpdateMeta=true

seq starts with 0, yes? So seq=1 doesn't preclude that this key was also written at seq=0?

seq starts with 0, but is incremented to 1 before the first write. So 0 is used to allow a transaction to read below all of its writes. Here's the code that handles this. It checks out that the range descriptor write has sequence 1, because that's the first key this transaction writes to.

So (epo=0,seq=1) does preclude that key having been written to before. It doesn't preclude the key being written to again by the same transaction (and hypothetically incorrectly keeping the txnDidNotUpdateMeta=true flag), either at a later sequence number or a later epoch. I don't see any other writes to the same key by the transaction, so I don't see where a write with a higher sequence would come from. We also don't see evidence of this transaction restarting at a later epoch, but we can't yet exclude that possibility.

singleDelOptimizationHelper._hasIgnoredSeqs is initialized with len(roachpb.LockUpdate.IgnoredSeqNums) -- can we have a situation where one LockUpdate before the txn commits cleans up the ignored seqnums and then a later LockUpdate does not bother specifying the IgnoredSeqNums? Or is this the full list of ignored seqnums tracked by the transaction?

Once an ignored seq num is established by a transaction, it will be included in all later LockUpdates. We also don't see any use of savepoints in this transaction, so ignored seq nums should not be in play.

If raft application succeeded but some code in the KV layer thought it had failed and didn't advance the in-memory Replica.mu.state.RaftAppliedIndex, and did not crash the node, isn't it possible that the application could be repeated?

It seems difficult for a fully repeated raft application to go unnoticed, but it depends on how much of the in-memory state from the prior application is lost. Replica.mu.state.RaftAppliedIndex would need to be lost, as would Replica.mu.state.LeaseAppliedIndex, and ReplicaState.RaftClosedTimestamp.

I worry that the invariants needed for this optimization cut across multiple layers, so we would be safer by turning off the SingleDel optimization.

Agreed. Although this isn't the only operation which requires exactly-once application to avoid replica divergence. Another example is the Merge operation, which we write in the tsdb.

@nvanbenschoten
Copy link
Member

We also don't see evidence of this transaction restarting at a later epoch, but we can't yet exclude that possibility.

Actually, we can exclude that possibility. The log line on replica_command.go:2352 is within the txn's retry loop. If we don't see multiple log lines for this transaction, it did not retry at epoch 1.

@ajstorm ajstorm added the O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster label Nov 17, 2023
@jbowens
Copy link
Collaborator

jbowens commented Nov 17, 2023

Another example is the Merge operation, which we write in the tsdb.

Am I understanding correctly that the MergeTimestamp field is intended to help in detecting this case where a Merge operation is committed multiple times? If so, do you know where we use that field to actually detect the violation, because I couldn't find it.

I agree that if we learn nothing else and decide we cannot continue to block a beta, we should disable the single delete optimization. But in the interim, I think we should keep it enabled and add additional misuse detection:

  1. We could merge db: add CanDeterministicallySingleDelete pebble#3077 and use it to validate that there's at most one SET beneath the intent in MVCCResolveWriteIntent.
  2. We could add an off-by-default cluster setting that enables an extra layer of validation during memtable application: eg, validate that there's at most one SET beneath every SINGLEDEL.

Anything else?

Within Pebble I want to give the metamorphic tests use of SINGLEDEL a more critical look to see if we might be too conservative in writing SINGLEDELs such that we're potentially leaving subtle interactions unexercised.

@nvanbenschoten
Copy link
Member

Am I understanding correctly that the MergeTimestamp field is intended to help in detecting this case where a Merge operation is committed multiple times? If so, do you know where we use that field to actually detect the violation, because I couldn't find it.

I wasn't aware of this. It was introduced in 1fe6808, where we can see that the MergeValues function short-circuits on a replay. It was then removed by b1373fb, which stated that the replay protection "did not work".

This was all before proposer-evaluated KV, which pretty dramatically changed how request evaluation and application interact with Raft. That RFC contains a discussion about replay protection, alluding to the protection we now have with the RaftAppliedIndex and LeaseAppliedIndex. The RaftAppliedIndex prevents any single log entry from applying twice, and the LeaseAppliedIndex prevents any single proposal from resulting in more than one log entry which isn't discarded during application.

I'm not aware of any changes in this area recently. @erikgrinaker are you?

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Dec 15, 2023
Handlers are added for SingleDeleteInvariantViolationCallback and
IneffectualSingleDeleteCallback options in Pebble. By default, they crash
the process. If crashing is turned off, they increment metrics,
storage.single-delete.invariant-violation and
storage.single-delete.ineffectual.

Informs cockroachdb#115881
Informs cockroachdb#114421

Epic: none

Release note (ops change): The cluster settings
storage.single_delete.crash_on_invariant_violation.enabled and
storage.single_delete.crash_on_ineffectual.enabled default to true
since these can be indicative of a serious bug that could corrupt
data.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Dec 15, 2023
Handlers are added for SingleDeleteInvariantViolationCallback and
IneffectualSingleDeleteCallback options in Pebble. By default, they crash
the process. If crashing is turned off, they increment metrics,
storage.single-delete.invariant-violation and
storage.single-delete.ineffectual.

Informs cockroachdb#115881
Informs cockroachdb#114421

Epic: none

Release note (ops change): The cluster settings
storage.single_delete.crash_on_invariant_violation.enabled and
storage.single_delete.crash_on_ineffectual.enabled default to true
since these can be indicative of a serious bug that could corrupt
data.
craig bot pushed a commit that referenced this issue Dec 15, 2023
116544: storage: add single delete callbacks to crash, or increment metrics r=jbowens a=sumeerbhola

Handlers are added for SingleDeleteInvariantViolationCallback and IneffectualSingleDeleteCallback options in Pebble. By default, they crash the process. If crashing is turned off, they increment metrics, storage.single-delete.invariant-violation and
storage.single-delete.ineffectual.

Informs #115881
Informs #114421

Epic: none

Release note (ops change): The cluster settings
storage.single_delete.crash_on_invariant_violation.enabled and storage.single_delete.crash_on_ineffectual.enabled default to true since these can be indicative of a serious bug that could corrupt data.

116549: sql,opt: don't validate AOST during session migration r=rafiss a=rafiss

When re-preparing a statement for a session migration, we want to skip evaluating and validating the AS OF SYSTEM TIME clause. During session migrations, we know that the statement will just be prepared, and not executed, and each statement could have different AOST timestamps. Therefore it is incorrect to evaluate the AOST clause and fix the transaction timestamp.

No release note since this fixes a bug that only affects Serverless. See [here](https://cockroachlabsgcp.splunkcloud.com/en-US/app/search/search?earliest=1702141875&latest=1702401075&q=search%20index%3Dcc-app-crdb*%20%20%22could%20not%20prepare%20statement%20during%20session%20migration%22%20host_ip%3D%2210.4.0.19%22&display.page.search.mode=verbose&dispatch.sample_ratio=1&display.general.type=events&display.page.search.tab=events&workload_pool=standard_perf&sid=1702579453.274591) for an example of the error.

Epic: None
Release note: None

116567: roachtest: run workload after 8tb OR roachtest r=adityamaru a=msbutler

Previously the tpce workload would not run after the 8tb OR test, because of a misconfiguration in the test specs. This fixes this bug.

Epic: none

Release note: none

116580: roachtest: fix mininum version calculation in `UploadWorkload` r=rickystewart,DarrylWong a=renatolabs

In ARM64 builds, the `workload` binary is only available in 23.2+.

Epic: none

Release note: None

116587: roachtest: don't run `schemachange` workload with upgrade migrations r=annrpom,DarrylWong a=renatolabs

This has been flaking on all branches.

Informs: #116586.
Fixes: #116304.
Fixes: #116425.
Fixes: #116357.

Release note: None

Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Michael Butler <butler@cockroachlabs.com>
Co-authored-by: Renato Costa <renato@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue Dec 15, 2023
Handlers are added for SingleDeleteInvariantViolationCallback and
IneffectualSingleDeleteCallback options in Pebble. By default, they crash
the process. If crashing is turned off, they increment metrics,
storage.single-delete.invariant-violation and
storage.single-delete.ineffectual.

Informs #115881
Informs #114421

Epic: none

Release note (ops change): The cluster settings
storage.single_delete.crash_on_invariant_violation.enabled and
storage.single_delete.crash_on_ineffectual.enabled default to true
since these can be indicative of a serious bug that could corrupt
data.
blathers-crl bot pushed a commit that referenced this issue Dec 15, 2023
Handlers are added for SingleDeleteInvariantViolationCallback and
IneffectualSingleDeleteCallback options in Pebble. By default, they crash
the process. If crashing is turned off, they increment metrics,
storage.single-delete.invariant-violation and
storage.single-delete.ineffectual.

Informs #115881
Informs #114421

Epic: none

Release note (ops change): The cluster settings
storage.single_delete.crash_on_invariant_violation.enabled and
storage.single_delete.crash_on_ineffectual.enabled default to true
since these can be indicative of a serious bug that could corrupt
data.
@sumeerbhola sumeerbhola removed branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 blocks-23.2.0-beta.1 blocks-23.2.0-rc.1 labels Dec 18, 2023
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Dec 20, 2023
As elaborated in code comments, these can be false positives caused by
delete-only compactions. The metrics are not disabled, under the rare
chance that we see a problem in a cluster and one of the metrics (if
it happens to not be a false positive) gives us a hint about the cause
of the problem. For the same reason, we log such callbacks every 5
minutes, so we can correlate the key in such logs with an actual
problem. False positives should be rare, especially for the invariant
violation callback.

Informs cockroachdb#115881
Informs cockroachdb#114421

Epic: none

Release note (ops change): The cluster settings
storage.single_delete.crash_on_invariant_violation.enabled and
storage.single_delete.crash_on_ineffectual.enabled are disabled and
must not be enabled.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Dec 20, 2023
As elaborated in code comments, these can be false positives caused by
delete-only compactions. The metrics are not disabled, under the rare
chance that we see a problem in a cluster and one of the metrics (if
it happens to not be a false positive) gives us a hint about the cause
of the problem. For the same reason, we log such callbacks every 5
minutes, so we can correlate the key in such logs with an actual
problem. False positives should be rare, especially for the invariant
violation callback.

Informs cockroachdb#115881
Informs cockroachdb#114421

Epic: none

Release note (ops change): The cluster settings
storage.single_delete.crash_on_invariant_violation.enabled and
storage.single_delete.crash_on_ineffectual.enabled are disabled and
must not be enabled.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Dec 20, 2023
As elaborated in code comments, these can be false positives caused by
delete-only compactions. The metrics are not disabled, under the rare
chance that we see a problem in a cluster and one of the metrics (if
it happens to not be a false positive) gives us a hint about the cause
of the problem. For the same reason, we log such callbacks every 5
minutes, so we can correlate the key in such logs with an actual
problem. False positives should be rare, especially for the invariant
violation callback.

Informs cockroachdb#115881
Informs cockroachdb#114421

Epic: none

Release note (ops change): The cluster settings
storage.single_delete.crash_on_invariant_violation.enabled and
storage.single_delete.crash_on_ineffectual.enabled are disabled and
must not be enabled.
jbowens pushed a commit to jbowens/cockroach that referenced this issue Dec 20, 2023
As elaborated in code comments, these can be false positives caused by
delete-only compactions. The metrics are not disabled, under the rare
chance that we see a problem in a cluster and one of the metrics (if
it happens to not be a false positive) gives us a hint about the cause
of the problem. For the same reason, we log such callbacks every 5
minutes, so we can correlate the key in such logs with an actual
problem. False positives should be rare, especially for the invariant
violation callback.

Informs cockroachdb#115881
Informs cockroachdb#114421

Epic: none

Release note (ops change): The cluster settings
storage.single_delete.crash_on_invariant_violation.enabled and
storage.single_delete.crash_on_ineffectual.enabled are disabled and
must not be enabled.
craig bot pushed a commit that referenced this issue Dec 21, 2023
116889: storage: disable crashing for single delete callbacks r=sumeerbhola a=sumeerbhola

As elaborated in code comments, these can be false positives caused by delete-only compactions. The metrics are not disabled, under the rare chance that we see a problem in a cluster and one of the metrics (if it happens to not be a false positive) gives us a hint about the cause of the problem. For the same reason, we log such callbacks every 5 minutes, so we can correlate the key in such logs with an actual problem. False positives should be rare, especially for the invariant violation callback.

Informs #115881
Informs #114421

Epic: none

Release note (ops change): The cluster settings
storage.single_delete.crash_on_invariant_violation.enabled and storage.single_delete.crash_on_ineffectual.enabled are disabled and must not be enabled.

Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue Dec 21, 2023
As elaborated in code comments, these can be false positives caused by
delete-only compactions. The metrics are not disabled, under the rare
chance that we see a problem in a cluster and one of the metrics (if
it happens to not be a false positive) gives us a hint about the cause
of the problem. For the same reason, we log such callbacks every 5
minutes, so we can correlate the key in such logs with an actual
problem. False positives should be rare, especially for the invariant
violation callback.

Informs #115881
Informs #114421

Epic: none

Release note (ops change): The cluster settings
storage.single_delete.crash_on_invariant_violation.enabled and
storage.single_delete.crash_on_ineffectual.enabled are disabled and
must not be enabled.
blathers-crl bot pushed a commit that referenced this issue Dec 21, 2023
As elaborated in code comments, these can be false positives caused by
delete-only compactions. The metrics are not disabled, under the rare
chance that we see a problem in a cluster and one of the metrics (if
it happens to not be a false positive) gives us a hint about the cause
of the problem. For the same reason, we log such callbacks every 5
minutes, so we can correlate the key in such logs with an actual
problem. False positives should be rare, especially for the invariant
violation callback.

Informs #115881
Informs #114421

Epic: none

Release note (ops change): The cluster settings
storage.single_delete.crash_on_invariant_violation.enabled and
storage.single_delete.crash_on_ineffectual.enabled are disabled and
must not be enabled.
craig bot pushed a commit that referenced this issue Jan 2, 2024
116319: kvnemesis: add Raft application assertions r=erikgrinaker a=erikgrinaker

This adds basic kvnemesis assertions for Raft command application:

* A request is only applied once, at a single log index and lease applied index across all replicas (i.e. no double-applies or replays).

* Commands do not regress the Raft index/term, lease applied index, and closed timestamp (assuming no node restarts).

* All replicas apply the same commands in the same order at the same positions.

Resolves #115771.
Touches #114421.
Epic: none
Release note: None

117014: kv: add mutex tracing to tscache r=erikgrinaker a=nvanbenschoten

We added the ability to trace mutex acquisition in afb73d2. This commit uses this capability to trace the mutex acquisition the timestamp cache.

Epic: None
Release note: None

117138: sctest: remove t.Parallel calls and cleanup r=rafiss a=rafiss

In 943a7fd and f397c13 these tests were made parallel. This makes debugging harder, and now that the tests run in a remote execution environment, the speed benefits seem dubious. Let's make it non-parallel to see if the tests become more stable.

informs #116740
informs #117103
Release note: None

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
@erikgrinaker
Copy link
Contributor

We've added all of the planned assertions and test coverage to try to track this down, but so far haven't found any problem. I'm going to close this out as inactionable for now.

(Deprecated) Storage automation moved this from In Progress (this milestone) to Done Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-23.2-scale-testing issues found during 23.2 scale testing O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster T-storage Storage Team
Projects
Archived in project
Development

No branches or pull requests

9 participants