storage,kv: investigate overlap memtable during snapshot application #99273

jbowens · 2023-03-22T19:16:53Z

In the 23.1 test cluster, we're observing higher than expected exercising of the new flushable ingests code path. This code path is triggered one or more of the sstables in an ingest overlaps with one or more of the memtables. In 22.2 and earlier, ingest would flush the memtable, wait for it complete and then proceed. In 23.1 the ingest writes a record committing to the ingest to the WAL and layers the ingested sstables onto the 'flushable' queue containing the memtables and large batches.

The flushable ingest numbers we're observing indicate that many snapshot applications overlap a memtable, forcing memtable rotation and flushes. This is unexpected, because we thought the ingested sstables should have narrow keyspans local to the associated KV range. We expected that the memtable would not contain any keys that fall within the new replica's six keyspans.

Here's a few example version edits pulled from the test cluster.

61234027/158636
  log-num:       2073703
  next-file-num: 2073724
  last-seq-num:  15888760204
  added:         L0 2073696:1535<#15888757362-#15888757362>[/Local/RangeID/191853/r""/0,0#15888757362,RANGEKEYDEL-/Local/RangeID/191853/s""/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073701:1180<#15888757363-#15888757363>[/Local/RangeID/191853/u""/0,0#15888757363,RANGEDEL-/Local/RangeID/191853/v""/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073697:2387<#15888757364-#15888757364>[/Local/Range/Table/1199/1/2073185522769554992/0,0#15888757364,RANGEKEYDEL-/Local/Range/Table/1199/1/2110042154285458192/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073698:1248<#15888757365-#15888757365>[/Local/Lock/Intent/Local/Range/Table/1199/1/2073185522769554992/0,0#15888757365,RANGEKEYDEL-/Local/Lock/Intent/Local/Range/Table/1199/1/2110042154285458192/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073699:1224<#15888757366-#15888757366>[/Local/Lock/Intent/Table/1199/1/2073185522769554992/0,0#15888757366,RANGEKEYDEL-/Local/Lock/Intent/Table/1199/1/2110042154285458192/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L6 2073700:1200<#15888757367-#15888757367>[/Table/1199/1/2073185522769554992/0,0#15888757367,RANGEKEYDEL-/Table/1199/1/2110042154285458192/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  
61234765/158637
  next-file-num: 2073734
  last-seq-num:  15888760945
  added:         L0 2073717:1540<#15888760939-#15888760939>[/Local/RangeID/191841/r""/0,0#15888760939,RANGEKEYDEL-/Local/RangeID/191841/s""/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073722:1179<#15888760940-#15888760940>[/Local/RangeID/191841/u""/0,0#15888760940,RANGEDEL-/Local/RangeID/191841/v""/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073718:2938<#15888760941-#15888760941>[/Local/Range/Table/1199/1/2202183733075216192/0,0#15888760941,RANGEKEYDEL-/Local/Range/Table/1199/1/2220612048833167792/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073719:1248<#15888760942-#15888760942>[/Local/Lock/Intent/Local/Range/Table/1199/1/2202183733075216192/0,0#15888760942,RANGEKEYDEL-/Local/Lock/Intent/Local/Range/Table/1199/1/2220612048833167792/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073720:1224<#15888760943-#15888760943>[/Local/Lock/Intent/Table/1199/1/2202183733075216192/0,0#15888760943,RANGEKEYDEL-/Local/Lock/Intent/Table/1199/1/2220612048833167792/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L6 2073721:1200<#15888760944-#15888760944>[/Table/1199/1/2202183733075216192/0,0#15888760944,RANGEKEYDEL-/Table/1199/1/2220612048833167792/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  
61235499/158638
  log-num:       2073723
  next-file-num: 2073734
  last-seq-num:  15888761334
  added:         L0 2073725:159667<#15888757368-#15888760198>[/Local/RangeID/2/u/RaftHardState/0,0#15888760084,SET-/Table/13/1/2023-03-22T17:47:10.587605Z/850122606474362886/6/1/1679507230.587605315,0#15888759341,SET] (2023-03-22T17:47:10Z)
  
61235607/158639
  log-num:       2073724
  next-file-num: 2073734
  last-seq-num:  15888761346
  added:         L0 2073711:1539<#15888760199-#15888760199>[/Local/RangeID/191857/r""/0,0#15888760199,RANGEKEYDEL-/Local/RangeID/191857/s""/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073716:1179<#15888760200-#15888760200>[/Local/RangeID/191857/u""/0,0#15888760200,RANGEDEL-/Local/RangeID/191857/v""/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073712:1828<#15888760201-#15888760201>[/Local/Range/Table/1199/1/-3381595941584118608/0,0#15888760201,RANGEKEYDEL-/Local/Range/Table/1199/1/-3363167625826167008/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073713:1248<#15888760202-#15888760202>[/Local/Lock/Intent/Local/Range/Table/1199/1/-3381595941584118608/0,0#15888760202,RANGEKEYDEL-/Local/Lock/Intent/Local/Range/Table/1199/1/-3363167625826167008/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073714:1224<#15888760203-#15888760203>[/Local/Lock/Intent/Table/1199/1/-3381595941584118608/0,0#15888760203,RANGEKEYDEL-/Local/Lock/Intent/Table/1199/1/-3363167625826167008/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L6 2073715:1200<#15888760204-#15888760204>[/Table/1199/1/-3381595941584118608/0,0#15888760204,RANGEKEYDEL-/Table/1199/1/-3363167625826167008/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)

There's no specific recording of which version edits correspond to flushable ignests that overlap with the memtable, but I believe all the ones that ratchet log-num are, since only flushes should be ratcheting log-num.

Jira issue: CRDB-25798

Epic CRDB-27235

The text was updated successfully, but these errors were encountered:

jbowens · 2023-03-22T19:23:07Z

One immediate observation: we write RANGEKEYDELs for every keyspace other than the local unreplicated range ID keyspace: /Local/RangeID/191857/u""/. Is that necessary? It would be mildly more efficient to omit RANGEKEYDELs for any keyspaces that can never contain RANGEKEYSETs.

jbowens · 2023-03-22T19:27:51Z

Another observation: the end boundaries are all exclusive range deletion or range key sentinels. I wonder if we're handling them appropriately when calculating overlap?

bananabrick · 2023-03-22T21:00:09Z

the end boundaries are all exclusive range deletion or range key sentinels. I wonder if we're handling them appropriately when calculating overlap?

https://github.com/cockroachdb/pebble/blob/master/ingest.go#L413. Gave this a look and the sstableKeyCompare function a look. I think the behaviour is as expected.

erikgrinaker · 2023-03-22T21:52:56Z

we thought the ingested sstables should have narrow keyspans local to the associated KV range. We expected that the memtable would not contain any keys that fall within the new replica's six keyspans.

What about snapshots that are sent to catch up followers that have fallen behind on the log? These follower replicas could certainly have applied writes that are still present in the memtable when the catchup snapshot arrives.

During splits, we don't really expect snapshots, at least in the common case. If a follower is slow to apply the split trigger, this can cause a spurious Raft message from the split-off leader to instantiate a new Raft replica, which will then request a snapshot, but we reject these since we'd rather wait for the split trigger. It's possible for that replica to apply the split trigger so late that it'll immediately need a snapshot to catch up though. I suppose there can be further race conditions here too -- I'll ask around for details.

One immediate observation: we write RANGEKEYDELs for every keyspace other than the local unreplicated range ID keyspace: /Local/RangeID/191857/u""/. Is that necessary? It would be mildly more efficient to omit RANGEKEYDELs for any keyspaces that can never contain RANGEKEYSETs.

This was only done to guard against anyone writing range keys here in the future and forgetting to handle it here. But we now explicitly reject writing them across the local keyspan, because we don't handle them correctly elsewhere either (e.g. in SysBytes MVCC stats). It's thus safe to omit these RANGEKEYDELs for the local keyspans as long as that restriction is in place, and we should add a comment saying as much. I'll submit a PR when I'm back from PTO on Monday, unless anyone beats me to it.

cockroach/pkg/storage/mvcc.go

Lines 3382 to 3387 in c2460f1

    
           // We currently don't allow MVCC range tombstones across the local keyspace, 
        
           // to be safe. This wouldn't handle MVCC stats (SysBytes) correctly either. 
        
           if startKey.Compare(keys.LocalMax) < 0 { 
        
           	return errors.AssertionFailedf("can't write MVCC range tombstone across local keyspan %s", 
        
           		rangeKey) 
        
           }

cockroach/pkg/kv/kvserver/store_snapshot.go

Lines 203 to 206 in c2460f1

    
           if err := msstw.currSST.ClearRawRange( 
        
           	msstw.keySpans[msstw.currSpan].Key, msstw.keySpans[msstw.currSpan].EndKey, 
        
           	true /* pointKeys */, true, /* rangeKeys */ 
        
           ); err != nil {

jbowens · 2023-03-23T13:25:18Z

What about snapshots that are sent to catch up followers that have fallen behind on the log? These follower replicas could certainly have applied writes that are still present in the memtable when the catchup snapshot arrives.

Yeah, snapshots for fallen-behind followers makes sense as an explanation. We're seeing many ingests and ~70% of ingests be ingested as flushable, which seems high for all catch-up snapshots.

Gave this a look and the sstableKeyCompare function a look. I think the behaviour is as expected.

Agreed, it looks correct to me.

This was only done to guard against anyone writing range keys here in the future and forgetting to handle it here. But we now explicitly reject writing them across the local keyspan, because we don't handle them correctly elsewhere either (e.g. in SysBytes MVCC stats). It's thus safe to omit these RANGEKEYDELs for the local keyspans as long as that restriction is in place, and we should add a comment saying as much. I'll submit a PR when I'm back from PTO on Monday, unless anyone beats me to it.

Got it, nice. This should become moot soon enough with the virtual sstables work, which can eliminate RANGEDELs and RANGEKEYDELs with a new ExciseAndIngest operation that virtualizes any overlapping sstables at ingest time to remove existing data.

tbg · 2023-03-30T08:10:48Z

What does "overlap" mean in that context? Do you mean that there are keys in the memtable within the bounds of the SSTs to be ingested? That should in fact be rare, at least on a "seasoned" cluster that sees regular memtable rotations.

As Erik pointed out, certain pathological patterns could have snapshots hit the memtable. For example, if a replica got removed and immediately re-added, we probably still have some of its bits in the memtable (from log application, or even just the raft log, which we also clear in the SSTs). Also, any snapshot that is sent as a result of log truncation will hit an existing replica, so very likely overlaps the memtable as well. These patterns may not be as rare as we'd like. We'd really have to know what snapshots were going on as these flushable ingestions happened.

We do have stats returned on each snapshot ingestion¹, so possibly we could print which snapshots are ingested with memtable overlap. Then we can pick a few and spot-check the logs for their history. This might give us a better intuition on when we see these flushable ingests.

https://github.com/cockroachdb/cockroach/blob/4849055d168764e46a0fae650d35d27d772f5de8/pkg/kv/kvserver/replica_raftstorage.go#L557 ↩

Add a new field to IngestOperationStats indicating whether or not any of the ingested sstables overlapped any memtables. Informs cockroachdb/cockroach#99273.

jbowens · 2023-03-30T14:12:46Z

Do you mean that there are keys in the memtable within the bounds of the SSTs to be ingested?

Exactly.

We do have stats returned on each snapshot ingestion

Good idea, put up cockroachdb/pebble#2422.

I think we can also pull the WALs and manifest from a node and dump them to find the conflicting writes.

nicktrav · 2023-05-15T21:22:05Z

Actions for follow up here: look at the test cluster for the proportions of flushable ingests to ingest.

jbowens · 2023-05-22T21:25:54Z

Seems like we don't actually have a timeseries for count of total ingestions. (filed #103744)

Looking at the flushable ingestions, there is a periodicity to it that suggests there's a correlation with workload restarts.

Now that the telemetry cluster has v23.1.x, we can also look there for more evidence.

The ingest-as-flushable graph appears to mirror the graph of snapshot receptions, suggesting applying a snapshot almost always result in memtable overlap. I don't understand why. I don't think we can chalk this up to an artifact of the telemetry cluster's artificial workload and split+scatters. If we're able to prevent this memtable overlap, we can likely significantly reduce w-amp (and r-amp) during snapshot reception.

~~Although arguably this will all be moot with the planned ingest+excise operation.~~ Edit: This isn't true. Ingest+excise won't be able to cheaply "excise" the memtable, at least as designed. This will force all of these ingests into the non-flushable code path, forcing a "write stall" to wait for the flush to complete. I think ingest+excise actually increases the importance of avoiding these writes before ingesting the snapshot.

jbowens · 2023-05-22T22:12:16Z

I grabbed the WALs and MANIFESTs from a test cluster node.

Here's an example flushable ingest in a WAL:

4879581.log
0(48) seq=28463980876 count=6
    INGESTSST(4879575)
    INGESTSST(4879580)
    INGESTSST(4879576)
    INGESTSST(4879577)
    INGESTSST(4879578)
    INGESTSST(4879579)
EOF

and the corresponding version edit:

126468172/322036
  log-num:       4879582
  next-file-num: 4879588
  last-seq-num:  28463981903
  added:         L4 4879575:1526<#28463980876-#28463980876>[/Local/RangeID/318347/r""/0,0#28463980876,RANGEKEYDEL-/Local/RangeID/318347/s""/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
  added:         L0 4879580:1181<#28463980877-#28463980877>[/Local/RangeID/318347/u""/0,0#28463980877,RANGEDEL-/Local/RangeID/318347/v""/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
  added:         L5 4879576:2311<#28463980878-#28463980878>[/Local/Range/Table/16060/1/-248782262732346608/0,0#28463980878,RANGEKEYDEL-/Local/Range/Table/16060/1/-230353946974395008/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
  added:         L5 4879577:1248<#28463980879-#28463980879>[/Local/Lock/Intent/Local/Range/Table/16060/1/-248782262732346608/0,0#28463980879,RANGEKEYDEL-/Local/Lock/Intent/Local/Range/Table/16060/1/-230353946974395008/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
  added:         L5 4879578:1224<#28463980880-#28463980880>[/Local/Lock/Intent/Table/16060/1/-248782262732346608/0,0#28463980880,RANGEKEYDEL-/Local/Lock/Intent/Table/16060/1/-230353946974395008/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
  added:         L3 4879579:263676<#28463980881-#28463980881>[/Table/16060/1/-248782262732346608/0,0#28463980881,RANGEKEYSET-/Table/16060/1/-230353946974395008/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
126468919/322037

Judging by the fact that 4879580 is the only sstable to ingest into L0, it seems that's where the memtable overlap was:

  added:         L0 4879580:1181<#28463980877-#28463980877>[/Local/RangeID/318347/u""/0,0#28463980877,RANGEDEL-/Local/RangeID/318347/v""/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)

In the final 1% of the previous WAL 4879549.log, there are two batches containing overlapping keys:

12083692(48) seq=28463980624 count=1
    SET(/Local/RangeID/318347/u"rftr"/0,0,Term:0 Index:0 Type:6 : EMPTY
)

12084163(52) seq=28463980626 count=1
    SET(/Local/RangeID/318347/u/RaftHardState/0,0,Term:0 Index:0 Type:7 : EMPTY
)

Are these keys expected to be written before the replica's snapshot is ingested? Is there someway we could avoid writing them until the snapshot has been ingested?

erikgrinaker · 2023-05-23T08:10:03Z

@pavelkalinnikov Can you have a look at the above?

blathers-crl · 2023-05-23T10:39:03Z

cc @cockroachdb/replication

pav-kv · 2023-05-23T18:19:57Z

12083692(48) seq=28463980624 count=1
    SET(/Local/RangeID/318347/u"rftr"/0,0,Term:0 Index:0 Type:6 : EMPTY
)

This key is always written by getOrCreateReplica when a replica is first "encountered" on a store (either by means of a split, or snapshot application). This includes the case when a replica is initialized from a snapshot below the HandleSnapshot call:

cockroach/pkg/kv/kvserver/store_raft.go

Line 211 in d6ca139

return s.receiveSnapshot(ctx, header, stream)
cockroach/pkg/kv/kvserver/store_snapshot.go

Line 1116 in d6ca139

if pErr := s.processRaftSnapshotRequest(applyCtx, header, inSnap); pErr != nil {
cockroach/pkg/kv/kvserver/store_raft.go

Line 406 in d6ca139

return s.withReplicaForRequest(ctx, &snapHeader.RaftMessageRequest, func(
cockroach/pkg/kv/kvserver/store_raft.go

Line 332 in d6ca139

r, _, err := s.getOrCreateReplica(

Only then we handle raft ready

cockroach/pkg/kv/kvserver/store_raft.go

Line 465 in d6ca139

stats, err := r.handleRaftReadyRaftMuLocked(ctx, inSnap)

which eventually ingests the snapshot.

So, one possible way to optimize is avoiding this write, because the snapshot ingestion writes this key anyway. This doesn't seem trivial at a quick glance though, because of the way getOrCreateReplica is coupled with other paths. Looking though. Perhaps we could avoid writing the range deletes in applySnapshot if we know the keys are not present.

12084163(52) seq=28463980626 count=1
    SET(/Local/RangeID/318347/u/RaftHardState/0,0,Term:0 Index:0 Type:7 : EMPTY
)

This key, I'm not sure why it's already written. The getOrCreateReplica path doesn't write it, so it must have been something else.

sumeerbhola · 2023-05-23T18:27:17Z

RaftHardStateKey and RaftReplicaIDKey are both written for an uninitialized replica (see CreateUninitializedReplica)

pav-kv · 2023-05-23T18:28:50Z

@sumeerbhola I can only see the RaftReplicaIDKey write there. I think the RaftHardStateKey is written by each of the two initialization flows, but elsewhere (hence we see them in separate batches in @jbowens's comment?):

split:

cockroach/pkg/kv/kvserver/store_split.go

Line 128 in a908137

if err := rightRepl.raftMu.stateLoader.SetHardState(ctx, readWriter, hs); err != nil {
snapshot:

cockroach/pkg/kv/kvserver/replica_raftstorage.go

Line 764 in a9989a8

if err := sl.SetHardState(ctx, &unreplicatedSST, hs); err != nil {

Before any of these flows, it's possible that only RaftReplicaIDKey is written (e.g. we've got some requests for this replica before it got initialized)?

sumeerbhola · 2023-05-23T18:46:41Z

I think the RaftHardStateKey is written by each of the two initialization flows, but elsewhere

The HardState must be getting written somewhere for an uninitialized replica too, since I think an uninitialized replica can vote (so needs to remember it). This is also the reason for the dance in

cockroach/pkg/kv/kvserver/store_split.go

Lines 69 to 81 in f397cf9

    
           // To apply the split, we need to "throw away" the data that would belong to 
        
           // the RHS, i.e. we clear the user data the RHS would have inherited from 
        
           // the LHS due to the split and additionally clear all of the range ID local 
        
           // state that the split trigger writes into the RHS. At the time of writing, 
        
           // unfortunately that means that we'll also delete any data that might 
        
           // already be present in the RHS: the HardState and RaftReplicaID. It is 
        
           // important to preserve the HardState because we might however have already 
        
           // voted at a higher term. In general this shouldn't happen because we add 
        
           // learners and then promote them only after they apply a snapshot but we're 
        
           // going to be extra careful in case future versions of cockroach somehow 
        
           // promote replicas without ensuring that a snapshot has been received. So 
        
           // we write it back (and the RaftReplicaID too, since it's an invariant that 
        
           // it's always present).

pav-kv · 2023-05-23T18:48:14Z

Another possible scenario:

getOrCreateReplica creates an unitialized replica and writes the RaftReplicaIDKey
Then this replica votes (it's almost the only thing it can do) and stores the HardState
(alternatively) I believe it can store a HardState even without voting if it discovers a non-zero Term or Commit from other nodes (I vaguely remember seeing this happen)

pav-kv · 2023-05-23T18:49:39Z

@sumeerbhola Yeah, I think you're right. In the scenario above both keys can be written, and in separate batches.

pav-kv · 2023-05-23T18:51:55Z

We've discussed getting rid of uninitialized replicas (or at least make them not have any IO) previously. Could be good to reiterate on this. No uninit replicas / IO => no RaftReplicaIDKey / HardState keys written before ingestion => more efficient ingestion.

sumeerbhola · 2023-05-23T18:53:11Z

I am wondering how much of the work in writeUnreplicatedSST is optional for the case of a new uninitialized range followed by initialization

cockroach/pkg/kv/kvserver/replica_raftstorage.go

Lines 750 to 761 in f397cf9

    
           // Clearing the unreplicated state. 
        
           // 
        
           // NB: We do not expect to see range keys in the unreplicated state, so 
        
           // we don't drop a range tombstone across the range key space. 
        
           unreplicatedPrefixKey := keys.MakeRangeIDUnreplicatedPrefix(id.RangeID) 
        
           unreplicatedStart := unreplicatedPrefixKey 
        
           unreplicatedEnd := unreplicatedPrefixKey.PrefixEnd() 
        
           if err := unreplicatedSST.ClearRawRange( 
        
           	unreplicatedStart, unreplicatedEnd, true /* pointKeys */, false, /* rangeKeys */ 
        
           ); err != nil { 
        
           	return nil, false, errors.Wrapf(err, "error clearing range of unreplicated SST writer") 
        
           }

There will be no raft log, RaftTruncatedStateKey, RangeLastReplicaGCTimestampKey. We can read RaftReplicaIDKey to check that the replica id hasn't changed. And if the HardState has not changed we could avoid doing any of that work (fast-path). But perhaps HardState.Commit will change since IIRC, that value must be 0 for an uninitialized replica.

btw, are we accidentally clearing the RangeTombstoneKey, which would be a bug?

nvanbenschoten · 2023-06-08T04:34:28Z

To verify the theories from above, I added the following instrumentation to snapshot ingestion and spun up a multi-node local cluster:

--- a/pkg/kv/kvserver/replica_raftstorage.go
+++ b/pkg/kv/kvserver/replica_raftstorage.go
@@ -514,6 +514,33 @@ func (r *Replica) applySnapshot(
                log.Infof(ctx, "applied %s (%s)", inSnap, logDetails)
        }(timeutil.Now())

+       // Look at what keys are already written into the unreplicated key space.
+       {
+               unreplicatedPrefixKey := keys.MakeRangeIDUnreplicatedPrefix(r.ID().RangeID)
+               unreplicatedStart := unreplicatedPrefixKey
+               unreplicatedEnd := unreplicatedPrefixKey.PrefixEnd()
+               it := r.store.TODOEngine().NewEngineIterator(storage.IterOptions{UpperBound: unreplicatedEnd})
+               defer it.Close()
+               var ok bool
+               var err error
+               for ok, err = it.SeekEngineKeyGE(storage.EngineKey{Key: unreplicatedStart}); ok; ok, err = it.NextEngineKey() {
+                       key, err := it.UnsafeEngineKey()
+                       if err != nil {
+                               panic(err)
+                       }
+                       log.Infof(ctx, "found unreplicated key: %s", key.Key)
+               }
+               if err != nil {
+                       panic(err)
+               }
+
+               prevHS, err := r.raftMu.stateLoader.StateLoader.LoadHardState(ctx, r.store.TODOEngine())
+               if err != nil {
+                       panic(err)
+               }
+               log.Infof(ctx, "found prev HardState: %+v", prevHS)
+       }
+
        unreplicatedSSTFile, nonempty, err := writeUnreplicatedSST(

Each INITIAL snapshot (during upreplication) looked the same. Each had a RaftHardStateKey and a RaftReplicaIDKey. The HardState's vote was 0 (the common-case, but not always true), a commit index was 0 (I believe this is always the case), and a term of 6.

I230608 04:22:04.625626 1331 kv/kvserver/replica_raftstorage.go:531 ⋮ [T1,n2,s2,r13/2:{-}] 128  found unreplicated key: /Local/RangeID‹/13›/‹u›/‹RaftHardState›
I230608 04:22:04.625641 1331 kv/kvserver/replica_raftstorage.go:531 ⋮ [T1,n2,s2,r13/2:{-}] 129  found unreplicated key: /Local/RangeID‹/13›/‹u›‹"rftr"›
I230608 04:22:04.625674 1331 kv/kvserver/replica_raftstorage.go:541 ⋮ [T1,n2,s2,r13/2:{-}] 130  found prev HardState: {Term:6 Vote:0 Commit:0}
I230608 04:22:04.684671 1331 kv/kvserver/replica_raftstorage.go:514 ⋮ [T1,n2,s2,r13/2:‹/Table/1{1-2}›] 131  applied INITIAL snapshot 35998b43 from (n1,s1):1 at applied index 153 (total=59ms data=1.8 KiB ingestion=6@49ms)

The RaftReplicaIDKey is written during replica creation. It shouldn't be changed by the snapshot, so in theory, if we avoided deleting it by splitting the range deletion, we could also avoid re-writing it.

It is less clear what we should do with the RaftHardStateKey. Snapshots are always accompanied by an updated HardState whose commit index is set to the index of the snapshot. We currently write this updated HardState into the snapshot's ssts, so that the HardState is updated atomically with the snapshot ingestion. Is this necessary for correctness? Typically the commit index in the HardState does not require durability and can be reconstructed after a crash. Does that provide us any flexibility to write the new HardState in a separate batch, either before or after the SST ingestion in order to avoid the key collision with the memtable during ingestion?

jbowens · 2023-06-08T14:03:55Z

It shouldn't be changed by the snapshot, so in theory, if we avoided deleting it by splitting the range deletion, we could also avoid re-writing it.

We're hoping to move to remove the range deletion, replacing it with an atomic ingest+excise operation that virtualizes overlapping sstables to hide existing data. This will allow ingested snapshot sstables to unconditionally ingest into L6. While we could separately ingest+excise on either side of this key, it'd be preferable to avoid the tiny files.

sumeerbhola · 2023-07-05T17:21:51Z

Typically the commit index in the HardState does not require durability and can be reconstructed after a crash. Does that provide us any flexibility to write the new HardState in a separate batch, either before or after the SST ingestion in order to avoid the key collision with the memtable during ingestion?

The ReplicasStorage (separated raft log) design included an init step at crash recovery time that would see if HardState.Commit < RangeAppliedState.RaftAppliedIndex and if so update it to make it equal to RangeAppliedState.RaftAppliedIndex. I don't know if we already do something like this. If we do (or can), we could check that the only thing that needs updating is HardState.Commit (as you outlined above) and then write it after the ingest (the separated raft log design calls for the invariant that HardState.Commit is 0 for an uninitialized replica, so I don't think we should write it before the ingest).
@tbg

tbg · 2023-07-18T15:15:03Z

Typically the commit index in the HardState does not require durability and can be reconstructed after a crash.

I would tread careful here due to apply-time conf changes. There is a secret invariant that you can't have more than one "committed but not known committed" confchange in the log. I don't know how much this can be relaxed but either way I am not terribly motivated to relax anything in that area, since it is generally poorly understood and problems would not be discovered until it is potentially too late.

I was actually thinking the other day that we should just get out of the business of apply-time conf changes. I filed an issue¹ to that effect. With that, the Committed index would be entirely discretionary, I think, mod, perhaps, additional meaning we've attached to it with replica quiescence, which one day might be on its way out², too.

jbowens added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Mar 22, 2023

blathers-crl bot added this to Incoming in Storage Mar 22, 2023

jbowens removed the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 22, 2023

jbowens moved this from Incoming to To Do (investigations) in Storage Mar 27, 2023

jbowens mentioned this issue Mar 30, 2023

db: add OverlappedMemtable indicator to IngestOperationStats cockroachdb/pebble#2422

Closed

nicktrav added the O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster label Apr 25, 2023

jbowens self-assigned this May 22, 2023

erikgrinaker assigned pav-kv May 23, 2023

pav-kv added the A-kv-replication Relating to Raft, consensus, and coordination. label May 23, 2023

blathers-crl bot added the T-kv-replication KV Replication Team label May 23, 2023

exalate-issue-sync bot unassigned jbowens May 23, 2023

exalate-issue-sync bot removed the T-kv-replication KV Replication Team label May 23, 2023

jbowens removed this from To Do (investigations) in Storage May 24, 2023

jbowens mentioned this issue May 25, 2023

*: add IngestAndExcise operation cockroachdb/pebble#2538

Merged

exalate-issue-sync bot unassigned pav-kv May 30, 2023

jbowens mentioned this issue Jun 22, 2023

db: use uncompensated scores to smoothe compaction picking scores cockroachdb/pebble#2663

Merged

This was referenced Jul 5, 2023

roachtest: tpce/c=100000/nodes=5 failed #105833

Closed

roachtest: clearrange/checks=false/rangeTs=true failed [deletion pacing] #106258

Closed

tbg mentioned this issue Jul 18, 2023

kvserver: switch to append-time conf changes #107083

Open

sumeerbhola mentioned this issue Aug 31, 2023

kv: do not apply small snapshots as storage ingests #109808

Closed

sumeerbhola mentioned this issue Nov 6, 2023

db: specialize ingest for excise by excising from memtables cockroachdb/pebble#3038

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage,kv: investigate overlap memtable during snapshot application #99273

storage,kv: investigate overlap memtable during snapshot application #99273

jbowens commented Mar 22, 2023 •

edited by exalate-issue-sync bot

jbowens commented Mar 22, 2023

jbowens commented Mar 22, 2023

bananabrick commented Mar 22, 2023 •

edited

erikgrinaker commented Mar 22, 2023 •

edited

jbowens commented Mar 23, 2023

tbg commented Mar 30, 2023

jbowens commented Mar 30, 2023 •

edited

nicktrav commented May 15, 2023

jbowens commented May 22, 2023 •

edited

jbowens commented May 22, 2023 •

edited

erikgrinaker commented May 23, 2023

blathers-crl bot commented May 23, 2023

pav-kv commented May 23, 2023 •

edited

sumeerbhola commented May 23, 2023 •

edited

pav-kv commented May 23, 2023 •

edited

sumeerbhola commented May 23, 2023

pav-kv commented May 23, 2023 •

edited

pav-kv commented May 23, 2023

pav-kv commented May 23, 2023 •

edited

sumeerbhola commented May 23, 2023 •

edited

nvanbenschoten commented Jun 8, 2023

jbowens commented Jun 8, 2023

sumeerbhola commented Jul 5, 2023

tbg commented Jul 18, 2023 •

edited

storage,kv: investigate overlap memtable during snapshot application #99273

storage,kv: investigate overlap memtable during snapshot application #99273

Comments

jbowens commented Mar 22, 2023 • edited by exalate-issue-sync bot

jbowens commented Mar 22, 2023

jbowens commented Mar 22, 2023

bananabrick commented Mar 22, 2023 • edited

erikgrinaker commented Mar 22, 2023 • edited

jbowens commented Mar 23, 2023

tbg commented Mar 30, 2023

Footnotes

jbowens commented Mar 30, 2023 • edited

nicktrav commented May 15, 2023

jbowens commented May 22, 2023 • edited

jbowens commented May 22, 2023 • edited

erikgrinaker commented May 23, 2023

blathers-crl bot commented May 23, 2023

pav-kv commented May 23, 2023 • edited

sumeerbhola commented May 23, 2023 • edited

pav-kv commented May 23, 2023 • edited

sumeerbhola commented May 23, 2023

pav-kv commented May 23, 2023 • edited

pav-kv commented May 23, 2023

pav-kv commented May 23, 2023 • edited

sumeerbhola commented May 23, 2023 • edited

nvanbenschoten commented Jun 8, 2023

jbowens commented Jun 8, 2023

sumeerbhola commented Jul 5, 2023

tbg commented Jul 18, 2023 • edited

Footnotes

jbowens commented Mar 22, 2023 •

edited by exalate-issue-sync bot

bananabrick commented Mar 22, 2023 •

edited

erikgrinaker commented Mar 22, 2023 •

edited

jbowens commented Mar 30, 2023 •

edited

jbowens commented May 22, 2023 •

edited

jbowens commented May 22, 2023 •

edited

pav-kv commented May 23, 2023 •

edited

sumeerbhola commented May 23, 2023 •

edited

pav-kv commented May 23, 2023 •

edited

pav-kv commented May 23, 2023 •

edited

pav-kv commented May 23, 2023 •

edited

sumeerbhola commented May 23, 2023 •

edited

tbg commented Jul 18, 2023 •

edited