Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage,kv: investigate overlap memtable during snapshot application #99273

Open
jbowens opened this issue Mar 22, 2023 · 24 comments
Open

storage,kv: investigate overlap memtable during snapshot application #99273

jbowens opened this issue Mar 22, 2023 · 24 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-storage Relating to our storage engine (Pebble) on-disk storage. C-investigation Further steps needed to qualify. C-label will change. O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster T-storage Storage Team

Comments

@jbowens
Copy link
Collaborator

jbowens commented Mar 22, 2023

In the 23.1 test cluster, we're observing higher than expected exercising of the new flushable ingests code path. This code path is triggered one or more of the sstables in an ingest overlaps with one or more of the memtables. In 22.2 and earlier, ingest would flush the memtable, wait for it complete and then proceed. In 23.1 the ingest writes a record committing to the ingest to the WAL and layers the ingested sstables onto the 'flushable' queue containing the memtables and large batches.

The flushable ingest numbers we're observing indicate that many snapshot applications overlap a memtable, forcing memtable rotation and flushes. This is unexpected, because we thought the ingested sstables should have narrow keyspans local to the associated KV range. We expected that the memtable would not contain any keys that fall within the new replica's six keyspans.

Here's a few example version edits pulled from the test cluster.

61234027/158636
  log-num:       2073703
  next-file-num: 2073724
  last-seq-num:  15888760204
  added:         L0 2073696:1535<#15888757362-#15888757362>[/Local/RangeID/191853/r""/0,0#15888757362,RANGEKEYDEL-/Local/RangeID/191853/s""/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073701:1180<#15888757363-#15888757363>[/Local/RangeID/191853/u""/0,0#15888757363,RANGEDEL-/Local/RangeID/191853/v""/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073697:2387<#15888757364-#15888757364>[/Local/Range/Table/1199/1/2073185522769554992/0,0#15888757364,RANGEKEYDEL-/Local/Range/Table/1199/1/2110042154285458192/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073698:1248<#15888757365-#15888757365>[/Local/Lock/Intent/Local/Range/Table/1199/1/2073185522769554992/0,0#15888757365,RANGEKEYDEL-/Local/Lock/Intent/Local/Range/Table/1199/1/2110042154285458192/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073699:1224<#15888757366-#15888757366>[/Local/Lock/Intent/Table/1199/1/2073185522769554992/0,0#15888757366,RANGEKEYDEL-/Local/Lock/Intent/Table/1199/1/2110042154285458192/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L6 2073700:1200<#15888757367-#15888757367>[/Table/1199/1/2073185522769554992/0,0#15888757367,RANGEKEYDEL-/Table/1199/1/2110042154285458192/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  
61234765/158637
  next-file-num: 2073734
  last-seq-num:  15888760945
  added:         L0 2073717:1540<#15888760939-#15888760939>[/Local/RangeID/191841/r""/0,0#15888760939,RANGEKEYDEL-/Local/RangeID/191841/s""/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073722:1179<#15888760940-#15888760940>[/Local/RangeID/191841/u""/0,0#15888760940,RANGEDEL-/Local/RangeID/191841/v""/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073718:2938<#15888760941-#15888760941>[/Local/Range/Table/1199/1/2202183733075216192/0,0#15888760941,RANGEKEYDEL-/Local/Range/Table/1199/1/2220612048833167792/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073719:1248<#15888760942-#15888760942>[/Local/Lock/Intent/Local/Range/Table/1199/1/2202183733075216192/0,0#15888760942,RANGEKEYDEL-/Local/Lock/Intent/Local/Range/Table/1199/1/2220612048833167792/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073720:1224<#15888760943-#15888760943>[/Local/Lock/Intent/Table/1199/1/2202183733075216192/0,0#15888760943,RANGEKEYDEL-/Local/Lock/Intent/Table/1199/1/2220612048833167792/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L6 2073721:1200<#15888760944-#15888760944>[/Table/1199/1/2202183733075216192/0,0#15888760944,RANGEKEYDEL-/Table/1199/1/2220612048833167792/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  
61235499/158638
  log-num:       2073723
  next-file-num: 2073734
  last-seq-num:  15888761334
  added:         L0 2073725:159667<#15888757368-#15888760198>[/Local/RangeID/2/u/RaftHardState/0,0#15888760084,SET-/Table/13/1/2023-03-22T17:47:10.587605Z/850122606474362886/6/1/1679507230.587605315,0#15888759341,SET] (2023-03-22T17:47:10Z)
  
61235607/158639
  log-num:       2073724
  next-file-num: 2073734
  last-seq-num:  15888761346
  added:         L0 2073711:1539<#15888760199-#15888760199>[/Local/RangeID/191857/r""/0,0#15888760199,RANGEKEYDEL-/Local/RangeID/191857/s""/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073716:1179<#15888760200-#15888760200>[/Local/RangeID/191857/u""/0,0#15888760200,RANGEDEL-/Local/RangeID/191857/v""/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073712:1828<#15888760201-#15888760201>[/Local/Range/Table/1199/1/-3381595941584118608/0,0#15888760201,RANGEKEYDEL-/Local/Range/Table/1199/1/-3363167625826167008/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073713:1248<#15888760202-#15888760202>[/Local/Lock/Intent/Local/Range/Table/1199/1/-3381595941584118608/0,0#15888760202,RANGEKEYDEL-/Local/Lock/Intent/Local/Range/Table/1199/1/-3363167625826167008/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073714:1224<#15888760203-#15888760203>[/Local/Lock/Intent/Table/1199/1/-3381595941584118608/0,0#15888760203,RANGEKEYDEL-/Local/Lock/Intent/Table/1199/1/-3363167625826167008/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L6 2073715:1200<#15888760204-#15888760204>[/Table/1199/1/-3381595941584118608/0,0#15888760204,RANGEKEYDEL-/Table/1199/1/-3363167625826167008/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)

There's no specific recording of which version edits correspond to flushable ignests that overlap with the memtable, but I believe all the ones that ratchet log-num are, since only flushes should be ratcheting log-num.

Jira issue: CRDB-25798

Epic CRDB-27235

@jbowens jbowens added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Mar 22, 2023
@blathers-crl blathers-crl bot added this to Incoming in Storage Mar 22, 2023
@jbowens jbowens removed the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 22, 2023
@jbowens
Copy link
Collaborator Author

jbowens commented Mar 22, 2023

One immediate observation: we write RANGEKEYDELs for every keyspace other than the local unreplicated range ID keyspace: /Local/RangeID/191857/u""/. Is that necessary? It would be mildly more efficient to omit RANGEKEYDELs for any keyspaces that can never contain RANGEKEYSETs.

@jbowens
Copy link
Collaborator Author

jbowens commented Mar 22, 2023

Another observation: the end boundaries are all exclusive range deletion or range key sentinels. I wonder if we're handling them appropriately when calculating overlap?

@bananabrick
Copy link
Contributor

bananabrick commented Mar 22, 2023

the end boundaries are all exclusive range deletion or range key sentinels. I wonder if we're handling them appropriately when calculating overlap?

https://github.com/cockroachdb/pebble/blob/master/ingest.go#L413. Gave this a look and the sstableKeyCompare function a look. I think the behaviour is as expected.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Mar 22, 2023

we thought the ingested sstables should have narrow keyspans local to the associated KV range. We expected that the memtable would not contain any keys that fall within the new replica's six keyspans.

What about snapshots that are sent to catch up followers that have fallen behind on the log? These follower replicas could certainly have applied writes that are still present in the memtable when the catchup snapshot arrives.

During splits, we don't really expect snapshots, at least in the common case. If a follower is slow to apply the split trigger, this can cause a spurious Raft message from the split-off leader to instantiate a new Raft replica, which will then request a snapshot, but we reject these since we'd rather wait for the split trigger. It's possible for that replica to apply the split trigger so late that it'll immediately need a snapshot to catch up though. I suppose there can be further race conditions here too -- I'll ask around for details.

One immediate observation: we write RANGEKEYDELs for every keyspace other than the local unreplicated range ID keyspace: /Local/RangeID/191857/u""/. Is that necessary? It would be mildly more efficient to omit RANGEKEYDELs for any keyspaces that can never contain RANGEKEYSETs.

This was only done to guard against anyone writing range keys here in the future and forgetting to handle it here. But we now explicitly reject writing them across the local keyspan, because we don't handle them correctly elsewhere either (e.g. in SysBytes MVCC stats). It's thus safe to omit these RANGEKEYDELs for the local keyspans as long as that restriction is in place, and we should add a comment saying as much. I'll submit a PR when I'm back from PTO on Monday, unless anyone beats me to it.

cockroach/pkg/storage/mvcc.go

Lines 3382 to 3387 in c2460f1

// We currently don't allow MVCC range tombstones across the local keyspace,
// to be safe. This wouldn't handle MVCC stats (SysBytes) correctly either.
if startKey.Compare(keys.LocalMax) < 0 {
return errors.AssertionFailedf("can't write MVCC range tombstone across local keyspan %s",
rangeKey)
}

if err := msstw.currSST.ClearRawRange(
msstw.keySpans[msstw.currSpan].Key, msstw.keySpans[msstw.currSpan].EndKey,
true /* pointKeys */, true, /* rangeKeys */
); err != nil {

@jbowens
Copy link
Collaborator Author

jbowens commented Mar 23, 2023

What about snapshots that are sent to catch up followers that have fallen behind on the log? These follower replicas could certainly have applied writes that are still present in the memtable when the catchup snapshot arrives.

Yeah, snapshots for fallen-behind followers makes sense as an explanation. We're seeing many ingests and ~70% of ingests be ingested as flushable, which seems high for all catch-up snapshots.

Gave this a look and the sstableKeyCompare function a look. I think the behaviour is as expected.

Agreed, it looks correct to me.

This was only done to guard against anyone writing range keys here in the future and forgetting to handle it here. But we now explicitly reject writing them across the local keyspan, because we don't handle them correctly elsewhere either (e.g. in SysBytes MVCC stats). It's thus safe to omit these RANGEKEYDELs for the local keyspans as long as that restriction is in place, and we should add a comment saying as much. I'll submit a PR when I'm back from PTO on Monday, unless anyone beats me to it.

Got it, nice. This should become moot soon enough with the virtual sstables work, which can eliminate RANGEDELs and RANGEKEYDELs with a new ExciseAndIngest operation that virtualizes any overlapping sstables at ingest time to remove existing data.

@jbowens jbowens moved this from Incoming to To Do (investigations) in Storage Mar 27, 2023
@tbg
Copy link
Member

tbg commented Mar 30, 2023

What does "overlap" mean in that context? Do you mean that there are keys in the memtable within the bounds of the SSTs to be ingested? That should in fact be rare, at least on a "seasoned" cluster that sees regular memtable rotations.

As Erik pointed out, certain pathological patterns could have snapshots hit the memtable. For example, if a replica got removed and immediately re-added, we probably still have some of its bits in the memtable (from log application, or even just the raft log, which we also clear in the SSTs). Also, any snapshot that is sent as a result of log truncation will hit an existing replica, so very likely overlaps the memtable as well. These patterns may not be as rare as we'd like. We'd really have to know what snapshots were going on as these flushable ingestions happened.

We do have stats returned on each snapshot ingestion1, so possibly we could print which snapshots are ingested with memtable overlap. Then we can pick a few and spot-check the logs for their history. This might give us a better intuition on when we see these flushable ingests.

Footnotes

  1. https://github.com/cockroachdb/cockroach/blob/4849055d168764e46a0fae650d35d27d772f5de8/pkg/kv/kvserver/replica_raftstorage.go#L557

jbowens added a commit to jbowens/pebble that referenced this issue Mar 30, 2023
Add a new field to IngestOperationStats indicating whether or not any of the
ingested sstables overlapped any memtables.

Informs cockroachdb/cockroach#99273.
@jbowens
Copy link
Collaborator Author

jbowens commented Mar 30, 2023

Do you mean that there are keys in the memtable within the bounds of the SSTs to be ingested?

Exactly.

We do have stats returned on each snapshot ingestion

Good idea, put up cockroachdb/pebble#2422.

I think we can also pull the WALs and manifest from a node and dump them to find the conflicting writes.

@nicktrav nicktrav added the O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster label Apr 25, 2023
@nicktrav
Copy link
Collaborator

Actions for follow up here: look at the test cluster for the proportions of flushable ingests to ingest.

@jbowens jbowens self-assigned this May 22, 2023
@jbowens
Copy link
Collaborator Author

jbowens commented May 22, 2023

Seems like we don't actually have a timeseries for count of total ingestions. (filed #103744)

Looking at the flushable ingestions, there is a periodicity to it that suggests there's a correlation with workload restarts.

Screenshot 2023-05-22 at 5 06 37 PM Screenshot 2023-05-22 at 5 06 55 PM

Now that the telemetry cluster has v23.1.x, we can also look there for more evidence.

Screenshot 2023-05-22 at 5 18 45 PM Screenshot 2023-05-22 at 5 18 35 PM

The ingest-as-flushable graph appears to mirror the graph of snapshot receptions, suggesting applying a snapshot almost always result in memtable overlap. I don't understand why. I don't think we can chalk this up to an artifact of the telemetry cluster's artificial workload and split+scatters. If we're able to prevent this memtable overlap, we can likely significantly reduce w-amp (and r-amp) during snapshot reception.

Although arguably this will all be moot with the planned ingest+excise operation. Edit: This isn't true. Ingest+excise won't be able to cheaply "excise" the memtable, at least as designed. This will force all of these ingests into the non-flushable code path, forcing a "write stall" to wait for the flush to complete. I think ingest+excise actually increases the importance of avoiding these writes before ingesting the snapshot.

@jbowens
Copy link
Collaborator Author

jbowens commented May 22, 2023

I grabbed the WALs and MANIFESTs from a test cluster node.

Here's an example flushable ingest in a WAL:

4879581.log
0(48) seq=28463980876 count=6
    INGESTSST(4879575)
    INGESTSST(4879580)
    INGESTSST(4879576)
    INGESTSST(4879577)
    INGESTSST(4879578)
    INGESTSST(4879579)
EOF

and the corresponding version edit:

126468172/322036
  log-num:       4879582
  next-file-num: 4879588
  last-seq-num:  28463981903
  added:         L4 4879575:1526<#28463980876-#28463980876>[/Local/RangeID/318347/r""/0,0#28463980876,RANGEKEYDEL-/Local/RangeID/318347/s""/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
  added:         L0 4879580:1181<#28463980877-#28463980877>[/Local/RangeID/318347/u""/0,0#28463980877,RANGEDEL-/Local/RangeID/318347/v""/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
  added:         L5 4879576:2311<#28463980878-#28463980878>[/Local/Range/Table/16060/1/-248782262732346608/0,0#28463980878,RANGEKEYDEL-/Local/Range/Table/16060/1/-230353946974395008/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
  added:         L5 4879577:1248<#28463980879-#28463980879>[/Local/Lock/Intent/Local/Range/Table/16060/1/-248782262732346608/0,0#28463980879,RANGEKEYDEL-/Local/Lock/Intent/Local/Range/Table/16060/1/-230353946974395008/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
  added:         L5 4879578:1224<#28463980880-#28463980880>[/Local/Lock/Intent/Table/16060/1/-248782262732346608/0,0#28463980880,RANGEKEYDEL-/Local/Lock/Intent/Table/16060/1/-230353946974395008/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
  added:         L3 4879579:263676<#28463980881-#28463980881>[/Table/16060/1/-248782262732346608/0,0#28463980881,RANGEKEYSET-/Table/16060/1/-230353946974395008/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
126468919/322037

Judging by the fact that 4879580 is the only sstable to ingest into L0, it seems that's where the memtable overlap was:

  added:         L0 4879580:1181<#28463980877-#28463980877>[/Local/RangeID/318347/u""/0,0#28463980877,RANGEDEL-/Local/RangeID/318347/v""/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)

In the final 1% of the previous WAL 4879549.log, there are two batches containing overlapping keys:

12083692(48) seq=28463980624 count=1
    SET(/Local/RangeID/318347/u"rftr"/0,0,Term:0 Index:0 Type:6 : EMPTY
)
12084163(52) seq=28463980626 count=1
    SET(/Local/RangeID/318347/u/RaftHardState/0,0,Term:0 Index:0 Type:7 : EMPTY
)

Are these keys expected to be written before the replica's snapshot is ingested? Is there someway we could avoid writing them until the snapshot has been ingested?

@erikgrinaker
Copy link
Contributor

@pavelkalinnikov Can you have a look at the above?

@pav-kv pav-kv added the A-kv-replication Relating to Raft, consensus, and coordination. label May 23, 2023
@blathers-crl blathers-crl bot added the T-kv-replication KV Replication Team label May 23, 2023
@blathers-crl
Copy link

blathers-crl bot commented May 23, 2023

cc @cockroachdb/replication

@pav-kv
Copy link
Collaborator

pav-kv commented May 23, 2023

12083692(48) seq=28463980624 count=1
    SET(/Local/RangeID/318347/u"rftr"/0,0,Term:0 Index:0 Type:6 : EMPTY
)

This key is always written by getOrCreateReplica when a replica is first "encountered" on a store (either by means of a split, or snapshot application). This includes the case when a replica is initialized from a snapshot below the HandleSnapshot call:

Only then we handle raft ready

stats, err := r.handleRaftReadyRaftMuLocked(ctx, inSnap)

which eventually ingests the snapshot.

So, one possible way to optimize is avoiding this write, because the snapshot ingestion writes this key anyway. This doesn't seem trivial at a quick glance though, because of the way getOrCreateReplica is coupled with other paths. Looking though. Perhaps we could avoid writing the range deletes in applySnapshot if we know the keys are not present.

12084163(52) seq=28463980626 count=1
    SET(/Local/RangeID/318347/u/RaftHardState/0,0,Term:0 Index:0 Type:7 : EMPTY
)

This key, I'm not sure why it's already written. The getOrCreateReplica path doesn't write it, so it must have been something else.

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented May 23, 2023

RaftHardStateKey and RaftReplicaIDKey are both written for an uninitialized replica (see CreateUninitializedReplica)

@pav-kv
Copy link
Collaborator

pav-kv commented May 23, 2023

@sumeerbhola I can only see the RaftReplicaIDKey write there. I think the RaftHardStateKey is written by each of the two initialization flows, but elsewhere (hence we see them in separate batches in @jbowens's comment?):

Before any of these flows, it's possible that only RaftReplicaIDKey is written (e.g. we've got some requests for this replica before it got initialized)?

@sumeerbhola
Copy link
Collaborator

I think the RaftHardStateKey is written by each of the two initialization flows, but elsewhere

The HardState must be getting written somewhere for an uninitialized replica too, since I think an uninitialized replica can vote (so needs to remember it). This is also the reason for the dance in

// To apply the split, we need to "throw away" the data that would belong to
// the RHS, i.e. we clear the user data the RHS would have inherited from
// the LHS due to the split and additionally clear all of the range ID local
// state that the split trigger writes into the RHS. At the time of writing,
// unfortunately that means that we'll also delete any data that might
// already be present in the RHS: the HardState and RaftReplicaID. It is
// important to preserve the HardState because we might however have already
// voted at a higher term. In general this shouldn't happen because we add
// learners and then promote them only after they apply a snapshot but we're
// going to be extra careful in case future versions of cockroach somehow
// promote replicas without ensuring that a snapshot has been received. So
// we write it back (and the RaftReplicaID too, since it's an invariant that
// it's always present).

@pav-kv
Copy link
Collaborator

pav-kv commented May 23, 2023

Another possible scenario:

  1. getOrCreateReplica creates an unitialized replica and writes the RaftReplicaIDKey
  2. Then this replica votes (it's almost the only thing it can do) and stores the HardState
  3. (alternatively) I believe it can store a HardState even without voting if it discovers a non-zero Term or Commit from other nodes (I vaguely remember seeing this happen)

@pav-kv
Copy link
Collaborator

pav-kv commented May 23, 2023

@sumeerbhola Yeah, I think you're right. In the scenario above both keys can be written, and in separate batches.

@pav-kv
Copy link
Collaborator

pav-kv commented May 23, 2023

We've discussed getting rid of uninitialized replicas (or at least make them not have any IO) previously. Could be good to reiterate on this. No uninit replicas / IO => no RaftReplicaIDKey / HardState keys written before ingestion => more efficient ingestion.

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented May 23, 2023

I am wondering how much of the work in writeUnreplicatedSST is optional for the case of a new uninitialized range followed by initialization

// Clearing the unreplicated state.
//
// NB: We do not expect to see range keys in the unreplicated state, so
// we don't drop a range tombstone across the range key space.
unreplicatedPrefixKey := keys.MakeRangeIDUnreplicatedPrefix(id.RangeID)
unreplicatedStart := unreplicatedPrefixKey
unreplicatedEnd := unreplicatedPrefixKey.PrefixEnd()
if err := unreplicatedSST.ClearRawRange(
unreplicatedStart, unreplicatedEnd, true /* pointKeys */, false, /* rangeKeys */
); err != nil {
return nil, false, errors.Wrapf(err, "error clearing range of unreplicated SST writer")
}

There will be no raft log, RaftTruncatedStateKey, RangeLastReplicaGCTimestampKey. We can read RaftReplicaIDKey to check that the replica id hasn't changed. And if the HardState has not changed we could avoid doing any of that work (fast-path). But perhaps HardState.Commit will change since IIRC, that value must be 0 for an uninitialized replica.

btw, are we accidentally clearing the RangeTombstoneKey, which would be a bug?

@exalate-issue-sync exalate-issue-sync bot removed the T-kv-replication KV Replication Team label May 23, 2023
@jbowens jbowens removed this from To Do (investigations) in Storage May 24, 2023
@nvanbenschoten
Copy link
Member

To verify the theories from above, I added the following instrumentation to snapshot ingestion and spun up a multi-node local cluster:

--- a/pkg/kv/kvserver/replica_raftstorage.go
+++ b/pkg/kv/kvserver/replica_raftstorage.go
@@ -514,6 +514,33 @@ func (r *Replica) applySnapshot(
                log.Infof(ctx, "applied %s (%s)", inSnap, logDetails)
        }(timeutil.Now())

+       // Look at what keys are already written into the unreplicated key space.
+       {
+               unreplicatedPrefixKey := keys.MakeRangeIDUnreplicatedPrefix(r.ID().RangeID)
+               unreplicatedStart := unreplicatedPrefixKey
+               unreplicatedEnd := unreplicatedPrefixKey.PrefixEnd()
+               it := r.store.TODOEngine().NewEngineIterator(storage.IterOptions{UpperBound: unreplicatedEnd})
+               defer it.Close()
+               var ok bool
+               var err error
+               for ok, err = it.SeekEngineKeyGE(storage.EngineKey{Key: unreplicatedStart}); ok; ok, err = it.NextEngineKey() {
+                       key, err := it.UnsafeEngineKey()
+                       if err != nil {
+                               panic(err)
+                       }
+                       log.Infof(ctx, "found unreplicated key: %s", key.Key)
+               }
+               if err != nil {
+                       panic(err)
+               }
+
+               prevHS, err := r.raftMu.stateLoader.StateLoader.LoadHardState(ctx, r.store.TODOEngine())
+               if err != nil {
+                       panic(err)
+               }
+               log.Infof(ctx, "found prev HardState: %+v", prevHS)
+       }
+
        unreplicatedSSTFile, nonempty, err := writeUnreplicatedSST(

Each INITIAL snapshot (during upreplication) looked the same. Each had a RaftHardStateKey and a RaftReplicaIDKey. The HardState's vote was 0 (the common-case, but not always true), a commit index was 0 (I believe this is always the case), and a term of 6.

I230608 04:22:04.625626 1331 kv/kvserver/replica_raftstorage.go:531 ⋮ [T1,n2,s2,r13/2:{-}] 128  found unreplicated key: /Local/RangeID‹/13›/‹u›/‹RaftHardState›
I230608 04:22:04.625641 1331 kv/kvserver/replica_raftstorage.go:531 ⋮ [T1,n2,s2,r13/2:{-}] 129  found unreplicated key: /Local/RangeID‹/13›/‹u›‹"rftr"›
I230608 04:22:04.625674 1331 kv/kvserver/replica_raftstorage.go:541 ⋮ [T1,n2,s2,r13/2:{-}] 130  found prev HardState: {Term:6 Vote:0 Commit:0}
I230608 04:22:04.684671 1331 kv/kvserver/replica_raftstorage.go:514 ⋮ [T1,n2,s2,r13/2:‹/Table/1{1-2}›] 131  applied INITIAL snapshot 35998b43 from (n1,s1):1 at applied index 153 (total=59ms data=1.8 KiB ingestion=6@49ms)

The RaftReplicaIDKey is written during replica creation. It shouldn't be changed by the snapshot, so in theory, if we avoided deleting it by splitting the range deletion, we could also avoid re-writing it.

It is less clear what we should do with the RaftHardStateKey. Snapshots are always accompanied by an updated HardState whose commit index is set to the index of the snapshot. We currently write this updated HardState into the snapshot's ssts, so that the HardState is updated atomically with the snapshot ingestion. Is this necessary for correctness? Typically the commit index in the HardState does not require durability and can be reconstructed after a crash. Does that provide us any flexibility to write the new HardState in a separate batch, either before or after the SST ingestion in order to avoid the key collision with the memtable during ingestion?

@jbowens
Copy link
Collaborator Author

jbowens commented Jun 8, 2023

It shouldn't be changed by the snapshot, so in theory, if we avoided deleting it by splitting the range deletion, we could also avoid re-writing it.

We're hoping to move to remove the range deletion, replacing it with an atomic ingest+excise operation that virtualizes overlapping sstables to hide existing data. This will allow ingested snapshot sstables to unconditionally ingest into L6. While we could separately ingest+excise on either side of this key, it'd be preferable to avoid the tiny files.

@sumeerbhola
Copy link
Collaborator

Typically the commit index in the HardState does not require durability and can be reconstructed after a crash. Does that provide us any flexibility to write the new HardState in a separate batch, either before or after the SST ingestion in order to avoid the key collision with the memtable during ingestion?

The ReplicasStorage (separated raft log) design included an init step at crash recovery time that would see if HardState.Commit < RangeAppliedState.RaftAppliedIndex and if so update it to make it equal to RangeAppliedState.RaftAppliedIndex. I don't know if we already do something like this. If we do (or can), we could check that the only thing that needs updating is HardState.Commit (as you outlined above) and then write it after the ingest (the separated raft log design calls for the invariant that HardState.Commit is 0 for an uninitialized replica, so I don't think we should write it before the ingest).
@tbg

@tbg
Copy link
Member

tbg commented Jul 18, 2023

Typically the commit index in the HardState does not require durability and can be reconstructed after a crash.

I would tread careful here due to apply-time conf changes. There is a secret invariant that you can't have more than one "committed but not known committed" confchange in the log. I don't know how much this can be relaxed but either way I am not terribly motivated to relax anything in that area, since it is generally poorly understood and problems would not be discovered until it is potentially too late.

I was actually thinking the other day that we should just get out of the business of apply-time conf changes. I filed an issue1 to that effect. With that, the Committed index would be entirely discretionary, I think, mod, perhaps, additional meaning we've attached to it with replica quiescence, which one day might be on its way out2, too.

Footnotes

  1. https://github.com/cockroachdb/cockroach/issues/107083

  2. https://github.com/cockroachdb/cockroach/issues/94592

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-storage Relating to our storage engine (Pebble) on-disk storage. C-investigation Further steps needed to qualify. C-label will change. O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster T-storage Storage Team
Projects
None yet
Development

No branches or pull requests

8 participants