-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage,kv: investigate overlap memtable during snapshot application #99273
Comments
One immediate observation: we write |
Another observation: the end boundaries are all exclusive range deletion or range key sentinels. I wonder if we're handling them appropriately when calculating overlap? |
https://github.com/cockroachdb/pebble/blob/master/ingest.go#L413. Gave this a look and the |
What about snapshots that are sent to catch up followers that have fallen behind on the log? These follower replicas could certainly have applied writes that are still present in the memtable when the catchup snapshot arrives. During splits, we don't really expect snapshots, at least in the common case. If a follower is slow to apply the split trigger, this can cause a spurious Raft message from the split-off leader to instantiate a new Raft replica, which will then request a snapshot, but we reject these since we'd rather wait for the split trigger. It's possible for that replica to apply the split trigger so late that it'll immediately need a snapshot to catch up though. I suppose there can be further race conditions here too -- I'll ask around for details.
This was only done to guard against anyone writing range keys here in the future and forgetting to handle it here. But we now explicitly reject writing them across the local keyspan, because we don't handle them correctly elsewhere either (e.g. in Lines 3382 to 3387 in c2460f1
cockroach/pkg/kv/kvserver/store_snapshot.go Lines 203 to 206 in c2460f1
|
Yeah, snapshots for fallen-behind followers makes sense as an explanation. We're seeing many ingests and ~70% of ingests be ingested as flushable, which seems high for all catch-up snapshots.
Agreed, it looks correct to me.
Got it, nice. This should become moot soon enough with the virtual sstables work, which can eliminate |
What does "overlap" mean in that context? Do you mean that there are keys in the memtable within the bounds of the SSTs to be ingested? That should in fact be rare, at least on a "seasoned" cluster that sees regular memtable rotations. As Erik pointed out, certain pathological patterns could have snapshots hit the memtable. For example, if a replica got removed and immediately re-added, we probably still have some of its bits in the memtable (from log application, or even just the raft log, which we also clear in the SSTs). Also, any snapshot that is sent as a result of log truncation will hit an existing replica, so very likely overlaps the memtable as well. These patterns may not be as rare as we'd like. We'd really have to know what snapshots were going on as these flushable ingestions happened. We do have stats returned on each snapshot ingestion1, so possibly we could print which snapshots are ingested with memtable overlap. Then we can pick a few and spot-check the logs for their history. This might give us a better intuition on when we see these flushable ingests. Footnotes |
Add a new field to IngestOperationStats indicating whether or not any of the ingested sstables overlapped any memtables. Informs cockroachdb/cockroach#99273.
Exactly.
Good idea, put up cockroachdb/pebble#2422. I think we can also pull the WALs and manifest from a node and dump them to find the conflicting writes. |
Actions for follow up here: look at the test cluster for the proportions of flushable ingests to ingest. |
Seems like we don't actually have a timeseries for count of total ingestions. (filed #103744) Looking at the flushable ingestions, there is a periodicity to it that suggests there's a correlation with workload restarts. Now that the telemetry cluster has v23.1.x, we can also look there for more evidence. The ingest-as-flushable graph appears to mirror the graph of snapshot receptions, suggesting applying a snapshot almost always result in memtable overlap. I don't understand why. I don't think we can chalk this up to an artifact of the telemetry cluster's artificial workload and split+scatters. If we're able to prevent this memtable overlap, we can likely significantly reduce w-amp (and r-amp) during snapshot reception.
|
I grabbed the WALs and MANIFESTs from a test cluster node. Here's an example flushable ingest in a WAL:
and the corresponding version edit:
Judging by the fact that 4879580 is the only sstable to ingest into L0, it seems that's where the memtable overlap was:
In the final 1% of the previous WAL 4879549.log, there are two batches containing overlapping keys:
Are these keys expected to be written before the replica's snapshot is ingested? Is there someway we could avoid writing them until the snapshot has been ingested? |
@pavelkalinnikov Can you have a look at the above? |
cc @cockroachdb/replication |
This key is always written by
Only then we handle raft ready cockroach/pkg/kv/kvserver/store_raft.go Line 465 in d6ca139
which eventually ingests the snapshot. So, one possible way to optimize is avoiding this write, because the snapshot ingestion writes this key anyway. This doesn't seem trivial at a quick glance though, because of the way
This key, I'm not sure why it's already written. The |
RaftHardStateKey and RaftReplicaIDKey are both written for an uninitialized replica (see |
@sumeerbhola I can only see the
Before any of these flows, it's possible that only |
The HardState must be getting written somewhere for an uninitialized replica too, since I think an uninitialized replica can vote (so needs to remember it). This is also the reason for the dance in cockroach/pkg/kv/kvserver/store_split.go Lines 69 to 81 in f397cf9
|
Another possible scenario:
|
@sumeerbhola Yeah, I think you're right. In the scenario above both keys can be written, and in separate batches. |
We've discussed getting rid of uninitialized replicas (or at least make them not have any IO) previously. Could be good to reiterate on this. No uninit replicas / IO => no |
I am wondering how much of the work in cockroach/pkg/kv/kvserver/replica_raftstorage.go Lines 750 to 761 in f397cf9
There will be no raft log, RaftTruncatedStateKey, RangeLastReplicaGCTimestampKey. We can read btw, are we accidentally clearing the |
To verify the theories from above, I added the following instrumentation to snapshot ingestion and spun up a multi-node local cluster: --- a/pkg/kv/kvserver/replica_raftstorage.go
+++ b/pkg/kv/kvserver/replica_raftstorage.go
@@ -514,6 +514,33 @@ func (r *Replica) applySnapshot(
log.Infof(ctx, "applied %s (%s)", inSnap, logDetails)
}(timeutil.Now())
+ // Look at what keys are already written into the unreplicated key space.
+ {
+ unreplicatedPrefixKey := keys.MakeRangeIDUnreplicatedPrefix(r.ID().RangeID)
+ unreplicatedStart := unreplicatedPrefixKey
+ unreplicatedEnd := unreplicatedPrefixKey.PrefixEnd()
+ it := r.store.TODOEngine().NewEngineIterator(storage.IterOptions{UpperBound: unreplicatedEnd})
+ defer it.Close()
+ var ok bool
+ var err error
+ for ok, err = it.SeekEngineKeyGE(storage.EngineKey{Key: unreplicatedStart}); ok; ok, err = it.NextEngineKey() {
+ key, err := it.UnsafeEngineKey()
+ if err != nil {
+ panic(err)
+ }
+ log.Infof(ctx, "found unreplicated key: %s", key.Key)
+ }
+ if err != nil {
+ panic(err)
+ }
+
+ prevHS, err := r.raftMu.stateLoader.StateLoader.LoadHardState(ctx, r.store.TODOEngine())
+ if err != nil {
+ panic(err)
+ }
+ log.Infof(ctx, "found prev HardState: %+v", prevHS)
+ }
+
unreplicatedSSTFile, nonempty, err := writeUnreplicatedSST( Each INITIAL snapshot (during upreplication) looked the same. Each had a
The It is less clear what we should do with the |
We're hoping to move to remove the range deletion, replacing it with an atomic ingest+excise operation that virtualizes overlapping sstables to hide existing data. This will allow ingested snapshot sstables to unconditionally ingest into L6. While we could separately ingest+excise on either side of this key, it'd be preferable to avoid the tiny files. |
The ReplicasStorage (separated raft log) design included an init step at crash recovery time that would see if |
I would tread careful here due to apply-time conf changes. There is a secret invariant that you can't have more than one "committed but not known committed" confchange in the log. I don't know how much this can be relaxed but either way I am not terribly motivated to relax anything in that area, since it is generally poorly understood and problems would not be discovered until it is potentially too late. I was actually thinking the other day that we should just get out of the business of apply-time conf changes. I filed an issue1 to that effect. With that, the Footnotes |
In the 23.1 test cluster, we're observing higher than expected exercising of the new flushable ingests code path. This code path is triggered one or more of the sstables in an ingest overlaps with one or more of the memtables. In 22.2 and earlier, ingest would flush the memtable, wait for it complete and then proceed. In 23.1 the ingest writes a record committing to the ingest to the WAL and layers the ingested sstables onto the 'flushable' queue containing the memtables and large batches.
The flushable ingest numbers we're observing indicate that many snapshot applications overlap a memtable, forcing memtable rotation and flushes. This is unexpected, because we thought the ingested sstables should have narrow keyspans local to the associated KV range. We expected that the memtable would not contain any keys that fall within the new replica's six keyspans.
Here's a few example version edits pulled from the test cluster.
There's no specific recording of which version edits correspond to flushable ignests that overlap with the memtable, but I believe all the ones that ratchet
log-num
are, since only flushes should be ratchetinglog-num
.Jira issue: CRDB-25798
Epic CRDB-27235
The text was updated successfully, but these errors were encountered: