Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage, kv: do best-effort GC of old versions during storage compactions #57260

Open
sumeerbhola opened this issue Nov 30, 2020 · 14 comments
Open
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Nov 30, 2020

GC of old versions requires reading each of the versions and creating point deletes to the storage layer to individually delete them (we may start using range tombstones in #51230). Since this is costly, we do GC periodically, when the "score", which includes the fraction of the bytes in a replica that are dead, is high enough. The lack of incremental GC means that individual keys can accumulate a lot of garbage, which can affect read performance.
It would be better if most of the GC work happened as a side-effect of storage level compactions which take as input a subset of sstables in the LSM. Periodic GC would still be used to cleanup versions that did not get cleaned up incrementally in compactions.

In #42514, we attempted to prototype GC in compactions.

There are some challenges to achieving this in the higher-level context of CockroachDB:

  • MVCCStats maintenance: MVCCStats include information about non-live bytes and GC in compactions will make these inaccurate. We could do one of the following:

    • completely overhaul the MVCCStats machinery
    • keep the periodic scan that computes new MVCCStats, and a compaction-GC-timestamp-threshold, and replicate both through Raft (suggested by @ajwerner). The timestamp threshold allows compactions to GC all non-live versions that have timestamp lower than the threshold. Since this is a non-incremental scheme we are still subject to too much garbage accumulation for individual keys, so this may be ok as a first step, but not necessarily as the final solution.
  • We expect all replicas to have identical replicated state in the store: Replica consistency checking is important for discovering bugs, and sometimes mitigating the impact of bugs that may accidentally creep into production deployments. GC in compactions will make the replicas diverge. The previous compaction-GC-timestamp-threshold approach fixes this, in that the read for consistency checking can hide all non-live data below the timestamp threshold. Alternatively, consistency checking could hide all non-live versions altogether.

  • Correctly identifying which values (SET) in the compaction have not been deleted (DEL) in sstables not involved in the compaction. See the discussion thread at [WIP] storage/engine: attempt to do best-effort GC in compactions (fa… #42514 (review) for details. There are two (not fleshed out) approaches:

    • make different MVCC versions be overlapping in the file key space. This ensures that if one sees, for example, a@30#100.SET, a@20#90.SET, a@10#80.SET in the compaction input, there isn't a a@20#95.DEL lurking in another sstable that is not part of this compaction. The cons of this approach are (a) the possibility of higher write amplification, and (b) very big sstables (if there are too many versions of a key -- though we should be able to GC most of them, even if there is an old protected timestamp, so this may not be an issue).
    • pass down a promise (structured as a range) that no such a key with timestamp <= 20 exists in a higher sstable (suggested by @nvanbenschoten). This idea needs more refinement, since propagation of the promise would require the key range of the promise to be included in the sstable key range (like range tombstones), which may itself cause unnecessarily wide compactions and higher write amplification. And if the file key range caused by the promise is considered in the read path, we may unnecessarily read an sstable where there is no relevant data, which would increase read amplification.

Jira issue: CRDB-2843

Epic CRDB-2566

@sumeerbhola sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. labels Nov 30, 2020
@sumeerbhola sumeerbhola self-assigned this Nov 30, 2020
@sumeerbhola
Copy link
Collaborator Author

Somewhat related is the issue of efficiently supporting row-level TTLs, discussed here #20239 (comment)
Compactions doing discard of data that is expired due to TTL is easier from a correctness perspective, since the state of the row (key-value pair) is sufficient to make that expiry decision -- it does not need knowledge of the state of other rows.

@sumeerbhola
Copy link
Collaborator Author

In the context of the design for cockroachdb/pebble#1339, @nvanbenschoten @erikgrinaker brought up:

We could use these range keys in CockroachDB to store information about GC actions over an MVCC key span, and then expose them in Pebble compaction filters, such that we could push much of the GC mechanics down to Pebble compactions.

My interpretation of this is:

  • Use the support we are building for range keys to introduce an "MVCC-aware RANGEDEL". An MVCC-RANGEDEL [k1,k2)@v would be telling Pebble that all points and ranges with prefix overlapping [k1,k2) and version < v can be discarded, and must immediately be hidden from all iteration.
  • Unlike regular range keys, where Pebble only provides masking iteration, and otherwise does not interpet the range key, this is a case where Pebble knows the semantics of the operation. Also, there would not be any Unset operation on MVCC-RANGEDEL, and it would be elided when falling down to L6.
  • For range keys we have not decided yet whether to require the invariant that two overlapping key prefixes with @v1#s1 and @v2#s2 satisfy v1 < v2 => s1 <= s2. That is, versions would not violate the LSM invariant. For MVCC-RANGEDEL we would need to require this.
  • Having this operation would not avoid the need to scan the range and recompute MVCC stats, so GC would continue to be periodic at a low frequency. However, it does allow for efficient deletion since regular RANGEDELs can only be used over a swathe of keys only if there aren't more recent versions.

@erikgrinaker
Copy link
Contributor

My interpretation, without much familiarity with Pebble internals, was that instead of introducing an "MVCC-aware RANGEDEL" that Pebble has to know the semantics of, we instead expose a mechanism where users (CRDB) can hook into the compaction process via compaction filters and provide a custom GC policy.

We could then drop a user-defined range key across the span we want to GC (to Pebble this would just be an arbitrary range key), and the compaction filter would reference this range key during compactions to make versions below it eligible for compaction-time GC.

As for stats, I'd imagine that CRDB would update its MVCC stats when it drops this custom range key. From CRDB's point of view, the logical data is removed, and it's just pending compaction in Pebble. This is no different than other write amplification in Pebble, which will be reduced on compaction.

I may be misrepresenting @nvanbenschoten's idea here, but that was my understanding at least.

@nvanbenschoten
Copy link
Member

@erikgrinaker's comment is closer to what we had discussed. Pebble does not need to guarantee that GC-able versions are immediately hidden from all iteration after the ranged key has been written because those versions are not visible by higher levels anyway. By definition, for a KV version to be GC-able, it must be shadowed by another version that itself is below the GC threshold. And because we don't allow MVCC scans at timestamps below the GC threshold, the GC-able version will never be visible to an MVCC scan.

So the idea was to treat this ranged key as a form of metadata that is exposed to a user-defined compaction filter but otherwise ignored. It was essentially the same idea that you mentioned above:

pass down a promise (structured as a range) that no such a key with timestamp <= 20 exists in a higher sstable

@sumeerbhola
Copy link
Collaborator Author

@tbg: @jbowens and I were discussing compaction-time GC and MVCCStats in light of the recent discussions on https://github.com/cockroachlabs/support/issues/1882 and the old issue #38251 (comment), specifically the comment by @andreimatei:

And we need to summarize the function somehow. One simple engineering thing to do is keep a series of buckets representing time intervals and each maintaining the amount of bytes that died during that interval (say, 10 of them each representing TTL/10)

With the resolved-LSM plan, the main difficulty with compaction-time GC, of not knowing whether a version is actually committed or not goes away, and the remaining issue is only related to MVCCStats computation. When doing a compaction, say it sees point(T1) and point(T2) for the same key where T1 < T2 and both are <= gc-threshold:

  • point(T1) can be GC'd
  • If point(T2) is a tombstone and there is no point in the LSM older than T1, then point(T2) can also be GC'd. This would be possible in a L6 compaction if the compaction bounds were such that all versions of this key are known to be included in this compaction.

Compaction-time GC does not need to know if there are points newer than T2.

This compaction-time GC would need to compute a delta for MVCCStats, where there are some difficulties:

  • MVCCStats.KeyBytes: My understanding is that other than the latest version, the other versions only contribute MVCCVersionTimestampSize each. So if we also wanted to GC point(T2) we would need to know whether there is a newer version or not.
  • Stats divergence across replicas: This is where Andrei's suggestion is relevant. If stats about dead versions were bucketed by time, then stats comparison for replica divergence would ignore buckets that were below the gc-threshold, since replicas will diverge in their compaction-time GC of these.

We of course have the option to do a periodic scan to compute new MVCCStats, that hide garbage below the gc-threshold, and only then give Pebble permission to GC (this is what was discussed earlier in the issue), but it is worse than having the compaction produce a delta since:

  • The scan is periodic and has overhead.
  • MVCCStats should truly represent what is in the replica at that store. Hiding garbage that has not yet been deleted makes MVCCStats less useful. And knowing the accurate stats can also tell us that some of the old dead buckets still have data at this replica (since compaction-time GC is not guaranteed to clean everything) and drive a scan-based cleanup.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Nov 9, 2022

I'm uneasy about the replica divergence introduced by purely compaction-driven GC. The low-level storage APIs expose all keys regardless of the GC threshold, so different replicas will have a different view of the range data depending on their compaction schedules and LSM structure. Nathan says:

Pebble does not need to guarantee that GC-able versions are immediately hidden from all iteration after the ranged key has been written because those versions are not visible by higher levels anyway.

While this is true at the KV level, it strikes me as the sort of thing that could very easily lead to subtle bugs in storage-level code. It sort of breaks the idea of a deterministic Raft state machine. I think we should consider alternative designs that avoid this, and I don't feel like avoiding a background scan is a strong enough reason by itself.

@jbowens
Copy link
Collaborator

jbowens commented Nov 9, 2022

I'm uneasy about the replica divergence introduced by purely compaction-driven GC.

Doesn't this replica divergence exist with any form of compaction-time GC? I guess the alternative is to force manual compactions of the range's sstables across replicas, which may require compacting much more than just the range's boundaries since range boundaries ≠ sstable boundaries. The write amplification from these forced compactions runs counter to the original goal of avoiding the write amplification of writing explicit tombstones.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Nov 9, 2022

I think it's more like we'd write some predicate to Pebble (or some other lower-level component) that hides the keys until they're compacted. But that predicate could end up looking an awful lot like Pebble tombstones, which sort of defeats the purpose.

We could maybe get away with writing a garbage marker on MVCC versions whenever we replace them, which makes the garbage determination local to individual keys, but that comes with write-amplification too (on the main txn write path, no less). Haven't given this much thought.

@jbowens
Copy link
Collaborator

jbowens commented Nov 9, 2022

We could maybe get away with writing a garbage marker on MVCC versions whenever we replace them, which makes the garbage determination local to individual keys,

Running with this thought, when two keys meet in a compaction with a "resolved-only" LSM, Pebble knows what's garbage. It could write garbage keys annotated with the timestamp of when they became garbage. As a part of a GC, all replicas could scan their local replica data, writing point tombstones for any garbage keys not annotated with the garbage timestamp. The remaining garbage would be required to be filtered by Pebble at read-time by comparing the garbage timestamp to the GC threshold, and eventually be dropped by a Pebble compaction that observes the key with a garbage timestamp less than the GC threshold.

That would give us consistent replicas and compaction-time GC of (most?) garbage, but at the cost of
a) complicating the key encoding
b) a scan on every replica when performing a GC (the scan on the replicas could be elided and replaced with point deletes if the leaseholder scan found little garbage)
c) a timestamp comparison whenever Pebble returns a garbage key


I think Pebble compaction-driven GC will provide the most benefits, but we would need to be cautious and safeguard against bugs from this shift to allowing divergence at the Pebble API boundary.

@erikgrinaker
Copy link
Contributor

It could write garbage keys annotated with the timestamp of when they became garbage. As a part of a GC, all replicas could scan their local replica data, writing point tombstones for any garbage keys not annotated with the garbage timestamp.

Good idea. Wouldn't this still have inconsistent state, because the set of keys having garbage annotations would differ between replicas? To be clear, I think this is a lesser problem (especially if we have specialized APIs to expose this state and mark it with "here be dragons").

a scan on every replica when performing a GC (the scan on the replicas could be elided and replaced with point deletes if the leaseholder scan found little garbage)

I wonder how this would work with very short GC TTLs. We're moving to 4 hours in 23.1, serverless is considering dropping to 2 hours, and I know a lot of people really want "as short as possible" (i.e. as low as 1 minute). This means that we can often run GC before Pebble has done much compacting at all. But I suppose in those cases the amount of garbage to collect is expected to be small anyway, except for bulk updates.

a timestamp comparison whenever Pebble returns a garbage key

This may be unavoidable in any compaction-GC variant that's Raft-consistent.

@tbg
Copy link
Member

tbg commented Nov 21, 2022

Rephrasing Jackson’s suggestion to make sure I understand it - we make pebble compactions mark keys that are “shadowed” (= there exists newer MVCC version for the same key), and whenever a shadowed key shows up under the GCThreshold we can drop it in compactions? Doesn’t this still have the problems with updating stats? Still interested in understanding the implications of this approach.

Taking a step back, what are our goals? I think they are, loosely in priority order:

  • avoid write-amp when deleting keys (user low on disk → deletes stuff → even lower on disk)
  • avoid periodic scans as part of GC as those are expensive

If we can get a solution in which compactions really do the job entirely vs having to have two ways in which old versions can get GC’ed that would seem preferrable. I wonder if we can do it:

  • make MVCCStats unreplicated. Yes, this is a bigger change, but I wonder if it isn’t an attractive one in isolation anyway. We would keep distributing MVCCStats updates via commands just the same, and they’d get tallied up on each replica. The difference is that replicas could also introduce their own deltas on top of these, and the world would keep turning. This would actually make a few things easier. We would have to fully embrace ContainsEstimates because it would de facto “always” be true. It would be less of a big deal to introduce estimates. We could rectify the deltas during consistency checks, snapshots sending, splits, etc because fixing a delta is now easy - you scan through the data, slap the delta on, done. No need to worry about a race where another write slips in; the small discrepancy is now acceptable.
  • teach the MVCC layer (and consistency checks) about the GCThreshold; under the GCThreshold, shadowed versions (and tombstones) are treated as nonexistent.
  • add a compaction filter that has a view of the current GCThreshold. Whenever two keys of the same version meet in a compaction, and the older is under the GCThreshold, it gets dropped. CRDB gets to see the kvs that are dropped and can synthesize (local, unreplicated) stats deltas from that which can be put into local unreplicated MVCCStats. Seeing the “dropped” and “covering” key should be enough to generate the updates. In particular, GCBytesAge is accrued from the covering key’s timestamp on, but we see that here. (If there’s a problem we can also change the semantics of GCBytesAge to accrue from its own timestamp on).
  • SSTs in L6 keep metadata about the number of shadowed keys in them and the max shadowed timestamp (versions of a key can never be split across SSTs, right?). A mechanism makes sure that an L6 SST that has shadowed keys eventually becomes recompacted as the GCThreshold (complication: on any overlapping replica) rises. Maybe something similar makes sense for higher levels as well. It might also make sense to have “sublevels” in L6, where one SST has the non-shadowed versions and the other SST the shadowed versions. That way, we can hope to drop the shadowed SST wholesale at some point and beside, we don’t need to check it for current reads because we know it’s all shadowed.

I also sort of wonder, doesn’t this splitting in 4) sort of look like a useful approach for a generational LSM? If we split each SST into two parts while it’s being created, namely

  • the live MVCC keys (keys that are not shadowed and aren’t tombstones)
  • non-live MVCC keys (shadowed keys, tombstones)

and record into the second SST’s metadata the max timestamp we’ve seen (say T), then can’t we do a “time bound iterator” type thing here? If an MVCC reader comes in with a timestamp S, and S>T (as will be the case for ~all reads, since T lags current time), then we know by construction that it doesn’t need to see the second SST. In particular, queue-like workloads would work ~fine, regardless of the TTL. The split between the two SSTs could be adjusted - the timestamp could be picked in advance (now()-10s instead of whatever the largest write is, for example). It’s always allowed to put something that should be in the second SST in the first, and this will naturally happen - an older write is, when it gets shadowed, still in the “live” SST in some lower level.

Wonder if we've considered such approaches before and what the problems are.

@jbowens
Copy link
Collaborator

jbowens commented Nov 22, 2022

Doesn’t this still have the problems with updating stats? Still interested in understanding the implications of this approach.

Sort of. I think it's somewhat agnostic to the options for dealing with the stats problem. The goal of that approach was to preserve more replica synchronization at the lower, internal levels of the mvcc iterator stack. That approach preserves identical replica state within much of the mvcc iterator implementation, since shadowing of marked garbage can be done locally without knowledge of what key proceeds it.

If we split each SST into two parts while it’s being created, namely

  • the live MVCC keys (keys that are not shadowed and aren’t tombstones)
  • non-live MVCC keys (shadowed keys, tombstones)

Yeah, we'd like to get here eventually. The tracking Pebble issue is cockroachdb/pebble#1170. The biggest obstacle (both to compaction-time GC and separating live and garbage keys) right now is that although we can query the resolved timestamp(s) of a key range, a compaction's local view is not necessarily resolved. There may a point tombstone from a transaction abort higher in the LSM that deletes a key that locally appears to shadow another key.

In the recent disaggregated storage designs, we've been trying to ensure that there's some part of storage—either a LSM, or lower N levels of a LSM—that is known to be fully resolved.

make MVCCStats unreplicated

On the continuum between where we are today and completely unreplicated stats, I think there's an attractive spot in between. Garbage within MVCCStats can be divided into buckets according to when they became garbage. When garbage ages out of the final bucket, it becomes un-replicated and simultaneously eligible for compaction-time GC.

@erikgrinaker
Copy link
Contributor

we make pebble compactions mark keys that are “shadowed” (= there exists newer MVCC version for the same key), and whenever a shadowed key shows up under the GCThreshold we can drop it in compactions

This isn't sufficient. For a key to be GCable, the key must be shadowed, but the key shadowing it must also be below the GC threshold. Otherwise, we're breaking reads between the GC threshold and the shadowing key.

I don't think that changes much about everything else you said, except that to determine whether a shadowed key can be GCed we will have to be able to read the live key (because in the case where a live key shadows another version, we can't GC the older version until the live key drops below the GC threshold).

@jbowens
Copy link
Collaborator

jbowens commented Mar 10, 2023

I think there's enough complexity and complications to make this intractable, but I want to document it in case it sparks ideas for someone else:

We could use a manual compaction to drive GC. We expect most (~99%) of a range's bytes to be in L5+L6. We could expose a new manual compaction mechanism from Pebble, allowing garbage collection to request a L5->L6 compaction of the range's bounds and pass in a compaction filter. This manual compaction would be setup with a compaction iterator that includes all levels (including memtables) that contain keys within the range. This compaction iterator would have a full view of the LSM within this range.

This compaction would operate normally, except while within the bounds of the manually compacted KV range.

  • If a key is within the the range bounds:
    • If a key k is not elided due to shadowing:
      • If a key k originates from L4 or higher: the compaction iterator invokes CompactionFilter.KeyAboveCompaction(k InternalKey, lv LazyValue) and then discards the key.
      • If a key k originates from L5 or L6: the compaction iterator invokes CompactionFilter.KeyWithinCompaction(k InternalKey, lv LazyValue) (elide bool). If KeyWithinCompaction returns true, the compaction iterator discards the key.

This would allow most garbage to be dropped as a part of useful compactions. The remaining garbage can be dropped explicitly through point tombstones, built in a batch at the same time.

There are some complications that probably preclude this approach:

  • The batch of higher-level garbage needs to somehow be committed atomically with compaction completion.
  • The above scheme describes a compaction as wide as the KV range. It's difficult to make this granular, because file boundaries vary from store to store.
  • This scheme only saves write amp if there's sufficient garbage. Compaction picking heuristics are designed to reduce write amp, so picking compactions by another criteria can increase write amp unless the savings from avoiding writing tombstones (and any eventual compactions they would trigger) offset that.
  • The L5 and L6 files containing the KV range may contain data from other adjacent ranges that must be compacted at the same time. If L5 and L6 file sizes are 64M and 128M respectively, we'd expect the range's L5 and L6 sstables to contain an average of 64M + 128M = 192M of bytes from adjacent ranges.
  • If there are any open LSM snapshots, dropping garbage violates the snapshot's consistent view of the LSM. We have considered eliminating use of LSM snapshots, or constraining LSM snapshots effective range (perf: snapshot keyspan bounds pebble#1810).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team
Projects
Status: Backlog
Development

No branches or pull requests

5 participants