-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage, kv: do best-effort GC of old versions during storage compactions #57260
Comments
Somewhat related is the issue of efficiently supporting row-level TTLs, discussed here #20239 (comment) |
In the context of the design for cockroachdb/pebble#1339, @nvanbenschoten @erikgrinaker brought up:
My interpretation of this is:
|
My interpretation, without much familiarity with Pebble internals, was that instead of introducing an "MVCC-aware RANGEDEL" that Pebble has to know the semantics of, we instead expose a mechanism where users (CRDB) can hook into the compaction process via compaction filters and provide a custom GC policy. We could then drop a user-defined range key across the span we want to GC (to Pebble this would just be an arbitrary range key), and the compaction filter would reference this range key during compactions to make versions below it eligible for compaction-time GC. As for stats, I'd imagine that CRDB would update its MVCC stats when it drops this custom range key. From CRDB's point of view, the logical data is removed, and it's just pending compaction in Pebble. This is no different than other write amplification in Pebble, which will be reduced on compaction. I may be misrepresenting @nvanbenschoten's idea here, but that was my understanding at least. |
@erikgrinaker's comment is closer to what we had discussed. Pebble does not need to guarantee that GC-able versions are immediately hidden from all iteration after the ranged key has been written because those versions are not visible by higher levels anyway. By definition, for a KV version to be GC-able, it must be shadowed by another version that itself is below the GC threshold. And because we don't allow MVCC scans at timestamps below the GC threshold, the GC-able version will never be visible to an MVCC scan. So the idea was to treat this ranged key as a form of metadata that is exposed to a user-defined compaction filter but otherwise ignored. It was essentially the same idea that you mentioned above:
|
@tbg: @jbowens and I were discussing compaction-time GC and MVCCStats in light of the recent discussions on https://github.com/cockroachlabs/support/issues/1882 and the old issue #38251 (comment), specifically the comment by @andreimatei:
With the resolved-LSM plan, the main difficulty with compaction-time GC, of not knowing whether a version is actually committed or not goes away, and the remaining issue is only related to
Compaction-time GC does not need to know if there are points newer than T2. This compaction-time GC would need to compute a delta for MVCCStats, where there are some difficulties:
We of course have the option to do a periodic scan to compute new MVCCStats, that hide garbage below the gc-threshold, and only then give Pebble permission to GC (this is what was discussed earlier in the issue), but it is worse than having the compaction produce a delta since:
|
I'm uneasy about the replica divergence introduced by purely compaction-driven GC. The low-level storage APIs expose all keys regardless of the GC threshold, so different replicas will have a different view of the range data depending on their compaction schedules and LSM structure. Nathan says:
While this is true at the KV level, it strikes me as the sort of thing that could very easily lead to subtle bugs in storage-level code. It sort of breaks the idea of a deterministic Raft state machine. I think we should consider alternative designs that avoid this, and I don't feel like avoiding a background scan is a strong enough reason by itself. |
Doesn't this replica divergence exist with any form of compaction-time GC? I guess the alternative is to force manual compactions of the range's sstables across replicas, which may require compacting much more than just the range's boundaries since range boundaries ≠ sstable boundaries. The write amplification from these forced compactions runs counter to the original goal of avoiding the write amplification of writing explicit tombstones. |
I think it's more like we'd write some predicate to Pebble (or some other lower-level component) that hides the keys until they're compacted. But that predicate could end up looking an awful lot like Pebble tombstones, which sort of defeats the purpose. We could maybe get away with writing a garbage marker on MVCC versions whenever we replace them, which makes the garbage determination local to individual keys, but that comes with write-amplification too (on the main txn write path, no less). Haven't given this much thought. |
Running with this thought, when two keys meet in a compaction with a "resolved-only" LSM, Pebble knows what's garbage. It could write garbage keys annotated with the timestamp of when they became garbage. As a part of a GC, all replicas could scan their local replica data, writing point tombstones for any garbage keys not annotated with the garbage timestamp. The remaining garbage would be required to be filtered by Pebble at read-time by comparing the garbage timestamp to the GC threshold, and eventually be dropped by a Pebble compaction that observes the key with a garbage timestamp less than the GC threshold. That would give us consistent replicas and compaction-time GC of (most?) garbage, but at the cost of I think Pebble compaction-driven GC will provide the most benefits, but we would need to be cautious and safeguard against bugs from this shift to allowing divergence at the Pebble API boundary. |
Good idea. Wouldn't this still have inconsistent state, because the set of keys having garbage annotations would differ between replicas? To be clear, I think this is a lesser problem (especially if we have specialized APIs to expose this state and mark it with "here be dragons").
I wonder how this would work with very short GC TTLs. We're moving to 4 hours in 23.1, serverless is considering dropping to 2 hours, and I know a lot of people really want "as short as possible" (i.e. as low as 1 minute). This means that we can often run GC before Pebble has done much compacting at all. But I suppose in those cases the amount of garbage to collect is expected to be small anyway, except for bulk updates.
This may be unavoidable in any compaction-GC variant that's Raft-consistent. |
Rephrasing Jackson’s suggestion to make sure I understand it - we make pebble compactions mark keys that are “shadowed” (= there exists newer MVCC version for the same key), and whenever a shadowed key shows up under the GCThreshold we can drop it in compactions? Doesn’t this still have the problems with updating stats? Still interested in understanding the implications of this approach. Taking a step back, what are our goals? I think they are, loosely in priority order:
If we can get a solution in which compactions really do the job entirely vs having to have two ways in which old versions can get GC’ed that would seem preferrable. I wonder if we can do it:
I also sort of wonder, doesn’t this splitting in 4) sort of look like a useful approach for a generational LSM? If we split each SST into two parts while it’s being created, namely
and record into the second SST’s metadata the max timestamp we’ve seen (say T), then can’t we do a “time bound iterator” type thing here? If an MVCC reader comes in with a timestamp S, and S>T (as will be the case for ~all reads, since T lags current time), then we know by construction that it doesn’t need to see the second SST. In particular, queue-like workloads would work ~fine, regardless of the TTL. The split between the two SSTs could be adjusted - the timestamp could be picked in advance (now()-10s instead of whatever the largest write is, for example). It’s always allowed to put something that should be in the second SST in the first, and this will naturally happen - an older write is, when it gets shadowed, still in the “live” SST in some lower level. Wonder if we've considered such approaches before and what the problems are. |
Sort of. I think it's somewhat agnostic to the options for dealing with the stats problem. The goal of that approach was to preserve more replica synchronization at the lower, internal levels of the mvcc iterator stack. That approach preserves identical replica state within much of the mvcc iterator implementation, since shadowing of marked garbage can be done locally without knowledge of what key proceeds it.
Yeah, we'd like to get here eventually. The tracking Pebble issue is cockroachdb/pebble#1170. The biggest obstacle (both to compaction-time GC and separating live and garbage keys) right now is that although we can query the resolved timestamp(s) of a key range, a compaction's local view is not necessarily resolved. There may a point tombstone from a transaction abort higher in the LSM that deletes a key that locally appears to shadow another key. In the recent disaggregated storage designs, we've been trying to ensure that there's some part of storage—either a LSM, or lower N levels of a LSM—that is known to be fully resolved.
On the continuum between where we are today and completely unreplicated stats, I think there's an attractive spot in between. Garbage within MVCCStats can be divided into buckets according to when they became garbage. When garbage ages out of the final bucket, it becomes un-replicated and simultaneously eligible for compaction-time GC. |
This isn't sufficient. For a key to be GCable, the key must be shadowed, but the key shadowing it must also be below the GC threshold. Otherwise, we're breaking reads between the GC threshold and the shadowing key. I don't think that changes much about everything else you said, except that to determine whether a shadowed key can be GCed we will have to be able to read the live key (because in the case where a live key shadows another version, we can't GC the older version until the live key drops below the GC threshold). |
I think there's enough complexity and complications to make this intractable, but I want to document it in case it sparks ideas for someone else: We could use a manual compaction to drive GC. We expect most (~99%) of a range's bytes to be in L5+L6. We could expose a new manual compaction mechanism from Pebble, allowing garbage collection to request a L5->L6 compaction of the range's bounds and pass in a compaction filter. This manual compaction would be setup with a compaction iterator that includes all levels (including memtables) that contain keys within the range. This compaction iterator would have a full view of the LSM within this range. This compaction would operate normally, except while within the bounds of the manually compacted KV range.
This would allow most garbage to be dropped as a part of useful compactions. The remaining garbage can be dropped explicitly through point tombstones, built in a batch at the same time. There are some complications that probably preclude this approach:
|
GC of old versions requires reading each of the versions and creating point deletes to the storage layer to individually delete them (we may start using range tombstones in #51230). Since this is costly, we do GC periodically, when the "score", which includes the fraction of the bytes in a replica that are dead, is high enough. The lack of incremental GC means that individual keys can accumulate a lot of garbage, which can affect read performance.
It would be better if most of the GC work happened as a side-effect of storage level compactions which take as input a subset of sstables in the LSM. Periodic GC would still be used to cleanup versions that did not get cleaned up incrementally in compactions.
In #42514, we attempted to prototype GC in compactions.
There are some challenges to achieving this in the higher-level context of CockroachDB:
MVCCStats maintenance: MVCCStats include information about non-live bytes and GC in compactions will make these inaccurate. We could do one of the following:
We expect all replicas to have identical replicated state in the store: Replica consistency checking is important for discovering bugs, and sometimes mitigating the impact of bugs that may accidentally creep into production deployments. GC in compactions will make the replicas diverge. The previous compaction-GC-timestamp-threshold approach fixes this, in that the read for consistency checking can hide all non-live data below the timestamp threshold. Alternatively, consistency checking could hide all non-live versions altogether.
Correctly identifying which values (SET) in the compaction have not been deleted (DEL) in sstables not involved in the compaction. See the discussion thread at [WIP] storage/engine: attempt to do best-effort GC in compactions (fa… #42514 (review) for details. There are two (not fleshed out) approaches:
a@30#100.SET, a@20#90.SET, a@10#80.SET
in the compaction input, there isn't aa@20#95.DEL
lurking in another sstable that is not part of this compaction. The cons of this approach are (a) the possibility of higher write amplification, and (b) very big sstables (if there are too many versions of a key -- though we should be able to GC most of them, even if there is an old protected timestamp, so this may not be an issue).a
key with timestamp <= 20 exists in a higher sstable (suggested by @nvanbenschoten). This idea needs more refinement, since propagation of the promise would require the key range of the promise to be included in the sstable key range (like range tombstones), which may itself cause unnecessarily wide compactions and higher write amplification. And if the file key range caused by the promise is considered in the read path, we may unnecessarily read an sstable where there is no relevant data, which would increase read amplification.Jira issue: CRDB-2843
Epic CRDB-2566
The text was updated successfully, but these errors were encountered: