storage: range data is not being GCed in a timely manner #38251

andreimatei · 2019-06-17T22:55:45Z

A customer has a cluster with a humongous range (range 6 - the "system config" range) that seems to constantly have around 3-days worth of old values, instead of 25h worth. And those extra two days amount to gigabytes of dead values.
The GC queue does not seem interested in processing the range. When enqueued manually, it says low score:

It also says "likely last GC: 43h" ago, which I think is wrong (the real is more like 72h, which is also confirmed by the gcThreshold from the range stats).

Why the low score, I don't know yet. Perhaps something's busted with the scoring heuristics?

I've got the range data, all 7GB of it but highly compressed, and the range stats (internal only).
https://drive.google.com/drive/folders/1zgM9h5m3-voqV_Usm6fHxnuM8p4dK0LK?usp=sharing
There you can see, for example, all the versions of key /Table/3/1/4191/2/1, which range from June 14, 2019 11:17:25 AM to June 17, 2019 7:57:27 AM.

This is a spin-off of #38088.

gz#4843

Jira issue: CRDB-5642

The text was updated successfully, but these errors were encountered:

andreimatei · 2019-06-17T22:56:27Z

One possibly interesting fact is that the range has only 7 keys with tons and tons of old versions. The other keys are well behaved.

tbg · 2019-06-18T09:53:43Z

The shouldQueue method uses the GCBytesAge to determine whether there are enough old keys to justify a gc run (=a full scan of the range). Unfortunately, the GCBytesAge is only loosely related to actually being able to GC something, depending on how the old versions are distributed the GCBytesAge could be relatively high even though nothing can be deleted yet. The math related in working out when we're guaranteed to be able to delete a significant fraction of data implies that we won't be deleting values eagerly when they go out of scope. See

cockroach/pkg/storage/gc_queue.go

Line 179 in 7e2ceae

// makeGCQueueScoreImpl is used to compute when to trigger the GC Queue. It's

for details, but basically the bottom line is that setting some TTL doesn't tell you that things will be deleted soon thereafter, just that they'll stick around until then.

On the plus side, the screenshot above indicates that the range is very close to being GC'ed (it needs a score of 2).

I agree that none of this seems great, especially since it affects the system config range.

As a quick fix, I wonder if we can lower the TTL, which should linearly affect the score (i.e. halve the TTL, double the score) and thus the frequency at which GC runs.

tbg · 2019-06-18T09:58:39Z

PS even if we fixed the GC queue to be super aggressive, it seems that this range would be too big still. You say there are "gigabytes" for three days, so there will still be vastly more than 64mb for one day, i.e. even with optimal GC the range seems to end up in the 1GB range because data is written so quickly.

andreimatei · 2019-06-18T20:24:26Z

I now understand how the thing works; thanks for the pointers.

In my opinion, it's not a good thing that we only trigger GC based on a score normalized to the ranges live data size. For example, imagine you've got two ranges: one configured to be 1GB in size, the other 64MB. Now imagine the 1GB has 500MB of collectible garbage. We'd like to collect it even though it's relatively smaller than the live data, wouldn't we?

It's not clear to me how to improve the algorithm, though. How about if we maintained more granular statistics than a single GCBytesAge? For example, we could maintain stats for every hour - how many bytes died during that hour.

andreimatei · 2019-06-18T21:00:52Z

Also a question - given that gcKeyScoreThreshold is 2, do I understand correctly that when you drop a table it'll actually take twice the TTL until that range is collected? I don't quiet follow how DeadFraction factors into the computation.

tbg · 2019-06-19T09:36:23Z

When you drop a table, hopefully it gets collected via ClearRange which sidesteps GC completely. If it doesn't (perhaps there's some interleaving going on) yes, it'll take roughly 2x the TTL. Say you have 1GB of data and say it was all written at t=0 and it gets dropped at t=24h. That contributes a GCBytesAge of 24 GB*h, so Age/(Junk*TTL) = 1. Another 24h later and you'll hit 2.

I think you know this, but just to make sure: the reason we don't set up GC so that it runs right at 1 (or 1.001) is that in the worst case of data distribution, we only get to delete a tiny fraction of the rows, but have to scan all of the data.

We'd like to collect it even though it's relatively smaller than the live data, wouldn't we?

The example is missing a comparison. Yeah, we'd want to collect this at some point, and we will. But we also have to scan 1GB to do it.

In any alternative strategy that you come up with, you have to handle the common case in which there is very little garbage. Say 1GB range and 1MB of garbage. You'll have to collect that eventually, how do you determine when? You want to make absolutely sure you're not scanning 1GB for nothing (perhaps repeatedly), so you need some way to confirm that you're going to collect a significant fraction of that 1MB when you invest your CPU time. That's the main contribution of the current mechanism, and in turn we get the shortcoming that even though there may be something worth collecting, it might take ~an extra TTL until we do.

Agree that it's worth thinking about. At the moment, we are scanning each range once per day anyway (consistency checker), there's an opportunity there to compute for sure what we can delete (and perhaps even doing it right there).

andreimatei · 2019-06-19T16:05:55Z

When you drop a table, hopefully it gets collected via ClearRange which sidesteps GC completely

Aha, I forgot about ClearRange.

The example is missing a comparison. Yeah, we'd want to collect this at some point, and we will. But we also have to scan 1GB to do it.

The point is that we'd like to delete sooner, not later. So the example is a 1GB range with 500MB worth of garbage. As it stands, it will take 4 days until this garbage is collected (2 days until GCBytesAge=Junk*TTL and another two days to reach the factor of 2 threshold). I understand the cost of scanning, but still, if you conduct a poll I'm sure everybody will say collect the sucker sooner.

What really bothers me is that it takes an arbitrary amount of time to collect some garbage. If I delete a tiny row and nothing else, it'll be collected much later. And I don't have any knobs for it. Besides the resource usage questions, sometimes I really want my data wiped because of contracts or what not - think PCI compliance saying that you need to delete credit card numbers within some time window.
And I'm also unhappy about the reverse :) - I don't have a knob to say that if I can only collect one byte out of a 1GB range, I never want to collect that.

Basically, I find it hard to say when the thing that we currently do works well and when it doesn't. The only guarantee the current mechanism gives us is that when we collect, we'll collect something. And it has the reasonable behavior that, if there's more garbage relative to live data, it'll collect sooner than otherwise. But it's just vague cause the statistic - average age of dead byte - is so coarse. It doesn't provide us with an estimate on how much garbage we can collect.

So thinking fundamentally about the problem, the quantity of non-live bytes in a range is a monotonic function and we want the area under the curve for a prefix of it (up until the TTL). And we need to summarize the function somehow.
One simple engineering thing to do is keep a series of buckets representing time intervals and each maintaining the amount of bytes that died during that interval (say, 10 of them each representing TTL/10). The left-most bucket is special, representing all the dead bytes older than TTL. And so as time flows, we'd shift buckets towards the left and create new ones at the right.
And so with this datastructure, we'd have an approximation of the garbage that's currently collectible and we can make policies on it.

At the moment, we are scanning each range once per day anyway (consistency checker), there's an opportunity there to compute for sure what we can delete (and perhaps even doing it right there).

Yeah, I guess that would also shut me up, although it's not perfectly satisfying.

This is a WIP because it doesn't actually implement anything it claims yet. Instead, I wrote a big comment explaining how everything worked and fit together, and left some TODOs in the code detailing what we'd want to improve. Let's discuss and polish this on this PR first, and I'll then amend it to actually implement the suggestions, updating/adding tests in the process. ---- We noticed some unfortunate GC queue behavior in [38251] and I went back to the code and found that there are several good improvements we can make without completely switching up the way things work. The point of contention was essentially that configuring a given TTL does not imply that keys gc'able by that TTL will be GC'ed right when they become GC'able. This is sort of intrinsic in the information we have available when deciding when to GC, but we were leaving a factor of two on the table. In result, users should reasonably be able to expect that data is deleted approximately after the TTL for large deletions and approximately after twice the TTL for "spread out" deletions (such as repeatedly updating the same key). Previously these were 2x and 4x, respectively. Additionally (though this did not matter for the present issue), the impact of the fraction of dead bytes vs live bytes into the score (and by extension the scheduling of GC) wasn't very principled and has been reworked, the goal being to delay GC until the expected amount of freed bytes represents a certain fraction of the live data left in the range. This is to avoid repeatedly scanning large amounts of live data only to remove very small amounts of gc'able data. [38251]: cockroachdb#38251 Release note: None

BramGruneir · 2020-12-02T18:40:32Z

@tbg , any update on this issue?

tbg · 2020-12-08T10:09:31Z

No. There is some work-in-progress here but I am not working on it: #38432
Somewhat independently, @sumeerbhola has been exploring improving when disk space is actually reclaimed.

daniel-crlabs · 2022-11-03T15:05:56Z

I accidentally found this gh issue and noticed I have a ticket from a customer exactly about this behavior. They have a table with 360GB + of data, and this table has over 2300 ranges and half of the data in all these ranges (as per range json files) is considered to be garbage, yet, it is not removed, which is causing the cluster to be dangerously close from running out of disk.

GH issue here for visibility and hoping this somehow gets improved, as it appears (based on the explanations above) this is an expected behavior (this should probably be documented somehow in a more clear fashion so customers are aware of this):

Cluster storage remains high #1882

tbg · 2022-11-07T10:52:38Z

I think we can set this value here to 1:

cockroach/pkg/kv/kvserver/mvcc_gc_queue.go

Line 67 in 3414b03

mvccGCKeyScoreThreshold = 2

Since this issue has been open for a while, a small refresher on the problem: we want to avoid the GC triggering before there's work to do, to avoid spinning.

So we need an invariant: if the GC score indicates a GC, carrying out that GC must reduce the score sufficiently to not trigger another GC. We compute such a score (valScore here). The invariant associated to that score is that GC will always bring that score to <1. So if we queue only when it's >= 1, we have what we want (no tight loop of GC). But we actually only trigger when the score hits 2!

This was dnoe out of an abundance of caution when I introduced this score and in hindsight has done more harm than good. The "worst case" for this score is a large range where everything is recently deleted (i.e. not gc'able), but there is one outlier KV pair that has been deleted very very long ago. This would drive the score up above the threshold of 1 and yet, when you GC, all you get to delete is that one pair. But, doesn't matter, the resulting score will be << 1. The hope was that a factor of 2 would give a better balance, where we delay GC until there's more work to do (GC has to scan the entire range, i.e. expensive).

However, at this point we have a GC cooldown period (per replica) of 1h, so tight loops aren't as much of an issue. The original reasoning for picking 2 over the optimal 1 just isn't very convincing at all. There is every indication that 1 makes a fine cut-off.

With a cut-off of 1, if at time t=0, a customer deletes the entire contents of a range, then once the TTL passes, it will become GC'able (and will be GC'ed subject to the queue coming around to take a look). This is what customers expect and we should give it to them.

Another interesting example to think through is a range in which there's only one row that gets overwritten once per second. At a 1h TTL and assuming GC as soon as the score hits 1.0, how much history will we keep?

Let's assume a GC run happened at t=0 and it's t=T now. The range will have 1h+T worth of writes, timestamped t=T, ..., t=0, t=-1, t=-2, ..., t=-3600, contributing GCBytesAge of 1, 2, ..., 3600+T. The size of the writes cancels out in GCBytesAge/(ttl*num_writes*GCBytes), which comes out as roughly 0.5+T/2h. In other word, the next GC will happen around the T=h mark again (when the score hits 1), but of course it will only get to remove half of the values (even though all are deleted). With the previous factor of two, we'd make the GC worth our while more, but of course we can't ever hope to remove all old versions since at most an hour's worth of them is GC'able.

Long story short, I think we can change the above factor from 2 to 1.

erikgrinaker · 2022-11-07T11:42:13Z

at this point we have a GC cooldown period (per replica) of 1h, so tight loops aren't as much of an issue

If you're referring to the cooldown I added a while back, that was only for the intent score, not the overall score:

cockroach/pkg/kv/kvserver/mvcc_gc_queue.go

Lines 56 to 60 in 3414b03

    
           // mvccGCQueueIntentCooldownDuration is the duration to wait between MVCC GC 
        
           // attempts of the same range when triggered solely by intents. This is to 
        
           // prevent continually spinning on intents that belong to active transactions, 
        
           // which can't be cleaned up. 
        
           mvccGCQueueIntentCooldownDuration = 2 * time.Hour

But generalizing it should be straightforward.

aliher1911 · 2022-11-07T12:57:58Z

Long story short, I think we can change the above factor from 2 to 1.

Wouldn't that cause large deletions to always cause two GC runs? If we have any data deleted before large deletion we'll have GC triggered just before we reach threshold and then have another go after cooldown. Maybe having it at 1.1 would be better?

erikgrinaker · 2022-11-08T00:58:09Z

Something slightly larger than 1 makes sense. Although there is also a random fuzz factor of up to 5%, so 1.05 might be sufficient. Let's run a few experiments to verify.

blathers-crl · 2022-11-08T09:45:26Z

cc @cockroachdb/replication

Fixes #38251

…22.2-92118 Fixes #38251

andreimatei assigned andreimatei and tbg Jun 17, 2019

andreimatei added this to Incoming in KV via automation Jun 17, 2019

tbg mentioned this issue Jun 26, 2019

[wip] storage: improve the gcqueue score #38432

Closed

lunevalex moved this from Incoming to Scale in KV Jul 14, 2020

lunevalex moved this from Scale to Cold storage in KV Apr 23, 2021

jlinder added the T-kv KV Team label Jun 16, 2021

tbg unassigned andreimatei and tbg May 31, 2022

erikgrinaker mentioned this issue Nov 5, 2022

storage: manual testing of MVCC bulk ops #89276

Closed

erikgrinaker added T-kv-replication and removed T-kv KV Team labels Nov 8, 2022

exalate-issue-sync bot assigned aliher1911 Nov 8, 2022

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Nov 8, 2022

sumeerbhola mentioned this issue Nov 9, 2022

storage, kv: do best-effort GC of old versions during storage compactions #57260

Open

aliher1911 mentioned this issue Nov 18, 2022

kvserver: lower score threshold for mvcc GC eligibility #92118

Merged

craig bot closed this as completed in 2fe7b94 Nov 29, 2022

KV automation moved this from On Hold to Closed Nov 29, 2022

blathers-crl bot mentioned this issue Nov 29, 2022

release-22.2: kvserver: lower score threshold for mvcc GC eligibility #92675

Merged

aliher1911 mentioned this issue Dec 1, 2022

release-22.1: kvserver: lower score threshold for mvcc GC eligibility #92816

Merged

aliher1911 added a commit that referenced this issue Dec 1, 2022

Merge pull request #92816 from aliher1911/backport22.1-92118

b208777

Fixes #38251

aliher1911 added a commit that referenced this issue Dec 1, 2022

Merge pull request #92675 from cockroachdb/blathers/backport-release-…

35d26ad

…22.2-92118 Fixes #38251

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: range data is not being GCed in a timely manner #38251

storage: range data is not being GCed in a timely manner #38251

andreimatei commented Jun 17, 2019 •

edited by cockroach-jira-scripts

Loading

andreimatei commented Jun 17, 2019

tbg commented Jun 18, 2019

tbg commented Jun 18, 2019

andreimatei commented Jun 18, 2019

andreimatei commented Jun 18, 2019

tbg commented Jun 19, 2019

andreimatei commented Jun 19, 2019

BramGruneir commented Dec 2, 2020

tbg commented Dec 8, 2020

daniel-crlabs commented Nov 3, 2022

tbg commented Nov 7, 2022

erikgrinaker commented Nov 7, 2022

aliher1911 commented Nov 7, 2022

erikgrinaker commented Nov 8, 2022

blathers-crl bot commented Nov 8, 2022

storage: range data is not being GCed in a timely manner #38251

storage: range data is not being GCed in a timely manner #38251

Comments

andreimatei commented Jun 17, 2019 • edited by cockroach-jira-scripts Loading

andreimatei commented Jun 17, 2019

tbg commented Jun 18, 2019

tbg commented Jun 18, 2019

andreimatei commented Jun 18, 2019

andreimatei commented Jun 18, 2019

tbg commented Jun 19, 2019

andreimatei commented Jun 19, 2019

BramGruneir commented Dec 2, 2020

tbg commented Dec 8, 2020

daniel-crlabs commented Nov 3, 2022

tbg commented Nov 7, 2022

erikgrinaker commented Nov 7, 2022

aliher1911 commented Nov 7, 2022

erikgrinaker commented Nov 8, 2022

blathers-crl bot commented Nov 8, 2022

andreimatei commented Jun 17, 2019 •

edited by cockroach-jira-scripts

Loading