Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upstorage: forward GC threshold when clearing range #31504
Conversation
eriktrinh
requested a review
from
benesch
Oct 16, 2018
eriktrinh
requested a review
from cockroachdb/core-prs
as a
code owner
Oct 16, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
benesch
reviewed
Oct 16, 2018
Thanks, @eriktrinh! Looking good.
Reviewed 2 of 2 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained
pkg/storage/client_replica_test.go, line 1726 at r1 (raw file):
t.Fatalf("replica for key %s not found", rKey) } if thresh := repl.GetGCThreshold().WallTime; thresh != ns {
The logical component is important here, I believe. This should compare the GC threshold against an exact hlc.Timestamp.
pkg/storage/client_replica_test.go, line 1751 at r1 (raw file):
} verifyKeysWithPrefix(sm, []roachpb.Key{sm1, sm2, sm3}) verifyKeysWithPrefix(lg, []roachpb.Key{lg1, lg2, lg3})
Add a check to verify that the initial GC threshold here is less than 123. I think this might suffice:
verifyGCThreshold(sm, hlc.Timestamp{})
verifyGCThreshold(lg, hlc.Timestamp{})But I'm not positive about what the GC threshold is actually initialized to.
pkg/storage/batcheval/cmd_clear_range.go, line 94 at r1 (raw file):
@eriktrinh pointed out that this code was present originally but removed in #20607. I think we should figure out why this was the case before merging this. Paging @bdarnell @tschottdorf @spencerkimball.
Ben had this to say in #20607:
If the span is less than the entire range, we must not update the GCThreshold to the current time (and as a consequence, should we also fall back to MVCC-safe DeleteRange, or use ClearRange without any safeguards against later reads?)
I think, really, we should be checking whether GC threshold is less than now - GC_TTL. If so, it's safe to bump the whole GC threshold forward.
bdarnell
reviewed
Oct 16, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/storage/batcheval/cmd_clear_range.go, line 94 at r1 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
@eriktrinh pointed out that this code was present originally but removed in #20607. I think we should figure out why this was the case before merging this. Paging @bdarnell @tschottdorf @spencerkimball.
Ben had this to say in #20607:
If the span is less than the entire range, we must not update the GCThreshold to the current time (and as a consequence, should we also fall back to MVCC-safe DeleteRange, or use ClearRange without any safeguards against later reads?)
I think, really, we should be checking whether GC threshold is less than
now - GC_TTL. If so, it's safe to bump the whole GC threshold forward.
If the GC threshold is less than now - GC_TTL, then it is safe to bump the whole GC threshold forward (provided the whole range has the same GC TTL), but it's also unnecessary (at least in theory). This is what we were relying on in #20607. ClearRange is (was?) only used at timestamps old enough to be GC'd, and we should be using max(now - GC_TTL, actual GCThreshold) as the minimum timestamp.
If we did that, GCThreshold would be irrelevant in most cases. It's just a fallback for edge cases like when the GC TTL is increased and the current effective threshold is uncertain.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
benesch
Oct 16, 2018
Member
|
I’m afraid I don’t follow. If the GC threshold on a table is 1h, but the GC
queue is bogged down and doesn’t process any of the range on that table, it
may be possible to execute an AOST query on that table two hours in the
past, even though you only guaranteed one hour of past version storage.
Bumping the GC threshold to now() - 1h is what causes queries more than one
hour in the past to start failing, giving you license to clean up those old
versions of the data. But cleaning those old versions up without bumping
the GC threshold first is incorrect. Am I crazy?
…On Tue, Oct 16, 2018 at 1:50 PM Ben Darnell ***@***.***> wrote:
***@***.**** commented on this pull request.
*Reviewable <https://reviewable.io/reviews/cockroachdb/cockroach/31504>*
status: [image:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bdarnell
Oct 16, 2018
Member
I’m afraid I don’t follow. If the GC threshold on a table is 1h, but the GC
queue is bogged down and doesn’t process any of the range on that table, it
may be possible to execute an AOST query on that table two hours in the
past, even though you only guaranteed one hour of past version storage.
It's possible, but we should probably disallow it. We only guaranteed one hour of storage, and we shouldn't expose the implementation detail that our GC is not instantaneous. The GCThreshold doesn't serve much purpose today (it might in the future if we have alternative GC policies such as something based on the size of deleted data or the number of versions, but even then I'm not sure a range-wide threshold would make sense)
But cleaning those old versions up without bumping the GC threshold first is incorrect.
Yeah, as long as we have a GCThreshold, that's true. The big concern (and the reason this was removed in the first place) is that if the ClearRange only covers part of the range, setting the GCThreshold would also hide parts of the data that was not covered by the ClearRange. This is OK if and only if the ClearRange is performed at a timestamp that is behind the GC TTL for the entire range (normally the range will all have the same TTL, but this could race against config changes and the split queue). It's easier to think like a C compiler writer and rely on the fact that anything behind the GC TTL is undefined than to figure out what to do if we manage to hit a case where it's unsafe to set the GC threshold because the range spans multiple zone configs.
It's possible, but we should probably disallow it. We only guaranteed one hour of storage, and we shouldn't expose the implementation detail that our GC is not instantaneous. The GCThreshold doesn't serve much purpose today (it might in the future if we have alternative GC policies such as something based on the size of deleted data or the number of versions, but even then I'm not sure a range-wide threshold would make sense)
Yeah, as long as we have a GCThreshold, that's true. The big concern (and the reason this was removed in the first place) is that if the ClearRange only covers part of the range, setting the GCThreshold would also hide parts of the data that was not covered by the ClearRange. This is OK if and only if the ClearRange is performed at a timestamp that is behind the GC TTL for the entire range (normally the range will all have the same TTL, but this could race against config changes and the split queue). It's easier to think like a C compiler writer and rely on the fact that anything behind the GC TTL is undefined than to figure out what to do if we manage to hit a case where it's unsafe to set the GC threshold because the range spans multiple zone configs. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
benesch
Oct 16, 2018
Member
I’m afraid I don’t follow. If the GC threshold on a table is 1h, but the GC
queue is bogged down and doesn’t process any of the range on that table, it
may be possible to execute an AOST query on that table two hours in the
past, even though you only guaranteed one hour of past version storage.It's possible, but we should probably disallow it. We only guaranteed one hour of storage, and we shouldn't expose the implementation detail that our GC is not instantaneous. The GCThreshold doesn't serve much purpose today (it might in the future if we have alternative GC policies such as something based on the size of deleted data or the number of versions, but even then I'm not sure a range-wide threshold would make sense)
Ok, so should we change Replica.requestCanProceed to check both? And then I guess we don't need to forward the GC threshold in ClearRange?
But cleaning those old versions up without bumping the GC threshold first is incorrect.
Yeah, as long as we have a GCThreshold, that's true. The big concern (and the reason this was removed in the first place) is that if the ClearRange only covers part of the range, setting the GCThreshold would also hide parts of the data that was not covered by the ClearRange. This is OK if and only if the ClearRange is performed at a timestamp that is behind the GC TTL for the entire range (normally the range will all have the same TTL, but this could race against config changes and the split queue). It's easier to think like a C compiler writer and rely on the fact that anything behind the GC TTL is undefined than to figure out what to do if we manage to hit a case where it's unsafe to set the GC threshold because the range spans multiple zone configs.
Incorrectly forwarding the GC threshold means that some transactions will be forced to retry unnecessarily. And we break our promise of not GC'ing data before the TTL expires. But it wouldn't be a consistency violation. OTOH, our current approach of not bumping the threshold is a consistency violation. I think a consistency violation is worse than data getting GC'd too soon—it sounds like you have the opposite opinion? Especially considering that the circumstances in which we'd forward the GC threshold on a too-large range would be quite rare in practice. You'd need the split queue to get wedged for the entirety of your GC window.
Figuring out which is worse is immaterial, I suppose, since we can come up with a solution that neither violates consistency nor GCs data too soon.
Ok, so should we change
Incorrectly forwarding the GC threshold means that some transactions will be forced to retry unnecessarily. And we break our promise of not GC'ing data before the TTL expires. But it wouldn't be a consistency violation. OTOH, our current approach of not bumping the threshold is a consistency violation. I think a consistency violation is worse than data getting GC'd too soon—it sounds like you have the opposite opinion? Especially considering that the circumstances in which we'd forward the GC threshold on a too-large range would be quite rare in practice. You'd need the split queue to get wedged for the entirety of your GC window. Figuring out which is worse is immaterial, I suppose, since we can come up with a solution that neither violates consistency nor GCs data too soon. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
benesch
Oct 16, 2018
Member
Ok, so should we change Replica.requestCanProceed to check both? And then I guess we don't need to forward the GC threshold in ClearRange?
@eriktrinh points out this would have the same race, because the check would only consider the replica's current zone config. If things got delayed in gossip or the split queue, you'd use the wrong TTL.
@eriktrinh points out this would have the same race, because the check would only consider the replica's current zone config. If things got delayed in gossip or the split queue, you'd use the wrong TTL. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bdarnell
Oct 17, 2018
Member
Ok, so should we change Replica.requestCanProceed to check both? And then I guess we don't need to forward the GC threshold in ClearRange?
The tricky thing about this is what time do we use for "now"? I haven't thought through all of this yet. But if we check both times I think we may be able to get rid of the GC threshold completely.
The one place where GCThreshold matters is when the zone config is changed to increase the TTL. For this, maybe we need to store a timestamp at which the new TTL takes effect.
The tricky thing about this is what time do we use for "now"? I haven't thought through all of this yet. But if we check both times I think we may be able to get rid of the GC threshold completely. The one place where GCThreshold matters is when the zone config is changed to increase the TTL. For this, maybe we need to store a timestamp at which the new TTL takes effect. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
eriktrinh
Oct 17, 2018
Contributor
The one place where GCThreshold matters is when the zone config is changed to increase the TTL. For this, maybe we need to store a timestamp at which the new TTL takes effect.
We could maybe forward the GC threshold (or some other threshold timestamp that we check against) to now now-prev TTL when the zone config changes, and forward to now-current TTL before we do the check in requestCanProceed
We could maybe forward the GC threshold (or some other threshold timestamp that we check against) to now |
eriktrinh commentedOct 16, 2018
Currently the GC threshold for a range is not forwarded when keys are
deleted using
ClearRange. This causes issues where reads withtransaction timestamps set before the data is removed, but occur after
the removal returns empty results (this happens when reading with
... AS OF SYSTEM TIME ...).This change is a rollback of the removal of the GC threshold
forwarding.
Related to #20696.
Release note: None