New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: forward GC threshold when clearing range #31504

Open
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
4 participants
@eriktrinh
Contributor

eriktrinh commented Oct 16, 2018

Currently the GC threshold for a range is not forwarded when keys are
deleted using ClearRange. This causes issues where reads with
transaction timestamps set before the data is removed, but occur after
the removal returns empty results (this happens when reading with ... AS OF SYSTEM TIME ...).

This change is a rollback of the removal of the GC threshold
forwarding.

Related to #20696.

Release note: None

storage: forward GC threshold when clearing range
Currently the GC threshold for a range is not forwarded when keys are
deleted using `ClearRange`. This causes issues where reads with
transaction timestamps set before the data is removed, but occur after
the removal returns empty results (this happens when reading with `...
AS OF SYSTEM TIME ...`).

This change is a rollback of the removal of the GC threshold
forwarding.

Related to #20696.

Release note: None

@eriktrinh eriktrinh requested a review from benesch Oct 16, 2018

@eriktrinh eriktrinh requested a review from cockroachdb/core-prs as a code owner Oct 16, 2018

@cockroach-teamcity

This comment has been minimized.

Show comment
Hide comment
@cockroach-teamcity

cockroach-teamcity Oct 16, 2018

Member

This change is Reviewable

Member

cockroach-teamcity commented Oct 16, 2018

This change is Reviewable

@eriktrinh

This comment has been minimized.

Show comment
Hide comment
@eriktrinh
Contributor

eriktrinh commented Oct 16, 2018

@benesch

Thanks, @eriktrinh! Looking good.

Reviewed 2 of 2 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/client_replica_test.go, line 1726 at r1 (raw file):

			t.Fatalf("replica for key %s not found", rKey)
		}
		if thresh := repl.GetGCThreshold().WallTime; thresh != ns {

The logical component is important here, I believe. This should compare the GC threshold against an exact hlc.Timestamp.


pkg/storage/client_replica_test.go, line 1751 at r1 (raw file):

	}
	verifyKeysWithPrefix(sm, []roachpb.Key{sm1, sm2, sm3})
	verifyKeysWithPrefix(lg, []roachpb.Key{lg1, lg2, lg3})

Add a check to verify that the initial GC threshold here is less than 123. I think this might suffice:

verifyGCThreshold(sm, hlc.Timestamp{})
verifyGCThreshold(lg, hlc.Timestamp{})

But I'm not positive about what the GC threshold is actually initialized to.


pkg/storage/batcheval/cmd_clear_range.go, line 94 at r1 (raw file):
@eriktrinh pointed out that this code was present originally but removed in #20607. I think we should figure out why this was the case before merging this. Paging @bdarnell @tschottdorf @spencerkimball.

Ben had this to say in #20607:

If the span is less than the entire range, we must not update the GCThreshold to the current time (and as a consequence, should we also fall back to MVCC-safe DeleteRange, or use ClearRange without any safeguards against later reads?)

I think, really, we should be checking whether GC threshold is less than now - GC_TTL. If so, it's safe to bump the whole GC threshold forward.

@bdarnell

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/batcheval/cmd_clear_range.go, line 94 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

@eriktrinh pointed out that this code was present originally but removed in #20607. I think we should figure out why this was the case before merging this. Paging @bdarnell @tschottdorf @spencerkimball.

Ben had this to say in #20607:

If the span is less than the entire range, we must not update the GCThreshold to the current time (and as a consequence, should we also fall back to MVCC-safe DeleteRange, or use ClearRange without any safeguards against later reads?)

I think, really, we should be checking whether GC threshold is less than now - GC_TTL. If so, it's safe to bump the whole GC threshold forward.

If the GC threshold is less than now - GC_TTL, then it is safe to bump the whole GC threshold forward (provided the whole range has the same GC TTL), but it's also unnecessary (at least in theory). This is what we were relying on in #20607. ClearRange is (was?) only used at timestamps old enough to be GC'd, and we should be using max(now - GC_TTL, actual GCThreshold) as the minimum timestamp.

If we did that, GCThreshold would be irrelevant in most cases. It's just a fallback for edge cases like when the GC TTL is increased and the current effective threshold is uncertain.

@benesch

This comment has been minimized.

Show comment
Hide comment
@benesch

benesch Oct 16, 2018

Member
Member

benesch commented Oct 16, 2018

@bdarnell

This comment has been minimized.

Show comment
Hide comment
@bdarnell

bdarnell Oct 16, 2018

Member

I’m afraid I don’t follow. If the GC threshold on a table is 1h, but the GC
queue is bogged down and doesn’t process any of the range on that table, it
may be possible to execute an AOST query on that table two hours in the
past, even though you only guaranteed one hour of past version storage.

It's possible, but we should probably disallow it. We only guaranteed one hour of storage, and we shouldn't expose the implementation detail that our GC is not instantaneous. The GCThreshold doesn't serve much purpose today (it might in the future if we have alternative GC policies such as something based on the size of deleted data or the number of versions, but even then I'm not sure a range-wide threshold would make sense)

But cleaning those old versions up without bumping the GC threshold first is incorrect.

Yeah, as long as we have a GCThreshold, that's true. The big concern (and the reason this was removed in the first place) is that if the ClearRange only covers part of the range, setting the GCThreshold would also hide parts of the data that was not covered by the ClearRange. This is OK if and only if the ClearRange is performed at a timestamp that is behind the GC TTL for the entire range (normally the range will all have the same TTL, but this could race against config changes and the split queue). It's easier to think like a C compiler writer and rely on the fact that anything behind the GC TTL is undefined than to figure out what to do if we manage to hit a case where it's unsafe to set the GC threshold because the range spans multiple zone configs.

Member

bdarnell commented Oct 16, 2018

I’m afraid I don’t follow. If the GC threshold on a table is 1h, but the GC
queue is bogged down and doesn’t process any of the range on that table, it
may be possible to execute an AOST query on that table two hours in the
past, even though you only guaranteed one hour of past version storage.

It's possible, but we should probably disallow it. We only guaranteed one hour of storage, and we shouldn't expose the implementation detail that our GC is not instantaneous. The GCThreshold doesn't serve much purpose today (it might in the future if we have alternative GC policies such as something based on the size of deleted data or the number of versions, but even then I'm not sure a range-wide threshold would make sense)

But cleaning those old versions up without bumping the GC threshold first is incorrect.

Yeah, as long as we have a GCThreshold, that's true. The big concern (and the reason this was removed in the first place) is that if the ClearRange only covers part of the range, setting the GCThreshold would also hide parts of the data that was not covered by the ClearRange. This is OK if and only if the ClearRange is performed at a timestamp that is behind the GC TTL for the entire range (normally the range will all have the same TTL, but this could race against config changes and the split queue). It's easier to think like a C compiler writer and rely on the fact that anything behind the GC TTL is undefined than to figure out what to do if we manage to hit a case where it's unsafe to set the GC threshold because the range spans multiple zone configs.

@benesch

This comment has been minimized.

Show comment
Hide comment
@benesch

benesch Oct 16, 2018

Member

I’m afraid I don’t follow. If the GC threshold on a table is 1h, but the GC
queue is bogged down and doesn’t process any of the range on that table, it
may be possible to execute an AOST query on that table two hours in the
past, even though you only guaranteed one hour of past version storage.

It's possible, but we should probably disallow it. We only guaranteed one hour of storage, and we shouldn't expose the implementation detail that our GC is not instantaneous. The GCThreshold doesn't serve much purpose today (it might in the future if we have alternative GC policies such as something based on the size of deleted data or the number of versions, but even then I'm not sure a range-wide threshold would make sense)

Ok, so should we change Replica.requestCanProceed to check both? And then I guess we don't need to forward the GC threshold in ClearRange?

But cleaning those old versions up without bumping the GC threshold first is incorrect.

Yeah, as long as we have a GCThreshold, that's true. The big concern (and the reason this was removed in the first place) is that if the ClearRange only covers part of the range, setting the GCThreshold would also hide parts of the data that was not covered by the ClearRange. This is OK if and only if the ClearRange is performed at a timestamp that is behind the GC TTL for the entire range (normally the range will all have the same TTL, but this could race against config changes and the split queue). It's easier to think like a C compiler writer and rely on the fact that anything behind the GC TTL is undefined than to figure out what to do if we manage to hit a case where it's unsafe to set the GC threshold because the range spans multiple zone configs.

Incorrectly forwarding the GC threshold means that some transactions will be forced to retry unnecessarily. And we break our promise of not GC'ing data before the TTL expires. But it wouldn't be a consistency violation. OTOH, our current approach of not bumping the threshold is a consistency violation. I think a consistency violation is worse than data getting GC'd too soon—it sounds like you have the opposite opinion? Especially considering that the circumstances in which we'd forward the GC threshold on a too-large range would be quite rare in practice. You'd need the split queue to get wedged for the entirety of your GC window.

Figuring out which is worse is immaterial, I suppose, since we can come up with a solution that neither violates consistency nor GCs data too soon.

Member

benesch commented Oct 16, 2018

I’m afraid I don’t follow. If the GC threshold on a table is 1h, but the GC
queue is bogged down and doesn’t process any of the range on that table, it
may be possible to execute an AOST query on that table two hours in the
past, even though you only guaranteed one hour of past version storage.

It's possible, but we should probably disallow it. We only guaranteed one hour of storage, and we shouldn't expose the implementation detail that our GC is not instantaneous. The GCThreshold doesn't serve much purpose today (it might in the future if we have alternative GC policies such as something based on the size of deleted data or the number of versions, but even then I'm not sure a range-wide threshold would make sense)

Ok, so should we change Replica.requestCanProceed to check both? And then I guess we don't need to forward the GC threshold in ClearRange?

But cleaning those old versions up without bumping the GC threshold first is incorrect.

Yeah, as long as we have a GCThreshold, that's true. The big concern (and the reason this was removed in the first place) is that if the ClearRange only covers part of the range, setting the GCThreshold would also hide parts of the data that was not covered by the ClearRange. This is OK if and only if the ClearRange is performed at a timestamp that is behind the GC TTL for the entire range (normally the range will all have the same TTL, but this could race against config changes and the split queue). It's easier to think like a C compiler writer and rely on the fact that anything behind the GC TTL is undefined than to figure out what to do if we manage to hit a case where it's unsafe to set the GC threshold because the range spans multiple zone configs.

Incorrectly forwarding the GC threshold means that some transactions will be forced to retry unnecessarily. And we break our promise of not GC'ing data before the TTL expires. But it wouldn't be a consistency violation. OTOH, our current approach of not bumping the threshold is a consistency violation. I think a consistency violation is worse than data getting GC'd too soon—it sounds like you have the opposite opinion? Especially considering that the circumstances in which we'd forward the GC threshold on a too-large range would be quite rare in practice. You'd need the split queue to get wedged for the entirety of your GC window.

Figuring out which is worse is immaterial, I suppose, since we can come up with a solution that neither violates consistency nor GCs data too soon.

@benesch

This comment has been minimized.

Show comment
Hide comment
@benesch

benesch Oct 16, 2018

Member

Ok, so should we change Replica.requestCanProceed to check both? And then I guess we don't need to forward the GC threshold in ClearRange?

@eriktrinh points out this would have the same race, because the check would only consider the replica's current zone config. If things got delayed in gossip or the split queue, you'd use the wrong TTL.

Member

benesch commented Oct 16, 2018

Ok, so should we change Replica.requestCanProceed to check both? And then I guess we don't need to forward the GC threshold in ClearRange?

@eriktrinh points out this would have the same race, because the check would only consider the replica's current zone config. If things got delayed in gossip or the split queue, you'd use the wrong TTL.

@bdarnell

This comment has been minimized.

Show comment
Hide comment
@bdarnell

bdarnell Oct 17, 2018

Member

Ok, so should we change Replica.requestCanProceed to check both? And then I guess we don't need to forward the GC threshold in ClearRange?

The tricky thing about this is what time do we use for "now"? I haven't thought through all of this yet. But if we check both times I think we may be able to get rid of the GC threshold completely.

The one place where GCThreshold matters is when the zone config is changed to increase the TTL. For this, maybe we need to store a timestamp at which the new TTL takes effect.

Member

bdarnell commented Oct 17, 2018

Ok, so should we change Replica.requestCanProceed to check both? And then I guess we don't need to forward the GC threshold in ClearRange?

The tricky thing about this is what time do we use for "now"? I haven't thought through all of this yet. But if we check both times I think we may be able to get rid of the GC threshold completely.

The one place where GCThreshold matters is when the zone config is changed to increase the TTL. For this, maybe we need to store a timestamp at which the new TTL takes effect.

@eriktrinh

This comment has been minimized.

Show comment
Hide comment
@eriktrinh

eriktrinh Oct 17, 2018

Contributor

The one place where GCThreshold matters is when the zone config is changed to increase the TTL. For this, maybe we need to store a timestamp at which the new TTL takes effect.

We could maybe forward the GC threshold (or some other threshold timestamp that we check against) to now now-prev TTL when the zone config changes, and forward to now-current TTL before we do the check in requestCanProceed

Contributor

eriktrinh commented Oct 17, 2018

The one place where GCThreshold matters is when the zone config is changed to increase the TTL. For this, maybe we need to store a timestamp at which the new TTL takes effect.

We could maybe forward the GC threshold (or some other threshold timestamp that we check against) to now now-prev TTL when the zone config changes, and forward to now-current TTL before we do the check in requestCanProceed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment