Skip to content

Fix 166091 gate range tombstone gc#168275

Open
thecitymouse wants to merge 2 commits intocockroachdb:masterfrom
thecitymouse:fix-166091-gate-range-tombstone-gc
Open

Fix 166091 gate range tombstone gc#168275
thecitymouse wants to merge 2 commits intocockroachdb:masterfrom
thecitymouse:fix-166091-gate-range-tombstone-gc

Conversation

@thecitymouse
Copy link
Copy Markdown

@thecitymouse thecitymouse commented Apr 13, 2026

Fixes #166091.

gc.Run does point-key GC then range-tombstone GC. If point-key GC
hits a clear-range error (concurrent write above threshold lands after
the planner snapshot), the error gets swallowed at gc.go:978 and
ClearRangeSpanFailures bumps. Range-tombstone GC runs anyway, assumes
point keys got cleaned up, and blows up with the "hiding key" assertion.

Fix: check ClearRangeSpanFailures > 0 before running range-tombstone GC.
Skip it for this cycle, let the next pass retry with a fresh snapshot.
Txn record and abort span GC still run.

This doesn't fix the underlying snapshot/live-engine divergence in the
clear-range path — just prevents the cascade. Happy to take a different
direction if the team prefers a deeper fix here.

Testing

  • Unit test reproducing the above-threshold error from the
    snapshot/live-engine race.
  • TestKVNemesisMVCCGCRepro from gc: potential logical bug in MVCC GC #166091: 3/23 hiding-key failures
    on master,0/40 with fix (gate fired once, race triggered,
    no hiding-key error).
    Note: the unit test covers the clear-range failure path; an end-to-end test through gc.Run asserting the skip would be stronger, happy to add if you guys want it.

… failures

processReplicatedRangeTombstones assumes point-key GC fully succeeded.
When a clear-range request fails (error swallowed at gc.go:978), that
invariant is violated and the range tombstone phase trips the 'hiding
key' assertion.

Skip range tombstone GC when info.ClearRangeSpanFailures > 0. Next GC
cycle retries with a fresh snapshot.

Fixes cockroachdb#166091.

Release note (bug fix): Fixed an assertion failure in the MVCC GC
queue under concurrent writes.

Signed-off-by: Kevin Leung <leungke@oregonstate.edu>
Reproduces the snapshot-vs-live-engine race from cockroachdb#166091 where a
concurrent above-threshold write causes MVCCGarbageCollectPointsWithClearRange
to fail.

Signed-off-by: Kevin Leung <leungke@oregonstate.edu>
@thecitymouse thecitymouse requested a review from a team as a code owner April 13, 2026 18:21
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented Apr 13, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented Apr 13, 2026

Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl Bot added the O-community Originated from the community label Apr 13, 2026
@cockroachlabs-cla-agent
Copy link
Copy Markdown

cockroachlabs-cla-agent Bot commented Apr 13, 2026

CLA assistant check
All committers have signed the CLA.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

O-community Originated from the community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gc: potential logical bug in MVCC GC

2 participants