Skip to content

fix: break corrupt mirror snapshot cycle#190

Merged
worstell merged 1 commit intomainfrom
fix-post-restore-fetch-corruption
Mar 16, 2026
Merged

fix: break corrupt mirror snapshot cycle#190
worstell merged 1 commit intomainfrom
fix-post-restore-fetch-corruption

Conversation

@worstell
Copy link
Copy Markdown
Contributor

Problem

When a corrupt or empty mirror snapshot exists in S3, pods enter a poison cycle:

  1. Pod restores corrupt snapshot → 80KB empty mirror (zero refs, no pack files)
  2. Post-restore git fetch fails — lowSpeedLimit (1KB/s for 60s) trips during server-side pack computation for the large delta
  3. Code logs a warning but proceeds: schedules snapshot jobs that immediately re-upload the empty mirror to S3
  4. Next pod restart (or S3 cleanup) restores the same empty snapshot → repeat

This cycle survived the fixes in #188 and #189 because those PRs prevented the creation of corruption (concurrent restores and concurrent fetch-during-tar) but not the propagation of already-corrupt snapshots.

Fix

Three changes break the cycle:

  1. FetchLenient: Post-restore and startup fetches omit the lowSpeedLimit check, matching executeClone's behavior. Large deltas after snapshot restore trigger GitHub's server-side pack computation which stalls at near-zero transfer rate for minutes, tripping the 1KB/s threshold.

  2. ResetToEmpty + fallback to clone: When the post-restore fetch fails, the corrupt mirror directory is removed and the repo state is reset to Empty. The code then falls through to a fresh git clone --mirror instead of serving and re-uploading stale data.

  3. Skip snapshot scheduling on failed fetch: Snapshot and repack jobs are only scheduled after a successful fetch, both in the startup path (DiscoverExisting) and the post-restore path (startClone).

Testing

All existing tests pass. The fix was validated against staging logs showing the poison cycle in action.

@worstell worstell requested a review from a team as a code owner March 13, 2026 06:19
@worstell worstell requested review from jrobotham-square and removed request for a team March 13, 2026 06:19
@worstell worstell changed the title fix: break corrupt mirror snapshot poison cycle fix: break corrupt mirror snapshot cycle Mar 13, 2026
When a corrupt or empty mirror snapshot exists in S3, pods restore it,
the post-restore fetch fails, and then snapshot jobs immediately
re-upload the corrupt mirror — perpetuating the cycle even after manual
S3 cleanup.

Three changes break this cycle:

1. FetchLenient: post-restore and startup fetches now omit the
   lowSpeedLimit check (same as executeClone), since large deltas after
   snapshot restore trigger server-side pack computation that stalls at
   near-zero transfer for minutes, tripping the 1KB/s threshold.

2. ResetToEmpty + fallback to clone: when the post-restore fetch fails,
   the corrupt mirror is removed and the repo state is reset to Empty so
   the code falls through to a fresh git clone --mirror instead of
   serving and re-uploading stale data.

3. Skip snapshot scheduling on failed fetch: snapshot and repack jobs
   are only scheduled after a successful fetch, both in the startup
   path and the post-restore path.

Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019ce5d1-9385-7026-8e69-78903ce99c47
@worstell worstell force-pushed the fix-post-restore-fetch-corruption branch from 8255718 to b113e07 Compare March 13, 2026 06:27
@worstell worstell merged commit 1339f42 into main Mar 16, 2026
9 of 10 checks passed
@worstell worstell deleted the fix-post-restore-fetch-corruption branch March 16, 2026 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants