Skip to content

fix: prevent git fetch from corrupting mirror snapshots#189

Merged
worstell merged 1 commit intomainfrom
fix-mirror-snapshot-fetch-race
Mar 13, 2026
Merged

fix: prevent git fetch from corrupting mirror snapshots#189
worstell merged 1 commit intomainfrom
fix-mirror-snapshot-fetch-race

Conversation

@worstell
Copy link
Copy Markdown
Contributor

Problem

generateAndUploadMirrorSnapshot tars the live bare mirror directory using WithReadLock, but git fetch doesn't hold a write lock during the actual fetch — it only briefly locks to update lastFetch. Since fetch runs on a separate scheduler queue (upstream+"/fetch"), it can execute concurrently with the tar.

When git fetch replaces packed-refs (via temp file + rename) while tar is reading it, the archive captures a truncated file. Restoring this snapshot produces a mirror with an unterminated packed-refs line:

fatal: unterminated line in ./packed-refs: f334c42ff... refs/pull/90892/head

This breaks all subsequent git operations (fetch, for-each-ref, clone) on the restored mirror.

Fix

Add WithFetchExclusion() to Repository — acquires the fetch semaphore to prevent concurrent git fetch while the caller runs. Used in generateAndUploadMirrorSnapshot to ensure tar captures a consistent view of the mirror directory.

Workstation snapshots (generateAndUploadSnapshot) are not affected because they git clone the mirror into a separate temp directory before tar-ing.

@worstell worstell requested a review from a team as a code owner March 13, 2026 05:32
@worstell worstell requested review from js-murph and removed request for a team March 13, 2026 05:32
@worstell worstell enabled auto-merge (squash) March 13, 2026 05:43
@worstell worstell disabled auto-merge March 13, 2026 05:43
generateAndUploadMirrorSnapshot tars the live bare mirror directory
while git fetch can run concurrently on a separate scheduler queue.
When git fetch replaces packed-refs (via temp file + rename) mid-tar,
the archive captures a truncated file. Restoring this snapshot produces
a mirror with an unterminated packed-refs line, breaking all subsequent
git operations.

Add WithFetchExclusion() which holds the repo's fetch semaphore,
preventing concurrent fetches while the tar runs. This ensures the
snapshot captures a consistent view of the mirror directory.

Amp-Thread-ID: https://ampcode.com/threads/T-019ce597-b4ff-72ad-b096-d2851a7058ff
Co-authored-by: Amp <amp@ampcode.com>
@worstell worstell force-pushed the fix-mirror-snapshot-fetch-race branch from ab9d98a to b1131ce Compare March 13, 2026 05:50
@worstell worstell merged commit 6ecee8d into main Mar 13, 2026
6 checks passed
@worstell worstell deleted the fix-mirror-snapshot-fetch-race branch March 13, 2026 05:52
worstell added a commit that referenced this pull request Mar 16, 2026
## Problem

When a corrupt or empty mirror snapshot exists in S3, pods enter a
poison cycle:

1. Pod restores corrupt snapshot → 80KB empty mirror (zero refs, no pack
files)
2. Post-restore `git fetch` fails — `lowSpeedLimit` (1KB/s for 60s)
trips during server-side pack computation for the large delta
3. Code logs a warning but proceeds: schedules snapshot jobs that
immediately re-upload the empty mirror to S3
4. Next pod restart (or S3 cleanup) restores the same empty snapshot →
repeat

This cycle survived the fixes in #188 and #189 because those PRs
prevented the *creation* of corruption (concurrent restores and
concurrent fetch-during-tar) but not the *propagation* of
already-corrupt snapshots.

## Fix

Three changes break the cycle:

1. **`FetchLenient`**: Post-restore and startup fetches omit the
`lowSpeedLimit` check, matching `executeClone`'s behavior. Large deltas
after snapshot restore trigger GitHub's server-side pack computation
which stalls at near-zero transfer rate for minutes, tripping the 1KB/s
threshold.

2. **`ResetToEmpty` + fallback to clone**: When the post-restore fetch
fails, the corrupt mirror directory is removed and the repo state is
reset to Empty. The code then falls through to a fresh `git clone
--mirror` instead of serving and re-uploading stale data.

3. **Skip snapshot scheduling on failed fetch**: Snapshot and repack
jobs are only scheduled after a successful fetch, both in the startup
path (`DiscoverExisting`) and the post-restore path (`startClone`).

## Testing

All existing tests pass. The fix was validated against staging logs
showing the poison cycle in action.

Co-authored-by: Amp <amp@ampcode.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants