Skip to content

fix: prevent concurrent snapshot restores from corrupting packed-refs#188

Merged
worstell merged 1 commit intomainfrom
fix-concurrent-snapshot-restore
Mar 13, 2026
Merged

fix: prevent concurrent snapshot restores from corrupting packed-refs#188
worstell merged 1 commit intomainfrom
fix-concurrent-snapshot-restore

Conversation

@worstell
Copy link
Copy Markdown
Contributor

Problem

ensureCloneReady (used by the snapshot HTTP endpoint) calls startClone synchronously on the request goroutine. When multiple concurrent snapshot requests arrive for the same repo, each goroutine enters startClone and extracts the mirror snapshot tarball into the same directory simultaneously. This corrupts packed-refs — one goroutine's tar truncates the file while another is running git fetch on it:

fatal: unterminated line in ./packed-refs: f334c42ff... refs/pull/90892/head

Once packed-refs is corrupt, git fetch, git for-each-ref, and git clone (for workstation snapshot generation) all fail, causing empty or error responses to clients.

Fix

  • Add TryStartCloning() to Repository — atomically transitions StateEmpty → StateCloning, returning true only for the winning goroutine
  • Gate startClone with TryStartCloning() so only one goroutine performs the restore/clone; others return immediately and wait in ensureCloneReady's poll loop
  • Update MarkRestored and Clone to accept StateCloning (since TryStartCloning already transitioned), preserving the fallback path from failed restore to fresh clone

ensureCloneReady allowed multiple concurrent goroutines to enter
startClone simultaneously. Each goroutine extracted the mirror snapshot
tarball into the same directory, corrupting packed-refs when one
goroutine's tar overwrote the file while another was running git fetch.

Add TryStartCloning() to atomically gate the StateEmpty→StateCloning
transition so only the winning goroutine performs the restore or clone.
Update MarkRestored and Clone to accept StateCloning (since the caller
already transitioned) so the fallback from failed restore to fresh
clone still works.

Amp-Thread-ID: https://ampcode.com/threads/T-019ce597-b4ff-72ad-b096-d2851a7058ff
Co-authored-by: Amp <amp@ampcode.com>
@worstell worstell requested a review from a team as a code owner March 13, 2026 05:28
@worstell worstell requested review from inez and removed request for a team March 13, 2026 05:28
@worstell worstell merged commit 71e91a0 into main Mar 13, 2026
7 checks passed
@worstell worstell deleted the fix-concurrent-snapshot-restore branch March 13, 2026 05:43
worstell added a commit that referenced this pull request Mar 16, 2026
## Problem

When a corrupt or empty mirror snapshot exists in S3, pods enter a
poison cycle:

1. Pod restores corrupt snapshot → 80KB empty mirror (zero refs, no pack
files)
2. Post-restore `git fetch` fails — `lowSpeedLimit` (1KB/s for 60s)
trips during server-side pack computation for the large delta
3. Code logs a warning but proceeds: schedules snapshot jobs that
immediately re-upload the empty mirror to S3
4. Next pod restart (or S3 cleanup) restores the same empty snapshot →
repeat

This cycle survived the fixes in #188 and #189 because those PRs
prevented the *creation* of corruption (concurrent restores and
concurrent fetch-during-tar) but not the *propagation* of
already-corrupt snapshots.

## Fix

Three changes break the cycle:

1. **`FetchLenient`**: Post-restore and startup fetches omit the
`lowSpeedLimit` check, matching `executeClone`'s behavior. Large deltas
after snapshot restore trigger GitHub's server-side pack computation
which stalls at near-zero transfer rate for minutes, tripping the 1KB/s
threshold.

2. **`ResetToEmpty` + fallback to clone**: When the post-restore fetch
fails, the corrupt mirror directory is removed and the repo state is
reset to Empty. The code then falls through to a fresh `git clone
--mirror` instead of serving and re-uploading stale data.

3. **Skip snapshot scheduling on failed fetch**: Snapshot and repack
jobs are only scheduled after a successful fetch, both in the startup
path (`DiscoverExisting`) and the post-restore path (`startClone`).

## Testing

All existing tests pass. The fix was validated against staging logs
showing the poison cycle in action.

Co-authored-by: Amp <amp@ampcode.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants