Skip to content

Prevent pack file race condition during checkpoint sync#1024

Closed
pjbgf wants to merge 12 commits into
mainfrom
pjbgf/resync
Closed

Prevent pack file race condition during checkpoint sync#1024
pjbgf wants to merge 12 commits into
mainfrom
pjbgf/resync

Conversation

@pjbgf
Copy link
Copy Markdown
Member

@pjbgf pjbgf commented Apr 24, 2026

https://entire.io/gh/entireio/cli/trails/6148e08144d9

Summary

  • Disable auto-gc during all fetch operations (--no-auto-gc) to prevent a race condition where git gc repacks/deletes pack files while go-git is reading from them, causing ENOENT errors during checkpoint sync
  • Remove shallow checkpoint fetches (--depth=1) — shallow repos break commands that need checkpoint ancestry/history, and the bandwidth savings are now handled by filtered fetches instead
  • Auto-unshallow legacy shallow repos — when filtered_fetches is enabled and the repo is shallow (from a previous --depth=1 fetch), the next filtered fetch adds --unshallow to migrate away from shallow state
  • Gate tree-only fetch paths on filtered_fetchesFetchMetadataTreeOnly and FetchV2MainTreeOnly only make sense when filtered fetches are enabled (they skip blobs); without the setting, callers fall back to full fetches that include blob content

Context

After enabling filtered_fetches in the project settings, users hit this error during git push:

[entire] Warning: couldn't sync entire/checkpoints/v1: failed to rebase local
commits onto remote: failed to flatten tip tree: failed to get subtree
36/db97269a69: open .git/objects/pack/pack-.pack: no such file or directory

Each filtered fetch creates a promisor pack file. Accumulating promisor packs reaches the gc.autoPackLimit threshold, triggering gc --auto which repacks and deletes old pack files.
go-git snapshots the pack file list when the repository is opened, then fails when it tries to read from a pack that was deleted by gc between the snapshot and the read.


Note

Medium Risk
Touches checkpoint sync/fetch behavior and Git plumbing flags, so regressions could impact resume/explain flows or fetch performance across different repo states (shallow vs full). Changes are scoped and covered by an added unshallow migration test.

Overview
Prevents packfile races during checkpoint sync by adding --no-auto-gc to all checkpoint-related git fetch invocations (including blob-by-hash and v2 /full/* fetches), avoiding background repacks while go-git is reading.

Removes legacy shallow checkpoint fetches and, when filtered_fetches is enabled, automatically migrates existing shallow repos by detecting shallow state and adding --unshallow on subsequent filtered fetches.

Adjusts resume/explain metadata fetch fallbacks so the “tree-only” fast path is only used when filtered_fetches is enabled; otherwise callers fall back to full, blob-inclusive fetches. Adds a test ensuring filtered fetches unshallow a legacy shallow clone.

Reviewed by Cursor Bugbot for commit 8cb8c82. Configure here.

Copilot AI review requested due to automatic review settings April 24, 2026 14:07
@pjbgf pjbgf requested a review from a team as a code owner April 24, 2026 14:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses checkpoint sync failures caused by git gc --auto repacking/deleting pack files while go-git is reading them, by preventing auto-GC during fetches and by improving fetch behavior (removing shallow fetches, migrating legacy shallow repos, and gating tree-only fetch fast paths on filtered_fetches).

Changes:

  • Add --no-auto-gc to git fetch operations to prevent packfile races during checkpoint sync.
  • Remove shallow checkpoint fetch behavior and add an auto---unshallow migration for legacy shallow repos when filtered_fetches is enabled.
  • Only use “tree-only” fetch paths when filtered_fetches is enabled; otherwise fall back to full fetches.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
cmd/entire/cli/strategy/common.go Removes outdated comment about --depth=1 compatibility with ancestry checks.
cmd/entire/cli/resume.go Gates tree-only metadata fetch path on filtered_fetches; adjusts v2 tree-only fetch injection.
cmd/entire/cli/git_operations.go Removes shallow fetch option plumbing; updates metadata/v2 fetch helpers accordingly.
cmd/entire/cli/explain.go Uses tree-only vs full fetch depending on filtered_fetches setting.
cmd/entire/cli/checkpoint/v2_read.go Adds --no-auto-gc to v2 full-ref fetches.
cmd/entire/cli/checkpoint/remote/git.go Adds --no-auto-gc to fetches; adds shallow-repo detection and --unshallow migration logic.
cmd/entire/cli/checkpoint/remote/git_test.go Updates tests for new --no-auto-gc argument; adds a test for unshallowing legacy shallow repos.

Comment thread cmd/entire/cli/checkpoint/remote/git.go Outdated
Comment thread cmd/entire/cli/checkpoint/remote/git_test.go
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 8cb8c82. Configure here.

Comment thread cmd/entire/cli/checkpoint/remote/git_test.go
pjbgf and others added 11 commits May 27, 2026 10:04
Entire-Checkpoint: 8436c49de6b9
Entire-Checkpoint: 2aaa0ca9a3ae
Entire-Checkpoint: 5dec9ad3f03b
The previous gate (filtered_fetches AND shallow) left a class of users
exposed: anyone whose repo became shallow for a non-Entire reason (a
hand-run git clone --depth=N, an older CLI version that pre-dated the
filtered-fetches setting, an unrelated tool) still saw the reconcile
path walk parent chains past .git/shallow via go-git and propose to
cherry-pick hundreds of stale commits onto the remote.

Drop the gate. Any shallow checkpoint repo unshallows on the next
fetch, regardless of how the shallow state was introduced. Test now
covers both filtered_fetches on and off.

Entire-Checkpoint: 5aacb2ba139e
collectCommitChain and the cherry-pick path used go-git to walk
commit.ParentHashes. go-git does not consult .git/shallow, so on a
shallow checkpoint repo the walk strolled past the boundary into stale
pack objects — producing a phantom chain that the reconcile path then
proposed to cherry-pick onto the remote tip, duplicating commits that
already lived on the remote under their original SHAs.

Read the shallow set once at the top of each reconcile/rebase entry
point and pass it down. collectCommitChain stops at shallow commits;
treeChangesForCherryPick treats shallow-boundary commits as roots and
diffs them against an empty tree, so the cherry-pick contributes the
commit's full content rather than a delta against a stale parent.

This is independent of the auto-unshallow fix: a future shallow file
introduced by any other tool (manual git clone --depth, an unrelated
plumbing helper) would otherwise still trigger the same bug.

Entire-Checkpoint: 49c40909c8f5
…File

- fetchWorkingDir: wrap filepath.Abs error, document why os.Getwd
  fallback is intentional (caller passed empty Dir to mean "inherit
  cwd"; tests rely on this), nolint forbidigo with reason.
- loadShallowHashes: nolint gosec G304 — path is derived from git's
  own --git-common-dir output, not user input.

No behavior change.

Entire-Checkpoint: b5bbff32a63d
… is off

The previous gating in getMetadataTree / getV2MetadataTree skipped the
tree-only fetch entirely when filtered_fetches was disabled. With no
fetch step, the local ref lookup fell through to whatever the worktree
last saw, which could be stale (a collaborator pushed a newer
checkpoint, or our last fetch was before the producer published). The
"full fetch" fallback then never ran because the stale local lookup
"succeeded."

Replace the gate with a first-attempt fetch that is always issued:
tree-only when filtered_fetches is enabled (cheaper), full fetch
otherwise. For v2 the secondary fallback is set to nil when the
first-attempt is already the full fetch, to avoid a duplicate network
round-trip. Restores the pre-PR property that a fetch always precedes
the local lookup.

Regression covered by TestRunExplainExport_JSONFetchesRemoteV2WhenLocalV2RefSelectsDualMode.

Entire-Checkpoint: 60d8c085123a
…tor test setup

- Drop fetchWorkingDir/isShallowRepositoryInDir indirection in
  isShallowRepository: cmd.Dir = "" already inherits parent cwd, so we
  don't need to resolve it ourselves. Removes the os.Getwd nolint and
  the filepath.Abs wrap.
- Trim multi-paragraph doc on loadShallowHashes and the inline shallow
  rationale in treeChangesForCherryPick (cherryPickOnto's doc already
  explains the WHY).
- Resume.go: collapse paired firstFetch/firstFetchLabel assignments
  into one line each; trim over-explanatory "First-attempt fetch
  ensures..." block in getV2MetadataTree.
- Factor TestFetch_UnshallowsShallowRepository's setup into
  setupShallowClone + runIsolatedGit helpers so the table body is now
  the actual assertion rather than 70 lines of duplicated scaffolding.

Entire-Checkpoint: cd78c8c0f7bd
Two issues raised in review:

1. Generic Fetch auto-unshallowed any shallow repo, including
   FetchAndCheckoutRemoteBranch's user-branch fetches. A deliberately
   shallow user clone would silently convert to full history just by
   running resume/checkout — far outside the metadata-repair scope
   this PR intended.

2. FetchMetadataTreeOnly / FetchV2MainTreeOnly were no longer "tree-
   only" after --depth=1 was dropped. They now downloaded the full
   commit chain of the metadata ref (minus blobs when filtered fetches
   are enabled). On long-lived checkpoint branches every resume/explain
   paid the full-history cost.

Fix: replace the implicit auto-unshallow with two explicit flags on
FetchOptions:

  - Shallow (--depth=1): set on tip-only probes. Cheap, creates
    .git/shallow state. Safe now because the reconcile path's commit
    walk respects .git/shallow (earlier commit in this PR).
  - Unshallow (--unshallow when shallow): set on metadata-repair
    paths that need full ancestry (FetchMetadataBranch,
    FetchV2MainRef, fetchAndRebaseSessionsCommon, doctor v2
    disconnection check). Migrates legacy shallow state on demand.

Generic FetchAndCheckoutRemoteBranch sets neither — user branches
stay as the user configured them.

resume.go's getMetadataTree / getV2MetadataTree can revert to
always-call-tree-only since tree-only is genuinely cheap again;
drops the filtered_fetches-conditional firstFetch switcheroo and
its accompanying logging. Test renamed to TestFetch_Unshallow with
two cases (Unshallow=true deepens, Unshallow=false leaves alone)
plus TestFetch_Shallow asserting --depth=1 actually shallows a
clone.

Entire-Checkpoint: 9fa8691d4b8a
git's --unshallow removes ALL shallow boundaries in .git/shallow, not
just the one for the ref being fetched. On a shallow clone of a large
repo it pulls the entire repository history (we observed 687 MB on
entirehq/entire). Used from the push hook, this hangs every push on
shallow repos until the full clone arrives.

The downstream reconcile/rebase paths already respect .git/shallow
(collectCommitChain stops at shallow boundaries; collectCommitsSince
uses git rev-list which respects them natively). When the local view
is shallow-truncated, the merge-base check may report "disconnected"
where the actual chains share ancestry below the boundary — but the
cherry-pick path it routes through still produces a correct result,
because it picks up exactly the commits that aren't yet on the remote.

Remove Unshallow from:
  - fetchAndRebaseSessionsCommon (push hot path)
  - FetchMetadataBranch (resume/explain fallback — only needs blobs)
  - FetchV2MainRef (same)

Keep Unshallow on metadata_reconcile's doctor v2 disconnection check
since it's only reached from `entire doctor`, an explicit
user-invoked diagnostic where the cost is acceptable.

Verified by re-running the full unit suite (5304 tests) and the
fetch-flag regression coverage from TestFetch_Unshallow /
TestFetch_Shallow.

Entire-Checkpoint: 8184f5f2a970
@pjbgf pjbgf force-pushed the pjbgf/resync branch 2 times, most recently from 10f42e6 to a12d142 Compare May 27, 2026 12:10
Signed-off-by: Paulo Gomes <paulo@entire.io>
Entire-Checkpoint: 1fe71f299748
@Soph Soph closed this May 27, 2026
auto-merge was automatically disabled May 27, 2026 13:07

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants