Skip to content

Prevent pack file race condition during checkpoint sync#1276

Merged
pjbgf merged 12 commits into
mainfrom
pjbgf/resync
May 27, 2026
Merged

Prevent pack file race condition during checkpoint sync#1276
pjbgf merged 12 commits into
mainfrom
pjbgf/resync

Conversation

@pjbgf
Copy link
Copy Markdown
Member

@pjbgf pjbgf commented May 27, 2026

https://entire.io/gh/entireio/cli/trails/218

Summary

  • Disable auto-gc during all fetch operations (--no-auto-gc) to prevent a race condition where git gc repacks/deletes pack files while go-git is reading from them, causing ENOENT errors during checkpoint sync
  • Remove shallow checkpoint fetches (--depth=1) — shallow repos break commands that need checkpoint ancestry/history, and the bandwidth savings are now handled by filtered fetches instead
  • Auto-unshallow legacy shallow repos — when filtered_fetches is enabled and the repo is shallow (from a previous --depth=1 fetch), the next filtered fetch adds --unshallow to migrate away from shallow state
  • Gate tree-only fetch paths on filtered_fetchesFetchMetadataTreeOnly and FetchV2MainTreeOnly only make sense when filtered fetches are enabled (they skip blobs); without the setting, callers fall back to full fetches that include blob content

Context

After enabling filtered_fetches in the project settings, users hit this error during git push:

[entire] Warning: couldn't sync entire/checkpoints/v1: failed to rebase local
commits onto remote: failed to flatten tip tree: failed to get subtree
36/db97269a69: open .git/objects/pack/pack-.pack: no such file or directory

Each filtered fetch creates a promisor pack file. Accumulating promisor packs reaches the gc.autoPackLimit threshold, triggering gc --auto which repacks and deletes old pack files.
go-git snapshots the pack file list when the repository is opened, then fails when it tries to read from a pack that was deleted by gc between the snapshot and the read.

New version of #1024.


Note

Medium Risk
Changes core git fetch, ref promotion, and metadata reconciliation paths used during push, resume, and doctor; mistakes could leave stale checkpoint refs or over-fetch history, but behavior is covered by new integration tests.

Overview
Checkpoint sync now runs git fetch with --no-auto-gc so automatic GC cannot repack or delete packfiles while go-git still reads them (fixing intermittent ENOENT on .git/objects/pack during rebase/sync).

Fetch behavior is split explicitly: FetchOptions gains documented Shallow and Unshallow flags (with shallow detection via rev-parse --is-shallow-repository) instead of implicitly adding --unshallow on every full metadata fetch. FetchV2MainTreeOnly no longer uses --depth=1, because shallow boundaries made remote tips look unrelated and SafelyAdvanceLocalRef could leave stale local v2 refs; a regression test asserts the repo stays non-shallow. Generic push/rebase fetches also stop using --unshallow, with comments that unshallow is clone-global and can pull an entire monorepo; repair paths that need full ancestry pass Unshallow: true (e.g. doctor v2 temp fetch).

Reconciliation respects shallow history: commit walks and cherry-picks load .git/shallow and treat boundary commits as roots so replay does not walk into stale parents. V2 disconnected-ref detection and repair (IsV2MainDisconnected, ReconcileDisconnectedV2Ref) mirror v1 via ls-remote, temp fetch, and cherry-pick onto the remote tip.

Reviewed by Cursor Bugbot for commit 6af9e93. Configure here.

pjbgf and others added 12 commits May 27, 2026 13:37
Entire-Checkpoint: 8436c49de6b9
Entire-Checkpoint: 2aaa0ca9a3ae
Entire-Checkpoint: 5dec9ad3f03b
The previous gate (filtered_fetches AND shallow) left a class of users
exposed: anyone whose repo became shallow for a non-Entire reason (a
hand-run git clone --depth=N, an older CLI version that pre-dated the
filtered-fetches setting, an unrelated tool) still saw the reconcile
path walk parent chains past .git/shallow via go-git and propose to
cherry-pick hundreds of stale commits onto the remote.

Drop the gate. Any shallow checkpoint repo unshallows on the next
fetch, regardless of how the shallow state was introduced. Test now
covers both filtered_fetches on and off.

Entire-Checkpoint: 5aacb2ba139e
collectCommitChain and the cherry-pick path used go-git to walk
commit.ParentHashes. go-git does not consult .git/shallow, so on a
shallow checkpoint repo the walk strolled past the boundary into stale
pack objects — producing a phantom chain that the reconcile path then
proposed to cherry-pick onto the remote tip, duplicating commits that
already lived on the remote under their original SHAs.

Read the shallow set once at the top of each reconcile/rebase entry
point and pass it down. collectCommitChain stops at shallow commits;
treeChangesForCherryPick treats shallow-boundary commits as roots and
diffs them against an empty tree, so the cherry-pick contributes the
commit's full content rather than a delta against a stale parent.

This is independent of the auto-unshallow fix: a future shallow file
introduced by any other tool (manual git clone --depth, an unrelated
plumbing helper) would otherwise still trigger the same bug.

Entire-Checkpoint: 49c40909c8f5
…File

- fetchWorkingDir: wrap filepath.Abs error, document why os.Getwd
  fallback is intentional (caller passed empty Dir to mean "inherit
  cwd"; tests rely on this), nolint forbidigo with reason.
- loadShallowHashes: nolint gosec G304 — path is derived from git's
  own --git-common-dir output, not user input.

No behavior change.

Entire-Checkpoint: b5bbff32a63d
… is off

The previous gating in getMetadataTree / getV2MetadataTree skipped the
tree-only fetch entirely when filtered_fetches was disabled. With no
fetch step, the local ref lookup fell through to whatever the worktree
last saw, which could be stale (a collaborator pushed a newer
checkpoint, or our last fetch was before the producer published). The
"full fetch" fallback then never ran because the stale local lookup
"succeeded."

Replace the gate with a first-attempt fetch that is always issued:
tree-only when filtered_fetches is enabled (cheaper), full fetch
otherwise. For v2 the secondary fallback is set to nil when the
first-attempt is already the full fetch, to avoid a duplicate network
round-trip. Restores the pre-PR property that a fetch always precedes
the local lookup.

Regression covered by TestRunExplainExport_JSONFetchesRemoteV2WhenLocalV2RefSelectsDualMode.

Entire-Checkpoint: 60d8c085123a
…tor test setup

- Drop fetchWorkingDir/isShallowRepositoryInDir indirection in
  isShallowRepository: cmd.Dir = "" already inherits parent cwd, so we
  don't need to resolve it ourselves. Removes the os.Getwd nolint and
  the filepath.Abs wrap.
- Trim multi-paragraph doc on loadShallowHashes and the inline shallow
  rationale in treeChangesForCherryPick (cherryPickOnto's doc already
  explains the WHY).
- Resume.go: collapse paired firstFetch/firstFetchLabel assignments
  into one line each; trim over-explanatory "First-attempt fetch
  ensures..." block in getV2MetadataTree.
- Factor TestFetch_UnshallowsShallowRepository's setup into
  setupShallowClone + runIsolatedGit helpers so the table body is now
  the actual assertion rather than 70 lines of duplicated scaffolding.

Entire-Checkpoint: cd78c8c0f7bd
Two issues raised in review:

1. Generic Fetch auto-unshallowed any shallow repo, including
   FetchAndCheckoutRemoteBranch's user-branch fetches. A deliberately
   shallow user clone would silently convert to full history just by
   running resume/checkout — far outside the metadata-repair scope
   this PR intended.

2. FetchMetadataTreeOnly / FetchV2MainTreeOnly were no longer "tree-
   only" after --depth=1 was dropped. They now downloaded the full
   commit chain of the metadata ref (minus blobs when filtered fetches
   are enabled). On long-lived checkpoint branches every resume/explain
   paid the full-history cost.

Fix: replace the implicit auto-unshallow with two explicit flags on
FetchOptions:

  - Shallow (--depth=1): set on tip-only probes. Cheap, creates
    .git/shallow state. Safe now because the reconcile path's commit
    walk respects .git/shallow (earlier commit in this PR).
  - Unshallow (--unshallow when shallow): set on metadata-repair
    paths that need full ancestry (FetchMetadataBranch,
    FetchV2MainRef, fetchAndRebaseSessionsCommon, doctor v2
    disconnection check). Migrates legacy shallow state on demand.

Generic FetchAndCheckoutRemoteBranch sets neither — user branches
stay as the user configured them.

resume.go's getMetadataTree / getV2MetadataTree can revert to
always-call-tree-only since tree-only is genuinely cheap again;
drops the filtered_fetches-conditional firstFetch switcheroo and
its accompanying logging. Test renamed to TestFetch_Unshallow with
two cases (Unshallow=true deepens, Unshallow=false leaves alone)
plus TestFetch_Shallow asserting --depth=1 actually shallows a
clone.

Entire-Checkpoint: 9fa8691d4b8a
git's --unshallow removes ALL shallow boundaries in .git/shallow, not
just the one for the ref being fetched. On a shallow clone of a large
repo it pulls the entire repository history (we observed 687 MB on
entirehq/entire). Used from the push hook, this hangs every push on
shallow repos until the full clone arrives.

The downstream reconcile/rebase paths already respect .git/shallow
(collectCommitChain stops at shallow boundaries; collectCommitsSince
uses git rev-list which respects them natively). When the local view
is shallow-truncated, the merge-base check may report "disconnected"
where the actual chains share ancestry below the boundary — but the
cherry-pick path it routes through still produces a correct result,
because it picks up exactly the commits that aren't yet on the remote.

Remove Unshallow from:
  - fetchAndRebaseSessionsCommon (push hot path)
  - FetchMetadataBranch (resume/explain fallback — only needs blobs)
  - FetchV2MainRef (same)

Keep Unshallow on metadata_reconcile's doctor v2 disconnection check
since it's only reached from `entire doctor`, an explicit
user-invoked diagnostic where the cost is acceptable.

Verified by re-running the full unit suite (5304 tests) and the
fetch-flag regression coverage from TestFetch_Unshallow /
TestFetch_Shallow.

Entire-Checkpoint: 8184f5f2a970
Signed-off-by: Paulo Gomes <paulo@entire.io>
Entire-Checkpoint: 1fe71f299748
Copilot AI review requested due to automatic review settings May 27, 2026 13:17
@pjbgf pjbgf requested a review from a team as a code owner May 27, 2026 13:17
@pjbgf pjbgf enabled auto-merge May 27, 2026 13:18
@pjbgf pjbgf merged commit 20e56f8 into main May 27, 2026
10 of 11 checks passed
@pjbgf pjbgf deleted the pjbgf/resync branch May 27, 2026 13:20
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 6af9e93. Configure here.

// FetchMetadataBranch will undo that when full ancestry is later needed.
func FetchMetadataTreeOnly(ctx context.Context) error {
return fetchMetadataFromOrigin(ctx, true /* shallow */, false /* noFilter */)
return fetchMetadataFromOrigin(ctx, fetchMetadataOpts{Shallow: true})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FetchMetadataBranch no longer unshallows, breaking SafelyAdvanceLocalRef fallback

High Severity

The new comment on FetchMetadataTreeOnly claims "FetchMetadataBranch will undo that when full ancestry is later needed," but FetchMetadataBranch now explicitly does NOT --unshallow (Unshallow defaults to false). The old fullMetadataFetchArgs helper that added --unshallow was deleted without preserving this behavior. In the resume.go fallback flow, FetchMetadataTreeOnly creates shallow state via --depth=1, then FetchMetadataBranch is called as a fallback but no longer removes the shallow boundary. SafelyAdvanceLocalRef inside FetchMetadataBranch then fails with "no merge base… run 'entire doctor'" whenever the local metadata branch differs from remote, because git merge-base cannot traverse past the shallow boundary. Previously this path succeeded automatically.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6af9e93. Configure here.

@pjbgf pjbgf review requested due to automatic review settings May 27, 2026 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants