Skip to content

Make entire explain work with partial-clone#1069

Merged
gtrrz-victor merged 7 commits intomainfrom
soph/explain-fetch
Apr 30, 2026
Merged

Make entire explain work with partial-clone#1069
gtrrz-victor merged 7 commits intomainfrom
soph/explain-fetch

Conversation

@Soph
Copy link
Copy Markdown
Collaborator

@Soph Soph commented Apr 28, 2026

Problem

entire explain <checkpoint-id> failed with "no checkpoint or commit found" against a checkpoint_remote-backed setup with filtered_fetches: true. The checkpoint was on the remote and discoverable in the local tree, but the metadata blob was absent locally — and go-git's Tree.File() returns ErrFileNotFound for missing blobs, which the v1 / v2 read paths treated as "checkpoint doesn't exist." Even when it eventually returned correct data, it could take 10+ seconds on a cold cache with zero progress feedback — easy to mistake for a hang.

Fix — correctness

Three coordinated changes:

  1. Configure blob fetcher on both stores. v1Store had SetBlobFetcher but explain wasn't calling it. Added the same to V2GitStore and wired both.
  2. Wrap v2 reads in FetchingTree. v1's read path already used FetchingTree (with cat-file fallback for partial-clone-filtered blobs). Mirrored that for v2's ReadCommitted / ReadSession* methods.
  3. Use git fetch-pack for blob SHAs. Porcelain git fetch <url> <hash> enforces partial-clone integrity checks that reject blob-only responses; plumbing skips those checks. Switched FetchBlobs accordingly.

Plus: FetchingTree.File now tries git cat-file BEFORE the network fetch — saves multi-second round-trips when a blob is on disk but invisible to go-git's storer (typical partial-clone state).

Fix — progress feedback

Wrapped the entire pre-output pipeline with two sequential spinners on stderr:

  • ⣾ Loading checkpoints during newExplainCheckpointLookup (the slow ListCommitted walk that reads every checkpoint's metadata.json via git cat-file)
  • ⣾ Loading checkpoint <id> during prefetchCheckpointBlobs (including its cat-file -e analysis loop), the actual fetch, summary + session reads, and getAssociatedCommits git log walk

Same Braille frames as the activity TUI for visual consistency. 250ms initial delay so warm/fast runs stay silent — no flicker. Stops strictly before any write to stdout so spinner frames and output never interleave. Suppressed entirely when stderr isn't a terminal (CI, redirected output, agent subprocesses).

Tests

Two new integration tests reproduce the partial-clone state in a fresh TempDir clone (file:// URL + uploadpack.allowFilter=true on the bare so --filter=blob:none is actually honored — the local-path transport ignores filters) and verify explain succeeds:

  • TestExplain_CheckpointSucceedsAfterTreelessFetch (v1)
  • TestExplain_CheckpointV2SucceedsAfterTreelessFetch (v2)

Both assert at least one metadata blob is genuinely missing locally before running explain — without that pre-condition the tests could silently pass against a fully-cloned repo. Verified for regression: stashing the v2 fix makes the v2 test fail with the exact original symptom (failed to read checkpoint: checkpoint not found).

Performance

  • Cold run (blobs missing locally): ~1-2s — one batched fetch-pack call covers all blobs in the checkpoint subtree, no per-blob round-trips
  • Warm run (blobs on disk): sub-second — cat-file reads local blobs directly, no network

Soph and others added 2 commits April 28, 2026 20:00
`entire explain <checkpoint-id>` failed with "no checkpoint or commit
found" when the metadata branch was on a configured `checkpoint_remote`
but blobs hadn't been fetched yet. Three coordinated fixes:

1. v1Store now has a blob fetcher configured (the piece resume already
   had via SetBlobFetcher). Without it, go-git's Tree.File() returns
   ErrFileNotFound on a missing blob, which ReadCommitted translates to
   ErrCheckpointNotFound — making it look like the checkpoint doesn't
   exist when really only the blob is absent.

2. FetchBlobs now uses `git fetch-pack` instead of porcelain
   `git fetch`. Porcelain enforces partial-clone integrity checks that
   reject blob-only responses with "did not send all necessary objects",
   so `git fetch <url> <blob-sha>` always failed against GitHub for
   repos with filtered_fetches enabled. Plumbing skips those checks and
   just downloads the requested objects.

3. FetchingTree.File now tries `git cat-file` BEFORE the network fetch.
   In partial-clone repos a blob is commonly on disk but invisible to
   go-git's storer (filtered out, or in a packfile not in the cached
   index). Without the short-circuit, every File() burned a network
   round-trip on already-local data — which compounded into per-File
   25s SSH-handshake-and-rejection delays before this fix.

The matching empty-list fetch chain for explain now mirrors
resume.getMetadataTree (checkpoint_remote → treeless origin → full
origin), and a per-checkpoint PreFetch batches all subtree blobs in one
fetch-pack call. Cold runs are ~1-2s; warm runs are sub-second.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 74b3190b7dfb
The v1 fix in 645fb12 left v2 vulnerable to the same bug: when a
metadata blob is on the remote but absent locally, V2GitStore's read
path used raw `*object.Tree.File()` calls which return ErrFileNotFound
on missing blobs — and ReadCommitted treats that as "checkpoint not
found". This blocks `entire explain` against partial-clone repos as
soon as checkpoints_v2 is enabled.

v2 fix mirrors v1:
- V2GitStore gains `blobFetcher` + `SetBlobFetcher` (parallel to GitStore).
- ReadCommitted, ReadSessionCompactTranscript, ReadSessionMetadataAndPrompts,
  and ReadSessionContent now wrap their cp/session subtrees in
  FetchingTree before calling File(). FetchingTree's cat-file fallback
  + on-demand fetcher recover blobs that go-git's storer can't see.
- explain configures FetchBlobsByHash on both v1 and v2 stores.
- prefetchCheckpointBlobs walks both v1 and v2 cp subtrees so a single
  fetch-pack round-trip primes both stores.

Regression tests in cmd/entire/cli/integration_test/explain_test.go:
- TestExplain_CheckpointSucceedsAfterTreelessFetch (v1)
- TestExplain_CheckpointV2SucceedsAfterTreelessFetch (v2)

Both reproduce the bug-triggering state by:
1. Pushing a checkpoint to a bare remote.
2. Cloning into a fresh TempDir with `--filter=blob:none --depth=1` over
   `file://` URL (uploadpack.allowFilter set on the bare). This gives
   the clone tree entries but no blobs — the actual partial-clone state.
3. Asserting at least one metadata blob is genuinely missing (otherwise
   the test would silently pass without exercising the fix).
4. Running explain in the clone and verifying the prompt text is in the
   output.

Verified: stashing the v2 fix and running TestExplain_CheckpointV2SucceedsAfterTreelessFetch
fails with `failed to read checkpoint: checkpoint not found` — the
exact original bug — confirming the test catches regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 7febd2b70c1f
Copilot AI review requested due to automatic review settings April 28, 2026 19:03
@Soph Soph requested a review from a team as a code owner April 28, 2026 19:03
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit ae4b4e5. Configure here.

Comment thread cmd/entire/cli/explain.go Outdated
Comment thread cmd/entire/cli/integration_test/explain_test.go
Soph and others added 4 commits April 28, 2026 21:14
Entire-Checkpoint: 3ebcceee3024
`entire explain <id>` against a treeless checkpoint metadata branch can
take 1-2 seconds for the fetch-pack round-trip. Previously the command
was silent during that window — easy to mistake for a hang.

Adds two visual indicators on stderr:

1. "Fetching checkpoint metadata from remote" — shown when the empty-
   matches path triggers `getMetadataTree` / `getV2MetadataTree`.

2. "Fetching N checkpoint blob(s) from remote" — shown when the
   pre-resolve prefetch detects N missing blobs in the cp subtree(s).

The spinner uses the same Braille frames as the activity TUI's
`spinner.Dot` so the visual matches across commands. When stderr is
not a terminal (CI, redirected output, agent subprocess), the spinner
is suppressed and the message is printed once with a trailing "..." —
non-interactive callers still see what's happening without ANSI noise.

`prefetchCheckpointBlobs` was restructured to count missing blobs
upfront (via the new exported `FetchingTree.CollectMissingBlobs`) so
the spinner only shows when network work is genuinely needed; warm runs
where everything is already local stay silent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: c3c1ad356bad
…arm runs stay silent

Spinner from the previous commit only wrapped the explicit fetch paths,
but the slow part of `entire explain` against a treeless metadata branch
is actually `newExplainCheckpointLookup` → `ListCommitted`, which reads
metadata.json via `git cat-file` for every checkpoint in the branch
(~10ms × N). 7-10s before any output, no visible feedback.

Wrap the lookup creation in `runExplainAuto` with the spinner. Add a
250ms initial delay to startSpinner so warm/fast runs (sub-second) don't
flicker — the spinner only appears once the operation has actually been
running long enough that the user might wonder if it's hung.

Also drop the suppression-mode "msg..." print for non-TTY: redirected
output and CI stay completely clean (matches the existing convention
for progress dots).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 7ccf5948e128
The "Loading checkpoints" spinner only wraps `newExplainCheckpointLookup`
(ListCommitted). After that returns, several more network/IO steps run
silently before output prints:
  - ResolveCommittedReaderForCheckpoint reads metadata.json via cat-file
  - readLatestSessionContentForExplain reads session metadata, prompt,
    transcript blobs (more cat-file calls)
  - getAssociatedCommits walks git log for matching trailers

On a deep repo or one with many missing-blob reads, this second phase
itself can take seconds. The user previously saw the lookup spinner,
then silence for "twice as long", then output.

Add a second spinner with message "Loading checkpoint <id>" covering the
post-lookup pipeline. Stop strictly before any write to w (stdout) so
the stderr spinner frames never interleave with stdout output. The
existing 250ms initial delay applies, so warm runs that complete the
data-load in under a quarter second still produce no spinner output.

For --generate, the spinner stops before generation prints its own
progress, then a "Reloading checkpoint" spinner resumes for the
post-generation read.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 2f90eccad241
pjbgf
pjbgf previously approved these changes Apr 28, 2026
The data-load spinner started AFTER prefetchCheckpointBlobs, but the
prefetch's missing-blob analysis is itself slow on a deep checkpoint
subtree: it spawns one `git cat-file -e` subprocess per blob entry. For
a checkpoint with many sessions × tasks × files, that's hundreds of
subprocess calls — silent seconds between the lookup spinner ending
and the data-load spinner starting.

Move the spinner start to wrap prefetchCheckpointBlobs too, and drop
the inner "Fetching N blobs from remote" spinner that the prefetch
function used to spawn. Two spinners on the same line would conflict;
one continuous spinner is the right UX. The fetch-count detail moves
into a debug log.

Extract the prefetch + summary read + content read sequence into a
loadCheckpointForExplain helper so runExplainCheckpointWithLookup stays
under maintidx limits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: c48334ab4ab6
@gtrrz-victor gtrrz-victor merged commit 28aa01f into main Apr 30, 2026
9 checks passed
@gtrrz-victor gtrrz-victor deleted the soph/explain-fetch branch April 30, 2026 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants