Skip to content

Checkpoints v2: Fix migration rerun logic (speedups, avoid duplicated migration efforts)#1114

Merged
computermode merged 13 commits intomainfrom
more-migration-speedups
May 5, 2026
Merged

Checkpoints v2: Fix migration rerun logic (speedups, avoid duplicated migration efforts)#1114
computermode merged 13 commits intomainfrom
more-migration-speedups

Conversation

@computermode
Copy link
Copy Markdown
Contributor

@computermode computermode commented May 5, 2026

https://entire.io/gh/entireio/cli/trails/292

Summary

Fixes a few issues that arose when rerunning `entire migrate --checkpoints v2:

  • rerunning the migration could duplicate /full/<n> archives on supposed no-ops (one of our repos grew from gen 46 -> 86),
  • interrupting a migration could result in a partially completed state that re-running may not fix properly
  • rerunning the migration resulted in an unconditional transcript walk across all the trees every time, even if there was nothing else to be done.

For repos with thousands of checkpoints, the issues above meant that migrations would take 20+ minutes to run even if there was nothing to be done.

Changes

  • skip when every session has /full artifacts, pack only the missing sessions when some are absent, fresh-migrate when the checkpoint isn't in /main yet, --force to redo from scratch
  • generation-metadata repair only runs when the loop packed something, and just-packed archives are excluded since the packer writes correct generation.json in-memory. A new BuildFullSessionArtifactsIndex turns per-session presence checks into O(1) map lookups, and ReadSessionMetadata reads only metadata.json
  • no-op migration reruns now take seconds to run

Note

Medium Risk
Touches checkpoint migration and generation repair flows, changing when archives are created and when metadata repair runs; mistakes could skip needed packing or leave inconsistent /full/* state. Scope is contained to migration/repair tooling, but impacts large repos and historical data layout.

Overview
Improves entire migrate --checkpoints v2 rerun behavior so already-migrated checkpoints are skipped when /full/* artifacts are present, while partially migrated checkpoints (written to /main but missing /full/*) resume by packing only missing sessions instead of re-migrating everything.

Speeds up the migration hot path by prebuilding a BuildFullSessionArtifactsIndex for O(1) per-session /full/* presence checks, adding a lightweight ReadSessionMetadata (metadata-only) read, and computing generation.json timestamps from already-loaded transcripts via AggregateTranscriptTimestamps.

Reduces post-migration overhead by running RepairV2GenerationMetadata only when new /full/<n> refs were written, and adds ExcludeRefs support so freshly packed generations are not re-walked during repair. Removes legacy v1 transcript cleanup and compact-transcript backfill/fast-path logic from the migration rerun flow, with tests updated to lock in the new idempotent/resume semantics.

Reviewed by Cursor Bugbot for commit 4b333b0. Configure here.

computermode and others added 7 commits May 4, 2026 12:30
Entire-Checkpoint: ff49fc0893a5
Detect checkpoints with /main written but /full artifacts missing —
the state a Ctrl+C between the metadata write and the generation
packer flush leaves behind — and queue the missing sessions for the
same packer used by fresh migrations. Result:

  * New v1 checkpoints are migrated on rerun.
  * Interrupted runs resume without --force.
  * Fully-migrated checkpoints stay skipped.
  * --force still prunes and re-migrates from scratch.

Also recognize the pre-rename filenames (full.jsonl, full.jsonl.NNN,
content_hash.txt — see a3cd771) as valid /full-session artifacts.
Without this, archived /full/<n> generations written by an older CLI
looked "missing" on rerun and were repacked into duplicate
generations (a real repo went 46 → 86 archived gens from one rerun).
Both the new (raw_transcript / raw_transcript_hash.txt) and legacy
names are now accepted.

CleanupV1TranscriptFiles is removed — --force prunes the entire
checkpoint subtree (legacy files included), so targeted /full/current
cleanup is redundant under the new semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 7391783beb62
Walk every /full/* ref (current plus all archived /full/<n>) at the
start of each migrate invocation and rewrite session subtrees so any
pre-rename entry names — full.jsonl, full.jsonl.NNN, content_hash.txt
— are renamed to their current equivalents (raw_transcript,
raw_transcript.NNN, raw_transcript_hash.txt). Blob hashes are
preserved; only tree entry names change. A repo already on the current
naming is a no-op (no commits created).

Without this, a repo migrated under a pre-a3cd77122 CLI ended up with
archived generations carrying legacy names that the read paths could
no longer find. The previous commit's recognize-both-names fix
prevented duplicate generations on rerun, but left the legacy entries
sitting on disk as dark matter. Now they get renamed in place on the
next migrate run, so v2 ends up on a single naming convention
regardless of which CLI version originally populated it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: d0f97513ade1
…e run"

This reverts commit 3be8eed.

Entire-Checkpoint: da7d3f277ea6
The pre-rename filenames (full.jsonl / content_hash.txt — see
a3cd771) don't appear in any real repo we've inspected: every v2
archived /full/<n> was written under the current naming
(raw_transcript / raw_transcript_hash.txt). Recognizing the legacy
names was speculative defense for a path no real data follows.

Removes the extra switch arms in inspectFullSessionArtifacts plus the
test case + helper that were exercising them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 610ce5c951cf
Three changes that together turn the loop and post-loop phases of
entire migrate from "hours" to "seconds + the irreducible cost of
reading the v1 raw transcripts that need packing" on repos with
thousands of partially-migrated checkpoints.

* Build a /full/* presence index once per migration run.
  HasFullSessionArtifacts lists every git ref in the repo and re-walks
  every /full/* tree on each call. The migration loop calls it once
  per session — for every v1 checkpoint — so on a repo with N refs and
  K archived /full/<n> generations the cost is O(checkpoints x sessions
  x N x K). The new BuildFullSessionArtifactsIndex walks each /full/*
  tree once at the top of migrateCheckpointsV2 and records sessions
  with both raw_transcript and raw_transcript_hash.txt. The loop's
  presence checks become O(1) map lookups.

* Read only metadata.json when collecting missing-/full sessions.
  collectMissingFullSessionsForPacking previously called
  ReadSessionMetadataAndPrompts purely to extract Metadata.SessionID,
  but that function also reads prompt.txt and transcript.jsonl. The
  third lookup spammed FetchingTree.File "entry not found" debug logs
  at thousands of cps/sec on partial-state repos because compact
  transcripts are absent from /main for the same sessions whose /full
  is missing. New ReadSessionMetadata reads only metadata.json.

* Skip just-packed archives in RepairV2GenerationMetadata.
  The migration packer writes each fresh /full/<n> with generation.json
  computed in-memory from the just-packed transcripts (via
  AggregateTranscriptTimestamps), so re-deriving timestamps from the
  archived blobs in the repair pass is wasted work. The packer now
  records every ref it writes, runMigrateCheckpointsV2 passes that
  list to RepairV2GenerationMetadata via the new ExcludeRefs option,
  and the repair pass filters those candidates out before doing any
  per-archive computation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 3983fe320ba8
RepairV2GenerationMetadata does a `git ls-remote` plus a transcript-
blob walk for every archived /full/<n> ref. On a fully-migrated repo
those archives don't change between runs, so paying that cost on
every `entire migrate --checkpoints v2` invocation is wrong — it's
the dominant cause of "the progress bar finishes and then it hangs"
on big repos.

Gate the call on `len(freshlyPackedRefs) > 0`. Repair runs when the
migration actually wrote new state (fresh checkpoints or a resumed
partial migration); a no-op rerun exits as soon as the loop completes.
The previously-existing path for fixing malformed archives still
works whenever migration does any packing — the malformed archive
isn't in ExcludeRefs and gets repaired. The path that no longer
fires automatically is "no migration work, just verify everything
that already existed", which only matters on a malformed-but-stable
repo and can be triggered explicitly via --force.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: e03e093352f3
Copilot AI review requested due to automatic review settings May 5, 2026 00:21
@computermode computermode requested a review from a team as a code owner May 5, 2026 00:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refines the hidden entire migrate --checkpoints v2 flow so reruns avoid redundant work in the checkpoints-v2 migration pipeline. The PR mainly speeds up reruns by indexing existing /full/* artifacts, repairing only missing packed state, and reducing post-migration generation-metadata work.

Changes:

  • Adds a /full/* presence index plus lighter metadata reads to make rerun checks cheaper.
  • Reworks rerun behavior to resume interrupted /main/full packing and skip already-packed checkpoints.
  • Avoids recomputing generation metadata for freshly packed refs and limits when the archived-generation repair pass runs.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
cmd/entire/cli/strategy/generation_repair.go Adds repair options so callers can exclude freshly written archived refs.
cmd/entire/cli/strategy/generation_repair_test.go Tests the new exclude-refs repair behavior.
cmd/entire/cli/migrate.go Reworks migration rerun logic, adds /full indexing, and changes when repair runs.
cmd/entire/cli/migrate_test.go Updates migration tests around interrupted reruns, no-op reruns, and new checkpoint pickup.
cmd/entire/cli/checkpoint/v2_store_test.go Replaces legacy cleanup tests with index-behavior tests.
cmd/entire/cli/checkpoint/v2_read.go Adds a metadata-only v2 read path for hot loops.
cmd/entire/cli/checkpoint/v2_generation.go Adds in-memory transcript timestamp aggregation for packed generations.
cmd/entire/cli/checkpoint/v2_committed.go Implements the /full artifact index used by migration reruns.

Comment thread cmd/entire/cli/migrate.go
Comment thread cmd/entire/cli/checkpoint/v2_read.go Outdated
Comment thread cmd/entire/cli/migrate.go Outdated
computermode and others added 4 commits May 5, 2026 10:34
ReadSessionMetadata bypassed wrapWithFetcher, unlike the other v2
readers in the same file. On treeless or partial clones (or with a
stale go-git object cache), that would surface as ErrCheckpointNotFound
during migration reruns even when metadata.json was reachable through
the configured blob fetcher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: b302c4eebaf6
After the rerun fast path was switched to a /full/* presence check,
older v2 checkpoints with archived raw transcripts but blank
transcript.jsonl on /main could only be repaired via --force.
Restore the lighter recovery path: when every session already has
/full artifacts but some compact transcripts are missing, rebuild
them from v1 and write via UpdateCommitted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 2c4b1bc42be7
Drops the single-field RepairV2GenerationMetadataOptions struct in favor
of a plain excludeRefs slice, switches FullSessionArtifactsIndex to
map[string]struct{} (the ref-name value was unused), and tightens the
comment landscape so the gating-rationale story is told in one place
instead of three. Adds a regression test for the resume-via-packing
path that ensures root-level v1 task metadata stays attached to the
latest v2 session rather than the older session being repacked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 2da5d3818b9b
Comment thread cmd/entire/cli/checkpoint/v2_committed.go Outdated
Comment thread cmd/entire/cli/checkpoint/v2_committed.go
Comment thread cmd/entire/cli/migrate.go Outdated
pfleidi
pfleidi previously approved these changes May 5, 2026
@computermode computermode merged commit 4fae3ed into main May 5, 2026
9 checks passed
@computermode computermode deleted the more-migration-speedups branch May 5, 2026 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants