Improve support for sub-agents on same branch#1212
Conversation
Multiple sessions sharing a shadow branch (same base commit + worktree) raced on the final SetReference at the end of WriteTemporary, and the race was silently lost to last-writer-wins. A stress test driving 8 concurrent sessions × 4 checkpoints each saw 75-94% of checkpoints orphaned per run. Wrap WriteTemporary and WriteTemporaryTask in a per-shadow-branch flock under <git-common-dir>/entire-shadow-locks/ and replace the bare SetReference with `git update-ref <ref> <new> <old>` (CAS) plus a small retry budget. The flock serializes our own writers across goroutines and processes; the CAS catches external `git update-ref` callers as defense in depth. After: 30/30 race iterations land all 32 checkpoints with no losses. Assisted-by: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Paulo Gomes <paulo@entire.io>
There was a problem hiding this comment.
Pull request overview
Fixes a checkpoint-loss race where multiple AI-agent sessions sharing the same shadow branch (same base commit + worktree) raced on the final SetReference at the end of WriteTemporary/WriteTemporaryTask. The fix combines a per-shadow-branch cross-process flock with a CAS-style git update-ref <ref> <new> <old> and a small randomized-backoff retry loop, plus a stress test that catches regressions.
Changes:
- Introduce
withShadowBranchFlock(file-based exclusive lock under<git-common-dir>/entire-shadow-locks/) andcasUpdateShadowBranchRef(shell-out CAS viagit update-ref), with retry budget and jittered backoff. - Refactor
WriteTemporaryandWriteTemporaryTaskto hold the flock around a retry loop that re-reads parent, re-builds the tree, creates the commit, and CASes the ref. - Add a concurrent stress test (
TestSaveStep_ConcurrentSessionsSameShadowBranch) driving 8 sessions × 4 checkpoints sharing a shadow branch and validating shadow-branch internal consistency and commit count.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| cmd/entire/cli/checkpoint/shadow_ref.go | New flock-path helper, withShadowBranchFlock, and casUpdateShadowBranchRef with jittered backoff. |
| cmd/entire/cli/checkpoint/temporary.go | Wrap shadow-branch writes in flock + CAS retry loop; restructure WriteTemporary/WriteTemporaryTask. |
| cmd/entire/cli/checkpoint/flock_unix.go | New Unix flock primitive (syscall.Flock) for the shadow lock. |
| cmd/entire/cli/checkpoint/flock_windows.go | New Windows flock primitive (LockFileEx) for the shadow lock. |
| cmd/entire/cli/strategy/manual_commit_concurrent_test.go | New concurrent stress test that reproduces the race and asserts shadow-branch consistency. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 7350c3f. Configure here.
- Resolve repoRoot+commonDir from s.repo at the top of WriteTemporary and WriteTemporaryTask, then pass them to the helpers. The git update-ref invocation now sets cmd.Dir = repoRoot so it targets s.repo regardless of process cwd, and the flock path resolves via the git common dir from the repo handle rather than session.GetGitCommonDir(cwd). - Force LC_ALL=C/LANG=C on git update-ref so the CAS-conflict pattern match against "cannot lock ref" / "but expected" isn't defeated by a translated stderr in non-C locales. - Extract the cross-process flock primitive to cmd/entire/cli/internal/ flock; remove the duplicated copies in checkpoint and strategy and switch both packages to the shared package. - Use newHash.HexSize() to size the all-zeros <oldvalue>, fixing the SHA-256 integration test (TestSHA256Repository_EnableAndFirstCheckpoint) that broke because 40 zeros are too short for SHA-256 refs. - Fix shadowRefBackoff: compute the doubling on the upper-bound base before sampling, and add a 1ms floor so the jitter is always meaningful and the "doubled after attempt > 4" branch actually scales the distribution. - Delete the dangling commit object on each CAS-losing attempt so we don't accumulate unreachable loose objects per retry. - Log a warning when the CAS retry budget is exhausted, so a stuck shadow branch surfaces in .entire/logs/. - Test: assert each session's StepCount equals checkpointsPerWorker after the race, so a future dedup-induced skip would surface as the precise invariant violation instead of the misleading "commits were lost" message. Renamed the goroutine-safe writeFile helper to writeFileForRaceTest to prevent accidental shadowing of a more general helper added to the strategy package later. Test-level comment now spells out the unique-content invariant the dedup behavior relies on. Assisted-by: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Paulo Gomes <paulo@entire.io> Entire-Checkpoint: b313504a803e

https://entire.io/gh/entireio/cli/trails/376
Multiple sessions sharing a shadow branch (same base commit + worktree) raced on the final SetReference at the end of WriteTemporary, and the race was silently lost to last-writer-wins. A stress test driving 8 concurrent sessions × 4 checkpoints each saw 75-94% of checkpoints orphaned per run.
Wrap WriteTemporary and WriteTemporaryTask in a per-shadow-branch flock under /entire-shadow-locks/ and replace the bare SetReference with
git update-ref <ref> <new> <old>(CAS) plus a small retry budget. The flock serializes our own writers across goroutines and processes; the CAS catches externalgit update-refcallers as defense in depth.After: 30/30 race iterations land all 32 checkpoints with no losses.
Before vs after
Note
Medium Risk
Changes shadow-branch ref update semantics and introduces cross-process locking plus a CAS retry loop, which can affect checkpoint correctness and potentially block/hang if locking/backoff misbehaves. Scope is limited to checkpoint write paths and is covered by a new concurrency regression test.
Overview
Prevents concurrent checkpoint writers from clobbering shadow branches. Temporary checkpoint writes (
WriteTemporaryandWriteTemporaryTask) are now wrapped in a per-shadow-branch filesystem lock and update the branch ref usinggit update-refcompare-and-swap with bounded retries and jittered backoff.Adds platform-specific flock implementations (Unix/Windows), stores lockfiles under
<git-common-dir>/entire-shadow-locks/, and introduces a stress test that runs many concurrentSaveStepcalls against the same shadow branch and validates the resulting commit chain/tree consistency.Reviewed by Cursor Bugbot for commit 7350c3f. Configure here.