Skip to content

Improve support for sub-agents on same branch#1212

Merged
pjbgf merged 2 commits into
mainfrom
tree-investigation
May 14, 2026
Merged

Improve support for sub-agents on same branch#1212
pjbgf merged 2 commits into
mainfrom
tree-investigation

Conversation

@pjbgf
Copy link
Copy Markdown
Member

@pjbgf pjbgf commented May 14, 2026

https://entire.io/gh/entireio/cli/trails/376

Multiple sessions sharing a shadow branch (same base commit + worktree) raced on the final SetReference at the end of WriteTemporary, and the race was silently lost to last-writer-wins. A stress test driving 8 concurrent sessions × 4 checkpoints each saw 75-94% of checkpoints orphaned per run.

Wrap WriteTemporary and WriteTemporaryTask in a per-shadow-branch flock under /entire-shadow-locks/ and replace the bare SetReference with git update-ref <ref> <new> <old> (CAS) plus a small retry budget. The flock serializes our own writers across goroutines and processes; the CAS catches external git update-ref callers as defense in depth.

After: 30/30 race iterations land all 32 checkpoints with no losses.

Before vs after

Test scenario Before After
20 race iterations, 8 sessions × 4 checkpoints 75–94% commits lost 0 commits lost
30 race iterations (final) n/a 0 commits lost

Note

Medium Risk
Changes shadow-branch ref update semantics and introduces cross-process locking plus a CAS retry loop, which can affect checkpoint correctness and potentially block/hang if locking/backoff misbehaves. Scope is limited to checkpoint write paths and is covered by a new concurrency regression test.

Overview
Prevents concurrent checkpoint writers from clobbering shadow branches. Temporary checkpoint writes (WriteTemporary and WriteTemporaryTask) are now wrapped in a per-shadow-branch filesystem lock and update the branch ref using git update-ref compare-and-swap with bounded retries and jittered backoff.

Adds platform-specific flock implementations (Unix/Windows), stores lockfiles under <git-common-dir>/entire-shadow-locks/, and introduces a stress test that runs many concurrent SaveStep calls against the same shadow branch and validates the resulting commit chain/tree consistency.

Reviewed by Cursor Bugbot for commit 7350c3f. Configure here.

Multiple sessions sharing a shadow branch (same base commit + worktree)
raced on the final SetReference at the end of WriteTemporary, and the
race was silently lost to last-writer-wins. A stress test driving 8
concurrent sessions × 4 checkpoints each saw 75-94% of checkpoints
orphaned per run.

Wrap WriteTemporary and WriteTemporaryTask in a per-shadow-branch flock
under <git-common-dir>/entire-shadow-locks/ and replace the bare
SetReference with `git update-ref <ref> <new> <old>` (CAS) plus a small
retry budget. The flock serializes our own writers across goroutines
and processes; the CAS catches external `git update-ref` callers as
defense in depth.

After: 30/30 race iterations land all 32 checkpoints with no losses.

Assisted-by: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Paulo Gomes <paulo@entire.io>
Copilot AI review requested due to automatic review settings May 14, 2026 14:34
@pjbgf pjbgf requested a review from a team as a code owner May 14, 2026 14:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a checkpoint-loss race where multiple AI-agent sessions sharing the same shadow branch (same base commit + worktree) raced on the final SetReference at the end of WriteTemporary/WriteTemporaryTask. The fix combines a per-shadow-branch cross-process flock with a CAS-style git update-ref <ref> <new> <old> and a small randomized-backoff retry loop, plus a stress test that catches regressions.

Changes:

  • Introduce withShadowBranchFlock (file-based exclusive lock under <git-common-dir>/entire-shadow-locks/) and casUpdateShadowBranchRef (shell-out CAS via git update-ref), with retry budget and jittered backoff.
  • Refactor WriteTemporary and WriteTemporaryTask to hold the flock around a retry loop that re-reads parent, re-builds the tree, creates the commit, and CASes the ref.
  • Add a concurrent stress test (TestSaveStep_ConcurrentSessionsSameShadowBranch) driving 8 sessions × 4 checkpoints sharing a shadow branch and validating shadow-branch internal consistency and commit count.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
cmd/entire/cli/checkpoint/shadow_ref.go New flock-path helper, withShadowBranchFlock, and casUpdateShadowBranchRef with jittered backoff.
cmd/entire/cli/checkpoint/temporary.go Wrap shadow-branch writes in flock + CAS retry loop; restructure WriteTemporary/WriteTemporaryTask.
cmd/entire/cli/checkpoint/flock_unix.go New Unix flock primitive (syscall.Flock) for the shadow lock.
cmd/entire/cli/checkpoint/flock_windows.go New Windows flock primitive (LockFileEx) for the shadow lock.
cmd/entire/cli/strategy/manual_commit_concurrent_test.go New concurrent stress test that reproduces the race and asserts shadow-branch consistency.

Comment thread cmd/entire/cli/checkpoint/shadow_ref.go Outdated
Comment thread cmd/entire/cli/checkpoint/shadow_ref.go
Comment thread cmd/entire/cli/checkpoint/flock_unix.go Outdated
Comment thread cmd/entire/cli/checkpoint/shadow_ref.go
Comment thread cmd/entire/cli/checkpoint/temporary.go Outdated
Comment thread cmd/entire/cli/strategy/manual_commit_concurrent_test.go Outdated
Comment thread cmd/entire/cli/strategy/manual_commit_concurrent_test.go Outdated
Comment thread cmd/entire/cli/checkpoint/temporary.go
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 7350c3f. Configure here.

Comment thread cmd/entire/cli/checkpoint/shadow_ref.go
@pjbgf pjbgf changed the title Serialize shadow-branch writes with flock + CAS Improve support for sub-agents on same branch May 14, 2026
- Resolve repoRoot+commonDir from s.repo at the top of WriteTemporary and
  WriteTemporaryTask, then pass them to the helpers. The git update-ref
  invocation now sets cmd.Dir = repoRoot so it targets s.repo regardless
  of process cwd, and the flock path resolves via the git common dir from
  the repo handle rather than session.GetGitCommonDir(cwd).
- Force LC_ALL=C/LANG=C on git update-ref so the CAS-conflict pattern
  match against "cannot lock ref" / "but expected" isn't defeated by a
  translated stderr in non-C locales.
- Extract the cross-process flock primitive to cmd/entire/cli/internal/
  flock; remove the duplicated copies in checkpoint and strategy and
  switch both packages to the shared package.
- Use newHash.HexSize() to size the all-zeros <oldvalue>, fixing the
  SHA-256 integration test (TestSHA256Repository_EnableAndFirstCheckpoint)
  that broke because 40 zeros are too short for SHA-256 refs.
- Fix shadowRefBackoff: compute the doubling on the upper-bound base
  before sampling, and add a 1ms floor so the jitter is always meaningful
  and the "doubled after attempt > 4" branch actually scales the
  distribution.
- Delete the dangling commit object on each CAS-losing attempt so we
  don't accumulate unreachable loose objects per retry.
- Log a warning when the CAS retry budget is exhausted, so a stuck
  shadow branch surfaces in .entire/logs/.
- Test: assert each session's StepCount equals checkpointsPerWorker
  after the race, so a future dedup-induced skip would surface as the
  precise invariant violation instead of the misleading "commits were
  lost" message. Renamed the goroutine-safe writeFile helper to
  writeFileForRaceTest to prevent accidental shadowing of a more general
  helper added to the strategy package later. Test-level comment now
  spells out the unique-content invariant the dedup behavior relies on.

Assisted-by: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Paulo Gomes <paulo@entire.io>
Entire-Checkpoint: b313504a803e
@pjbgf pjbgf enabled auto-merge May 14, 2026 15:19
@pjbgf pjbgf merged commit e69fdc3 into main May 14, 2026
9 checks passed
@pjbgf pjbgf deleted the tree-investigation branch May 14, 2026 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants