Skip to content

Bound checkpoint git subprocesses to their context deadline#1282

Merged
computermode merged 11 commits into
mainfrom
harden-checkpoint-git-timeout
May 28, 2026
Merged

Bound checkpoint git subprocesses to their context deadline#1282
computermode merged 11 commits into
mainfrom
harden-checkpoint-git-timeout

Conversation

@computermode
Copy link
Copy Markdown
Contributor

@computermode computermode commented May 27, 2026

https://entire.io/gh/entireio/cli/trails/440

Problem

A git push/git fetch over a custom transport (e.g. entire://) can hang the pre-push hook for ~an hour despite a 2-minute timeout. A doctor trace of a stuck checkpoint push:

push_checkpoints_branch        3,502,176ms
  ├─ fetch_and_rebase             13,695ms
  │  └─ git_fetch                 12,612ms
  ├─ git_push                      1,163ms  x
  └─ git_push~1                3,487,316ms  x   ← ~58 min

The context deadline SIGKILLs the git process, but the spawned remote-helper grandchild (e.g. git-remote-entire) keeps the output pipe open, so cmd.CombinedOutput() blocks until that helper finally exits.

Fix

Harden every git subprocess built by newCommand (push / fetch / fetch-pack / ls-remote / cat-file):

  • WaitDelay backstop (all platforms): after the process is killed, Go force-closes the pipes instead of waiting on a pipe-holding descendant.
  • Process-group kill (unix): the child runs in its own process group and the whole group is SIGKILLed on cancellation, so the remote helper is terminated too rather than orphaned. Windows relies on the WaitDelay backstop (reliable process-tree kill there needs a Job Object, out of scope).

Turns "stuck forever" into a bounded failure: checkpoints stay saved locally and the user's own push proceeds.

Testing

  • Unit tests assert newCommand sets WaitDelay + a group-kill Cancel (unix Setpgid).
  • A behavioral test reproduces the failure shape — a backgrounded grandchild holding the pipe — and confirms the group-kill terminates it promptly.
  • Existing remote, strategy push/fetch/rebase, and integration remote tests pass.

Notes

  • The transport itself still hangs on the retry upload — that's the entiredb / git-remote-entire side; this PR just stops it from hanging the hook.
  • The per-phase perf spans that produced the trace above are in a separate branch (hooks-slow-push-db-mirror).

🤖 Generated with Claude Code


Note

Medium Risk
Changes how all checkpoint remote git invocations terminate on timeout/cancel; incorrect group kill could affect child processes, though scope is limited to the checkpoint remote package and is well-tested on Unix.

Overview
Checkpoint git subprocesses built by newCommand (push, fetch, fetch-pack, ls-remote, cat-file) are hardened so context timeouts are not defeated by orphaned transport helpers.

Every such command gets a 10s WaitDelay so CombinedOutput/Wait cannot block indefinitely on pipe-holding descendants after the parent is killed. On Unix, children run in their own process group with a custom Cancel that SIGKILLs the whole group, so helpers like git-remote-entire are torn down instead of keeping stdout open for tens of minutes. Windows only gets the WaitDelay backstop (process-tree kill deferred).

Unit and behavioral tests cover WaitDelay, group kill on Unix, and the pipe-holding grandchild scenario.

Reviewed by Cursor Bugbot for commit 36085bb. Configure here.

Copilot AI review requested due to automatic review settings May 27, 2026 20:33
@computermode computermode requested a review from a team as a code owner May 27, 2026 20:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens checkpoint-remote git subprocess execution so context timeouts/cancellation reliably terminate hung git push/git fetch operations (especially when custom transports spawn helper descendants that keep stdout/stderr pipes open).

Changes:

  • Apply subprocess hardening in newCommand for all checkpoint-remote git invocations.
  • Introduce cross-platform WaitDelay backstop plus Unix-only process-group kill on cancellation.
  • Add unit and behavioral tests validating WaitDelay and process-group termination behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
cmd/entire/cli/checkpoint/remote/git.go Ensures all checkpoint-remote git commands created via newCommand are hardened against hangs.
cmd/entire/cli/checkpoint/remote/command_harden.go Adds shared hardening logic (WaitDelay + platform hook).
cmd/entire/cli/checkpoint/remote/command_harden_windows.go Implements Windows no-op for process-group kill (relies on WaitDelay + default cancellation).
cmd/entire/cli/checkpoint/remote/command_harden_unix.go Implements Unix process-group isolation (Setpgid) and group-wide SIGKILL on cancel.
cmd/entire/cli/checkpoint/remote/command_harden_unix_test.go Adds Unix tests validating process-group kill behavior and pipe-holding descendant scenario.
cmd/entire/cli/checkpoint/remote/command_harden_test.go Adds cross-platform tests validating newCommand/hardenSubprocess set WaitDelay and cancellation behavior.

Comment thread cmd/entire/cli/checkpoint/remote/command_harden_test.go Outdated
Comment thread cmd/entire/cli/checkpoint/remote/command_harden_unix_test.go Outdated
Comment thread cmd/entire/cli/checkpoint/remote/command_harden_unix_test.go Outdated
Comment thread cmd/entire/cli/checkpoint/remote/command_harden_unix_test.go Outdated
A `git push`/`git fetch` over a custom transport (e.g. entire://) could
hang the pre-push hook for ~an hour despite a 2-minute timeout. The
deadline SIGKILLs the `git` process, but the spawned remote-helper
grandchild (e.g. git-remote-entire) keeps the output pipe open, so
cmd.CombinedOutput() blocks until that helper finally exits (~58 min
observed for a stuck checkpoint push).

Harden every git subprocess built by newCommand:
  - WaitDelay backstop (all platforms) so Wait force-closes the pipes
    shortly after the process is killed instead of waiting on a
    pipe-holding descendant.
  - On unix, run the child in its own process group and SIGKILL the whole
    group on cancellation so the remote helper is terminated too, not
    orphaned. Windows relies on the WaitDelay backstop (reliable
    process-tree kill there needs a Job Object, out of scope).

Turns "stuck forever" into a bounded failure: checkpoints stay saved
locally and the user's own push proceeds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 1d96e0531da7
@computermode computermode force-pushed the harden-checkpoint-git-timeout branch from 36085bb to 4dfd744 Compare May 27, 2026 20:42
computermode and others added 5 commits May 27, 2026 13:57
The group-kill regression test asserted a hardcoded 5s deadline. That
deadline must stay strictly below killWaitDelay (10s): the WaitDelay
backstop would itself force the pipe closed once it elapses, so a
deadline >= killWaitDelay would let the test pass even with group-kill
removed, silently turning it into a no-op. Derive the deadline from
killWaitDelay so the safety margin can't drift out of sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 29915cf2d150
Address Copilot review on the checkpoint cancel tests:

- Make TestNewCommand_TerminatesOnCancel and
  TestKillProcessGroupOnCancel_SetsSetpgidAndCancel hermetic by clearing
  ENTIRE_CHECKPOINT_TOKEN (and dropping t.Parallel for t.Setenv). With a
  token set, newCommand resolves "origin" via GetRemoteURL and would spawn
  a git subprocess against the ambient repo.
- Rewrite TestTerminateOnCancel_KillsProcessGroup to cancel only after a
  stdout "ready" marker confirms the backgrounded grandchild exists,
  instead of a 300ms timeout that could fire before process start and
  silently no-op the test.
- Bound cmd.Wait() in TestKillProcessGroupOnCancel_TerminatesLeader so a
  group-kill regression fails fast instead of hanging ~60s (this cmd sets
  no WaitDelay backstop).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 545d8ff9b01d
A stuck transport made the pre-push hook block ~4 minutes: each push
attempt minted its own 2-minute timeout and the post-sync retry got a
fresh one, so the deadline-killed initial push and retry ran back to back.

Replace the per-attempt timeouts in tryPushSessionsCommon and
fetchAndRebaseSessionsCommon with a single 60s budget threaded from
doPushBranch. The initial push, fetch+rebase recovery, and retry now
share one deadline, so a stuck push that consumes the budget makes the
recovery and retry fail fast instead of each waiting a full timeout. The
retry still works for fast non-fast-forward rejections.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: ca420d837db1
Soph
Soph previously approved these changes May 28, 2026
Comment thread cmd/entire/cli/strategy/push_common.go Outdated
Entire-Checkpoint: 5c5cd90917d0
computermode and others added 4 commits May 28, 2026 09:56
Collapse the multi-paragraph docstrings and inline explanations added in this
branch down to one or two dense lines each, keeping the non-obvious WHY
(transport-helper grandchildren holding the pipe; per-attempt timeouts
stacking; the deadline < killWaitDelay invariant) and dropping narration of
what the code already says.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: f2c1abb5a175
@computermode computermode merged commit f186c58 into main May 28, 2026
9 checks passed
@computermode computermode deleted the harden-checkpoint-git-timeout branch May 28, 2026 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants