Skip to content

fix(detached): use next-exit wait in executeDetached#6

Merged
enixCode merged 1 commit intomainfrom
fix/detached-next-exit-wait
May 5, 2026
Merged

fix(detached): use next-exit wait in executeDetached#6
enixCode merged 1 commit intomainfrom
fix/detached-next-exit-wait

Conversation

@enixCode
Copy link
Copy Markdown
Owner

@enixCode enixCode commented May 4, 2026

Summary

  • executeDetached was the last call-site still using await container.wait() without condition: 'next-exit'. The default 'not-running' is satisfied by the 'created' state too, so the wait could resolve immediately if armed before the daemon flipped the container to 'running' (race window after await start() returns).
  • Effect for consumers (light-run, light-process): execution.result resolved in milliseconds for a long-running detached run, downstream cancel() arrived too late, and the run reported success: true while the container was still alive.
  • Fix mirrors the existing one in attach() (commit 89f0c70): pass { condition: 'next-exit' }.

Test plan

  • npm run build (clean compile, no TS errors)
  • npm run test:unit — 30/30 pass
  • npm run test:e2e (new detached-wait test): pass in ~5.7s, sentinel wins the race and elapsed >= 4500ms for a sleep 5 detached run
  • No regression on existing detached/attach/runner suites
  • CI green on this PR

- Mirror the existing fix in attach() (commit 89f0c70). The default
  'not-running' wait condition is satisfied by the 'created' state,
  so wait() could resolve immediately when armed in the brief window
  after start() returned but before the daemon flipped the container
  to 'running'.
- Symptom: execution.result resolved in milliseconds for a long-running
  detached run, breaking downstream cancel() and reporting a false
  succeeded status before the container had actually exited.
- One-line fix: pass { condition: 'next-exit' } to container.wait().
- Add test/e2e/detached-wait.test.ts: races result against a 2s
  sentinel for a sleep 5 container, asserts the sentinel wins and
  elapsed >= 4500ms once result eventually resolves.

build with cc
@enixCode enixCode merged commit 31b389b into main May 5, 2026
3 of 4 checks passed
enixCode added a commit that referenced this pull request May 5, 2026
## Summary

PR #6 fixed the wrong race. The actual root cause is that **both**
`container.wait()` variants are unsafe for `executeDetached`:

| variant | failure mode |
|---|---|
| default `'not-running'` | matches `'created'` state, so wait armed
before daemon transitions out of `'created'` resolves immediately with
`StatusCode 0` (false success). This is what #6 attempted to fix. |
| `'next-exit'` (#6's fix) | requires the container to still be running
at call time, hangs forever for fast-exit commands or post-cancel
containers whose exit event already fired. |

Locally, **5 of 6 detached e2e tests reproduced the #6 hang**, all
timing out at 60s on `echo`-style commands and on the cancel scenario.
The CI run on this branch's predecessor burned ~6 h before GitHub
auto-cancelled it.

## Fix

Replace `container.wait()` with a polling loop on `container.inspect()`
(new `waitForDetachedExit` helper). Race-free because AutoRemove is
disabled for detached runs, so `'exited'` state stays observable until
`executeDetached`'s finally tears the container down. Detection latency
is at most one 100 ms tick.

## CI guardrail (second commit)

Add `timeout-minutes: 10` to the build/unit job and `15` to e2e.
Prevents a future hang from burning another six hours of runner time.
The default per-job ceiling is 360 min — these are well above normal
completion time but cap any pathological case.

## Test plan

- [x] `npm run test:build` clean
- [x] `npm run test:unit` 30/30 pass
- [x] `npm run test:e2e --test-timeout=30000` for `detached.test.js` +
`detached-wait.test.js` → 7/7 pass in ~12 s (including the sleep-5
regression test from #6)
- [x] Direct repro on `echo hello` detached → resolves in < 10 s with
`exitCode 0`, no hang
- [ ] CI green on this PR (the timeout-minutes cap protects against
regression)

## Note on attach()

`attach()` still uses `wait({condition: 'next-exit'})` — same
fundamental risk if the state file claims `'running'` but the container
has exited. The early-return guard `if (state.status !== 'running')`
catches the common case but the residual race exists. Out of scope for
this PR; tracking as follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant