Skip to content

Graceful Ctrl-C shutdown for mise dev / dev-all#4855

Merged
habdelra merged 6 commits into
mainfrom
graceful-dev-stack-shutdown
May 18, 2026
Merged

Graceful Ctrl-C shutdown for mise dev / dev-all#4855
habdelra merged 6 commits into
mainfrom
graceful-dev-stack-shutdown

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

@habdelra habdelra commented May 18, 2026

Summary

Ctrl-C through the mise run dev / mise run dev-all stack now lands as a clean WIFEXITED(0) across mise, pnpm, run-p, and every service, with no leaked file watchers, no orphaned vite/worker processes, and no spurious error lines from intermediate wrapper layers.

Orchestrators (mise tasks)

  • mise-tasks/dev and mise-tasks/dev-all: signal trap does a fast_shutdown_kick — SIGTERM every recorded pgroup and return immediately. The slow TERM→KILL escalation + sweep_orphaned_services runs in the background under the cleanup guardian after this bash exits, so mise records WIFEXITED(0) instead of WIFSIGNALED when its per-task grace runs out mid-cleanup.
  • The trailing wait "$SAT_PID"'s 128+signal return is normalized to 0. With set -m, Ctrl-C is delivered to SAT's pgroup — not this bash — so the INT/TERM/HUP trap doesn't fire; the EXIT trap still kicks the guardian and the script exits 0.
  • Per-service mise-tasks/services/* scripts trap INT/TERM and exit 0 so the 143-on-Ctrl-C from the ts-node … | dev-log-tee.sh pipeline isn't reported by mise / run-p as a task failure. The two services with existing icon-server EXIT cleanup (realm-server, test-realms) preserve that path; INT/TERM additionally call cleanup once and exit 0.
  • mise-tasks/lib/dev-common.sh sweep regexes:
    • VITE_SERVE_RE drops the ${REPO_ROOT_RE}/… anchor — pnpm invokes the wrapper as node scripts/vite-serve.js (relative argv), so the absolute pattern never matched and the sweep silently skipped it.
    • VITE_BIN_RE drops the trailing --port 4200 — in local-HTTPS dev mode the wrapper puts vite on a dynamic internal port, so pinning to 4200 missed every vite process.

Realm-server

  • packages/realm-server/main.ts wires SIGTERM/SIGINT to the same stopRealmServer path used by the IPC stop message. Process-group sweeps from mise dev previously had no signal handler and had to escalate to SIGKILL.
  • stopRealmServer iterates the mounted realms and calls realm.unsubscribe() on each, closing the underlying sane → fs.watch FSWatcher handles. Without this, each realm pinned a watcher and the process couldn't exit naturally — wtfnode dumps showed hundreds of FSWatcher handles after shutdown signal.

Host vite wrapper

packages/host/scripts/vite-with-traefik.js:

  • Spawns vite with stdio: ['ignore', 'inherit', 'inherit'] so process.stdin.isTTY is false inside vite, suppressing the bindCLIShortcuts readline that produced the read EIO stack trace when the parent TTY tore down.
  • Spawns without shell: true so child is the npx process, not an intermediate sh -c that would absorb our forwarded signal alone.
  • Forwards SIGTERM/SIGINT/SIGHUP to vite, then exits 0 in the same tick. Waiting for the child to acknowledge made the orchestrator's ~2s SIGKILL grace expire mid-wait, surfacing as [ERR_PNPM_RECURSIVE_RUN_FIRST_FAIL] from pnpm.
  • Non-INT/TERM signal exits from the child translate to 128 + signum instead of silently exit 0, so SIGKILL/SIGSEGV/SIGABRT on vite aren't masked as a clean shutdown.

packages/host/scripts/vite-serve.js runs ensure-boxel-ui inline via execFileSync, and packages/host/package.json start collapses to a single node scripts/vite-serve.js. The previous pnpm ensure-boxel-ui && node scripts/vite-serve.js chain forced pnpm to invoke the script through sh -c, and the shell — having no SIGTERM handler — died via signal on Ctrl-C even though Node exited 0, surfacing as [ERR_PNPM_RECURSIVE_RUN_FIRST_FAIL] and Command failed with signal "SIGTERM". Removing the && keeps Node as pnpm's direct child.

Observability for future hang investigations

New BOXEL_WTFNODE=1 opt-in helper (packages/realm-server/lib/wtfnode-on-signal.ts and packages/host/scripts/wtfnode-on-signal.js) dumps the active handles on SIGINT/SIGTERM and again 5 seconds later. Wired into every node entry point (realm-server main, worker-manager, worker children, prerender-server, prerender manager-server, vite wrapper) so future shutdown-hang investigations have evidence without ad-hoc edits. Disabled by default; the runtime cost is one process.on listener.

Real failures still surface

Only INT/TERM/HUP are translated to exit 0. Any non-signal child failure (ts-node faulting on its own, an indexer crash, vite OOM, etc.) still flows through pipefail and surfaces as a real error.

Test plan

  • mise run dev, wait for the stack to come up, hit a card preview URL to kick off indexing, then Ctrl-C. Expect:
    • No exited with 143 / ERROR task failed / read EIO / [ERR_PNPM_RECURSIVE_RUN_FIRST_FAIL] / Command failed with signal "SIGTERM" / no exit status.
    • Shell prompt returns in well under 5 seconds; background cleanup continues silently after.
  • mise run dev-all — same expectations.
  • After shutdown, ports 4200/4201/4202/4210/4211/4221/4222 should be free (lsof -i:4200,4201,4202,4210,4211,4221,4222).
  • With BOXEL_WTFNODE=1 mise run dev, Ctrl-C and confirm the 5s-later dump shows no FSWatcher / unexpected timer handles for realm-server, worker-manager, etc.
  • Sanity: induce a fake crash (e.g. kill -KILL a ts-node child mid-run) and confirm the orchestrator still surfaces it as a real failure rather than swallowing.

🤖 Generated with Claude Code

Treat signal-driven shutdown (SIGINT/SIGTERM) as a normal exit across the
dev orchestrators and per-service mise tasks so Ctrl-C no longer prints
exit 143 / "task failed" / ELIFECYCLE noise. Also detach vite's stdin
from the parent TTY in vite-with-traefik.js so vite's readline-based
shortcut handler can't emit the "read EIO" stack trace as the terminal
tears down.

Real crashes still propagate — only INT/TERM are translated to exit 0,
and existing EXIT-trap cleanup is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 91a24422bc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/host/scripts/vite-with-traefik.js
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

Preview deployments

Host Test Results

    1 files  ±0      1 suites  ±0   1h 44m 58s ⏱️ - 1m 12s
2 661 tests +2  2 646 ✅ +2  15 💤 ±0  0 ❌ ±0 
2 680 runs  +2  2 665 ✅ +2  15 💤 ±0  0 ❌ ±0 

Results for commit e324125. ± Comparison against earlier commit d9f833e.

Realm Server Test Results

    1 files  ±0      1 suites  ±0   8m 13s ⏱️ -33s
1 386 tests ±0  1 386 ✅ +1  0 💤 ±0  0 ❌  - 1 
1 467 runs  ±0  1 467 ✅ +1  0 💤 ±0  0 ❌  - 1 

Results for commit e324125. ± Comparison against earlier commit d9f833e.

With `shell: true`, the spawned `child` is the intermediate `sh -c`
process, and `child.kill(signal)` only signals the shell — leaving the
vite grandchild orphaned and still bound to port 4200 if a parent
process manager signals just this wrapper instead of sweeping the whole
process group. Spawning npx directly makes `child` the npx process,
which forwards signals to its vite child.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the developer experience when stopping mise run dev / mise run dev-all (and related service tasks) by making Ctrl-C / signal-driven shutdowns clean and non-erroring, while still allowing genuine failures to surface.

Changes:

  • Add SIGINT/SIGTERM traps to service-level mise tasks so signal-driven exits don’t show up as task failures (143/130) in mise/run-p.
  • Update mise-tasks/dev and mise-tasks/dev-all to run cleanup on INT/TERM/HUP and then exit 0, avoiding propagation of signal exit codes from the underlying wait.
  • Adjust the host’s Vite launcher to ignore stdin (avoids Vite readline read EIO noise) and forward shutdown signals to the Vite child, translating signal exits to 0 to prevent pnpm recursive-run failures during shutdown.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
packages/host/scripts/vite-with-traefik.js Detaches stdin for Vite and forwards shutdown signals; maps signal exits to 0 for clean pnpm start shutdown.
mise-tasks/services/worker-test Trap INT/TERM to exit 0 to avoid 143/130 shutdown noise.
mise-tasks/services/worker-base Trap INT/TERM to exit 0 to avoid 143/130 shutdown noise.
mise-tasks/services/worker Trap INT/TERM to exit 0 for pipefail pipeline shutdown behavior under the orchestrators.
mise-tasks/services/test-realms Split EXIT cleanup from INT/TERM handling to ensure cleanup runs once while exiting 0 on shutdown signals.
mise-tasks/services/realm-server-base Trap INT/TERM to exit 0 to avoid 143/130 shutdown noise.
mise-tasks/services/realm-server Split EXIT cleanup from INT/TERM handling to ensure cleanup runs once while exiting 0 on shutdown signals.
mise-tasks/services/prerender-mgr Trap INT/TERM to exit 0 to avoid 143/130 shutdown noise.
mise-tasks/services/prerender Trap INT/TERM to exit 0 to avoid 143/130 shutdown noise.
mise-tasks/dev-all Split EXIT cleanup from INT/TERM/HUP handling; cleanup + exit 0 on signal-driven shutdown.
mise-tasks/dev Split EXIT cleanup from INT/TERM/HUP handling; cleanup + exit 0 on signal-driven shutdown.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/host/scripts/vite-with-traefik.js
habdelra and others added 4 commits May 18, 2026 09:26
`child.on('exit')` previously fell through to `process.exit(code || 0)`
for any signal that wasn't SIGINT/SIGTERM. Since `code` is null when a
process exits via signal, that masked SIGKILL/SIGSEGV/SIGABRT crashes
as clean shutdowns. Translate those into 128+signum so the orchestrator
sees the crash instead of treating it as success.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Several reinforcing fixes so `mise run dev` / `dev-all` Ctrl-C completes
quickly with clean exit codes and without leaking file watchers:

- realm-server (`packages/realm-server/main.ts`): wire SIGTERM/SIGINT to
  the same shutdown path as IPC `stop`, and iterate the mounted realms
  calling `realm.unsubscribe()` so each NodeAdapter's sane watcher (and
  the underlying FSWatcher handles) actually closes. Without this the
  process pinned hundreds of FSWatchers until the orchestrator SIGKILL'd
  it.
- mise-tasks/dev{,-all}: replace the synchronous cleanup-then-exit
  shutdown handler with a fire-and-forget `fast_shutdown_kick` that
  SIGTERMs every recorded pgroup and returns. The cleanup guardian
  spawned at script start polls the bash PID and finishes the KILL
  escalation + `sweep_orphaned_services` after we exit, so mise sees
  WIFEXITED(0) instead of WIFSIGNAL'ing us mid-cleanup. Also normalize
  `wait $SAT_PID`'s 128+signal return to 0 — under `set -m` Ctrl-C is
  delivered to SAT's pgroup, not this bash, so the INT/TERM/HUP trap
  never fires and the script would otherwise fall through `exit $?` with
  a signal-induced code.
- mise-tasks/lib/dev-common.sh sweep regexes:
  - `VITE_SERVE_RE`: drop the absolute-path anchor — pnpm invokes the
    wrapper as `node scripts/vite-serve.js` (relative argv), so the old
    `${REPO_ROOT_RE}/...` pattern never matched and the sweep silently
    skipped the wrapper.
  - `VITE_BIN_RE`: drop the trailing `--port 4200`. In local-HTTPS dev
    mode the wrapper puts vite on a dynamic internal port and the
    dispatcher owns 4200, so the port-pinned pattern missed every vite
    process in that mode.
- packages/host/scripts/vite-serve.js: inline `ensure-boxel-ui` via
  `execFileSync` so the `start` package script can be a single `node …`
  command instead of `pnpm ensure-boxel-ui && node …`. With `&&`, pnpm
  ran the script through `sh -c`, and the shell — having no SIGTERM
  handler — died via signal on Ctrl-C even though Node exited 0,
  surfacing as `[ERR_PNPM_RECURSIVE_RUN_FIRST_FAIL]` and `Command failed
  with signal "SIGTERM"`. Removing the `&&` keeps Node as pnpm's direct
  child.
- packages/host/scripts/vite-with-traefik.js: exit the wrapper with code
  0 immediately on SIGTERM/SIGINT/SIGHUP rather than waiting for the
  child to acknowledge. The dev orchestrator gives the process group
  ~2s of grace before SIGKILL'ing stragglers, so waiting longer than
  that for the child gets us SIGKILL'd mid-wait and pnpm reports
  `Command failed with signal "SIGTERM"`. The orchestrator's
  `sweep_orphaned_services` is the safety net for the abandoned vite
  grandchild.
- wtfnode handle dumps: add an opt-in `BOXEL_WTFNODE=1` helper in both
  packages (`packages/realm-server/lib/wtfnode-on-signal.ts`,
  `packages/host/scripts/wtfnode-on-signal.js`) and wire it into every
  node entry point (realm-server main, worker-manager, worker children,
  prerender-server, prerender manager-server, vite wrapper). Dumps the
  active handles on SIGINT/SIGTERM and again 5s later, so future
  shutdown-hang investigations have evidence without ad-hoc edits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous `pnpm ensure-boxel-ui && node scripts/vite-serve.js` ran
through `sh -c`, and the shell layer — having no SIGTERM handler — died
via signal on Ctrl-C even though Node exited 0. pnpm reported
`[ERR_PNPM_RECURSIVE_RUN_FIRST_FAIL]` and `Command failed with signal
"SIGTERM"`. vite-serve.js now invokes ensure-boxel-ui inline via
execFileSync, so the start script collapses to a single `node …` and
Node is pnpm's direct child.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra habdelra requested a review from a team May 18, 2026 16:05
@habdelra habdelra merged commit 7aba0d7 into main May 18, 2026
79 of 80 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants