mise/dev: kill reparented service processes on cleanup#4727
Conversation
mise's task supervisor reparents long-running task scripts (the `mise-tasks/services/*` scripts and the worker's `ts-node --transpileOnly worker` child) to init, so a PPID-based `kill_tree` alone can't reach them when the trap fires on Ctrl-C / EXIT. Result: services keep running after `mise run dev` (or `pnpm start:all`) exits, worker-manager keeps port 4210 bound, and the next dev run fails to start with EADDRINUSE. `dev-all` got this fix in #4704; `dev` was missed. This ports the matching `set -m` + `kill_tree` + `cleanup` trap (plus the regex-escaped `pkill -f` sweep scoped to `$REPO_ROOT`) into `dev`, adapted for the simpler structure (no separate HOST_PID — host runs inside SAT phase 1 here). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1287df1dd0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
… children on cleanup The `--transpileOnly worker` regex only catches worker / worker-manager invocations; the realm-server (`--transpileOnly main`), test-realms (also `main`), prerender (`prerender/prerender-server`), and prerender- manager (`prerender/manager-server`) ts-node children were missed. None of the wrappers `exec` ts-node, so killing the bash wrapper alone leaves the ts-node grandchild reparented to init with its port still bound — the same EADDRINUSE-on-next-run failure mode this cleanup is meant to prevent, just on 4201 / 4202 / 4221 / 4222 instead of 4210. Broaden the second pkill pattern to `--transpileOnly (worker|main| prerender)` so all five service entrypoints are signalled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
maybe we should extract parts of this so both tasks can use it? And other tasks that may be added in the future |
There was a problem hiding this comment.
Pull request overview
Ports the mise dev-all cleanup strategy into mise run dev to prevent leaked, reparented service processes (notably worker-manager on :4210) from surviving Ctrl-C/EXIT and breaking subsequent dev runs.
Changes:
- Switch
mise-tasks/devto bash and enable job control (set -m) to improve signal handling withrun-p/npm runchildren. - Add
kill_tree+trap-driven cleanup that sweeps for reparented processes via$REPO_ROOT-scopedpkill -f(TERM → grace → KILL). - Run
start-server-and-testin the background so the wrapper script can reliably trap signals and clean up before exiting.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Workers can come up before the realm-server has bound its port (by design in dev — the worker stack starts in parallel with realm-server and any from-scratch-index jobs queued from a previous run get picked up immediately on startup). Each ECONNREFUSED currently logs a full TypeError + undici stack + nested cause, drowning the boot logs in noise that's expected and self-resolving. Detect ECONNREFUSED on the err or its cause and log a single line naming the job, type, and unreachable target. All other errors keep their existing full-stack treatment, and Sentry still captures the exception in deployed environments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cleanup infrastructure (REPO_ROOT computation, kill_tree, the absolute-path pkill sweep, and the explanatory comments around them) was duplicated verbatim across `dev` and `dev-all`. Move the parts that don't vary between tasks — `kill_tree()` and a new `sweep_orphaned_services()` helper — into the file both tasks already source. Each task is left with only the bits that genuinely differ (SAT_PID handling in `dev`; HOST_PID + readiness wait in `dev-all`). Also prepend the absolute repo and realm-server `node_modules/.bin` paths to PATH inside `dev-common.sh`. The previous relative `./node_modules/.bin` entry meant binaries resolved through PATH could carry relative argv[0] in spawned children, weakening the absolute-path anchor in the cleanup sweep. `dev-all` already had this prepend inline; centralizing it makes the assumption explicit and applies it to `dev` too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
[Claude Code] @backspace Done in c1c20fb. Hoisted Same change also addresses Copilot's adjacent point on the relative-PATH risk: moved the absolute repo + realm-server |
Host Test Results 1 files 1 suites 1h 47m 48s ⏱️ Results for commit c1c20fb. Realm Server Test Results 1 files ± 0 1 suites +1 15m 57s ⏱️ + 15m 57s Results for commit c1c20fb. ± Comparison against earlier commit d3e3526. |
backspace
left a comment
There was a problem hiding this comment.
I didn’t realise dev had this problem too but I confirmed that on a separate branch, then ran this in environment mode and confirmed that the Node process didn’t linger after Ctrl-C. Port conflicts weren’t happening for me because of environment mode, but runaway processes were another sign.
Summary
mise run dev(andpnpm start:all, which calls it) leaks reparented service children on Ctrl-C / EXIT — most visibly the worker-manager listening on 4210 — so the next dev run fails with EADDRINUSE on bind.dev-allgot the fix for this in mise/dev-all: kill reparented service processes on cleanup #4704 (set -m+kill_tree+ apkill -fsweep scoped to$REPO_ROOTwith a regex-escaped path prefix).devwas missed even though it's the more common entry point.mise-tasks/dev, adapted for its simpler structure (no separateHOST_PID— host runs insidestart-server-and-testphase 1 here, not as a separately-launched background job).Test plan
main:mise run dev, wait until phase 1 readiness, Ctrl-C, thenlsof -i :4210— expect a leakedworker-manager(and friends undermise-tasks/services/) still bound. Re-runningmise run devshould EADDRINUSE.lsof -i :4210should be empty within ~2s; nomise-tasks/services/*or--transpileOnly workerprocesses should remain (pgrep -af "mise-tasks/services/"returns nothing).mise run devimmediately — should bind 4210 cleanly with no port conflict.boxel.worktrees/...) running its ownmise run devin parallel, Ctrl-C in checkout A must not kill checkout B's services. (The regex-escaped$REPO_ROOTin thepkillpatterns is what makes this safe.)🤖 Generated with Claude Code