Skip to content

mise/dev: kill reparented service processes on cleanup#4727

Merged
habdelra merged 4 commits intomainfrom
worktree-agent-a354f63b0e12954c6
May 8, 2026
Merged

mise/dev: kill reparented service processes on cleanup#4727
habdelra merged 4 commits intomainfrom
worktree-agent-a354f63b0e12954c6

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

@habdelra habdelra commented May 8, 2026

Summary

  • mise run dev (and pnpm start:all, which calls it) leaks reparented service children on Ctrl-C / EXIT — most visibly the worker-manager listening on 4210 — so the next dev run fails with EADDRINUSE on bind.
  • dev-all got the fix for this in mise/dev-all: kill reparented service processes on cleanup #4704 (set -m + kill_tree + a pkill -f sweep scoped to $REPO_ROOT with a regex-escaped path prefix). dev was missed even though it's the more common entry point.
  • This ports the same cleanup pattern into mise-tasks/dev, adapted for its simpler structure (no separate HOST_PID — host runs inside start-server-and-test phase 1 here, not as a separately-launched background job).

Test plan

  • Repro the bug on main: mise run dev, wait until phase 1 readiness, Ctrl-C, then lsof -i :4210 — expect a leaked worker-manager (and friends under mise-tasks/services/) still bound. Re-running mise run dev should EADDRINUSE.
  • Switch to this branch and repeat: after Ctrl-C, lsof -i :4210 should be empty within ~2s; no mise-tasks/services/* or --transpileOnly worker processes should remain (pgrep -af "mise-tasks/services/" returns nothing).
  • Re-run mise run dev immediately — should bind 4210 cleanly with no port conflict.
  • Confirm worktree scoping: from a second checkout (e.g. boxel.worktrees/...) running its own mise run dev in parallel, Ctrl-C in checkout A must not kill checkout B's services. (The regex-escaped $REPO_ROOT in the pkill patterns is what makes this safe.)

🤖 Generated with Claude Code

mise's task supervisor reparents long-running task scripts (the
`mise-tasks/services/*` scripts and the worker's `ts-node --transpileOnly
worker` child) to init, so a PPID-based `kill_tree` alone can't reach
them when the trap fires on Ctrl-C / EXIT. Result: services keep running
after `mise run dev` (or `pnpm start:all`) exits, worker-manager keeps
port 4210 bound, and the next dev run fails to start with EADDRINUSE.

`dev-all` got this fix in #4704; `dev` was missed. This ports the
matching `set -m` + `kill_tree` + `cleanup` trap (plus the regex-escaped
`pkill -f` sweep scoped to `$REPO_ROOT`) into `dev`, adapted for the
simpler structure (no separate HOST_PID — host runs inside SAT phase 1
here).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra habdelra requested review from a team and Copilot May 8, 2026 13:35
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1287df1dd0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread mise-tasks/dev Outdated
… children on cleanup

The `--transpileOnly worker` regex only catches worker / worker-manager
invocations; the realm-server (`--transpileOnly main`), test-realms
(also `main`), prerender (`prerender/prerender-server`), and prerender-
manager (`prerender/manager-server`) ts-node children were missed.
None of the wrappers `exec` ts-node, so killing the bash wrapper alone
leaves the ts-node grandchild reparented to init with its port still
bound — the same EADDRINUSE-on-next-run failure mode this cleanup is
meant to prevent, just on 4201 / 4202 / 4221 / 4222 instead of 4210.

Broaden the second pkill pattern to `--transpileOnly (worker|main|
prerender)` so all five service entrypoints are signalled.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@backspace
Copy link
Copy Markdown
Contributor

maybe we should extract parts of this so both tasks can use it? And other tasks that may be added in the future

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Ports the mise dev-all cleanup strategy into mise run dev to prevent leaked, reparented service processes (notably worker-manager on :4210) from surviving Ctrl-C/EXIT and breaking subsequent dev runs.

Changes:

  • Switch mise-tasks/dev to bash and enable job control (set -m) to improve signal handling with run-p/npm run children.
  • Add kill_tree + trap-driven cleanup that sweeps for reparented processes via $REPO_ROOT-scoped pkill -f (TERM → grace → KILL).
  • Run start-server-and-test in the background so the wrapper script can reliably trap signals and clean up before exiting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mise-tasks/dev Outdated
habdelra and others added 2 commits May 8, 2026 09:51
Workers can come up before the realm-server has bound its port (by
design in dev — the worker stack starts in parallel with realm-server
and any from-scratch-index jobs queued from a previous run get picked
up immediately on startup). Each ECONNREFUSED currently logs a full
TypeError + undici stack + nested cause, drowning the boot logs in
noise that's expected and self-resolving.

Detect ECONNREFUSED on the err or its cause and log a single line
naming the job, type, and unreachable target. All other errors keep
their existing full-stack treatment, and Sentry still captures the
exception in deployed environments.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cleanup infrastructure (REPO_ROOT computation, kill_tree, the
absolute-path pkill sweep, and the explanatory comments around them)
was duplicated verbatim across `dev` and `dev-all`. Move the parts
that don't vary between tasks — `kill_tree()` and a new
`sweep_orphaned_services()` helper — into the file both tasks already
source. Each task is left with only the bits that genuinely differ
(SAT_PID handling in `dev`; HOST_PID + readiness wait in `dev-all`).

Also prepend the absolute repo and realm-server `node_modules/.bin`
paths to PATH inside `dev-common.sh`. The previous relative
`./node_modules/.bin` entry meant binaries resolved through PATH
could carry relative argv[0] in spawned children, weakening the
absolute-path anchor in the cleanup sweep. `dev-all` already had this
prepend inline; centralizing it makes the assumption explicit and
applies it to `dev` too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra
Copy link
Copy Markdown
Contributor Author

habdelra commented May 8, 2026

[Claude Code]

@backspace Done in c1c20fb. Hoisted kill_tree() and a new sweep_orphaned_services() helper (plus the explanatory comments around them) into mise-tasks/lib/dev-common.sh, which both dev and dev-all already source. Each task is left with just the parts that actually vary (SAT_PID handling in dev; HOST_PID + readiness wait in dev-all). Net is −83 / +48; future tasks that want the same orphan-sweep semantics can just source dev-common.sh and call the two functions from a trap.

Same change also addresses Copilot's adjacent point on the relative-PATH risk: moved the absolute repo + realm-server node_modules/.bin prepend into dev-common.sh (it was previously inline only in dev-all), so spawned children carry absolute argv[0] regardless of which task is running. Replied separately on that thread.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Host Test Results

    1 files      1 suites   1h 47m 48s ⏱️
2 643 tests 2 628 ✅ 15 💤 0 ❌
2 662 runs  2 647 ✅ 15 💤 0 ❌

Results for commit c1c20fb.

Realm Server Test Results

    1 files  ±    0      1 suites  +1   15m 57s ⏱️ + 15m 57s
1 297 tests +1 297  1 297 ✅ +1 297  0 💤 ±0  0 ❌ ±0 
1 376 runs  +1 376  1 376 ✅ +1 376  0 💤 ±0  0 ❌ ±0 

Results for commit c1c20fb. ± Comparison against earlier commit d3e3526.

Copy link
Copy Markdown
Contributor

@backspace backspace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t realise dev had this problem too but I confirmed that on a separate branch, then ran this in environment mode and confirmed that the Node process didn’t linger after Ctrl-C. Port conflicts weren’t happening for me because of environment mode, but runaway processes were another sign.

@habdelra habdelra merged commit ac4e4fd into main May 8, 2026
97 of 98 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants