codex grandchild orphan: #456 escalation misses npm wrapper's child

## Symptom

When a per-user codex backend dies (timeout, /stop, /new, recycle, or any `_kill` / `shutdown` path), the inner Rust codex binary is left running as the target os_user. Reproducible on 2026-05-15 after a `Codex timed out` event:

```
$ ps | grep 'codex app-server'
45609   ??  S      0:02.59 .../codex-darwin-arm64/vendor/.../codex/codex app-server
```

PID 45608 (the `node` wrapper) is gone, but PID 45609 (the Rust binary `node` fork-exec'd) survived. The Rust binary held a session in `~daniel/.codex/sessions/...` open and continued consuming whatever resources codex retains between turns. The orphan does not exit on its own and accumulates across recycles.

## Root cause

The #456 escalation ported to `codex.py` in PR #484 walks ONE level deeper than the sudo wrapper. For codex's actual process tree:

```
sudo (kai)
  └─ node /Users/daniel/.npm-global/bin/codex app-server  (daniel)
       └─ /Users/daniel/.npm-global/lib/.../codex/codex app-server  (daniel)
```

`pgrep -P <sudo_pid>` returns the `node` PID. `_send_signal` then `sudo -n -u daniel /bin/kill -SIGKILL <node_pid>`. The node wrapper exits. Its child (the Rust binary) reparents to init and stays alive.

The claude.py escalation does not hit this case because the claude CLI does not have a node-wrapper layer — `pgrep -P <sudo_pid>` returns the actual claude PID directly. Codex's npm-global install introduces an extra hop the escalation logic was never written for.

## Proposed fix

`_lookup_inner_codex_pid` and `_async_lookup_inner_codex_pid` in `codex.py` should walk one level deeper for codex: after finding the first child (`node`), do a second `pgrep -P` against that to find the Rust binary. If the second pgrep returns nothing, fall back to the first PID (single-binary install without the node wrapper).

Alternative shapes (combine or pick):

- Walk the entire descendant tree (`pgrep -P -d ',' <pid>` recursively until no children) and kill them all. More defensive against future codex packaging changes.
- Use `pkill -P <sudo_pid> -TERM` semantics that signal all descendants at once. Less precise on what we're killing.

The kill semantics also need to flip from killing the wrapper-PID to killing the leaf-PID, with a fallback kill of any intermediate wrapper PIDs to be safe.

The sudoers rule `(daniel) NOPASSWD: /bin/kill` already allows the bot to signal any daniel-owned process, so no install-time changes are needed.

## Tests

- `tests/test_codex.py`: add a `TestCodexGrandchildEscalation` that mocks `pgrep` to return two levels of children (node-PID then Rust-PID) and asserts that `_send_signal` issues `sudo kill` against the Rust-PID (not the node-PID).
- Regression: existing tests assume single-level child; need updating to reflect the two-level walk.

## Acceptance

- [ ] After a `_kill` on a codex backend, `pgrep -u daniel codex` returns no results.
- [ ] `shutdown` and `restart` paths leave no daniel-owned codex processes behind.
- [ ] The fix also handles the single-binary case (defensive; future codex releases may drop the node wrapper).

## Workaround until merged

Operators with an orphaned codex Rust binary can kill it manually:

```
sudo -u <os_user> pkill -f 'codex/codex app-server'
```

This kills any daniel-owned codex Rust binary regardless of parent. The persistent backend will spawn a fresh one on the next message.

## Related

- PR #484 (the #456 escalation port) — assumed a single-level child, correct for claude but not for codex's npm-global packaging.
- PR #485 (codex app-server protocol rewrite) — surfaced this during the live smoke test.
- Issue #486 (wizard hardening) — separate; this is a runtime-side issue, not wizard.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

codex grandchild orphan: #456 escalation misses npm wrapper's child #487

Symptom

Root cause

Proposed fix

Tests

Acceptance

Workaround until merged

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

codex grandchild orphan: #456 escalation misses npm wrapper's child #487

Description

Symptom

Root cause

Proposed fix

Tests

Acceptance

Workaround until merged

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions