Skip to content

codex grandchild orphan: #456 escalation misses npm wrapper's child #487

@dcellison

Description

@dcellison

Symptom

When a per-user codex backend dies (timeout, /stop, /new, recycle, or any _kill / shutdown path), the inner Rust codex binary is left running as the target os_user. Reproducible on 2026-05-15 after a Codex timed out event:

$ ps | grep 'codex app-server'
45609   ??  S      0:02.59 .../codex-darwin-arm64/vendor/.../codex/codex app-server

PID 45608 (the node wrapper) is gone, but PID 45609 (the Rust binary node fork-exec'd) survived. The Rust binary held a session in ~daniel/.codex/sessions/... open and continued consuming whatever resources codex retains between turns. The orphan does not exit on its own and accumulates across recycles.

Root cause

The #456 escalation ported to codex.py in PR #484 walks ONE level deeper than the sudo wrapper. For codex's actual process tree:

sudo (kai)
  └─ node /Users/daniel/.npm-global/bin/codex app-server  (daniel)
       └─ /Users/daniel/.npm-global/lib/.../codex/codex app-server  (daniel)

pgrep -P <sudo_pid> returns the node PID. _send_signal then sudo -n -u daniel /bin/kill -SIGKILL <node_pid>. The node wrapper exits. Its child (the Rust binary) reparents to init and stays alive.

The claude.py escalation does not hit this case because the claude CLI does not have a node-wrapper layer — pgrep -P <sudo_pid> returns the actual claude PID directly. Codex's npm-global install introduces an extra hop the escalation logic was never written for.

Proposed fix

_lookup_inner_codex_pid and _async_lookup_inner_codex_pid in codex.py should walk one level deeper for codex: after finding the first child (node), do a second pgrep -P against that to find the Rust binary. If the second pgrep returns nothing, fall back to the first PID (single-binary install without the node wrapper).

Alternative shapes (combine or pick):

  • Walk the entire descendant tree (pgrep -P -d ',' <pid> recursively until no children) and kill them all. More defensive against future codex packaging changes.
  • Use pkill -P <sudo_pid> -TERM semantics that signal all descendants at once. Less precise on what we're killing.

The kill semantics also need to flip from killing the wrapper-PID to killing the leaf-PID, with a fallback kill of any intermediate wrapper PIDs to be safe.

The sudoers rule (daniel) NOPASSWD: /bin/kill already allows the bot to signal any daniel-owned process, so no install-time changes are needed.

Tests

  • tests/test_codex.py: add a TestCodexGrandchildEscalation that mocks pgrep to return two levels of children (node-PID then Rust-PID) and asserts that _send_signal issues sudo kill against the Rust-PID (not the node-PID).
  • Regression: existing tests assume single-level child; need updating to reflect the two-level walk.

Acceptance

  • After a _kill on a codex backend, pgrep -u daniel codex returns no results.
  • shutdown and restart paths leave no daniel-owned codex processes behind.
  • The fix also handles the single-binary case (defensive; future codex releases may drop the node wrapper).

Workaround until merged

Operators with an orphaned codex Rust binary can kill it manually:

sudo -u <os_user> pkill -f 'codex/codex app-server'

This kills any daniel-owned codex Rust binary regardless of parent. The persistent backend will spawn a fresh one on the next message.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions