Skip to content

fix(codex): rewrite to real app-server thread/turn/item protocol#485

Merged
dcellison merged 3 commits into
mainfrom
fix/codex-app-server-protocol
May 15, 2026
Merged

fix(codex): rewrite to real app-server thread/turn/item protocol#485
dcellison merged 3 commits into
mainfrom
fix/codex-app-server-protocol

Conversation

@dcellison
Copy link
Copy Markdown
Owner

@dcellison dcellison commented May 15, 2026

Summary

The codex backend (#482) and codex triage branch (#483) were modeled on goose's ACP session/* + agent_message_chunk vocabulary. The smoke test that started with PR #484 surfaced that codex app-server actually speaks a different protocol (thread/* + turn/* + item/*) and codex exec --json emits a third schema again ({"type":"item.completed","item":{"type":"agent_message","text":"..."}}). The persistent backend's handshake failed immediately with unknown variant session/new (the server enumerated 50+ valid method names), and the triage NDJSON parser looked for fields the real schema never emits. Smoke test also revealed that sudo from the kai service user can't resolve bare codex when the binary lives in a per-os_user npm-global home not on the service PATH.

This PR rewrites both surfaces against the actual schema documented in codex-rs/app-server/README.md and codex-rs/exec/src/exec_events.rs in the openai/codex repo, and adds a CODEX_BIN lever so the bot invokes codex by absolute path on multi-user installs.

Commits

  • 9dac019 Protocol rewrite. codex.py handshake is now initialize -> initialized -> thread/start (no protocolVersion, opt-out list suppresses noisy notifications). _send_locked uses turn/start with input: [{type:"text",text}]. Event parser handles item/started, item/agentMessage/delta (concatenate delta per itemId), item/completed (authoritative), turn/completed (terminal), error (terminal). triage.py _extract_codex_text walks events keyed on top-level type discriminator and returns item.text for agent_message items.
  • c4ca1f9 Runtime CODEX_BIN. codex.py and triage.py argv[0] reads from CODEX_BIN env var, falling back to bare codex for single-user installs where it's on PATH. install.py writes CODEX_BIN to /etc/kai/env when set during the wizard.
  • 177d813 Apply-time CODEX_BIN. Operators running sudo CODEX_BIN=... kai install apply bypass the wizard; the apply path now reads the env var directly and injects it into the env dict before _apply_secrets writes /etc/kai/env.

Tests

  • tests/test_codex.py: helpers rebuilt against the real schema; new assertions on the optOutNotificationMethods list; new test locks CODEX_BIN precedence in the argv.
  • tests/test_triage.py: TestExtractCodexText rewritten against the real type-discriminated event schema; new test locks CODEX_BIN precedence in the codex exec argv.
  • tests/test_install.py: existing sudoers tests still cover the codex SETENV rule.
  • Full suite: 3086 passed, 1 skipped.

Smoke test (validated on Mac mini, 2026-05-15)

  • pytest green
  • make check clean
  • Cross-user sudo wrap from kai to daniel reaches codex
  • Handshake completes through initialize -> initialized -> thread/start
  • Bot answers a PING from Telegram with a coherent reply via the codex backend
  • ps confirms codex app-server is daniel-owned at runtime
  • Starting persistent Codex app-server process (model=gpt-5.4-mini, user=daniel) in the log

Closes the protocol-mismatch finding from PR #484's smoke test. Refs #480. Follow-up issue tracks wizard-side hardening that surfaced during the smoke test.

zigguratt added 3 commits May 15, 2026 12:22
The codex backend and triage path were modeled on goose's ACP
session/* + agent_message_chunk vocabulary, but the actual codex
app-server protocol uses a distinct thread/turn/item vocabulary.
The persistent backend's handshake failed with `unknown variant
session/new` (codex enumerated 50+ valid method names instead),
and the triage path's NDJSON parser looked for fields the codex
exec --json schema never emits. Per the operator's go-ahead, this
PR rewrites both surfaces against the schema in
codex-rs/app-server/README.md and codex-rs/exec/src/exec_events.rs
in the openai/codex repo.

App-server (codex.py):
- Handshake is now initialize -> initialized notification -> thread/start.
  protocolVersion field dropped from initialize (not in the schema).
  initialize.capabilities.optOutNotificationMethods suppresses
  remoteControl/status/changed, mcpServer/startupStatus/updated,
  thread/started, thread/tokenUsage/updated.
- _session_id holds the codex thread.id for ABC consistency.
- send loop uses turn/start with input array of {type:"text",text}
  blocks, optional model override per-turn.
- Event parser: item/started, item/agentMessage/delta (concatenate
  delta per itemId), item/completed (authoritative for agentMessage),
  turn/completed (terminal), error notification (terminal).
- _write_notification helper for the initialized step.

codex exec --json (triage.py):
- _extract_codex_text now walks events keyed on top-level type
  discriminator (thread.started / turn.started / item.started /
  item.updated / item.completed / turn.completed / turn.failed /
  error). For item.completed of an agent_message item, returns
  item.text. Falls back to latest item.updated if no completed
  arrives. turn.failed short-circuits to empty string.
- _recover_terminal_text / _recover_chunk_text removed; the new
  _recover_agent_message_text replaces them with the right shape.

Tests updated to match the real schemas; full suite 3084 pass.
Smoke-test discovery: sudo from the kai service user uses kai's PATH
to resolve bare command names, and on multi-user installs codex
typically lives in a per-os_user home (e.g. /Users/daniel/.npm-global/bin)
that is NOT on the service user's PATH. The bot's bare `codex` spawn
then fails with "a password is required" because sudo cannot find a
binary to match the sudoers rule against.

- src/kai/codex.py: argv[0] now reads from CODEX_BIN env var, falling
  back to bare "codex" for single-user installs where it's on PATH.
- src/kai/triage.py: same lever for the codex exec --json triage path.
- src/kai/install.py: persist CODEX_BIN to /etc/kai/env when set at
  install time, so the running bot picks up the same absolute path
  the sudoers rule names.
- Tests lock both branches: bare "codex" when env unset, full path
  when set.
Previous commit wrote CODEX_BIN inside _cmd_config (the wizard),
so operators running 'sudo CODEX_BIN=... kai install apply' bypassed
it - the wizard had never seen the var. Apply now reads CODEX_BIN
from the environment after loading install.conf and injects it into
the env dict before _apply_secrets writes /etc/kai/env. Apply-time
env wins over any stale install.conf value so the operator's
explicit override is honored.
@dcellison dcellison merged commit 436873f into main May 15, 2026
1 check passed
@dcellison dcellison deleted the fix/codex-app-server-protocol branch May 15, 2026 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants