Skip to content

fix(codex): re-land approval, sandbox, multi-agentMessage fixes#491

Merged
dcellison merged 3 commits into
mainfrom
fix/codex-sandbox-and-approval
May 16, 2026
Merged

fix(codex): re-land approval, sandbox, multi-agentMessage fixes#491
dcellison merged 3 commits into
mainfrom
fix/codex-sandbox-and-approval

Conversation

@dcellison
Copy link
Copy Markdown
Owner

Supersedes #488. That PR was filed against an older main and never merged; the codex wizard hardening PR landed in the interim and touched the __init__ block #488 was based on, so this PR re-applies the three fixes on a fresh branch off current main rather than rebasing the original. The original branch is left intact for history.

What this does

Three fixes from the original smoke-test session, folded into one commit:

  1. approvalPolicy: "never" at thread/start. The codex default is on-request, which makes the server emit approval-request notifications and wait for the client to respond before any tool call. The bot has no in-loop approver, so on-request gates silently: codex waits forever, the readline ceiling fires, the operator sees "Codex timed out". This is the user-visible symptom that triggered this whole investigation.
  2. sandbox: "danger-full-access" at thread/start. read-only blocks file writes; workspace-write disables network egress by default (the gh access is blocked by the sandbox symptom). danger-full-access matches the claude backend's posture: claude --print runs unsandboxed with the bot's os_user's permissions, and codex now has the same authority profile. Sandbox variants are kebab-case; the server rejects camelCase at thread/start with "unknown variant".
  3. Per-item agentMessage tracking in the streaming parser. A single turn can emit multiple agentMessage items (preamble, tool call, post-tool summary). The previous parser concatenated deltas across items without a separator and let item N's item/completed.text override the entire accumulated string, erasing prior committed items. Now: committed_text holds completed items joined with \n\n; current_item_text holds the in-flight item's delta sum; current_item_id correlates which item the deltas belong to. item/completed authoritatively sets the CURRENT item's text only, commits it to the prefix, and resets.

Test plan

  • make test (3136 passed, 1 skipped on this branch; 2 new tests in TestStreamParsing lock the multi-item join with a blank-line separator and the per-item override scope; all 63 existing codex tests still pass without modification)
  • make check (ruff check + ruff format)
  • Pre-push grep gate clean
  • End-to-end smoke on production after merge: reinstall, send a Telegram message that requires gh/curl (currently blocked by the sandbox default), confirm the response arrives without "Codex timed out" or GitHub access is blocked by the sandbox.

What happens to #488

After this PR merges and the production smoke passes, close #488 with a comment pointing at this PR. Until then it stays open as the historical record.

zigguratt added 3 commits May 15, 2026 18:12
…-agentMessage tracking

Restores the three changes from the unmerged fix/codex-approval-policy
branch, freshly applied on top of current main so the workspace_config
block (touched by the codex wizard hardening PR that merged after the
old branch was filed) keeps its current shape.

Three fixes folded into one commit:

1. approvalPolicy=never. Codex's default approvalPolicy is
   "on-request", which makes the server emit approval-request
   notifications and wait for the client to respond before any tool
   call. The bot has no human in the loop to approve from Telegram,
   so on-request gates silently: codex waits forever, the bot's
   readline ceiling fires, the operator sees "Codex timed out".

2. sandbox=danger-full-access. The default readOnly sandbox
   prevents codex from writing files. workspace-write disables
   network egress by default, breaking gh, curl, pip, and anything
   else codex needs to reach outside the local filesystem. The
   claude backend runs unsandboxed under the bot's os_user; codex
   should have the same authority profile, no more, no less.
   Sandbox variants are kebab-case; the server rejects camelCase
   spellings at thread/start with "unknown variant".

3. Multi-agentMessage tracking. A single turn can emit MORE than
   one agentMessage item (e.g. preamble, tool call, post-tool
   summary). The previous parser concatenated deltas across items
   without a separator AND let item N's item/completed text field
   override the entire accumulated string, erasing prior committed
   items. The Telegram message displayed only the last item's
   text. Rewrite tracks per-item: committed_text holds completed
   items joined with blank-line separators; current_item_text
   holds the in-flight item's delta sum; current_item_id correlates
   which item the deltas belong to. item/completed authoritatively
   sets the CURRENT item's text only, commits it to the prefix,
   and resets. Defensive schema-drift handling: a delta arriving
   without a prior item/started treats the new itemId as opening
   a new item; turn/completed flushes any uncommitted
   current_item_text into the prefix before returning.

Two new tests lock the multi-item join with a blank-line separator
and the per-item override scope.
Folded into the sandbox+approval+parser PR because it's the same
"codex backend works under real load" surface and the same restart
cycle.

A single codex event can carry the full text of a tool call's output
inline (e.g. a `gh pr diff` result on a reasoning item, or an
item/completed for a long agentMessage). The asyncio.StreamReader
default limit is 64KB; we previously bumped it to 1MB. Real codex PR-
review turns observed in production still exceed 1MB and surface
"Separator is not found, and chunk exceed the limit" from readline.

Bumping to 16MB covers any plausible single-event payload while still
bounding memory if codex ever produces a runaway line. A proper
streaming reader (chunked + reassemble on newline) would remove the
ceiling entirely; deferred until the 16MB ceiling is itself observed
in practice.

Regression test asserts the spawn-time limit is at least 16MB so a
future shrinkback gets caught at test time, not on a live operator
turn.
Addresses the non-blocking review warning on PR #491: the parser
fix and 16MB stdout limit both have regression tests, but the two
fields that unblock production - approvalPolicy=never and
sandbox=danger-full-access - were only asserted by the operator
manually re-installing and re-pinging. Lock them in the existing
handshake test so a future regression that drops or misspells
either field gets caught at test time, not on the next live turn.
@dcellison dcellison merged commit cab7ba9 into main May 16, 2026
1 check passed
@dcellison dcellison deleted the fix/codex-sandbox-and-approval branch May 16, 2026 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants