🤖 bench: add Terminal-Bench strict goal mode by ThomasK33 · Pull Request #3348 · coder/mux

ThomasK33 · 2026-05-21T08:26:03Z

Summary

Adds opt-in strict Goal Run support to the Terminal-Bench adapter, including MUX_RUN_AS_GOAL plumbing, strict nonzero handling for incomplete goals, workflow controls, upload metadata, and tbench skill docs. Also documents mux run --goal semantics and adds google/gemini-3.5-flash to the nightly Terminal-Bench all-model matrix.

Background

Terminal-Bench tasks can now exercise the CLI goal loop by passing the task instruction as both stdin and --goal, while keeping scheduled nightly runs unchanged until a manual goal-mode A/B run is healthy.

Implementation

Adds MUX_RUN_AS_GOAL validation/forwarding in the Harbor adapter and shell runner.
Makes the runner preserve goal-mode exit code 3 after extracting token metadata.
Exposes mux_run_as_goal in the reusable and nightly workflows, defaulting scheduled nightly runs to false.
Adds goal-mode BigQuery row metadata and docs/ADR coverage.
Adds Gemini 3.5 Flash to the nightly all model matrix.

Validation

PATH="$HOME/.local/bin:$PATH" make static-check
go run github.com/rhysd/actionlint/cmd/actionlint@v1.7.7 .github/workflows/terminal-bench.yml .github/workflows/nightly-terminal-bench.yml
PATH="$HOME/.local/bin:$PATH" UV_PYTHON=3.12 uvx --from "harbor[daytona]==0.6.4" --with pytest pytest benchmarks/terminal_bench/mux_agent_test.py
bun test src/cli/run.test.ts src/node/services/workspaceGoalService.test.ts
Local mux-run dogfood with a fake bun shim verified --goal injection, stdin preservation, duplicate --goal rejection, token extraction, and exit 3 preservation.

Risks

Strict goal mode intentionally treats incomplete goals as agent execution failures, so it is opt-in for manual nightly dispatch before changing scheduled defaults.

📋 Implementation Plan

Implementation Plan: Add optional `mux run --goal` mode to Terminal-Bench

Objective

Add an opt-in strict Terminal-Bench goal mode that runs each benchmark task as a Mux CLI Goal Run:

The Terminal-Bench task instruction remains the initial stdin kickoff message.
The same task instruction is also passed as mux run --goal "<instruction>".
mux run must exit successfully only when the persisted goal status becomes complete.
If the goal stops incomplete and mux run exits 3, the benchmark adapter should treat that as an agent execution failure (Option A, agreed with the user), not as a normal verifier-only outcome.

Recommended approach net product-code LoC estimate: 0 product LoC. This is benchmark/CI adapter work only. Expected non-product LoC: ~140–230 across shell, Python adapter/tests, workflow inputs, upload metadata, and skill/docs references.

Current verified setup

benchmarks/terminal_bench/mux_agent.py stages a local Mux archive plus mux-run.sh into the Harbor sandbox, forwards provider/config env vars, and executes bash /installed-agent/mux-run.sh <instruction>.
benchmarks/terminal_bench/mux-run.sh builds:
```
cmd=(bun src/cli/run.ts
  --dir "${project_path}"
  --model "${MUX_MODEL}"
  --keep-background-processes
  --json)
```
then appends MUX_RUN_ARGS via intentional word-splitting and pipes the benchmark instruction to stdin.
.github/workflows/terminal-bench.yml exposes mux_run_args and forwards it as MUX_RUN_ARGS.
.github/workflows/nightly-terminal-bench.yml passes mux_run_args for smoke tests and nightly model matrix, currently without goal mode.
Makefile target benchmark-terminal runs Harbor and exports MUX_TIMEOUT_MS; it does not need per-flag plumbing for mux run flags.
The newly implemented mux run --goal semantics make incomplete goals exit 3; mux-run.sh intends to treat any nonzero mux run exit as fatal. MuxAgent.run() should also explicitly raise on nonzero sandbox command returns after preserving logs/token metadata, because it currently manages execution manually instead of relying on Harbor's base runner behavior.

Design decision

Add `MUX_RUN_AS_GOAL`

Introduce a benchmark adapter env var:

MUX_RUN_AS_GOAL=1

When enabled, mux-run.sh appends the current Terminal-Bench instruction as the CLI goal objective:

cmd+=(--goal "${instruction}")

Keep piping the instruction to stdin unchanged:

printf '%s' "${instruction}" | "${cmd[@]}"

This intentionally exercises the handoff-defined behavior where a message/stdin kickoff can be distinct from the active objective, even though Terminal-Bench uses the same text for both.

Goal limits stay in `MUX_RUN_ARGS`

Do not add bespoke env vars for --goal-budget or --goal-turns in the first pass. Users can supply them through the existing escape hatch:

MUX_RUN_AS_GOAL=1 \
MUX_RUN_ARGS="--thinking high --goal-turns 30 --goal-budget 10.00" \
make benchmark-terminal TB_TASK_NAMES="chess-best-move"

Rationale: this keeps the integration minimal, avoids duplicating CLI flag schema in the adapter, and preserves MUX_RUN_ARGS as the canonical path for arbitrary mux run options.

Strict failure semantics (Option A)

Do not special-case exit 3 in mux-run.sh.

If mux run --goal exits 0, Harbor proceeds to verification as usual.
If it exits 3, mux-run.sh prints an error and exits with the original 3; MuxAgent.run() raises after preserving logs/token metadata, and Harbor records an agent execution error.
This makes goal completion a benchmarked capability, not only a hidden telemetry field, while preserving the diagnostically important mux run exit code.

Duplicate `--goal` guard

When MUX_RUN_AS_GOAL is enabled, reject MUX_RUN_ARGS containing an explicit --goal or --goal=... token. The adapter owns the dynamic per-task objective in this mode; callers should use MUX_RUN_ARGS only for --goal-budget, --goal-turns, and other non-objective flags.

Observability marker

Goal-mode runs should be easy to separate from normal Terminal-Bench runs:

mux-run.sh should log whether strict goal mode is enabled.
Workflow upload steps should pass MUX_RUN_AS_GOAL through to result upload scripts.
scripts/upload-tbench-results.py and scripts/upload-harbor-results.py should add a safe mux_run_as_goal row field derived from env. The upload helpers drop fields that are not present in the BigQuery schema, so this can land before the table column exists.

Implementation steps

Phase 0 — Preflight existing benchmark tests

Before editing benchmark files, run the existing adapter test file once:

uvx --from "harbor[daytona]==0.6.4" --with pytest pytest benchmarks/terminal_bench/mux_agent_test.py

If it fails on stale expectations unrelated to goal mode (for example old MUX_THINKING_LEVEL / MUX_MODE assumptions), fix those stale expectations as a separate minimal cleanup before adding new assertions. This keeps goal-mode failures distinct from pre-existing test drift.

Phase 1 — Adapter env plumbing

Update benchmarks/terminal_bench/mux_agent.py:
- Add "MUX_RUN_AS_GOAL" to _CONFIG_ENV_KEYS so it is forwarded into the sandbox.
- Normalize/validate values in _env:
  - Accept unset/empty, 0, 1, true, false case-insensitively.
  - Normalize truthy values to "1".
  - Omit false/unset values from the forwarded env (or normalize to "0" if keeping an explicit marker proves simpler for tests); keep the shell tolerant of both.
  - Raise ValueError for ambiguous values like yes, goal, enabled.
- Keep default disabled.
- In MuxAgent.run(), preserve command stdout/stderr/return-code files, download /tmp/mux-tokens.json, populate context, and then raise RuntimeError for any nonzero command return in all modes. This aligns Harbor results with mux-run.sh's fatal intent and enforces strict Option A for goal exit 3; the current manual runner should not silently treat a failed agent command as success.
Update benchmarks/terminal_bench/mux-run.sh:
- Define MUX_RUN_AS_GOAL="${MUX_RUN_AS_GOAL:-}" near the other MUX_* env reads.
- Compute a normalized local boolean (for example mux_run_as_goal_normalized="${MUX_RUN_AS_GOAL,,}") and accept 1/true as enabled, 0/false/empty as disabled. Fail fast for anything else when the script is called directly outside MuxAgent.
- After the base cmd=(...) is created, before MUX_RUN_ARGS is appended, add:
```
if [[ "${mux_run_as_goal_normalized}" == "1" || "${mux_run_as_goal_normalized}" == "true" ]]; then
  cmd+=(--goal "${instruction}")
fi
```
- Before appending MUX_RUN_ARGS, split them into an array once and reject exact --goal or --goal=... tokens when goal mode is enabled. Do not reject --goal-budget or --goal-turns.
- Keep --goal "$instruction" before MUX_RUN_ARGS so callers can still add --goal-budget, --goal-turns, --thinking, --budget, etc. afterward.
- Restructure the pipeline so token extraction still runs after a nonzero mux run exit while preserving the original exit code:
  - temporarily disable errexit around the pipeline (set +e),
  - run printf ... | "${cmd[@]}" ... | tee ...,
  - capture PIPESTATUS immediately and use the actual mux command status (the middle pipeline element) as mux_status,
  - restore set -e,
  - always attempt token extraction from MUX_OUTPUT_FILE,
  - if mux_status is nonzero, print a clear [mux-run] ERROR: mux agent session failed message and exit "${mux_status}" instead of calling fatal, because fatal would collapse exit 3 to 1.
- Do not add any exit-code override; a stored exit 3 should remain fatal under Option A and remain visible as 3.

Quality gate after Phase 1:

Run shellcheck target through the repo’s normal checks later; ensure no shellcheck issue from lowercase expansion or array append.
Run targeted Python tests after Phase 2.

Phase 2 — Unit coverage

Update benchmarks/terminal_bench/mux_agent_test.py with focused adapter tests:

test_goal_mode_env_is_forwarded:
- Set MUX_RUN_AS_GOAL=1.
- Assert agent._env["MUX_RUN_AS_GOAL"] == "1".
test_goal_mode_defaults_to_disabled:
- Assert "MUX_RUN_AS_GOAL" not in agent._env when unset (or equals "0" if implementation chooses explicit false forwarding consistently).
test_goal_mode_rejects_invalid_values:
- Set MUX_RUN_AS_GOAL=yes.
- Assert _env raises ValueError.
test_run_raises_after_preserving_logs_for_nonzero_exit:
- Use a small fake BaseEnvironment object whose exec() returns a nonzero return code and whose download_file() either writes a token file or raises as needed.
- Assert MuxAgent.run() writes return-code.txt/stdout/stderr before raising.
- If practical, assert token download/population was attempted before the raise.
If there is a lightweight way to assert command construction without running Harbor, add a test around create_run_agent_commands only for preserving the instruction argument. Do not assert shell internals tautologically; the meaningful behavior is env forwarding and generated command remains a single runner invocation with the instruction quoted.

Potential pre-existing test issue to verify before editing:

Current mux_agent_test.py appears to assert MUX_THINKING_LEVEL and MUX_MODE, while current MuxAgent._env evidence did not show those env keys. Run the targeted pytest before changes; if it already fails, fix or update stale tests as a separate minimal cleanup in the same patch, documenting that the goal-mode tests exposed stale expectations.

Quality gate after Phase 2:

uvx --from "harbor[daytona]==0.6.4" --with pytest pytest benchmarks/terminal_bench/mux_agent_test.py

If the repo’s preferred Python test invocation differs, use the smallest available equivalent.

Phase 3 — CI workflow input

Update .github/workflows/terminal-bench.yml:

Add a workflow_call.inputs.mux_run_as_goal boolean input:

mux_run_as_goal:
  description: "Run each task instruction as a mux CLI Goal Run"
  required: false
  type: boolean
  default: false

Add a matching workflow_dispatch.inputs.mux_run_as_goal boolean input.

In the Run Terminal-Bench step env, add:

MUX_RUN_AS_GOAL: ${{ inputs.mux_run_as_goal && '1' || '' }}

In both BigQuery upload steps, pass the same env marker so upload scripts can annotate rows:
```
MUX_RUN_AS_GOAL: ${{ inputs.mux_run_as_goal && '1' || '' }}
```
Update scripts/upload-tbench-results.py and scripts/upload-harbor-results.py:
- Read MUX_RUN_AS_GOAL from env.
- Normalize it to a boolean row field such as mux_run_as_goal.
- Rely on each script's table-schema filtering to drop the field until BigQuery schema catches up.

Update the mux_run_args descriptions to mention goal limits when goal mode is enabled:

Additional CLI flags passed to mux run (e.g., --thinking high --use-1m; with goal mode, add --goal-turns/--goal-budget)

Quality gate after Phase 3:

Validate workflow syntax through make static-check later; it includes formatting/linting relevant to repo checks.

Phase 4 — Nightly workflow opt-in only

Update .github/workflows/nightly-terminal-bench.yml:

Add a manual workflow_dispatch boolean input:

mux_run_as_goal:
  description: "Run nightly smoke/matrix tasks as strict mux CLI Goal Runs"
  required: false
  default: false
  type: boolean

Pass it through to every reusable terminal-bench.yml job with an expression that is explicitly false for scheduled events:
```
mux_run_as_goal: ${{ github.event_name == 'workflow_dispatch' && inputs.mux_run_as_goal }}
```
Keep scheduled runs disabled by default. Scheduled events should continue to behave exactly as today; do not rely on undefined inputs behavior for scheduled runs.

Rationale: this allows a full manual A/B nightly-style run without changing scheduled leaderboard-tracking behavior.

Quality gate after Phase 4:

Confirm scheduled path still omits/false-passes goal mode by reading the workflow expressions.

Phase 5 — Skill/docs updates

Update the project tbench skill documentation (.mux/skills/tbench/SKILL.md or the source skill location if applicable) only if project convention expects skill docs to track benchmark invocation options.

Add examples:

# Run a single task as a strict CLI Goal Run
MUX_RUN_AS_GOAL=1 \
MUX_RUN_ARGS="--thinking high --goal-turns 30 --goal-budget 10.00" \
make benchmark-terminal TB_TASK_NAMES="chess-best-move"

# CI dispatch
gh workflow run terminal-bench.yml \
  -f model_name=anthropic/claude-sonnet-4-5 \
  -f task_names=chess-best-move \
  -f mux_run_as_goal=true \
  -f mux_run_args="--thinking high --goal-turns 30 --goal-budget 10.00"

If make fmt regenerates src/node/builtinSkills/mux-docs.md or generated skill content, include those generated changes only when the repo’s tooling requires them.

Quality gate after Phase 5:

make fmt

Dogfooding plan

Dogfooding must capture both a terminal recording and a final screenshot artifact.

Local smoke dogfood

Run a very small Terminal-Bench task in strict goal mode:

MUX_RUN_AS_GOAL=1 \
MUX_RUN_ARGS="--thinking high --goal-turns 20 --goal-budget 5.00" \
make benchmark-terminal \
  TB_TASK_NAMES="chess-best-move" \
  TB_CONCURRENCY=1 \
  TB_TIMEOUT=1800

Record it:

TERM=xterm-256color COLUMNS=120 LINES=36 \
asciinema rec --overwrite \
  --command 'MUX_RUN_AS_GOAL=1 MUX_RUN_ARGS="--thinking high --goal-turns 20 --goal-budget 5.00" make benchmark-terminal TB_TASK_NAMES="chess-best-move" TB_CONCURRENCY=1 TB_TIMEOUT=1800' \
  /tmp/mux-tbench-goal.cast

Generate a terminal screenshot SVG from the asciicast and attach it for review. If the full benchmark is too expensive locally, run a workflow-dispatch smoke instead and record the local gh workflow run + gh run watch session, then download and inspect the artifact logs.

Evidence to collect

/logs/agent/command-0/stdout.txt or artifact equivalent contains JSONL with:
- goal-started
- either goal-completed + run-complete.goal.status == "complete", or a strict failure if the agent does not complete the goal
/logs/agent/command-0/return-code.txt or the downloaded command log records the runner exit code.
Token extraction still produces /tmp/mux-tokens.json / mux-tokens.json when enough JSONL was emitted before a nonzero exit.
Harbor result exists under jobs/<timestamp>/result.json.
If the run fails, verify whether it failed due to mux run exit 3 (expected strict Option A) versus setup/infrastructure.

Automated validation plan

Run in this order:

Python adapter tests:

uvx --from "harbor[daytona]==0.6.4" --with pytest pytest benchmarks/terminal_bench/mux_agent_test.py

Shell/workflow/static validation:
```
make static-check
```
If make static-check asks for generated docs/formatting:
```
make fmt
make static-check
```
Optional targeted dry-run command construction check:
```
MUX_RUN_AS_GOAL=1 MUX_RUN_ARGS="--thinking high --goal-turns 2" make benchmark-terminal TB_TASK_NAMES="hello-world" TB_CONCURRENCY=1 TB_TIMEOUT=600
```
Only run this if provider credentials and environment are available; otherwise rely on CI workflow dispatch.

Acceptance criteria

MUX_RUN_AS_GOAL is forwarded by the Terminal-Bench agent adapter into the sandbox.
When MUX_RUN_AS_GOAL=1, mux-run.sh invokes mux run --goal "$instruction" while still piping the instruction on stdin.
MUX_RUN_ARGS remains the only generic pass-through for --goal-budget, --goal-turns, and other mux run flags.
MUX_RUN_AS_GOAL=1 rejects duplicate --goal / --goal=... supplied through MUX_RUN_ARGS.
Exit code 3 from incomplete goals remains fatal in the adapter (strict Option A), and MuxAgent.run() explicitly raises on the nonzero command after preserving logs/token metadata.
Reusable Terminal-Bench workflow exposes mux_run_as_goal for workflow_call and workflow_dispatch.
Nightly workflow can opt into strict goal mode manually, but scheduled runs remain unchanged by default.
Targeted adapter tests pass.
make static-check passes.
Dogfood evidence includes terminal recording, screenshot, and artifact/log proof that goal mode was active.

Risks and mitigations

Risk: goal mode increases agent execution errors. Mitigation: opt-in only; scheduled nightly runs stay unchanged until A/B data supports enabling it.
Risk: shell quoting breaks goal objectives with spaces/newlines. Mitigation: append --goal "$instruction" as a Bash array element, not through MUX_RUN_ARGS word-splitting.
Risk: workflows accidentally enable goal mode on scheduled runs. Mitigation: boolean input default false; review expressions for scheduled event behavior.
Risk: invalid env values create surprising behavior. Mitigation: validate MUX_RUN_AS_GOAL in MuxAgent._env and fail fast.
Risk: strict exit 3 obscures verifier pass/fail. Mitigation: this is intended for Option A; preserve JSONL/token artifacts before raising, annotate rows/logs with mux_run_as_goal, and only consider Option B later if strict failures are too noisy.

Future follow-up, not in this patch

If strict mode proves too noisy, add a separate opt-in compatibility mode such as MUX_GOAL_ALLOW_INCOMPLETE=1 that lets exit 3 proceed to verification while recording goal status. Do not include that in the first implementation because the user explicitly selected Option A.

Generated with mux • Model: openai:gpt-5.5 • Thinking: xhigh • Cost: 3134897{MUX_COSTS_USD:-0}

mintlify · 2026-05-21T08:26:08Z

Preview deployment for your docs. Learn more about Mintlify Previews.

Project	Status	Preview	Updated (UTC)
Mux	🟢 Ready	View Preview	May 21, 2026, 8:26 AM

💡 Tip: Enable Workflows to automatically generate PRs for you.

ThomasK33 · 2026-05-21T08:26:51Z

/coder-agents-review

ThomasK33 · 2026-05-21T08:26:53Z

@codex review

chatgpt-codex-connector · 2026-05-21T08:35:44Z

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

coder-agents-review

First-pass mechanical review (Netero). This is not the full panel review.

The adapter plumbing, shell integration, workflow wiring, and upload metadata look solid. The env normalization, duplicate --goal guard, exit code preservation chain (PIPESTATUS -> mux_status -> MuxAgent raise), and the service-level tests are well-constructed. CI is green (21 passed, 3 skipped).

One coverage gap worth addressing before the panel reviews: the core goal continuation loop (driveGoalUntilTerminal) has complex control flow (while-true state machine, safety assert, timeout with diagnostic, budget interaction) and zero automated test coverage. The existing tests cover argument parsing and service-level options but not the CLI integration loop itself.

1 P2 (test coverage), 1 Note (minor duplication).

This is a first-pass review only: these are mechanical findings from Netero. The full review panel has not yet reviewed this PR and will review after these findings are addressed.

"Total new production lines in run.ts: ~245. Total new test lines covering run.ts: ~33, all argument parsing." (Netero)

🤖 This review was automatically generated with Coder Agents.

ThomasK33 · 2026-05-21T09:08:16Z

/coder-agents-review

ThomasK33 · 2026-05-21T09:08:17Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 08142cae78

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ThomasK33 · 2026-05-21T09:16:40Z

/coder-agents-review

ThomasK33 · 2026-05-21T09:16:41Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b753dbc13e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ThomasK33 · 2026-05-21T09:27:27Z

/coder-agents-review

ThomasK33 · 2026-05-21T09:27:30Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 82624dbb12

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ThomasK33 · 2026-05-21T09:38:02Z

/coder-agents-review

ThomasK33 · 2026-05-21T09:38:03Z

@codex review

coder-agents-review

Panel review (14 reviewers). The DEREM-1 fix is well-done: extracting driveCliGoalUntilTerminal into a standalone module with dependency injection is the right pattern, and the tests cover the important behavioral branches. The WorkspaceGoalServiceOptions parameterization is surgical and clean. The shell exit code preservation via PIPESTATUS[1] is a solid improvement. The ADR is clear and the claim verification pass against the code holds up.

Two structural issues surfaced from convergent findings across multiple reviewers:

When driveCliGoalUntilTerminal throws (stream-start timeout or safety limit), the error bypasses run-complete, goal-incomplete, and all cost/usage data. Exit code is 1 instead of the documented 3. Four reviewers flagged this independently.
Goal-incomplete exit (code 3) in non-JSON mode writes nothing to the operator. The success path writes [goal] completed: done; the failure path is silent.

Severity breakdown: 2 P2, 7 P3, 6 Nit.

Process note: the implementation plan required dogfood recordings and screenshots; the PR delivered a fake-bun-shim smoke test. The automated test suite compensates, but the gap between stated and actual validation is worth noting.

"The five-second timeout is configurable via streamStartTimeoutMs but only from code, not from the CLI user." (Hisoka)

🤖 This review was automatically generated with Coder Agents.

chatgpt-codex-connector · 2026-05-21T09:45:41Z

Codex Review: Didn't find any major issues. Keep them coming!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

coder-agents-review

Review blocked. All 16 open findings from Round 2 are unaddressed: no code changes, no author replies.

The author pushed commits since R2 (08142ca → 81af40e) addressing findings from the Codex reviewer (pipeline tee failures, plan-mode budget_limited, 5s stream-start timeout), but none of the DEREM findings were responded to.

Open findings requiring response before the next panel review:

P2 (2):

DEREM-4: driveCliGoalUntilTerminal throw bypasses run-complete, goal-incomplete, cost data; exits 1 instead of documented 3
DEREM-5: Goal-incomplete exit (code 3) in non-JSON mode produces no human-readable output

P3 (7):

DEREM-6: run-complete.goal.stopReason null when session budget stops goal before driver
DEREM-7: describeCliGoalStop has 6 return paths, 1 tested
DEREM-8: Null-goal and paused-goal driver exit paths untested
DEREM-9: MuxAgent.run() success-path test missing
DEREM-10: --goal "" without message shows misleading error
DEREM-11: Magic numbers without named constants
DEREM-12: Exported functions lack doc comments

Nit (6): DEREM-3, DEREM-14, DEREM-15, DEREM-16, DEREM-17, DEREM-18

The panel will review once the author responds to or pushes fixes for the open findings. At minimum, the two P2 findings need a response (fix, acknowledgment, or contestation).

🤖 This review was automatically generated with Coder Agents.

ThomasK33 · 2026-05-21T09:47:57Z

/coder-agents-review

ThomasK33 · 2026-05-21T09:47:59Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e3db7c90da

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ThomasK33 · 2026-05-21T09:56:56Z

@codex review

Addressed the stream-start hang finding by passing a bounded 5-minute startup timeout into CLI goal continuations, while keeping the timeout long enough to avoid the earlier 5-second CI false-positive issue. Added targeted driver coverage for forwarding the configured timeout.

Validation:

bun test src/cli/goalRunDriver.test.ts src/cli/run.test.ts
make static-check

chatgpt-codex-connector · 2026-05-21T10:03:50Z

Codex Review: Didn't find any major issues. Delightful!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

coder-agents-review

Panel re-review (5 reviewers). All 15 R2 findings verified fixed.

DEREM-4 (driver throw bypassing structured events) is properly caught now: the dedicated try/catch stores the error, sets goalStopReason, and lets run-complete/goal-incomplete emit before returning exit 3. DEREM-5 (silent goal-incomplete in non-JSON mode) now writes [goal] stopped: <reason>. The rest of the fixes (test coverage, doc comments, type accuracy, naming, validation ordering) are all confirmed by the reviewers who originally raised them.

The goalRunDriver extraction, WorkspaceGoalServiceOptions parameterization, shell PIPESTATUS preservation, and adapter error handling are solid. The test suite (451 lines covering 476 production lines) exercises all meaningful branches. The ADR clearly articulates the CLI/interactive distinction.

3 P3, 1 Nit new. None blocking.

"I could not identify a simpler alternative that would solve the same problem. The CLI genuinely lacked multi-continuation capability. This is additive work, not a workaround." (Pariston)

🤖 This review was automatically generated with Coder Agents.

Add optional strict mux goal mode to Terminal-Bench runs, wire workflow controls and upload metadata, and include Gemini 3.5 Flash in the nightly model matrix. --- _Generated with `mux` • Model: `openai:gpt-5.5` • Thinking: `xhigh` • Cost: `3125725{MUX_COSTS_USD:-0}`_

ThomasK33 · 2026-05-21T10:23:36Z

Responses to coder-agents-review findings:

DEREM-19: Fixed. The goal driver now re-reads the goal after a session-budgeted continuation; if that turn completed the goal, it records complete and emits the normal goal-completed event instead of reporting session budget exceeded. The final run.ts fallback now also skips the session-budget stop reason when the final goal is complete.
DEREM-20: Fixed. Added coverage for the continuation safety guard by forcing continuationSafetyLimit: 1 and asserting the guard throws.
DEREM-21: Fixed. run.ts now passes the shared CLI_GOAL_STREAM_START_TIMEOUT_MS (5 minutes) into CLI goal continuation waits, preventing indefinite direct mux run --goal hangs without reintroducing the earlier 5-second false-positive risk.
DEREM-22: Fixed. Converted the terminal-stop-shape assertions to named test.each cases so failures identify the scenario.

Validation:

bun test src/cli/goalRunDriver.test.ts src/cli/run.test.ts
make static-check

ThomasK33 · 2026-05-21T10:23:40Z

/coder-agents-review
@codex review

Please review the latest fixes.

chatgpt-codex-connector · 2026-05-21T10:28:35Z

Codex Review: Didn't find any major issues. More of your lovely PRs please.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ThomasK33 · 2026-05-21T10:44:45Z

/coder-agents-review

coder-agents-review

Panel re-review (4 reviewers, R5). All R4 findings verified fixed. No new findings.

DEREM-19 (stopReason override on completed goal): driver now re-reads the goal and checks for completion before defaulting to "session budget exceeded." Test confirms. DEREM-20 (safety limit untested): tested with continuationSafetyLimit: 1. DEREM-21 (indefinite hang): CLI_GOAL_STREAM_START_TIMEOUT_MS (5 minutes) now wired from run.ts. DEREM-22 (unnamed test cases): converted to named test.each.

Across 5 rounds, 22 findings were raised. 19 fixed, 1 accepted (DEREM-2, intentional script independence), 1 dropped (DEREM-13, unrelated drive-by), 1 closed (DEREM-1, superseded by extraction). The PR is thorough and responsive to review feedback.

"I tried to build a case against this change and could not. The problem is correctly understood, the solution is proportional, and the fix is at the right level of the causal chain." (Pariston)

🤖 This review was automatically generated with Coder Agents.

The two new `parseGoalBudgetFlag` and `parseGoalTurnsFlag` helpers in src/cli/run.ts (added in #3348) built their error strings with `'...' + value + '...'` concatenation, which is the only `throw new Error('...')` pair in the file — every other error (including the adjacent `parseMode` helper) uses backtick template literals. Convert these two messages to template literals for consistency. Pure stylistic alignment; the thrown Error has the same message text.

mintlify Bot deployed to staging - docs May 21, 2026 08:26 View deployment

coder-agents-review Bot reviewed May 21, 2026

View reviewed changes

Comment thread src/cli/run.ts

Comment thread scripts/upload-tbench-results.py

ThomasK33 force-pushed the handoff-docs-xtm4 branch from a83a3a6 to 08142ca Compare May 21, 2026 09:07

mintlify Bot deployed to staging - docs May 21, 2026 09:08 View deployment

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

Comment thread benchmarks/terminal_bench/mux-run.sh

ThomasK33 force-pushed the handoff-docs-xtm4 branch from 08142ca to b753dbc Compare May 21, 2026 09:16

mintlify Bot deployed to staging - docs May 21, 2026 09:17 View deployment

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

Comment thread src/cli/run.ts Outdated

ThomasK33 force-pushed the handoff-docs-xtm4 branch from b753dbc to 82624db Compare May 21, 2026 09:27

mintlify Bot deployed to staging - docs May 21, 2026 09:27 View deployment

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

Comment thread src/cli/goalRunDriver.ts

ThomasK33 force-pushed the handoff-docs-xtm4 branch from 82624db to 81af40e Compare May 21, 2026 09:37

mintlify Bot deployed to staging - docs May 21, 2026 09:38 View deployment

coder-agents-review Bot suggested changes May 21, 2026

View reviewed changes

coder-agents-review Bot reviewed May 21, 2026

View reviewed changes

ThomasK33 force-pushed the handoff-docs-xtm4 branch from 81af40e to e3db7c9 Compare May 21, 2026 09:46

mintlify Bot deployed to staging - docs May 21, 2026 09:47 View deployment

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

Comment thread src/cli/run.ts

ThomasK33 force-pushed the handoff-docs-xtm4 branch from e3db7c9 to 95f1e7a Compare May 21, 2026 09:56

mintlify Bot deployed to staging - docs May 21, 2026 09:57 View deployment

coder-agents-review Bot approved these changes May 21, 2026

View reviewed changes

Comment thread src/cli/goalRunDriver.ts

Comment thread src/cli/goalRunDriver.test.ts

Comment thread src/cli/goalRunDriver.ts

Comment thread src/cli/goalRunDriver.test.ts

ThomasK33 force-pushed the handoff-docs-xtm4 branch from 95f1e7a to 45a5f03 Compare May 21, 2026 10:23

mintlify Bot deployed to staging - docs May 21, 2026 10:24 View deployment

coder-agents-review Bot approved these changes May 21, 2026

View reviewed changes

ThomasK33 added this pull request to the merge queue May 21, 2026

Merged via the queue into main with commit 03b4e8e May 21, 2026
24 checks passed

ThomasK33 deleted the handoff-docs-xtm4 branch May 21, 2026 11:42

mux-bot Bot mentioned this pull request May 21, 2026

🤖 refactor: auto-cleanup #3291

Open

Conversation

ThomasK33 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Implementation

Validation

Risks

Implementation Plan: Add optional mux run --goal mode to Terminal-Bench

Objective

Current verified setup

Design decision

Add MUX_RUN_AS_GOAL

Goal limits stay in MUX_RUN_ARGS

Strict failure semantics (Option A)

Duplicate --goal guard

Observability marker

Implementation steps

Phase 0 — Preflight existing benchmark tests

Phase 1 — Adapter env plumbing

Phase 2 — Unit coverage

Phase 3 — CI workflow input

Phase 4 — Nightly workflow opt-in only

Phase 5 — Skill/docs updates

Dogfooding plan

Local smoke dogfood

Evidence to collect

Automated validation plan

Acceptance criteria

Risks and mitigations

Future follow-up, not in this patch

Uh oh!

mintlify Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ThomasK33 commented May 21, 2026

Uh oh!

ThomasK33 commented May 21, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 21, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ThomasK33 commented May 21, 2026

Uh oh!

ThomasK33 commented May 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

ThomasK33 commented May 21, 2026

Uh oh!

ThomasK33 commented May 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

ThomasK33 commented May 21, 2026

Uh oh!

ThomasK33 commented May 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

ThomasK33 commented May 21, 2026

Uh oh!

ThomasK33 commented May 21, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ThomasK33 commented May 21, 2026 •

edited

Loading

Implementation Plan: Add optional `mux run --goal` mode to Terminal-Bench

Add `MUX_RUN_AS_GOAL`

Goal limits stay in `MUX_RUN_ARGS`

Duplicate `--goal` guard

mintlify Bot commented May 21, 2026 •

edited

Loading