🤖 bench: add Terminal-Bench strict goal mode#3348
Conversation
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
|
/coder-agents-review |
|
@codex review |
|
Codex Review: Didn't find any major issues. What shall we delve into next? ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
First-pass mechanical review (Netero). This is not the full panel review.
The adapter plumbing, shell integration, workflow wiring, and upload metadata look solid. The env normalization, duplicate --goal guard, exit code preservation chain (PIPESTATUS -> mux_status -> MuxAgent raise), and the service-level tests are well-constructed. CI is green (21 passed, 3 skipped).
One coverage gap worth addressing before the panel reviews: the core goal continuation loop (driveGoalUntilTerminal) has complex control flow (while-true state machine, safety assert, timeout with diagnostic, budget interaction) and zero automated test coverage. The existing tests cover argument parsing and service-level options but not the CLI integration loop itself.
1 P2 (test coverage), 1 Note (minor duplication).
This is a first-pass review only: these are mechanical findings from Netero. The full review panel has not yet reviewed this PR and will review after these findings are addressed.
"Total new production lines in run.ts: ~245. Total new test lines covering run.ts: ~33, all argument parsing." (Netero)
🤖 This review was automatically generated with Coder Agents.
a83a3a6 to
08142ca
Compare
|
/coder-agents-review |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 08142cae78
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
08142ca to
b753dbc
Compare
|
/coder-agents-review |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b753dbc13e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
b753dbc to
82624db
Compare
|
/coder-agents-review |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 82624dbb12
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
82624db to
81af40e
Compare
|
/coder-agents-review |
|
@codex review |
There was a problem hiding this comment.
Panel review (14 reviewers). The DEREM-1 fix is well-done: extracting driveCliGoalUntilTerminal into a standalone module with dependency injection is the right pattern, and the tests cover the important behavioral branches. The WorkspaceGoalServiceOptions parameterization is surgical and clean. The shell exit code preservation via PIPESTATUS[1] is a solid improvement. The ADR is clear and the claim verification pass against the code holds up.
Two structural issues surfaced from convergent findings across multiple reviewers:
- When
driveCliGoalUntilTerminalthrows (stream-start timeout or safety limit), the error bypassesrun-complete,goal-incomplete, and all cost/usage data. Exit code is 1 instead of the documented 3. Four reviewers flagged this independently. - Goal-incomplete exit (code 3) in non-JSON mode writes nothing to the operator. The success path writes
[goal] completed: done; the failure path is silent.
Severity breakdown: 2 P2, 7 P3, 6 Nit.
Process note: the implementation plan required dogfood recordings and screenshots; the PR delivered a fake-bun-shim smoke test. The automated test suite compensates, but the gap between stated and actual validation is worth noting.
"The five-second timeout is configurable via streamStartTimeoutMs but only from code, not from the CLI user." (Hisoka)
🤖 This review was automatically generated with Coder Agents.
|
Codex Review: Didn't find any major issues. Keep them coming! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
Review blocked. All 16 open findings from Round 2 are unaddressed: no code changes, no author replies.
The author pushed commits since R2 (08142ca → 81af40e) addressing findings from the Codex reviewer (pipeline tee failures, plan-mode budget_limited, 5s stream-start timeout), but none of the DEREM findings were responded to.
Open findings requiring response before the next panel review:
P2 (2):
- DEREM-4:
driveCliGoalUntilTerminalthrow bypassesrun-complete,goal-incomplete, cost data; exits 1 instead of documented 3 - DEREM-5: Goal-incomplete exit (code 3) in non-JSON mode produces no human-readable output
P3 (7):
- DEREM-6:
run-complete.goal.stopReasonnull when session budget stops goal before driver - DEREM-7:
describeCliGoalStophas 6 return paths, 1 tested - DEREM-8: Null-goal and paused-goal driver exit paths untested
- DEREM-9:
MuxAgent.run()success-path test missing - DEREM-10:
--goal ""without message shows misleading error - DEREM-11: Magic numbers without named constants
- DEREM-12: Exported functions lack doc comments
Nit (6): DEREM-3, DEREM-14, DEREM-15, DEREM-16, DEREM-17, DEREM-18
The panel will review once the author responds to or pushes fixes for the open findings. At minimum, the two P2 findings need a response (fix, acknowledgment, or contestation).
🤖 This review was automatically generated with Coder Agents.
81af40e to
e3db7c9
Compare
|
/coder-agents-review |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e3db7c90da
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
e3db7c9 to
95f1e7a
Compare
|
@codex review Addressed the stream-start hang finding by passing a bounded 5-minute startup timeout into CLI goal continuations, while keeping the timeout long enough to avoid the earlier 5-second CI false-positive issue. Added targeted driver coverage for forwarding the configured timeout. Validation:
|
|
Codex Review: Didn't find any major issues. Delightful! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
Panel re-review (5 reviewers). All 15 R2 findings verified fixed.
DEREM-4 (driver throw bypassing structured events) is properly caught now: the dedicated try/catch stores the error, sets goalStopReason, and lets run-complete/goal-incomplete emit before returning exit 3. DEREM-5 (silent goal-incomplete in non-JSON mode) now writes [goal] stopped: <reason>. The rest of the fixes (test coverage, doc comments, type accuracy, naming, validation ordering) are all confirmed by the reviewers who originally raised them.
The goalRunDriver extraction, WorkspaceGoalServiceOptions parameterization, shell PIPESTATUS preservation, and adapter error handling are solid. The test suite (451 lines covering 476 production lines) exercises all meaningful branches. The ADR clearly articulates the CLI/interactive distinction.
3 P3, 1 Nit new. None blocking.
"I could not identify a simpler alternative that would solve the same problem. The CLI genuinely lacked multi-continuation capability. This is additive work, not a workaround." (Pariston)
🤖 This review was automatically generated with Coder Agents.
Add optional strict mux goal mode to Terminal-Bench runs, wire workflow controls and upload metadata, and include Gemini 3.5 Flash in the nightly model matrix.
---
_Generated with `mux` • Model: `openai:gpt-5.5` • Thinking: `xhigh` • Cost: `3125725{MUX_COSTS_USD:-0}`_
<!-- mux-attribution: model=openai:gpt-5.5 thinking=xhigh costs=91.31 -->
95f1e7a to
45a5f03
Compare
|
Responses to coder-agents-review findings:
Validation:
|
|
/coder-agents-review Please review the latest fixes. |
|
Codex Review: Didn't find any major issues. More of your lovely PRs please. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
/coder-agents-review |
There was a problem hiding this comment.
Panel re-review (4 reviewers, R5). All R4 findings verified fixed. No new findings.
DEREM-19 (stopReason override on completed goal): driver now re-reads the goal and checks for completion before defaulting to "session budget exceeded." Test confirms. DEREM-20 (safety limit untested): tested with continuationSafetyLimit: 1. DEREM-21 (indefinite hang): CLI_GOAL_STREAM_START_TIMEOUT_MS (5 minutes) now wired from run.ts. DEREM-22 (unnamed test cases): converted to named test.each.
Across 5 rounds, 22 findings were raised. 19 fixed, 1 accepted (DEREM-2, intentional script independence), 1 dropped (DEREM-13, unrelated drive-by), 1 closed (DEREM-1, superseded by extraction). The PR is thorough and responsive to review feedback.
"I tried to build a case against this change and could not. The problem is correctly understood, the solution is proportional, and the fix is at the right level of the causal chain." (Pariston)
🤖 This review was automatically generated with Coder Agents.
The two new `parseGoalBudgetFlag` and `parseGoalTurnsFlag` helpers in src/cli/run.ts (added in #3348) built their error strings with `'...' + value + '...'` concatenation, which is the only `throw new Error('...')` pair in the file — every other error (including the adjacent `parseMode` helper) uses backtick template literals. Convert these two messages to template literals for consistency. Pure stylistic alignment; the thrown Error has the same message text.
The two new `parseGoalBudgetFlag` and `parseGoalTurnsFlag` helpers in src/cli/run.ts (added in #3348) built their error strings with `'...' + value + '...'` concatenation, which is the only `throw new Error('...')` pair in the file — every other error (including the adjacent `parseMode` helper) uses backtick template literals. Convert these two messages to template literals for consistency. Pure stylistic alignment; the thrown Error has the same message text.
The two new `parseGoalBudgetFlag` and `parseGoalTurnsFlag` helpers in src/cli/run.ts (added in #3348) built their error strings with `'...' + value + '...'` concatenation, which is the only `throw new Error('...')` pair in the file — every other error (including the adjacent `parseMode` helper) uses backtick template literals. Convert these two messages to template literals for consistency. Pure stylistic alignment; the thrown Error has the same message text.
The two new `parseGoalBudgetFlag` and `parseGoalTurnsFlag` helpers in src/cli/run.ts (added in #3348) built their error strings with `'...' + value + '...'` concatenation, which is the only `throw new Error('...')` pair in the file — every other error (including the adjacent `parseMode` helper) uses backtick template literals. Convert these two messages to template literals for consistency. Pure stylistic alignment; the thrown Error has the same message text.
Summary
Adds opt-in strict Goal Run support to the Terminal-Bench adapter, including
MUX_RUN_AS_GOALplumbing, strict nonzero handling for incomplete goals, workflow controls, upload metadata, and tbench skill docs. Also documentsmux run --goalsemantics and addsgoogle/gemini-3.5-flashto the nightly Terminal-Bench all-model matrix.Background
Terminal-Bench tasks can now exercise the CLI goal loop by passing the task instruction as both stdin and
--goal, while keeping scheduled nightly runs unchanged until a manual goal-mode A/B run is healthy.Implementation
MUX_RUN_AS_GOALvalidation/forwarding in the Harbor adapter and shell runner.3after extracting token metadata.mux_run_as_goalin the reusable and nightly workflows, defaulting scheduled nightly runs to false.allmodel matrix.Validation
PATH="$HOME/.local/bin:$PATH" make static-checkgo run github.com/rhysd/actionlint/cmd/actionlint@v1.7.7 .github/workflows/terminal-bench.yml .github/workflows/nightly-terminal-bench.ymlPATH="$HOME/.local/bin:$PATH" UV_PYTHON=3.12 uvx --from "harbor[daytona]==0.6.4" --with pytest pytest benchmarks/terminal_bench/mux_agent_test.pybun test src/cli/run.test.ts src/node/services/workspaceGoalService.test.tsbunshim verified--goalinjection, stdin preservation, duplicate--goalrejection, token extraction, and exit3preservation.Risks
Strict goal mode intentionally treats incomplete goals as agent execution failures, so it is opt-in for manual nightly dispatch before changing scheduled defaults.
📋 Implementation Plan
Implementation Plan: Add optional
mux run --goalmode to Terminal-BenchObjective
Add an opt-in strict Terminal-Bench goal mode that runs each benchmark task as a Mux CLI Goal Run:
mux run --goal "<instruction>".mux runmust exit successfully only when the persisted goal status becomescomplete.mux runexits3, the benchmark adapter should treat that as an agent execution failure (Option A, agreed with the user), not as a normal verifier-only outcome.Recommended approach net product-code LoC estimate: 0 product LoC. This is benchmark/CI adapter work only. Expected non-product LoC: ~140–230 across shell, Python adapter/tests, workflow inputs, upload metadata, and skill/docs references.
Current verified setup
benchmarks/terminal_bench/mux_agent.pystages a local Mux archive plusmux-run.shinto the Harbor sandbox, forwards provider/config env vars, and executesbash /installed-agent/mux-run.sh <instruction>.benchmarks/terminal_bench/mux-run.shbuilds:then appends
MUX_RUN_ARGSvia intentional word-splitting and pipes the benchmark instruction to stdin..github/workflows/terminal-bench.ymlexposesmux_run_argsand forwards it asMUX_RUN_ARGS..github/workflows/nightly-terminal-bench.ymlpassesmux_run_argsfor smoke tests and nightly model matrix, currently without goal mode.Makefiletargetbenchmark-terminalruns Harbor and exportsMUX_TIMEOUT_MS; it does not need per-flag plumbing formux runflags.The newly implemented
mux run --goalsemantics make incomplete goals exit3;mux-run.shintends to treat any nonzeromux runexit as fatal.MuxAgent.run()should also explicitly raise on nonzero sandbox command returns after preserving logs/token metadata, because it currently manages execution manually instead of relying on Harbor's base runner behavior.Design decision
Add
MUX_RUN_AS_GOALIntroduce a benchmark adapter env var:
When enabled,
mux-run.shappends the current Terminal-Bench instruction as the CLI goal objective:cmd+=(--goal "${instruction}")Keep piping the instruction to stdin unchanged:
This intentionally exercises the handoff-defined behavior where a message/stdin kickoff can be distinct from the active objective, even though Terminal-Bench uses the same text for both.
Goal limits stay in
MUX_RUN_ARGSDo not add bespoke env vars for
--goal-budgetor--goal-turnsin the first pass. Users can supply them through the existing escape hatch:Rationale: this keeps the integration minimal, avoids duplicating CLI flag schema in the adapter, and preserves
MUX_RUN_ARGSas the canonical path for arbitrarymux runoptions.Strict failure semantics (Option A)
Do not special-case exit
3inmux-run.sh.mux run --goalexits0, Harbor proceeds to verification as usual.3,mux-run.shprints an error and exits with the original3;MuxAgent.run()raises after preserving logs/token metadata, and Harbor records an agent execution error.mux runexit code.Duplicate
--goalguardWhen
MUX_RUN_AS_GOALis enabled, rejectMUX_RUN_ARGScontaining an explicit--goalor--goal=...token. The adapter owns the dynamic per-task objective in this mode; callers should useMUX_RUN_ARGSonly for--goal-budget,--goal-turns, and other non-objective flags.Observability marker
Goal-mode runs should be easy to separate from normal Terminal-Bench runs:
mux-run.shshould log whether strict goal mode is enabled.MUX_RUN_AS_GOALthrough to result upload scripts.scripts/upload-tbench-results.pyandscripts/upload-harbor-results.pyshould add a safemux_run_as_goalrow field derived from env. The upload helpers drop fields that are not present in the BigQuery schema, so this can land before the table column exists.Implementation steps
Phase 0 — Preflight existing benchmark tests
Before editing benchmark files, run the existing adapter test file once:
uvx --from "harbor[daytona]==0.6.4" --with pytest pytest benchmarks/terminal_bench/mux_agent_test.pyIf it fails on stale expectations unrelated to goal mode (for example old
MUX_THINKING_LEVEL/MUX_MODEassumptions), fix those stale expectations as a separate minimal cleanup before adding new assertions. This keeps goal-mode failures distinct from pre-existing test drift.Phase 1 — Adapter env plumbing
Update
benchmarks/terminal_bench/mux_agent.py:"MUX_RUN_AS_GOAL"to_CONFIG_ENV_KEYSso it is forwarded into the sandbox._env:0,1,true,falsecase-insensitively."1"."0"if keeping an explicit marker proves simpler for tests); keep the shell tolerant of both.ValueErrorfor ambiguous values likeyes,goal,enabled.MuxAgent.run(), preserve command stdout/stderr/return-code files, download/tmp/mux-tokens.json, populate context, and then raiseRuntimeErrorfor any nonzero command return in all modes. This aligns Harbor results withmux-run.sh's fatal intent and enforces strict Option A for goal exit3; the current manual runner should not silently treat a failed agent command as success.Update
benchmarks/terminal_bench/mux-run.sh:Define
MUX_RUN_AS_GOAL="${MUX_RUN_AS_GOAL:-}"near the otherMUX_*env reads.Compute a normalized local boolean (for example
mux_run_as_goal_normalized="${MUX_RUN_AS_GOAL,,}") and accept1/trueas enabled,0/false/empty as disabled. Fail fast for anything else when the script is called directly outsideMuxAgent.After the base
cmd=(...)is created, beforeMUX_RUN_ARGSis appended, add:Before appending
MUX_RUN_ARGS, split them into an array once and reject exact--goalor--goal=...tokens when goal mode is enabled. Do not reject--goal-budgetor--goal-turns.Keep
--goal "$instruction"beforeMUX_RUN_ARGSso callers can still add--goal-budget,--goal-turns,--thinking,--budget, etc. afterward.Restructure the pipeline so token extraction still runs after a nonzero
mux runexit while preserving the original exit code:errexitaround the pipeline (set +e),printf ... | "${cmd[@]}" ... | tee ...,PIPESTATUSimmediately and use the actual mux command status (the middle pipeline element) asmux_status,set -e,MUX_OUTPUT_FILE,mux_statusis nonzero, print a clear[mux-run] ERROR: mux agent session failedmessage andexit "${mux_status}"instead of callingfatal, becausefatalwould collapse exit3to1.Do not add any exit-code override; a stored exit
3should remain fatal under Option A and remain visible as3.Quality gate after Phase 1:
Phase 2 — Unit coverage
Update
benchmarks/terminal_bench/mux_agent_test.pywith focused adapter tests:test_goal_mode_env_is_forwarded:MUX_RUN_AS_GOAL=1.agent._env["MUX_RUN_AS_GOAL"] == "1".test_goal_mode_defaults_to_disabled:"MUX_RUN_AS_GOAL" not in agent._envwhen unset (or equals"0"if implementation chooses explicit false forwarding consistently).test_goal_mode_rejects_invalid_values:MUX_RUN_AS_GOAL=yes._envraisesValueError.test_run_raises_after_preserving_logs_for_nonzero_exit:BaseEnvironmentobject whoseexec()returns a nonzero return code and whosedownload_file()either writes a token file or raises as needed.MuxAgent.run()writesreturn-code.txt/stdout/stderr before raising.If there is a lightweight way to assert command construction without running Harbor, add a test around
create_run_agent_commandsonly for preserving the instruction argument. Do not assert shell internals tautologically; the meaningful behavior is env forwarding and generated command remains a single runner invocation with the instruction quoted.Potential pre-existing test issue to verify before editing:
mux_agent_test.pyappears to assertMUX_THINKING_LEVELandMUX_MODE, while currentMuxAgent._envevidence did not show those env keys. Run the targeted pytest before changes; if it already fails, fix or update stale tests as a separate minimal cleanup in the same patch, documenting that the goal-mode tests exposed stale expectations.Quality gate after Phase 2:
uvx --from "harbor[daytona]==0.6.4" --with pytest pytest benchmarks/terminal_bench/mux_agent_test.pyIf the repo’s preferred Python test invocation differs, use the smallest available equivalent.
Phase 3 — CI workflow input
Update
.github/workflows/terminal-bench.yml:Add a
workflow_call.inputs.mux_run_as_goalboolean input:Add a matching
workflow_dispatch.inputs.mux_run_as_goalboolean input.In the
Run Terminal-Benchstep env, add:In both BigQuery upload steps, pass the same env marker so upload scripts can annotate rows:
Update
scripts/upload-tbench-results.pyandscripts/upload-harbor-results.py:MUX_RUN_AS_GOALfrom env.mux_run_as_goal.Update the
mux_run_argsdescriptions to mention goal limits when goal mode is enabled:Quality gate after Phase 3:
make static-checklater; it includes formatting/linting relevant to repo checks.Phase 4 — Nightly workflow opt-in only
Update
.github/workflows/nightly-terminal-bench.yml:Add a manual
workflow_dispatchboolean input:Pass it through to every reusable
terminal-bench.ymljob with an expression that is explicitly false for scheduled events:Keep scheduled runs disabled by default. Scheduled events should continue to behave exactly as today; do not rely on undefined
inputsbehavior for scheduled runs.Rationale: this allows a full manual A/B nightly-style run without changing scheduled leaderboard-tracking behavior.
Quality gate after Phase 4:
Phase 5 — Skill/docs updates
Update the project
tbenchskill documentation (.mux/skills/tbench/SKILL.mdor the source skill location if applicable) only if project convention expects skill docs to track benchmark invocation options.Add examples:
If
make fmtregeneratessrc/node/builtinSkills/mux-docs.mdor generated skill content, include those generated changes only when the repo’s tooling requires them.Quality gate after Phase 5:
Dogfooding plan
Dogfooding must capture both a terminal recording and a final screenshot artifact.
Local smoke dogfood
Run a very small Terminal-Bench task in strict goal mode:
Record it:
TERM=xterm-256color COLUMNS=120 LINES=36 \ asciinema rec --overwrite \ --command 'MUX_RUN_AS_GOAL=1 MUX_RUN_ARGS="--thinking high --goal-turns 20 --goal-budget 5.00" make benchmark-terminal TB_TASK_NAMES="chess-best-move" TB_CONCURRENCY=1 TB_TIMEOUT=1800' \ /tmp/mux-tbench-goal.castGenerate a terminal screenshot SVG from the asciicast and attach it for review. If the full benchmark is too expensive locally, run a workflow-dispatch smoke instead and record the local
gh workflow run+gh run watchsession, then download and inspect the artifact logs.Evidence to collect
/logs/agent/command-0/stdout.txtor artifact equivalent contains JSONL with:goal-startedgoal-completed+run-complete.goal.status == "complete", or a strict failure if the agent does not complete the goal/logs/agent/command-0/return-code.txtor the downloaded command log records the runner exit code./tmp/mux-tokens.json/mux-tokens.jsonwhen enough JSONL was emitted before a nonzero exit.jobs/<timestamp>/result.json.mux runexit3(expected strict Option A) versus setup/infrastructure.Automated validation plan
Run in this order:
Python adapter tests:
uvx --from "harbor[daytona]==0.6.4" --with pytest pytest benchmarks/terminal_bench/mux_agent_test.pyShell/workflow/static validation:
If
make static-checkasks for generated docs/formatting:Optional targeted dry-run command construction check:
Only run this if provider credentials and environment are available; otherwise rely on CI workflow dispatch.
Acceptance criteria
MUX_RUN_AS_GOALis forwarded by the Terminal-Bench agent adapter into the sandbox.MUX_RUN_AS_GOAL=1,mux-run.shinvokesmux run --goal "$instruction"while still piping the instruction on stdin.MUX_RUN_ARGSremains the only generic pass-through for--goal-budget,--goal-turns, and othermux runflags.MUX_RUN_AS_GOAL=1rejects duplicate--goal/--goal=...supplied throughMUX_RUN_ARGS.3from incomplete goals remains fatal in the adapter (strict Option A), andMuxAgent.run()explicitly raises on the nonzero command after preserving logs/token metadata.mux_run_as_goalforworkflow_callandworkflow_dispatch.make static-checkpasses.Risks and mitigations
--goal "$instruction"as a Bash array element, not throughMUX_RUN_ARGSword-splitting.MUX_RUN_AS_GOALinMuxAgent._envand fail fast.3obscures verifier pass/fail. Mitigation: this is intended for Option A; preserve JSONL/token artifacts before raising, annotate rows/logs withmux_run_as_goal, and only consider Option B later if strict failures are too noisy.Future follow-up, not in this patch
If strict mode proves too noisy, add a separate opt-in compatibility mode such as
MUX_GOAL_ALLOW_INCOMPLETE=1that lets exit3proceed to verification while recording goal status. Do not include that in the first implementation because the user explicitly selected Option A.Generated with
mux• Model:openai:gpt-5.5• Thinking:xhigh• Cost:3134897{MUX_COSTS_USD:-0}