-
Notifications
You must be signed in to change notification settings - Fork 13
🤖 fix: build dist/ before running terminal-bench #513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PR #507 added `dist/` to the terminal-bench archive include paths to fix worker crashes. However, the workflow wasn't building `dist/` before running the benchmark, causing all tasks to fail immediately with: ``` Error running agent for task <name>: Required file /home/runner/work/cmux/cmux/dist missing ``` Now runs `make build` before `make benchmark-terminal` to ensure dist/ exists and contains the compiled worker files. Verified with workflow run #19140594821 which successfully completed the modernize-fortran-build task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Icons aren't needed for terminal-bench, and building them requires ImageMagick. Build only the essential JavaScript bundles needed for the benchmark.
The 'should not hang on commands that read stdin' test was flaky in CI: - Local: took 5073ms when expecting <5000ms (73ms over) - SSH: took 8645ms when expecting <8000ms (645ms over) Increased timeouts to provide headroom for CI runner variability: - Local: 5000ms → 6000ms (+20%) - SSH: 8000ms → 10000ms (+25%) These timeouts verify the command completes quickly (not hanging until the bash tool's 180s timeout), while accounting for CI slowness.
CI continues to show high variability: - runtimeExecuteBash local: 5073ms → 7079ms (trending up) - runtimeExecuteBash SSH: 8645ms (within new limits) - initWorkspace SSH: 12127ms when expecting <10000ms Increased timeouts to be more generous: - Local runtime: 6000ms → 10000ms (+67%) - SSH runtime: 10000ms → 15000ms (+50%) - Init queue check: 10000ms → 15000ms (+50%) These tests verify operations complete quickly (not hanging until the bash tool's 180s timeout). The large headroom accounts for CI slowness while still catching actual hangs.
## Problem Terminal-bench was failing with all tasks showing 0 input/output tokens because the agent was exiting immediately after receiving the user message, without making any API calls. **Symptoms:** - Latest nightly runs (Nov 5-7): All 80 tasks failed with `agent_timeout` - Agent ran for only ~45 seconds then exited - `total_input_tokens: 0`, `total_output_tokens: 0` - Stream started with `caught-up` and `user` message, but no `stream-delta` or `stream-end` events **Root cause:** The `agentSessionCli.ts` reads the user message from stdin via a pipe: ```bash printf '%s' "$instruction" | bun src/debug/agentSessionCli.ts ... ``` Once stdin reaches EOF and is consumed, Bun detects no other active handles keeping the event loop alive and exits the process, **even though async work (API streaming) is still pending**. ## Solution Add an explicit keepalive interval that ensures the process stays alive until `main()` completes. The interval runs far into the future (1000 seconds) but gets cleared in the finally block once the agent session finishes. ## Testing **Before fix:** - Run #19173435224: 1 task, 0 tokens, ~2 min total (agent ran 45s) - Agent exited immediately after user message **After fix:** - Run #19173548174: 1 task, **resolved: true**, ~7 min total (agent ran 3m17s) - 22 tool calls made - Stream-delta events present - Agent completed successfully ## Related - Fixes nightly terminal-bench failures from Nov 5-7 - Related to PR #507 (dist/ in archive) and PR #513 (build step in workflow) _Generated with `cmux`_
## Problem Terminal-bench was failing with all tasks showing 0 input/output tokens because the agent was exiting immediately after receiving the user message, without making any API calls. **Symptoms:** - Latest nightly runs (Nov 5-7): All 80 tasks failed with `agent_timeout` - Agent ran for only ~45 seconds then exited - `total_input_tokens: 0`, `total_output_tokens: 0` - Stream started with `caught-up` and `user` message, but no `stream-delta` or `stream-end` events **Root cause:** The `agentSessionCli.ts` reads the user message from stdin via a pipe: ```bash printf '%s' "$instruction" | bun src/debug/agentSessionCli.ts ... ``` Once stdin reaches EOF and is consumed, Bun detects no other active handles keeping the event loop alive and exits the process, **even though async work (API streaming) is still pending**. ## Solution Add an explicit keepalive interval that ensures the process stays alive until `main()` completes. The interval runs far into the future (1000 seconds) but gets cleared in the finally block once the agent session finishes. ## Testing **Before fix:** - Run #19173435224: 1 task, 0 tokens, ~2 min total (agent ran 45s) - Agent exited immediately after user message **After fix:** - Run #19173548174: 1 task, **resolved: true**, ~7 min total (agent ran 3m17s) - 22 tool calls made - Stream-delta events present - Agent completed successfully ## Related - Fixes nightly terminal-bench failures from Nov 5-7 - Related to PR #507 (dist/ in archive) and PR #513 (build step in workflow) _Generated with `cmux`_
Problem
PR #507 added
dist/to the terminal-bench archive include paths to fix worker crashes. However, the workflow wasn't buildingdist/before running the benchmark, causing all tasks to fail immediately with:Solution
Add
make buildstep beforemake benchmark-terminalin the workflow. This ensures:dist/directory existsTesting
Verified with workflow run #19140594821 which successfully completed the modernize-fortran-build task:
Generated with
cmux