Skip to content

Conversation

@ammar-agent
Copy link
Collaborator

Problem

Terminal-bench was failing with all tasks showing 0 input/output tokens because the agent was exiting immediately after receiving the user message, without making any API calls.

Symptoms:

  • Latest nightly runs: All 80 tasks failed with agent_timeout
  • Agent ran for only ~45 seconds then exited
  • total_input_tokens: 0, total_output_tokens: 0
  • Stream started with caught-up and user message, but no stream-delta or stream-end events

Root cause:
The agentSessionCli.ts reads the user message from stdin via a pipe:

printf '%s' "$instruction" | bun src/debug/agentSessionCli.ts ...

Once stdin reaches EOF and is consumed, Bun detects no other active handles keeping the event loop alive and exits the process, even though async work (API streaming) is still pending.

Solution

Add an explicit keepalive interval that ensures the process stays alive until main() completes. The interval runs far into the future (1000 seconds) but gets cleared in the finally block once the agent session finishes.

Testing

Before fix:

  • Run 19173435224: 1 task, 0 tokens, ~2 min total (agent ran 45s)
  • Agent exited immediately after user message

After fix:

  • Run 19173548174: 1 task, resolved: true, ~7 min total (agent ran 3m17s)
  • 22 tool calls made
  • Stream-delta events present
  • Agent completed successfully

Related

Generated with cmux

PR #507 added `dist/` to the terminal-bench archive include paths to fix worker crashes. However, the workflow wasn't building `dist/` before running the benchmark, causing all tasks to fail immediately with:

```
Error running agent for task <name>: Required file /home/runner/work/cmux/cmux/dist missing
```

Now runs `make build` before `make benchmark-terminal` to ensure dist/ exists and contains the compiled worker files.

Verified with workflow run #19140594821 which successfully completed the modernize-fortran-build task.
Icons aren't needed for terminal-bench, and building them requires ImageMagick.
Build only the essential JavaScript bundles needed for the benchmark.
The 'should not hang on commands that read stdin' test was flaky in CI:
- Local: took 5073ms when expecting <5000ms (73ms over)
- SSH: took 8645ms when expecting <8000ms (645ms over)

Increased timeouts to provide headroom for CI runner variability:
- Local: 5000ms → 6000ms (+20%)
- SSH: 8000ms → 10000ms (+25%)

These timeouts verify the command completes quickly (not hanging until
the bash tool's 180s timeout), while accounting for CI slowness.
CI continues to show high variability:
- runtimeExecuteBash local: 5073ms → 7079ms (trending up)
- runtimeExecuteBash SSH: 8645ms (within new limits)
- initWorkspace SSH: 12127ms when expecting <10000ms

Increased timeouts to be more generous:
- Local runtime: 6000ms → 10000ms (+67%)
- SSH runtime: 10000ms → 15000ms (+50%)
- Init queue check: 10000ms → 15000ms (+50%)

These tests verify operations complete quickly (not hanging until the
bash tool's 180s timeout). The large headroom accounts for CI
slowness while still catching actual hangs.
Terminal-bench was failing with all tasks showing 0 input/output tokens
because Bun was exiting immediately after stdin closed, even though async
work (API calls) was still pending.

The agentSessionCli reads the user message from stdin, then waits for
stream completion. However, once stdin reaches EOF and is consumed, Bun
may exit if it detects no other active handles keeping the event loop alive.

Add an explicit keepalive interval that ensures the process stays alive
until main() completes. This interval runs far into the future but gets
cleared in the finally block once the agent session finishes.

This fixes the nightly terminal-bench failures where all 80 tasks were
failing with agent_timeout and 0 tokens.
@ammar-agent
Copy link
Collaborator Author

Closing in favor of cleaner PR from main without merge conflicts.

@ammar-agent ammar-agent closed this Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant