|
| 1 | +# Terminal-Bench Integration |
| 2 | + |
| 3 | +This directory contains the cmux agent adapter for [Terminal-Bench](https://github.com/benediktstroebl/terminal-bench), a benchmarking framework for evaluating agentic CLI/terminal capabilities. |
| 4 | + |
| 5 | +## Quick Start |
| 6 | + |
| 7 | +```bash |
| 8 | +# Run full benchmark suite (80 tasks, ~2.5 hours) |
| 9 | +make benchmark-terminal |
| 10 | + |
| 11 | +# Run with sample of 5 tasks |
| 12 | +TB_SAMPLE_SIZE=5 make benchmark-terminal |
| 13 | + |
| 14 | +# Run specific tasks |
| 15 | +make benchmark-terminal TB_ARGS="--task-id hello-world --task-id chess-best-move" |
| 16 | + |
| 17 | +# Run with specific model |
| 18 | +make benchmark-terminal TB_ARGS="--agent-kwarg model_name=anthropic:claude-opus-4" |
| 19 | +``` |
| 20 | + |
| 21 | +## Configuration |
| 22 | + |
| 23 | +### Environment Variables |
| 24 | + |
| 25 | +- `TB_DATASET`: Dataset to use (default: `terminal-bench-core==0.1.1`) |
| 26 | +- `TB_SAMPLE_SIZE`: Number of random tasks to run (default: all 80 tasks) |
| 27 | +- `TB_CONCURRENCY`: Number of concurrent tasks (default: 4) |
| 28 | +- `TB_LIVESTREAM`: Enable livestream mode (set to `1` to enable) |
| 29 | +- `TB_TIMEOUT`: Global timeout in seconds (default: 1800 = 30 minutes) |
| 30 | +- `TB_ARGS`: Additional arguments passed to terminal-bench |
| 31 | + |
| 32 | +### Timeout Handling |
| 33 | + |
| 34 | +The benchmark uses a **global timeout** applied to all tasks. The default is **30 minutes (1800 seconds)**, which provides sufficient time for most tasks while catching genuinely stuck agents. |
| 35 | + |
| 36 | +**Design Rationale:** |
| 37 | + |
| 38 | +Based on analysis of Oct 30, 2025 nightly runs: |
| 39 | +- Longest successful task: `blind-maze-explorer-algorithm.hard` at 20 minutes |
| 40 | +- 95th percentile: ~15 minutes |
| 41 | +- Mean duration: ~6 minutes |
| 42 | + |
| 43 | +The 30-minute default provides comfortable headroom for complex tasks without excessive wait times for failed attempts. |
| 44 | + |
| 45 | +**Override timeout:** |
| 46 | + |
| 47 | +```bash |
| 48 | +# Run with 60 minute timeout for very complex tasks |
| 49 | +TB_TIMEOUT=3600 make benchmark-terminal |
| 50 | + |
| 51 | +# Run with shorter 10 minute timeout for quick iteration |
| 52 | +TB_TIMEOUT=600 make benchmark-terminal TB_SAMPLE_SIZE=5 |
| 53 | +``` |
| 54 | + |
| 55 | +**Note:** We prefer global timeout defaults over per-task configuration to avoid complexity and maintenance burden. If you find tasks consistently timing out, increase `TB_TIMEOUT` rather than adding per-task configuration. |
| 56 | + |
| 57 | +## Agent Configuration |
| 58 | + |
| 59 | +The cmux agent supports the following kwargs (passed via `--agent-kwarg`): |
| 60 | + |
| 61 | +- `model_name`: Model to use (e.g., `anthropic:claude-sonnet-4-5`, `openai:gpt-5-codex`) |
| 62 | +- `thinking_level`: Thinking level (`off`, `low`, `medium`, `high`) |
| 63 | +- `mode`: Agent mode (`plan`, `exec`) |
| 64 | + |
| 65 | +**Example:** |
| 66 | + |
| 67 | +```bash |
| 68 | +make benchmark-terminal TB_ARGS="--agent-kwarg model_name=openai:gpt-5-codex --agent-kwarg thinking_level=high" |
| 69 | +``` |
| 70 | + |
| 71 | +## Results |
| 72 | + |
| 73 | +Results are saved to `runs/YYYY-MM-DD__HH-MM-SS/`: |
| 74 | + |
| 75 | +- `results.json`: Aggregate results with pass/fail rates |
| 76 | +- `run_metadata.json`: Run configuration and metadata |
| 77 | +- `<task-id>/`: Per-task directories containing: |
| 78 | + - `sessions/agent.log`: Full agent execution log |
| 79 | + - `sessions/agent.cast`: Asciinema recording of agent session |
| 80 | + - `sessions/tests.log`: Test execution output |
| 81 | + - `results.json`: Per-trial results |
| 82 | + |
| 83 | +## CI/CD Integration |
| 84 | + |
| 85 | +See `.github/workflows/terminal-bench.yml` and `.github/workflows/nightly-terminal-bench.yml` for GitHub Actions integration. |
| 86 | + |
| 87 | +**Nightly workflow** runs both Claude and GPT models on the full 80-task suite, uploading results as artifacts. |
| 88 | + |
| 89 | +## Timeout Analysis (2025-10-30 Nightly Run) |
| 90 | + |
| 91 | +Based on analysis of the Oct 30 nightly run (15-minute timeout): |
| 92 | + |
| 93 | +- **27-35% of tasks hit timeout** (too aggressive) |
| 94 | +- **5-6 tasks passed tests but hit timeout flag** (false negatives) |
| 95 | +- **Mean duration**: 356s (Anthropic) / 438s (OpenAI) |
| 96 | +- **Median duration**: 272s (Anthropic) / 299s (OpenAI) |
| 97 | +- **Longest successful**: 1200s (20 minutes) for `blind-maze-explorer-algorithm.hard` |
| 98 | + |
| 99 | +**Impact of 30-minute timeout**: Expected to reduce false timeout failures by ~50% and improve pass rates by 10-15 percentage points (from ~42% to ~52-57%). |
| 100 | + |
| 101 | +## Files |
| 102 | + |
| 103 | +- `cmux_agent.py`: Main agent adapter implementing Terminal-Bench's agent interface |
| 104 | +- `cmux-run.sh`: Shell script that sets up environment and invokes cmux CLI |
| 105 | +- `cmux_payload.py`: Helper to package cmux app for containerized execution |
| 106 | +- `cmux_setup.sh.j2`: Jinja2 template for agent installation script |
| 107 | +- `sample_tasks.py`: Utility to randomly sample tasks from dataset |
0 commit comments