You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Remove per-task timeout complexity in favor of single global default.
- Remove task_timeouts.py, calculate_timeout.py, and tests
- Set 30-minute (1800s) global timeout default in Makefile
- Update README to document preference for global defaults
- Based on Oct 30 nightly run analysis: longest successful task was 20 min
Rationale: Avoid maintenance burden of per-task configuration.
If tasks timeout, users can increase TB_TIMEOUT globally.
Copy file name to clipboardExpand all lines: benchmarks/terminal_bench/README.md
+20-36Lines changed: 20 additions & 36 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,48 +26,33 @@ make benchmark-terminal TB_ARGS="--agent-kwarg model_name=anthropic:claude-opus-
26
26
-`TB_SAMPLE_SIZE`: Number of random tasks to run (default: all 80 tasks)
27
27
-`TB_CONCURRENCY`: Number of concurrent tasks (default: 4)
28
28
-`TB_LIVESTREAM`: Enable livestream mode (set to `1` to enable)
29
-
-`TB_TIMEOUT`: Override timeout in seconds (default: intelligent per-task timeout)
29
+
-`TB_TIMEOUT`: Global timeout in seconds (default: 1800 = 30 minutes)
30
30
-`TB_ARGS`: Additional arguments passed to terminal-bench
31
31
32
-
### Intelligent Timeout Handling
32
+
### Timeout Handling
33
33
34
-
The Makefile automatically calculates optimal timeouts based on task complexity:
34
+
The benchmark uses a **global timeout** applied to all tasks. The default is **30 minutes (1800 seconds)**, which provides sufficient time for most tasks while catching genuinely stuck agents.
35
35
36
-
-**FAST tasks** (5 min): Simple operations like `hello-world`, `fix-permissions`
37
-
-**NORMAL tasks** (15 min): Default for most tasks
38
-
-**SLOW tasks** (30 min): Data processing, ML training, complex analysis
39
-
-**VERY_SLOW tasks** (60 min): Kernel compilation, large builds
36
+
**Design Rationale:**
40
37
41
-
**How it works:**
38
+
Based on analysis of Oct 30, 2025 nightly runs:
39
+
- Longest successful task: `blind-maze-explorer-algorithm.hard` at 20 minutes
40
+
- 95th percentile: ~15 minutes
41
+
- Mean duration: ~6 minutes
42
42
43
-
1. If `TB_TIMEOUT` is set, uses that value explicitly
44
-
2. If specific tasks are selected (via `TB_SAMPLE_SIZE` or `--task-id`), calculates the maximum timeout needed for those tasks
45
-
3. For full suite runs, uses 60 minutes (conservative default)
43
+
The 30-minute default provides comfortable headroom for complex tasks without excessive wait times for failed attempts.
46
44
47
-
**Examples:**
45
+
**Override timeout:**
48
46
49
47
```bash
50
-
#Fast tasks get 5 minute timeout automatically
51
-
make benchmark-terminal TB_ARGS="--task-id hello-world --task-id simple-web-scraper"
48
+
#Run with 60 minute timeout for very complex tasks
49
+
TB_TIMEOUT=3600 make benchmark-terminal
52
50
53
-
# Slow tasks get 60 minute timeout automatically
54
-
make benchmark-terminal TB_ARGS="--task-id build-linux-kernel-qemu"
55
-
56
-
# Override timeout manually (in seconds)
57
-
TB_TIMEOUT=1200 make benchmark-terminal TB_ARGS="--task-id chess-best-move"
51
+
# Run with shorter 10 minute timeout for quick iteration
52
+
TB_TIMEOUT=600 make benchmark-terminal TB_SAMPLE_SIZE=5
58
53
```
59
54
60
-
### Task Timeout Configuration
61
-
62
-
Task timeouts are configured in `task_timeouts.py` based on empirical data from nightly runs. To add or modify timeouts:
63
-
64
-
```python
65
-
# In task_timeouts.py
66
-
TASK_TIMEOUTS= {
67
-
"my-new-task": SLOW_TIMEOUT, # 30 minutes
68
-
"my-fast-task": FAST_TIMEOUT, # 5 minutes
69
-
}
70
-
```
55
+
**Note:** We prefer global timeout defaults over per-task configuration to avoid complexity and maintenance burden. If you find tasks consistently timing out, increase `TB_TIMEOUT` rather than adding per-task configuration.
71
56
72
57
## Agent Configuration
73
58
@@ -103,21 +88,20 @@ See `.github/workflows/terminal-bench.yml` and `.github/workflows/nightly-termin
103
88
104
89
## Timeout Analysis (2025-10-30 Nightly Run)
105
90
106
-
Based on analysis of the Oct 30 nightly run:
91
+
Based on analysis of the Oct 30 nightly run (15-minute timeout):
107
92
108
-
-**27-35% of tasks hit timeout**with 15-minute default
109
-
-**5-6 tasks passed tests but hit timeout** (would have succeeded with more time)
93
+
-**27-35% of tasks hit timeout**(too aggressive)
94
+
-**5-6 tasks passed tests but hit timeout flag** (false negatives)
-**Longest successful**: 1200s (20 minutes) for `blind-maze-explorer-algorithm.hard`
112
98
113
-
**Impact of intelligent timeouts**: Expected to reduce false timeout failures by ~50% and improve pass rates by 10-15 percentage points (from ~42% to ~52-57%).
99
+
**Impact of 30-minute timeout**: Expected to reduce false timeout failures by ~50% and improve pass rates by 10-15 percentage points (from ~42% to ~52-57%).
114
100
115
101
## Files
116
102
117
103
-`cmux_agent.py`: Main agent adapter implementing Terminal-Bench's agent interface
118
104
-`cmux-run.sh`: Shell script that sets up environment and invokes cmux CLI
119
105
-`cmux_payload.py`: Helper to package cmux app for containerized execution
120
106
-`cmux_setup.sh.j2`: Jinja2 template for agent installation script
0 commit comments