Skip to content

Commit eaffa08

Browse files
committed
🤖 Simplify to global timeout approach
Remove per-task timeout complexity in favor of single global default. - Remove task_timeouts.py, calculate_timeout.py, and tests - Set 30-minute (1800s) global timeout default in Makefile - Update README to document preference for global defaults - Based on Oct 30 nightly run analysis: longest successful task was 20 min Rationale: Avoid maintenance burden of per-task configuration. If tasks timeout, users can increase TB_TIMEOUT globally.
1 parent 8b8a62b commit eaffa08

File tree

5 files changed

+23
-269
lines changed

5 files changed

+23
-269
lines changed

Makefile

Lines changed: 3 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -297,10 +297,10 @@ chromatic: node_modules/.installed ## Run Chromatic for visual regression testin
297297
## Benchmarks
298298
benchmark-terminal: ## Run Terminal-Bench with the cmux agent (use TB_DATASET/TB_SAMPLE_SIZE/TB_TIMEOUT/TB_ARGS to customize)
299299
@TB_DATASET=$${TB_DATASET:-terminal-bench-core==0.1.1}; \
300+
TB_TIMEOUT=$${TB_TIMEOUT:-1800}; \
300301
CONCURRENCY_FLAG=$${TB_CONCURRENCY:+--n-concurrent $$TB_CONCURRENCY}; \
301302
LIVESTREAM_FLAG=$${TB_LIVESTREAM:+--livestream}; \
302303
TASK_ID_FLAGS=""; \
303-
TASK_IDS_LIST=""; \
304304
if [ -n "$$TB_SAMPLE_SIZE" ]; then \
305305
echo "Ensuring dataset $$TB_DATASET is downloaded..."; \
306306
uvx terminal-bench datasets download --dataset "$$TB_DATASET" 2>&1 | grep -v "already exists" || true; \
@@ -316,28 +316,16 @@ benchmark-terminal: ## Run Terminal-Bench with the cmux agent (use TB_DATASET/TB
316316
for task_id in $$TASK_IDS; do \
317317
TASK_ID_FLAGS="$$TASK_ID_FLAGS --task-id $$task_id"; \
318318
done; \
319-
TASK_IDS_LIST="$$TASK_IDS"; \
320319
echo "Selected task IDs: $$TASK_IDS"; \
321320
fi; \
322-
TIMEOUT_FLAG=""; \
323-
if [ -n "$$TB_TIMEOUT" ]; then \
324-
echo "Using explicit timeout: $$TB_TIMEOUT seconds"; \
325-
TIMEOUT_FLAG="--global-agent-timeout-sec $$TB_TIMEOUT"; \
326-
elif [ -n "$$TASK_IDS_LIST" ]; then \
327-
echo "Calculating optimal timeout for selected tasks..."; \
328-
TIMEOUT_FLAG=$$(python benchmarks/terminal_bench/calculate_timeout.py --task-ids $$TASK_IDS_LIST --format flag); \
329-
echo "Timeout: $$TIMEOUT_FLAG"; \
330-
else \
331-
echo "Using default timeout (60 minutes for full suite)"; \
332-
TIMEOUT_FLAG="--global-agent-timeout-sec 3600"; \
333-
fi; \
321+
echo "Using timeout: $$TB_TIMEOUT seconds"; \
334322
echo "Running Terminal-Bench with dataset $$TB_DATASET"; \
335323
uvx terminal-bench run \
336324
--dataset "$$TB_DATASET" \
337325
--agent-import-path benchmarks.terminal_bench.cmux_agent:CmuxAgent \
326+
--global-agent-timeout-sec $$TB_TIMEOUT \
338327
$$CONCURRENCY_FLAG \
339328
$$LIVESTREAM_FLAG \
340-
$$TIMEOUT_FLAG \
341329
$$TASK_ID_FLAGS \
342330
$${TB_ARGS}
343331

benchmarks/terminal_bench/README.md

Lines changed: 20 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -26,48 +26,33 @@ make benchmark-terminal TB_ARGS="--agent-kwarg model_name=anthropic:claude-opus-
2626
- `TB_SAMPLE_SIZE`: Number of random tasks to run (default: all 80 tasks)
2727
- `TB_CONCURRENCY`: Number of concurrent tasks (default: 4)
2828
- `TB_LIVESTREAM`: Enable livestream mode (set to `1` to enable)
29-
- `TB_TIMEOUT`: Override timeout in seconds (default: intelligent per-task timeout)
29+
- `TB_TIMEOUT`: Global timeout in seconds (default: 1800 = 30 minutes)
3030
- `TB_ARGS`: Additional arguments passed to terminal-bench
3131

32-
### Intelligent Timeout Handling
32+
### Timeout Handling
3333

34-
The Makefile automatically calculates optimal timeouts based on task complexity:
34+
The benchmark uses a **global timeout** applied to all tasks. The default is **30 minutes (1800 seconds)**, which provides sufficient time for most tasks while catching genuinely stuck agents.
3535

36-
- **FAST tasks** (5 min): Simple operations like `hello-world`, `fix-permissions`
37-
- **NORMAL tasks** (15 min): Default for most tasks
38-
- **SLOW tasks** (30 min): Data processing, ML training, complex analysis
39-
- **VERY_SLOW tasks** (60 min): Kernel compilation, large builds
36+
**Design Rationale:**
4037

41-
**How it works:**
38+
Based on analysis of Oct 30, 2025 nightly runs:
39+
- Longest successful task: `blind-maze-explorer-algorithm.hard` at 20 minutes
40+
- 95th percentile: ~15 minutes
41+
- Mean duration: ~6 minutes
4242

43-
1. If `TB_TIMEOUT` is set, uses that value explicitly
44-
2. If specific tasks are selected (via `TB_SAMPLE_SIZE` or `--task-id`), calculates the maximum timeout needed for those tasks
45-
3. For full suite runs, uses 60 minutes (conservative default)
43+
The 30-minute default provides comfortable headroom for complex tasks without excessive wait times for failed attempts.
4644

47-
**Examples:**
45+
**Override timeout:**
4846

4947
```bash
50-
# Fast tasks get 5 minute timeout automatically
51-
make benchmark-terminal TB_ARGS="--task-id hello-world --task-id simple-web-scraper"
48+
# Run with 60 minute timeout for very complex tasks
49+
TB_TIMEOUT=3600 make benchmark-terminal
5250

53-
# Slow tasks get 60 minute timeout automatically
54-
make benchmark-terminal TB_ARGS="--task-id build-linux-kernel-qemu"
55-
56-
# Override timeout manually (in seconds)
57-
TB_TIMEOUT=1200 make benchmark-terminal TB_ARGS="--task-id chess-best-move"
51+
# Run with shorter 10 minute timeout for quick iteration
52+
TB_TIMEOUT=600 make benchmark-terminal TB_SAMPLE_SIZE=5
5853
```
5954

60-
### Task Timeout Configuration
61-
62-
Task timeouts are configured in `task_timeouts.py` based on empirical data from nightly runs. To add or modify timeouts:
63-
64-
```python
65-
# In task_timeouts.py
66-
TASK_TIMEOUTS = {
67-
"my-new-task": SLOW_TIMEOUT, # 30 minutes
68-
"my-fast-task": FAST_TIMEOUT, # 5 minutes
69-
}
70-
```
55+
**Note:** We prefer global timeout defaults over per-task configuration to avoid complexity and maintenance burden. If you find tasks consistently timing out, increase `TB_TIMEOUT` rather than adding per-task configuration.
7156

7257
## Agent Configuration
7358

@@ -103,21 +88,20 @@ See `.github/workflows/terminal-bench.yml` and `.github/workflows/nightly-termin
10388

10489
## Timeout Analysis (2025-10-30 Nightly Run)
10590

106-
Based on analysis of the Oct 30 nightly run:
91+
Based on analysis of the Oct 30 nightly run (15-minute timeout):
10792

108-
- **27-35% of tasks hit timeout** with 15-minute default
109-
- **5-6 tasks passed tests but hit timeout** (would have succeeded with more time)
93+
- **27-35% of tasks hit timeout** (too aggressive)
94+
- **5-6 tasks passed tests but hit timeout flag** (false negatives)
11095
- **Mean duration**: 356s (Anthropic) / 438s (OpenAI)
11196
- **Median duration**: 272s (Anthropic) / 299s (OpenAI)
97+
- **Longest successful**: 1200s (20 minutes) for `blind-maze-explorer-algorithm.hard`
11298

113-
**Impact of intelligent timeouts**: Expected to reduce false timeout failures by ~50% and improve pass rates by 10-15 percentage points (from ~42% to ~52-57%).
99+
**Impact of 30-minute timeout**: Expected to reduce false timeout failures by ~50% and improve pass rates by 10-15 percentage points (from ~42% to ~52-57%).
114100

115101
## Files
116102

117103
- `cmux_agent.py`: Main agent adapter implementing Terminal-Bench's agent interface
118104
- `cmux-run.sh`: Shell script that sets up environment and invokes cmux CLI
119105
- `cmux_payload.py`: Helper to package cmux app for containerized execution
120106
- `cmux_setup.sh.j2`: Jinja2 template for agent installation script
121-
- `task_timeouts.py`: Task-specific timeout configuration
122-
- `calculate_timeout.py`: Helper script to calculate optimal timeouts
123107
- `sample_tasks.py`: Utility to randomly sample tasks from dataset

benchmarks/terminal_bench/calculate_timeout.py

Lines changed: 0 additions & 57 deletions
This file was deleted.

benchmarks/terminal_bench/task_timeouts.py

Lines changed: 0 additions & 95 deletions
This file was deleted.

benchmarks/terminal_bench/task_timeouts_test.py

Lines changed: 0 additions & 66 deletions
This file was deleted.

0 commit comments

Comments
 (0)