coder
diff --git a/‎Makefile‎
Lines changed: 3 additions & 15 deletions b/‎Makefile‎
Lines changed: 3 additions & 15 deletions
diff --git a/‎benchmarks/terminal_bench/README.md‎
Lines changed: 20 additions & 36 deletions b/‎benchmarks/terminal_bench/README.md‎
Lines changed: 20 additions & 36 deletions
diff --git a/‎benchmarks/terminal_bench/calculate_timeout.py‎
Lines changed: 0 additions & 57 deletions b/‎benchmarks/terminal_bench/calculate_timeout.py‎
Lines changed: 0 additions & 57 deletions
diff --git a/‎benchmarks/terminal_bench/task_timeouts.py‎
Lines changed: 0 additions & 95 deletions b/‎benchmarks/terminal_bench/task_timeouts.py‎
Lines changed: 0 additions & 95 deletions
diff --git a/‎benchmarks/terminal_bench/task_timeouts_test.py‎
Lines changed: 0 additions & 66 deletions b/‎benchmarks/terminal_bench/task_timeouts_test.py‎
Lines changed: 0 additions & 66 deletions
@@ -297,10 +297,10 @@ chromatic: node_modules/.installed ## Run Chromatic for visual regression testin
 ## Benchmarks
 benchmark-terminal: ## Run Terminal-Bench with the cmux agent (use TB_DATASET/TB_SAMPLE_SIZE/TB_TIMEOUT/TB_ARGS to customize)
 	@TB_DATASET=$${TB_DATASET:-terminal-bench-core==0.1.1}; \
+	TB_TIMEOUT=$${TB_TIMEOUT:-1800}; \
 	CONCURRENCY_FLAG=$${TB_CONCURRENCY:+--n-concurrent $$TB_CONCURRENCY}; \
 	LIVESTREAM_FLAG=$${TB_LIVESTREAM:+--livestream}; \
 	TASK_ID_FLAGS=""; \
-	TASK_IDS_LIST=""; \
 	if [ -n "$$TB_SAMPLE_SIZE" ]; then \
 		echo "Ensuring dataset $$TB_DATASET is downloaded..."; \
 		uvx terminal-bench datasets download --dataset "$$TB_DATASET" 2>&1 | grep -v "already exists" || true; \
@@ -316,28 +316,16 @@ benchmark-terminal: ## Run Terminal-Bench with the cmux agent (use TB_DATASET/TB
 		for task_id in $$TASK_IDS; do \
 			TASK_ID_FLAGS="$$TASK_ID_FLAGS --task-id $$task_id"; \
 		done; \
-		TASK_IDS_LIST="$$TASK_IDS"; \
 		echo "Selected task IDs: $$TASK_IDS"; \
 	fi; \
-	TIMEOUT_FLAG=""; \
-	if [ -n "$$TB_TIMEOUT" ]; then \
-		echo "Using explicit timeout: $$TB_TIMEOUT seconds"; \
-		TIMEOUT_FLAG="--global-agent-timeout-sec $$TB_TIMEOUT"; \
-	elif [ -n "$$TASK_IDS_LIST" ]; then \
-		echo "Calculating optimal timeout for selected tasks..."; \
-		TIMEOUT_FLAG=$$(python benchmarks/terminal_bench/calculate_timeout.py --task-ids $$TASK_IDS_LIST --format flag); \
-		echo "Timeout: $$TIMEOUT_FLAG"; \
-	else \
-		echo "Using default timeout (60 minutes for full suite)"; \
-		TIMEOUT_FLAG="--global-agent-timeout-sec 3600"; \
-	fi; \
+	echo "Using timeout: $$TB_TIMEOUT seconds"; \
 	echo "Running Terminal-Bench with dataset $$TB_DATASET"; \
 	uvx terminal-bench run \
 		--dataset "$$TB_DATASET" \
 		--agent-import-path benchmarks.terminal_bench.cmux_agent:CmuxAgent \
+		--global-agent-timeout-sec $$TB_TIMEOUT \
 		$$CONCURRENCY_FLAG \
 		$$LIVESTREAM_FLAG \
-		$$TIMEOUT_FLAG \
 		$$TASK_ID_FLAGS \
 		$${TB_ARGS}
 
 
@@ -26,48 +26,33 @@ make benchmark-terminal TB_ARGS="--agent-kwarg model_name=anthropic:claude-opus-
 - `TB_SAMPLE_SIZE`: Number of random tasks to run (default: all 80 tasks)
 - `TB_CONCURRENCY`: Number of concurrent tasks (default: 4)
 - `TB_LIVESTREAM`: Enable livestream mode (set to `1` to enable)
-- `TB_TIMEOUT`: Override timeout in seconds (default: intelligent per-task timeout)
+- `TB_TIMEOUT`: Global timeout in seconds (default: 1800 = 30 minutes)
 - `TB_ARGS`: Additional arguments passed to terminal-bench
 
-### Intelligent Timeout Handling
+### Timeout Handling
 
-The Makefile automatically calculates optimal timeouts based on task complexity:
+The benchmark uses a **global timeout** applied to all tasks. The default is **30 minutes (1800 seconds)**, which provides sufficient time for most tasks while catching genuinely stuck agents.
 
-- **FAST tasks** (5 min): Simple operations like `hello-world`, `fix-permissions`
-- **NORMAL tasks** (15 min): Default for most tasks
-- **SLOW tasks** (30 min): Data processing, ML training, complex analysis
-- **VERY_SLOW tasks** (60 min): Kernel compilation, large builds
+**Design Rationale:**
 
-**How it works:**
+Based on analysis of Oct 30, 2025 nightly runs:
+- Longest successful task: `blind-maze-explorer-algorithm.hard` at 20 minutes
+- 95th percentile: ~15 minutes
+- Mean duration: ~6 minutes
 
-1. If `TB_TIMEOUT` is set, uses that value explicitly
-2. If specific tasks are selected (via `TB_SAMPLE_SIZE` or `--task-id`), calculates the maximum timeout needed for those tasks
-3. For full suite runs, uses 60 minutes (conservative default)
+The 30-minute default provides comfortable headroom for complex tasks without excessive wait times for failed attempts.
 
-**Examples:**
+**Override timeout:**
 
 ```bash
-# Fast tasks get 5 minute timeout automatically
-make benchmark-terminal TB_ARGS="--task-id hello-world --task-id simple-web-scraper"
+# Run with 60 minute timeout for very complex tasks
+TB_TIMEOUT=3600 make benchmark-terminal
 
-# Slow tasks get 60 minute timeout automatically
-make benchmark-terminal TB_ARGS="--task-id build-linux-kernel-qemu"
-
-# Override timeout manually (in seconds)
-TB_TIMEOUT=1200 make benchmark-terminal TB_ARGS="--task-id chess-best-move"
+# Run with shorter 10 minute timeout for quick iteration
+TB_TIMEOUT=600 make benchmark-terminal TB_SAMPLE_SIZE=5
 ```
 
-### Task Timeout Configuration
-
-Task timeouts are configured in `task_timeouts.py` based on empirical data from nightly runs. To add or modify timeouts:
-
-```python
-# In task_timeouts.py
-TASK_TIMEOUTS = {
-    "my-new-task": SLOW_TIMEOUT,  # 30 minutes
-    "my-fast-task": FAST_TIMEOUT,  # 5 minutes
-}
-```
+**Note:** We prefer global timeout defaults over per-task configuration to avoid complexity and maintenance burden. If you find tasks consistently timing out, increase `TB_TIMEOUT` rather than adding per-task configuration.
 
 ## Agent Configuration
 
@@ -103,21 +88,20 @@ See `.github/workflows/terminal-bench.yml` and `.github/workflows/nightly-termin
 
 ## Timeout Analysis (2025-10-30 Nightly Run)
 
-Based on analysis of the Oct 30 nightly run:
+Based on analysis of the Oct 30 nightly run (15-minute timeout):
 
-- **27-35% of tasks hit timeout** with 15-minute default
-- **5-6 tasks passed tests but hit timeout** (would have succeeded with more time)
+- **27-35% of tasks hit timeout** (too aggressive)
+- **5-6 tasks passed tests but hit timeout flag** (false negatives)
 - **Mean duration**: 356s (Anthropic) / 438s (OpenAI)
 - **Median duration**: 272s (Anthropic) / 299s (OpenAI)
+- **Longest successful**: 1200s (20 minutes) for `blind-maze-explorer-algorithm.hard`
 
-**Impact of intelligent timeouts**: Expected to reduce false timeout failures by ~50% and improve pass rates by 10-15 percentage points (from ~42% to ~52-57%).
+**Impact of 30-minute timeout**: Expected to reduce false timeout failures by ~50% and improve pass rates by 10-15 percentage points (from ~42% to ~52-57%).
 
 ## Files
 
 - `cmux_agent.py`: Main agent adapter implementing Terminal-Bench's agent interface
 - `cmux-run.sh`: Shell script that sets up environment and invokes cmux CLI
 - `cmux_payload.py`: Helper to package cmux app for containerized execution
 - `cmux_setup.sh.j2`: Jinja2 template for agent installation script
-- `task_timeouts.py`: Task-specific timeout configuration
-- `calculate_timeout.py`: Helper script to calculate optimal timeouts
 - `sample_tasks.py`: Utility to randomly sample tasks from dataset