🤖 bench: simplify terminal-bench timeout handling #533

ammar-agent · 2025-11-08T16:30:03Z

Problem

Nightly terminal-bench run hit 3-hour timeout. Root cause: agent set max_timeout_sec=float('inf') which bypassed terminal-bench's timeout enforcement.

Solution

Remove max_timeout_sec=float('inf') to respect terminal-bench's global timeout. Simplified timeout handling and reduced complexity.

Changes:

Don't override max_timeout_sec in cmux_agent.py
Remove redundant shell-level timeout logic
Simplify workflow results output
Change workflow timeout 180→240 min for API slowdowns
Nightly livestream default: true→false

Net: -2 LoC

Testing

Ran TB workflow dispatch with 3 tasks:

✅ 1/3 passed (tmux-advanced-workflow)
Timeout correctly set to 1800s (30 min)
No hung tasks

Generated with cmux

…andling **Root Cause:** Agent was setting max_timeout_sec=float('inf') which bypassed terminal-bench's timeout enforcement, causing tasks to hang indefinitely. Nov 8 nightly run hit the 3-hour workflow timeout when tasks hung. **Changes:** 1. **Critical Fix (cmux_agent.py):** - Change max_timeout_sec from float('inf') to None - Allows terminal-bench to properly enforce timeouts - Prevents infinite task execution 2. **Defense-in-Depth Timeout Layers:** - Terminal-bench: --global-agent-timeout-sec (configurable) - Cmux agent: --timeout via CMUX_TIMEOUT_MS (Makefile exports) - Shell: timeout command with 60s buffer (cmux-run.sh) - Workflow: 240 min total timeout (up from 180 min) 3. **Nightly Configuration:** - Aggressive 15-min per-task timeout (down from 30 min default) - Faster detection of hung tasks - Added task_timeout parameter to workflows 4. **Improved Monitoring:** - Results summary shows pass rate percentage - Detects and reports timeout-related failures - Better error messages when results missing **Testing:** - Syntax validated (Python, Bash, YAML) - Typecheck passes - Formatting applied Successful runs typically complete in ~60-90 minutes. The 4-hour workflow timeout provides headroom for API slowdowns while preventing infinite hangs. _Generated with `cmux`_

TerminalCommand requires max_timeout_sec to be a float, not None. By not setting it at all, we use the Pydantic default (180s) while still allowing terminal-bench's --global-agent-timeout-sec to take precedence.

Terminal-bench format uses .results[] not .trials[], and is_resolved not .resolved. Also improved output format to show failure_mode and avoid bc dependency.

Agent logs are saved as artifacts, so no need to spam console output. Changes: - Disable livestream by default (was true, now false) - Filter output to show only key info: task selection, timeouts, results - Save full benchmark.log as artifact for debugging - Show last 50 lines on error for quick diagnosis Console now shows: ✓ Configuration (dataset, timeout, sample size) ✓ Selected task IDs ✓ Results summary (resolved/unresolved/accuracy) ✗ Not shown: verbose agent logs, docker output, intermediate steps Full logs still available in benchmark.log artifact.

- Remove task_timeout input parameter (use TB default: 1800s) - Remove redundant shell-level timeout in cmux-run.sh - Simplify workflow output (full logs saved in artifacts) - Simplify results summary display - Change nightly livestream default: true -> false Net result: -2 LoC while maintaining functionality

ammar-agent added 6 commits November 8, 2025 16:29

fix: use python3 instead of python in benchmark-terminal target

9b645b9

fix: remove max_timeout_sec override to use terminal-bench defaults

d3df797

TerminalCommand requires max_timeout_sec to be a float, not None. By not setting it at all, we use the Pydantic default (180s) while still allowing terminal-bench's --global-agent-timeout-sec to take precedence.

fix: correct results.json parsing in workflow summary

3e639dc

Terminal-bench format uses .results[] not .trials[], and is_resolved not .resolved. Also improved output format to show failure_mode and avoid bc dependency.

ammario approved these changes Nov 8, 2025

View reviewed changes

ammar-agent changed the title ~~🤖 fix: prevent terminal-bench timeout bypass and improve robustness~~ 🤖 bench: simplify terminal-bench timeout handling Nov 8, 2025

docs: add bench: prefix for benchmarking PRs

077828b

ammario enabled auto-merge November 8, 2025 20:38

ammario added this pull request to the merge queue Nov 8, 2025

Merged via the queue into main with commit 5438e2b Nov 8, 2025
15 checks passed

ammario deleted the tb-fix branch November 8, 2025 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🤖 bench: simplify terminal-bench timeout handling #533

🤖 bench: simplify terminal-bench timeout handling #533

ammar-agent commented Nov 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🤖 bench: simplify terminal-bench timeout handling #533

🤖 bench: simplify terminal-bench timeout handling #533

Conversation

ammar-agent commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ammar-agent commented Nov 8, 2025 •

edited

Loading