Skip to content

Conversation

@ammar-agent
Copy link
Collaborator

@ammar-agent ammar-agent commented Nov 8, 2025

Problem

Nightly terminal-bench run hit 3-hour timeout. Root cause: agent set max_timeout_sec=float('inf') which bypassed terminal-bench's timeout enforcement.

Solution

Remove max_timeout_sec=float('inf') to respect terminal-bench's global timeout. Simplified timeout handling and reduced complexity.

Changes:

  • Don't override max_timeout_sec in cmux_agent.py
  • Remove redundant shell-level timeout logic
  • Simplify workflow results output
  • Change workflow timeout 180→240 min for API slowdowns
  • Nightly livestream default: true→false

Net: -2 LoC

Testing

Ran TB workflow dispatch with 3 tasks:

  • ✅ 1/3 passed (tmux-advanced-workflow)
  • Timeout correctly set to 1800s (30 min)
  • No hung tasks

Generated with cmux

…andling

**Root Cause:**
Agent was setting max_timeout_sec=float('inf') which bypassed terminal-bench's
timeout enforcement, causing tasks to hang indefinitely. Nov 8 nightly run hit
the 3-hour workflow timeout when tasks hung.

**Changes:**

1. **Critical Fix (cmux_agent.py):**
   - Change max_timeout_sec from float('inf') to None
   - Allows terminal-bench to properly enforce timeouts
   - Prevents infinite task execution

2. **Defense-in-Depth Timeout Layers:**
   - Terminal-bench: --global-agent-timeout-sec (configurable)
   - Cmux agent: --timeout via CMUX_TIMEOUT_MS (Makefile exports)
   - Shell: timeout command with 60s buffer (cmux-run.sh)
   - Workflow: 240 min total timeout (up from 180 min)

3. **Nightly Configuration:**
   - Aggressive 15-min per-task timeout (down from 30 min default)
   - Faster detection of hung tasks
   - Added task_timeout parameter to workflows

4. **Improved Monitoring:**
   - Results summary shows pass rate percentage
   - Detects and reports timeout-related failures
   - Better error messages when results missing

**Testing:**
- Syntax validated (Python, Bash, YAML)
- Typecheck passes
- Formatting applied

Successful runs typically complete in ~60-90 minutes. The 4-hour workflow
timeout provides headroom for API slowdowns while preventing infinite hangs.

_Generated with `cmux`_
TerminalCommand requires max_timeout_sec to be a float, not None.
By not setting it at all, we use the Pydantic default (180s) while
still allowing terminal-bench's --global-agent-timeout-sec to take
precedence.
Terminal-bench format uses .results[] not .trials[], and is_resolved not .resolved.
Also improved output format to show failure_mode and avoid bc dependency.
Agent logs are saved as artifacts, so no need to spam console output.

Changes:
- Disable livestream by default (was true, now false)
- Filter output to show only key info: task selection, timeouts, results
- Save full benchmark.log as artifact for debugging
- Show last 50 lines on error for quick diagnosis

Console now shows:
✓ Configuration (dataset, timeout, sample size)
✓ Selected task IDs
✓ Results summary (resolved/unresolved/accuracy)
✗ Not shown: verbose agent logs, docker output, intermediate steps

Full logs still available in benchmark.log artifact.
- Remove task_timeout input parameter (use TB default: 1800s)
- Remove redundant shell-level timeout in cmux-run.sh
- Simplify workflow output (full logs saved in artifacts)
- Simplify results summary display
- Change nightly livestream default: true -> false

Net result: -2 LoC while maintaining functionality
@ammar-agent ammar-agent changed the title 🤖 fix: prevent terminal-bench timeout bypass and improve robustness 🤖 bench: simplify terminal-bench timeout handling Nov 8, 2025
@ammario ammario enabled auto-merge November 8, 2025 20:38
@ammario ammario added this pull request to the merge queue Nov 8, 2025
Merged via the queue into main with commit 5438e2b Nov 8, 2025
15 checks passed
@ammario ammario deleted the tb-fix branch November 8, 2025 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants