🤖 ci: increase terminal-bench global timeout to 30 minutes #495
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
Fixed 15-minute timeout caused 27-35% of tasks to timeout in nightly runs. Analysis of the Oct 30 nightly run revealed:
blind-maze-explorer-algorithm.hardat 1200s (20 minutes)Additionally, agent output was human-readable text making it difficult to analyze programmatically.
Solution
Two improvements:
1. Global Timeout Increase
Set global timeout to 30 minutes (1800 seconds) for all tasks.
Design Rationale:
TB_TIMEOUTenv var if needed2. JSON Lines Output
Enable
--json-streamingflag in agent CLI to output structured JSON lines instead of human-readable text.Benefits:
Makefile Changes
TB_TIMEOUTto 1800 seconds (30 minutes)TB_TIMEOUTenv var overrideUsage
Expected Impact
Documentation
Updated
benchmarks/terminal_bench/README.mdto document:TB_TIMEOUTenv varEvidence
Tasks from 2025-10-30 nightly run that motivated this change:
Tasks that passed but hit 15-minute timeout:
blind-maze-explorer-algorithm.hard: ✓ passed at 1200s (20 min)qemu-startup: ✓ passed at 838s (14 min)count-dataset-tokens: Anthropic timed out at 808s (13.5 min)path-tracing: ✓ passed at 660s (11 min)pytorch-model-cli: ✓ passed at 541s (9 min)95th percentile task duration: ~15 minutes
With 30-minute timeout, all these tasks would have succeeded.
Backward Compatibility
✅ Existing workflows continue to work unchanged
✅
TB_TIMEOUTenv var provides manual override✅ Default behavior provides better coverage than 15-minute timeout
✅ JSON output doesn't break existing analysis tools (they just see more structured data)
Generated with
cmux