🤖 ci: replace colons in TB artifact names with hyphens #488
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
The nightly Terminal-Bench workflow runs successfully but fails at the artifact upload step because artifact names contain colons (from model names like
anthropic:claude-sonnet-4-5).GitHub Actions artifact names cannot contain colons due to filesystem restrictions (NTFS compatibility).
Solution
1. Fixed Artifact Names
Use the
replace()function in the artifact name template to convert colons to hyphens:anthropic:claude-sonnet-4-5→anthropic-claude-sonnet-4-5openai:gpt-5-codex→openai-gpt-5-codex2. Added Results Logging
Important: Since the previous run had 720 files that failed to upload, added a new step to print
results.jsonto workflow logs before artifact upload:This ensures task-level results are preserved in logs even if artifact upload fails.
3. Added Model Verification Logging
Added logging in
agentSessionCli.tsto confirm model names:Investigation: Identical 42.50% Accuracy
During the manual run (#18913267878), both models achieved exactly 42.50% accuracy (34/80 tasks). Investigation revealed:
Facts:
runs/2025-10-29__15-29-47vs15-29-29)Code Path Verification:
Traced model parameter through the entire chain:
TB_ARGS: --agent-kwarg model_name=<model>$TB_ARGSto terminal-benchmodel_name, setsCMUX_MODELenv var--model "${CMUX_MODEL}"to CLI--modelflag and uses itThe code is correct. The identical scores are statistically unlikely but possible with offsetting timeout patterns.
Next Steps:
With the new results logging, the next benchmark run will show:
This allows full verification that models produce different task-level results.
Testing
The next nightly run (tonight at 00:00 UTC) will:
terminal-bench-results-anthropic-claude-sonnet-4-5-<run_id>terminal-bench-results-openai-gpt-5-codex-<run_id>[cmux-cli] Using model: <model_name>Generated with
cmux