🤖 ci: replace colons in TB artifact names with hyphens #488

ammar-agent · 2025-10-29T19:14:47Z

Problem

The nightly Terminal-Bench workflow runs successfully but fails at the artifact upload step because artifact names contain colons (from model names like anthropic:claude-sonnet-4-5).

GitHub Actions artifact names cannot contain colons due to filesystem restrictions (NTFS compatibility).

Solution

1. Fixed Artifact Names

Use the replace() function in the artifact name template to convert colons to hyphens:

anthropic:claude-sonnet-4-5 → anthropic-claude-sonnet-4-5
openai:gpt-5-codex → openai-gpt-5-codex

2. Added Results Logging

Important: Since the previous run had 720 files that failed to upload, added a new step to print results.json to workflow logs before artifact upload:

- name: Print results summary
  if: always()
  run: |
    # Outputs full results.json
    # Plus per-task summary: task_id: ✓ PASS / ✗ FAIL

This ensures task-level results are preserved in logs even if artifact upload fails.

3. Added Model Verification Logging

Added logging in agentSessionCli.ts to confirm model names:

console.error(`[cmux-cli] Using model: ${model}`);

Investigation: Identical 42.50% Accuracy

During the manual run (#18913267878), both models achieved exactly 42.50% accuracy (34/80 tasks). Investigation revealed:

Facts:

Both models had 24 timeouts (out of 360s limit)
50% overlap: Only 12 tasks timed out for both models
Each model attempted 56 non-timeout tasks and passed 34 (60.7% pass rate)
Results stored in separate timestamped directories (runs/2025-10-29__15-29-47 vs 15-29-29)
720 files were ready to upload but artifact upload failed

Code Path Verification:

Traced model parameter through the entire chain:

✅ Workflow → TB_ARGS: --agent-kwarg model_name=<model>
✅ Makefile → Passes $TB_ARGS to terminal-bench
✅ cmux_agent.py → Constructor accepts model_name, sets CMUX_MODEL env var
✅ cmux-run.sh → Passes --model "${CMUX_MODEL}" to CLI
✅ agentSessionCli.ts → Parses --model flag and uses it

The code is correct. The identical scores are statistically unlikely but possible with offsetting timeout patterns.

Next Steps:

With the new results logging, the next benchmark run will show:

✅ Model name used (in stderr logs)
✅ Full results.json (in workflow logs)
✅ Per-task pass/fail breakdown (in workflow logs)
✅ Artifacts uploaded successfully (with fixed names)

This allows full verification that models produce different task-level results.

Testing

The next nightly run (tonight at 00:00 UTC) will:

Successfully upload artifacts with names like:
- terminal-bench-results-anthropic-claude-sonnet-4-5-<run_id>
- terminal-bench-results-openai-gpt-5-codex-<run_id>
Show task-level results in workflow logs (survives even if upload fails)
Confirm each model in logs: [cmux-cli] Using model: <model_name>

Generated with cmux

Artifact names cannot contain colons due to filesystem restrictions. Convert model names like 'anthropic:claude-sonnet-4-5' to 'anthropic-claude-sonnet-4-5' in artifact names. This fixes the nightly benchmark workflow failures where artifacts were generated successfully but failed to upload.

Add console.error logging to confirm which model is being used when running terminal-bench. This helps verify that model_name is correctly passed through the workflow -> Makefile -> agent -> CLI chain. Logging to stderr to avoid interfering with stdout-based result parsing.

Add a step to output the full results.json and per-task summary before uploading artifacts. This ensures we can see task-level results even if artifact upload fails. Output includes: - Full results.json (with jq formatting if available) - Per-task summary: task_id: ✓ PASS / ✗ FAIL This will help verify that different models are producing different results, not just different timeout patterns.

## Problem PR #488 introduced a syntax error by using the `replace()` function in a GitHub Actions workflow expression: ```yaml name: terminal-bench-results-${{ inputs.model_name && replace(format('{0}-{1}', inputs.model_name, github.run_id), ':', '-') || format('{0}', github.run_id) }} ``` **GitHub Actions does not support `replace()` as a function in workflow expressions.** This caused the nightly workflow to fail with a syntax error. ## Solution Use a shell script step to compute the artifact name using the `tr` command: ```yaml - name: Set artifact name id: artifact-name run: | if [ -n "${{ inputs.model_name }}" ]; then ARTIFACT_NAME="terminal-bench-results-$(echo "${{ inputs.model_name }}-${{ github.run_id }}" | tr ':' '-')" else ARTIFACT_NAME="terminal-bench-results-${{ github.run_id }}" fi echo "name=$ARTIFACT_NAME" >> $GITHUB_OUTPUT - name: Upload benchmark results uses: actions/upload-artifact@v4 with: name: ${{ steps.artifact-name.outputs.name }} ``` This uses standard shell string manipulation which is fully supported. ## Testing The workflow syntax now validates correctly. The next nightly run should: - ✅ Parse successfully without syntax errors - ✅ Replace colons with hyphens: `anthropic-claude-sonnet-4-5` - ✅ Upload artifacts with valid names --- _Generated with `cmux`_

ammar-agent added 3 commits October 29, 2025 19:14

ammario changed the title ~~🤖 fix: replace colons in TB artifact names with hyphens~~ 🤖 ci: replace colons in TB artifact names with hyphens Oct 29, 2025

ammario enabled auto-merge October 29, 2025 19:28

ammario added this pull request to the merge queue Oct 29, 2025

Merged via the queue into main with commit cd58afa Oct 29, 2025
13 checks passed

ammario deleted the tb-workflow branch October 29, 2025 19:41

ammar-agent mentioned this pull request Oct 29, 2025

🤖 fix: use shell script for artifact name instead of replace() #490

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🤖 ci: replace colons in TB artifact names with hyphens #488

🤖 ci: replace colons in TB artifact names with hyphens #488

Uh oh!

ammar-agent commented Oct 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🤖 ci: replace colons in TB artifact names with hyphens #488

🤖 ci: replace colons in TB artifact names with hyphens #488

Uh oh!

Conversation

ammar-agent commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

1. Fixed Artifact Names

2. Added Results Logging

3. Added Model Verification Logging

Investigation: Identical 42.50% Accuracy

Facts:

Code Path Verification:

Next Steps:

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ammar-agent commented Oct 29, 2025 •

edited

Loading