Skip to content

Conversation

@ammar-agent
Copy link
Collaborator

@ammar-agent ammar-agent commented Oct 29, 2025

Problem

The nightly Terminal-Bench workflow runs successfully but fails at the artifact upload step because artifact names contain colons (from model names like anthropic:claude-sonnet-4-5).

GitHub Actions artifact names cannot contain colons due to filesystem restrictions (NTFS compatibility).

Solution

1. Fixed Artifact Names

Use the replace() function in the artifact name template to convert colons to hyphens:

  • anthropic:claude-sonnet-4-5anthropic-claude-sonnet-4-5
  • openai:gpt-5-codexopenai-gpt-5-codex

2. Added Results Logging

Important: Since the previous run had 720 files that failed to upload, added a new step to print results.json to workflow logs before artifact upload:

- name: Print results summary
  if: always()
  run: |
    # Outputs full results.json
    # Plus per-task summary: task_id: ✓ PASS / ✗ FAIL

This ensures task-level results are preserved in logs even if artifact upload fails.

3. Added Model Verification Logging

Added logging in agentSessionCli.ts to confirm model names:

console.error(`[cmux-cli] Using model: ${model}`);

Investigation: Identical 42.50% Accuracy

During the manual run (#18913267878), both models achieved exactly 42.50% accuracy (34/80 tasks). Investigation revealed:

Facts:

  • Both models had 24 timeouts (out of 360s limit)
  • 50% overlap: Only 12 tasks timed out for both models
  • Each model attempted 56 non-timeout tasks and passed 34 (60.7% pass rate)
  • Results stored in separate timestamped directories (runs/2025-10-29__15-29-47 vs 15-29-29)
  • 720 files were ready to upload but artifact upload failed

Code Path Verification:

Traced model parameter through the entire chain:

  1. ✅ Workflow → TB_ARGS: --agent-kwarg model_name=<model>
  2. ✅ Makefile → Passes $TB_ARGS to terminal-bench
  3. ✅ cmux_agent.py → Constructor accepts model_name, sets CMUX_MODEL env var
  4. ✅ cmux-run.sh → Passes --model "${CMUX_MODEL}" to CLI
  5. ✅ agentSessionCli.ts → Parses --model flag and uses it

The code is correct. The identical scores are statistically unlikely but possible with offsetting timeout patterns.

Next Steps:

With the new results logging, the next benchmark run will show:

  • ✅ Model name used (in stderr logs)
  • ✅ Full results.json (in workflow logs)
  • ✅ Per-task pass/fail breakdown (in workflow logs)
  • ✅ Artifacts uploaded successfully (with fixed names)

This allows full verification that models produce different task-level results.

Testing

The next nightly run (tonight at 00:00 UTC) will:

  • Successfully upload artifacts with names like:
    • terminal-bench-results-anthropic-claude-sonnet-4-5-<run_id>
    • terminal-bench-results-openai-gpt-5-codex-<run_id>
  • Show task-level results in workflow logs (survives even if upload fails)
  • Confirm each model in logs: [cmux-cli] Using model: <model_name>

Generated with cmux

Artifact names cannot contain colons due to filesystem restrictions.
Convert model names like 'anthropic:claude-sonnet-4-5' to
'anthropic-claude-sonnet-4-5' in artifact names.

This fixes the nightly benchmark workflow failures where artifacts
were generated successfully but failed to upload.
Add console.error logging to confirm which model is being used
when running terminal-bench. This helps verify that model_name
is correctly passed through the workflow -> Makefile -> agent -> CLI chain.

Logging to stderr to avoid interfering with stdout-based result parsing.
Add a step to output the full results.json and per-task summary
before uploading artifacts. This ensures we can see task-level
results even if artifact upload fails.

Output includes:
- Full results.json (with jq formatting if available)
- Per-task summary: task_id: ✓ PASS / ✗ FAIL

This will help verify that different models are producing
different results, not just different timeout patterns.
@ammario ammario changed the title 🤖 fix: replace colons in TB artifact names with hyphens 🤖 ci: replace colons in TB artifact names with hyphens Oct 29, 2025
@ammario ammario enabled auto-merge October 29, 2025 19:28
@ammario ammario added this pull request to the merge queue Oct 29, 2025
Merged via the queue into main with commit cd58afa Oct 29, 2025
13 checks passed
@ammario ammario deleted the tb-workflow branch October 29, 2025 19:41
github-merge-queue bot pushed a commit that referenced this pull request Oct 29, 2025
## Problem

PR #488 introduced a syntax error by using the `replace()` function in a
GitHub Actions workflow expression:

```yaml
name: terminal-bench-results-${{ inputs.model_name && replace(format('{0}-{1}', inputs.model_name, github.run_id), ':', '-') || format('{0}', github.run_id) }}
```

**GitHub Actions does not support `replace()` as a function in workflow
expressions.**

This caused the nightly workflow to fail with a syntax error.

## Solution

Use a shell script step to compute the artifact name using the `tr`
command:

```yaml
- name: Set artifact name
  id: artifact-name
  run: |
    if [ -n "${{ inputs.model_name }}" ]; then
      ARTIFACT_NAME="terminal-bench-results-$(echo "${{ inputs.model_name }}-${{ github.run_id }}" | tr ':' '-')"
    else
      ARTIFACT_NAME="terminal-bench-results-${{ github.run_id }}"
    fi
    echo "name=$ARTIFACT_NAME" >> $GITHUB_OUTPUT

- name: Upload benchmark results
  uses: actions/upload-artifact@v4
  with:
    name: ${{ steps.artifact-name.outputs.name }}
```

This uses standard shell string manipulation which is fully supported.

## Testing

The workflow syntax now validates correctly. The next nightly run
should:
- ✅ Parse successfully without syntax errors
- ✅ Replace colons with hyphens: `anthropic-claude-sonnet-4-5`
- ✅ Upload artifacts with valid names

---

_Generated with `cmux`_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants