Skip to content

🤖 bench: use GPT-5.5 for tbench#3193

Merged
ibetitsmike merged 2 commits intomainfrom
mike/tbench-eq2r
Apr 26, 2026
Merged

🤖 bench: use GPT-5.5 for tbench#3193
ibetitsmike merged 2 commits intomainfrom
mike/tbench-eq2r

Conversation

@ibetitsmike
Copy link
Copy Markdown
Contributor

@ibetitsmike ibetitsmike commented Apr 25, 2026

Mux working on behalf of Mike.

Summary

Updates nightly Terminal-Bench defaults to run Opus 4.7 at xhigh thinking and GPT-5.5 at high thinking while dropping the older GPT Codex model from the default matrix. Adds leaderboard metadata for Opus 4.7 and GPT-5.5, and refreshes TBench workflow and skill examples.

Background

GPT-5.5 xhigh runs were timing out in TBench, so the nightly workflow keeps GPT-5.5 at high while preserving xhigh for Opus 4.7.

Validation

  • make static-check
  • python3 -m py_compile benchmarks/terminal_bench/prepare_leaderboard_submission.py
  • go run github.com/rhysd/actionlint/cmd/actionlint@v1.7.7 .github/workflows/nightly-terminal-bench.yml .github/workflows/terminal-bench.yml
  • /home/coder/.local/bin/uvx ruff format --check benchmarks/terminal_bench/prepare_leaderboard_submission.py
  • git diff --check

Generated with mux • Model: openai:gpt-5.5 • Thinking: xhigh • Cost: $16.42

Switch nightly Terminal-Bench defaults to GPT-5.5 and Opus 4.7 with xhigh thinking. Add leaderboard metadata for both models and update tbench examples.

---

_Generated with `mux` • Model: `openai:gpt-5.5` • Thinking: `xhigh` • Cost: `$9.21`_

<!-- mux-attribution: model=openai:gpt-5.5 thinking=xhigh costs=9.21 -->
@ibetitsmike
Copy link
Copy Markdown
Contributor Author

@codex review

Mux working on behalf of Mike.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Use high thinking for GPT-5.5 Terminal-Bench runs because xhigh was timing out. Keep Opus 4.7 on xhigh.

---

_Generated with `mux` • Model: `openai:gpt-5.5` • Thinking: `xhigh` • Cost: `$16.42`_

<!-- mux-attribution: model=openai:gpt-5.5 thinking=xhigh costs=16.42 -->
@ibetitsmike
Copy link
Copy Markdown
Contributor Author

@codex review

Mux working on behalf of Mike.

@ibetitsmike ibetitsmike enabled auto-merge April 26, 2026 01:34
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ibetitsmike ibetitsmike added this pull request to the merge queue Apr 26, 2026
Merged via the queue into main with commit e1bf54b Apr 26, 2026
24 checks passed
@ibetitsmike ibetitsmike deleted the mike/tbench-eq2r branch April 26, 2026 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant