Skip to content

Conversation

@ammar-agent
Copy link
Collaborator

Generated with cmux

Adds a manually-triggerable GitHub Actions workflow for running make benchmark-terminal.

Features:

  • Workflow can be triggered from GitHub Actions UI with custom parameters
  • 3 hour timeout to accommodate long-running benchmarks
  • Installs uv for uvx terminal-bench execution
  • Configurable inputs:
    • Dataset (default: terminal-bench-core==0.1.1)
    • Concurrency (default: 4)
    • Livestream (default: true for progress visibility)
    • Extra args for custom options
  • Uploads benchmark results as artifacts (even on failure)
  • Uses ANTHROPIC_API_KEY and OPENAI_API_KEY from secrets

- Manually triggerable workflow for running terminal-bench
- 3 hour timeout for long-running benchmarks
- Configurable dataset, concurrency (default: 4), and livestream (default: true)
- Installs uv for uvx terminal-bench command
- Uploads benchmark results as artifacts
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR.

- Adds fmt-python and fmt-python-check targets
- Uses uvx ruff format for fast, Black-compatible formatting
- Automatically formats benchmarks/ directory
- Integrated into main fmt and fmt-check targets
@ammario ammario merged commit 79a124f into main Oct 17, 2025
8 checks passed
@ammario ammario deleted the term-bench-ci branch October 17, 2025 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants