feat(ci): e2e benchmark bot triggered by a PR comment#2574
Merged
Conversation
Comment "/e2e" on a pull request to run the committed task suite against the real provider and get an accuracy / cache-hit / token / cost report posted back. Gated to OWNER/MEMBER/COLLABORATOR because it checks out PR-head code and runs it with the provider API key. - reasonix run --metrics writes a JSON token/cache/cost summary of a run. - cmd/e2ebench copies each task's seed workdir to a temp dir, runs the agent there via the built binary, then grades with verify.sh (dropped in only after the run so the answer key isn't readable), and aggregates the per-task metrics into a markdown report. - benchmarks/e2e seeds three tasks (fizzbuzz, fix-add-bug, palindrome), each graded by a verify.sh that exits non-zero on failure. - .github/workflows/e2e-bot.yml wires the comment trigger, writes the provider config, builds, runs the suite under a token budget, and posts the report. Needs a DEEPSEEK_API_KEY repo secret; model/base URL are overridable via repo variables.
added 2 commits
June 1, 2026 05:25
Comment "/e2e diff" on a PR to have the agent, on the PR branch, write tests covering what the PR changed and run them — graded by the repo's own tests, with the accuracy/cache/token/cost report posted back. - e2ebench -mode diff: derives the changed Go source files and packages from base...HEAD, prompts the agent to add tests for them (not touch source), then grades: pass = the agent added >=1 new test function AND `go test` on the affected packages is green. A no-op can't pass since the suite was already green; non-test source edits by the agent are surfaced as a warning. - run-metrics is reused for the cost/token/cache numbers. - the workflow branches on "/e2e diff", computes the base via merge-base, and writes the provider config to the repo root too (project config wins over user config, and diff mode runs the agent in the repo root). Plain "/e2e" still runs the fixed suite.
The diff grader now reverts the PR's source to its pre-change state and re-runs the generated tests: they must FAIL on the old code, proving they capture the change rather than asserting behavior that already held. Pass now requires added tests + green on HEAD + failing on base. Also point the workflow's default model/pricing at deepseek-v4-flash to match the project's own provider config.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a comment-triggered e2e benchmark bot. Comment
/e2eon a PR and it runs the committed task suite against the real provider and posts a report back:Pieces
reasonix run --metrics <file>— writes a JSON token/cache/cost summary of a run (no behaviour change otherwise).cmd/e2ebench— copies each task's seedworkdir/into a temp dir, runs the agent there via the built binary, then grades withverify.sh(dropped in only after the run, so the answer key isn't readable), and aggregates per-task metrics into the markdown report. Token-budget capped (-budget, default 400k).benchmarks/e2e/— three seed tasks (fizzbuzz,fix-add-bug,palindrome), each graded by averify.shthat exits non-zero on failure. Add a task = drop a new dir withtask.toml+ optionalworkdir/+verify.sh..github/workflows/e2e-bot.yml—issue_commenttrigger gated to OWNER/MEMBER/COLLABORATOR (it runs PR-head code with the API key, so it must be trust-gated), writes the provider config, builds, runs the suite, posts the report.Before it works — one manual step
DEEPSEEK_API_KEY(Settings → Secrets → Actions).REASONIX_E2E_MODEL(defaultdeepseek-chat),REASONIX_E2E_BASE_URL(defaulthttps://api.deepseek.com). Pricing in the workflow is a placeholder — adjust to the live rate if you want exact cost.Verification
Built + vet +
go test ./internal/cligreen. Harness validated end-to-end offline (a fake agent that writes the solution + metrics → 3/3, metrics aggregated, report rendered) andreasonix run --metricsvalidated against a local mock (cost math exact). The live provider path runs in CI once the secret is set.