feat(ci): e2e benchmark bot triggered by a PR comment by esengine · Pull Request #2574 · esengine/DeepSeek-Reasonix

esengine · 2026-06-01T12:10:55Z

What

Adds a comment-triggered e2e benchmark bot. Comment /e2e on a PR and it runs the committed task suite against the real provider and posts a report back:

🤖 Reasonix e2e benchmark

Accuracy: 3/3 (100%) · Cache hit: 83% · Tokens: 4,050 (prompt 3,600 / completion 450) · Cost: ¥ 0.0369

Task Result Steps Prompt Completion Cache hit Cost

fizzbuzz ✅ pass 4 1,200 150 83% ¥ 0.0123

Pieces

reasonix run --metrics <file> — writes a JSON token/cache/cost summary of a run (no behaviour change otherwise).
cmd/e2ebench — copies each task's seed workdir/ into a temp dir, runs the agent there via the built binary, then grades with verify.sh (dropped in only after the run, so the answer key isn't readable), and aggregates per-task metrics into the markdown report. Token-budget capped (-budget, default 400k).
benchmarks/e2e/ — three seed tasks (fizzbuzz, fix-add-bug, palindrome), each graded by a verify.sh that exits non-zero on failure. Add a task = drop a new dir with task.toml + optional workdir/ + verify.sh.
.github/workflows/e2e-bot.yml — issue_comment trigger gated to OWNER/MEMBER/COLLABORATOR (it runs PR-head code with the API key, so it must be trust-gated), writes the provider config, builds, runs the suite, posts the report.

Before it works — one manual step

Add a repo secret DEEPSEEK_API_KEY (Settings → Secrets → Actions).
Optional repo variables to override defaults: REASONIX_E2E_MODEL (default deepseek-chat), REASONIX_E2E_BASE_URL (default https://api.deepseek.com). Pricing in the workflow is a placeholder — adjust to the live rate if you want exact cost.

Verification

Built + vet + go test ./internal/cli green. Harness validated end-to-end offline (a fake agent that writes the solution + metrics → 3/3, metrics aggregated, report rendered) and reasonix run --metrics validated against a local mock (cost math exact). The live provider path runs in CI once the secret is set.

Comment "/e2e" on a pull request to run the committed task suite against the real provider and get an accuracy / cache-hit / token / cost report posted back. Gated to OWNER/MEMBER/COLLABORATOR because it checks out PR-head code and runs it with the provider API key. - reasonix run --metrics writes a JSON token/cache/cost summary of a run. - cmd/e2ebench copies each task's seed workdir to a temp dir, runs the agent there via the built binary, then grades with verify.sh (dropped in only after the run so the answer key isn't readable), and aggregates the per-task metrics into a markdown report. - benchmarks/e2e seeds three tasks (fizzbuzz, fix-add-bug, palindrome), each graded by a verify.sh that exits non-zero on failure. - .github/workflows/e2e-bot.yml wires the comment trigger, writes the provider config, builds, runs the suite under a token budget, and posts the report. Needs a DEEPSEEK_API_KEY repo secret; model/base URL are overridable via repo variables.

Comment "/e2e diff" on a PR to have the agent, on the PR branch, write tests covering what the PR changed and run them — graded by the repo's own tests, with the accuracy/cache/token/cost report posted back. - e2ebench -mode diff: derives the changed Go source files and packages from base...HEAD, prompts the agent to add tests for them (not touch source), then grades: pass = the agent added >=1 new test function AND `go test` on the affected packages is green. A no-op can't pass since the suite was already green; non-test source edits by the agent are surfaced as a warning. - run-metrics is reused for the cost/token/cache numbers. - the workflow branches on "/e2e diff", computes the base via merge-base, and writes the provider config to the repo root too (project config wins over user config, and diff mode runs the agent in the repo root). Plain "/e2e" still runs the fixed suite.

The diff grader now reverts the PR's source to its pre-change state and re-runs the generated tests: they must FAIL on the old code, proving they capture the change rather than asserting behavior that already held. Pass now requires added tests + green on HEAD + failing on base. Also point the workflow's default model/pricing at deepseek-v4-flash to match the project's own provider config.

github-actions Bot added the v2 Go rewrite (1.x) — main-v2 branch, active development label Jun 1, 2026

reasonix added 2 commits June 1, 2026 05:25

esengine merged commit f38ea2c into main-v2 Jun 1, 2026
2 checks passed

esengine deleted the feat/ci-e2e-bot branch June 1, 2026 12:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci): e2e benchmark bot triggered by a PR comment#2574

feat(ci): e2e benchmark bot triggered by a PR comment#2574
esengine merged 3 commits into
main-v2from
feat/ci-e2e-bot

esengine commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

esengine commented Jun 1, 2026

What

🤖 Reasonix e2e benchmark

Pieces

Before it works — one manual step

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant