Skip to content

feat(ci): e2e benchmark bot triggered by a PR comment#2574

Merged
esengine merged 3 commits into
main-v2from
feat/ci-e2e-bot
Jun 1, 2026
Merged

feat(ci): e2e benchmark bot triggered by a PR comment#2574
esengine merged 3 commits into
main-v2from
feat/ci-e2e-bot

Conversation

@esengine
Copy link
Copy Markdown
Owner

@esengine esengine commented Jun 1, 2026

What

Adds a comment-triggered e2e benchmark bot. Comment /e2e on a PR and it runs the committed task suite against the real provider and posts a report back:

🤖 Reasonix e2e benchmark

Accuracy: 3/3 (100%) · Cache hit: 83% · Tokens: 4,050 (prompt 3,600 / completion 450) · Cost: ¥ 0.0369

Task Result Steps Prompt Completion Cache hit Cost
fizzbuzz ✅ pass 4 1,200 150 83% ¥ 0.0123

Pieces

  • reasonix run --metrics <file> — writes a JSON token/cache/cost summary of a run (no behaviour change otherwise).
  • cmd/e2ebench — copies each task's seed workdir/ into a temp dir, runs the agent there via the built binary, then grades with verify.sh (dropped in only after the run, so the answer key isn't readable), and aggregates per-task metrics into the markdown report. Token-budget capped (-budget, default 400k).
  • benchmarks/e2e/ — three seed tasks (fizzbuzz, fix-add-bug, palindrome), each graded by a verify.sh that exits non-zero on failure. Add a task = drop a new dir with task.toml + optional workdir/ + verify.sh.
  • .github/workflows/e2e-bot.ymlissue_comment trigger gated to OWNER/MEMBER/COLLABORATOR (it runs PR-head code with the API key, so it must be trust-gated), writes the provider config, builds, runs the suite, posts the report.

Before it works — one manual step

  • Add a repo secret DEEPSEEK_API_KEY (Settings → Secrets → Actions).
  • Optional repo variables to override defaults: REASONIX_E2E_MODEL (default deepseek-chat), REASONIX_E2E_BASE_URL (default https://api.deepseek.com). Pricing in the workflow is a placeholder — adjust to the live rate if you want exact cost.

Verification

Built + vet + go test ./internal/cli green. Harness validated end-to-end offline (a fake agent that writes the solution + metrics → 3/3, metrics aggregated, report rendered) and reasonix run --metrics validated against a local mock (cost math exact). The live provider path runs in CI once the secret is set.

Comment "/e2e" on a pull request to run the committed task suite against
the real provider and get an accuracy / cache-hit / token / cost report
posted back. Gated to OWNER/MEMBER/COLLABORATOR because it checks out
PR-head code and runs it with the provider API key.

- reasonix run --metrics writes a JSON token/cache/cost summary of a run.
- cmd/e2ebench copies each task's seed workdir to a temp dir, runs the
  agent there via the built binary, then grades with verify.sh (dropped
  in only after the run so the answer key isn't readable), and aggregates
  the per-task metrics into a markdown report.
- benchmarks/e2e seeds three tasks (fizzbuzz, fix-add-bug, palindrome),
  each graded by a verify.sh that exits non-zero on failure.
- .github/workflows/e2e-bot.yml wires the comment trigger, writes the
  provider config, builds, runs the suite under a token budget, and posts
  the report.

Needs a DEEPSEEK_API_KEY repo secret; model/base URL are overridable via
repo variables.
@github-actions github-actions Bot added the v2 Go rewrite (1.x) — main-v2 branch, active development label Jun 1, 2026
reasonix added 2 commits June 1, 2026 05:25
Comment "/e2e diff" on a PR to have the agent, on the PR branch, write
tests covering what the PR changed and run them — graded by the repo's
own tests, with the accuracy/cache/token/cost report posted back.

- e2ebench -mode diff: derives the changed Go source files and packages
  from base...HEAD, prompts the agent to add tests for them (not touch
  source), then grades: pass = the agent added >=1 new test function AND
  `go test` on the affected packages is green. A no-op can't pass since
  the suite was already green; non-test source edits by the agent are
  surfaced as a warning.
- run-metrics is reused for the cost/token/cache numbers.
- the workflow branches on "/e2e diff", computes the base via merge-base,
  and writes the provider config to the repo root too (project config
  wins over user config, and diff mode runs the agent in the repo root).

Plain "/e2e" still runs the fixed suite.
The diff grader now reverts the PR's source to its pre-change state and
re-runs the generated tests: they must FAIL on the old code, proving they
capture the change rather than asserting behavior that already held. Pass
now requires added tests + green on HEAD + failing on base.

Also point the workflow's default model/pricing at deepseek-v4-flash to
match the project's own provider config.
@esengine esengine merged commit f38ea2c into main-v2 Jun 1, 2026
2 checks passed
@esengine esengine deleted the feat/ci-e2e-bot branch June 1, 2026 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v2 Go rewrite (1.x) — main-v2 branch, active development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant