Benchmark how well coding agents plan against a real product spec. Feed it a PRD, get a plan, evaluate coverage, and generate a stakeholder-ready report.
There is a product requirements document (PRD) in docs/prd/. You point a coding agent at it and ask it to produce an implementation plan. Then you evaluate how well that plan covers the PRD and generate the stakeholder report from that evaluation.
The default collection flow is now two steps. You run each step in a fresh conversation with the best model available.
Open a fresh conversation with your coding agent. Tell it:
Read
1-START_HERE.mdand follow its instructions.
The agent will read all the PRD documents in docs/prd/ and produce an implementation plan. Step 1 is planning only; it must not start implementation.
Output: results/PLAN.md
Open a new conversation (fresh context). Tell the agent:
Read
2-EVALUATE_PLAN.mdand follow its instructions.
The agent will read both the PRD and the plan from Step 1, then audit the plan for coverage and alignment. It scores every requirement as full, partial, or missing, writes PLAN_EVAL.md, and then generates PLAN_EVAL_REPORT.html from that finished evaluation.
The denominator is frozen in evaluator/requirements_catalog_v1.md, so the evaluator scores against the same requirement list every run instead of re-deriving it from scratch.
If the evaluator/ folder is missing, run python3 tools/fetch_evaluator.py first.
Requires: results/PLAN.md from Step 1
Primary output: results/PLAN_EVAL.md
Secondary output: results/PLAN_EVAL_REPORT.html
Derived output: results/PLAN_EVAL_DATA.json (generated deterministically from the markdown evaluation by the control workflow)
For a finished benchmark submission, results/PLAN_EVAL_REPORT.html is still expected. It is now normally produced during Step 2.
If results/PLAN_EVAL.md already exists and you only need to re-render the HTML report, you can still use the standalone report prompt:
Read
3-PLAN_EVAL_REPORT.mdand follow its instructions.
Each step consumes significant context. Starting fresh ensures the agent has maximum context window available for the task at hand. It also isolates each step so you can re-run one without re-running the others.
1-START_HERE.md # Step 1 prompt — planning
2-EVALUATE_PLAN.md # Step 2 prompt — evaluation
3-PLAN_EVAL_REPORT.md # Optional fallback prompt — HTML report rerender only
docs/prd/ # The product spec (PRD + supporting docs)
evaluator/requirements_catalog_v1.md # Frozen Step 2 denominator hidden from Step 1
tools/fetch_evaluator.py # Downloads the public evaluator bundle into evaluator/
results/ # All outputs land here
CLAUDE.md # Auto-loaded instructions for Claude Code
GEMINI.md # Auto-loaded instructions for Gemini
AGENTS.md # Auto-loaded instructions for other agents
INSTRUCTIONS.md # Development guidelines (referenced by agent configs)
Each completed run should end with these files in results/:
PLAN.md— the implementation plan from Step 1PLAN_EVAL.md— the human-readable evaluation from Step 2PLAN_EVAL_DATA.json— the structured evaluation data derived fromPLAN_EVAL.mdby the control workflowrun_metadata.json— structured run metadata captured by the control workflow, including series/case identity, declared scenario settings, observed timing/tool metrics, and plan-save outcomePLAN_EVAL_REPORT.html— the stakeholder-facing report normally generated during Step 2