Planning Benchmark

Benchmark how well coding agents plan against a real product spec. Feed it a PRD, get a plan, evaluate coverage, and generate a stakeholder-ready report.

What This Repo Does

There is a product requirements document (PRD) in docs/prd/. You point a coding agent at it and ask it to produce an implementation plan. Then you evaluate how well that plan covers the PRD and generate the stakeholder report from that evaluation.

The default collection flow is now two steps. You run each step in a fresh conversation with the best model available.

The Two-Step Default Workflow

Step 1: Generate the Plan

Open a fresh conversation with your coding agent. Tell it:

Read 1-START_HERE.md and follow its instructions.

The agent will read all the PRD documents in docs/prd/ and produce an implementation plan. Step 1 is planning only; it must not start implementation.

Output: results/PLAN.md

Step 2: Evaluate the Plan

Open a new conversation (fresh context). Tell the agent:

Read 2-EVALUATE_PLAN.md and follow its instructions.

The agent will read both the PRD and the plan from Step 1, then audit the plan for coverage and alignment. It scores every requirement as full, partial, or missing, writes PLAN_EVAL.md, and then generates PLAN_EVAL_REPORT.html from that finished evaluation. The denominator is frozen in evaluator/requirements_catalog_v1.md, so the evaluator scores against the same requirement list every run instead of re-deriving it from scratch. If the evaluator/ folder is missing, run python3 tools/fetch_evaluator.py first.

Requires: results/PLAN.md from Step 1 Primary output: results/PLAN_EVAL.md Secondary output: results/PLAN_EVAL_REPORT.html Derived output: results/PLAN_EVAL_DATA.json (generated deterministically from the markdown evaluation by the control workflow)

For a finished benchmark submission, results/PLAN_EVAL_REPORT.html is still expected. It is now normally produced during Step 2.

Optional Fallback

If results/PLAN_EVAL.md already exists and you only need to re-render the HTML report, you can still use the standalone report prompt:

Read 3-PLAN_EVAL_REPORT.md and follow its instructions.

Why Fresh Conversations?

Each step consumes significant context. Starting fresh ensures the agent has maximum context window available for the task at hand. It also isolates each step so you can re-run one without re-running the others.

Repo Structure

1-START_HERE.md              # Step 1 prompt — planning
2-EVALUATE_PLAN.md           # Step 2 prompt — evaluation
3-PLAN_EVAL_REPORT.md        # Optional fallback prompt — HTML report rerender only
docs/prd/                    # The product spec (PRD + supporting docs)
evaluator/requirements_catalog_v1.md  # Frozen Step 2 denominator hidden from Step 1
tools/fetch_evaluator.py     # Downloads the public evaluator bundle into evaluator/
results/                     # All outputs land here
CLAUDE.md                    # Auto-loaded instructions for Claude Code
GEMINI.md                    # Auto-loaded instructions for Gemini
AGENTS.md                    # Auto-loaded instructions for other agents
INSTRUCTIONS.md              # Development guidelines (referenced by agent configs)

Result Artifacts

Each completed run should end with these files in results/:

PLAN.md — the implementation plan from Step 1
PLAN_EVAL.md — the human-readable evaluation from Step 2
PLAN_EVAL_DATA.json — the structured evaluation data derived from PLAN_EVAL.md by the control workflow
run_metadata.json — structured run metadata captured by the control workflow, including series/case identity, declared scenario settings, observed timing/tool metrics, and plan-save outcome
PLAN_EVAL_REPORT.html — the stakeholder-facing report normally generated during Step 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Planning Benchmark

What This Repo Does

The Two-Step Default Workflow

Step 1: Generate the Plan

Step 2: Evaluate the Plan

Optional Fallback

Why Fresh Conversations?

Repo Structure

Result Artifacts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.vscode		.vscode
docs		docs
tools		tools
.gitignore		.gitignore
1-START_HERE.md		1-START_HERE.md
2-EVALUATE_PLAN.md		2-EVALUATE_PLAN.md
3-PLAN_EVAL_REPORT.md		3-PLAN_EVAL_REPORT.md
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
GEMINI.md		GEMINI.md
INSTRUCTIONS.md		INSTRUCTIONS.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Planning Benchmark

What This Repo Does

The Two-Step Default Workflow

Step 1: Generate the Plan

Step 2: Evaluate the Plan

Optional Fallback

Why Fresh Conversations?

Repo Structure

Result Artifacts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages