Skip to content

DevKit v3: redo around alignment + AFK pipeline (grill-me → to-prd → to-issues → Ralph) #1

@atarazevich

Description

@atarazevich

Problem Statement

The current devkit (v2.2) is unused. It ships ceremony — .devkit/tasks/{pending,active,archive}/, structured log trailers (Task:, Decision:, Tried:), three custom agents, an install script — but the things that actually solve the user's pain are missing. Concretely:

  1. Misalignment. When given a vague problem, Claude commits to an implementation that doesn't match what the user wanted. There's no skill in current devkit that walks the design tree before code is written.
  2. No AFK throughput. The user is the bottleneck on every task. There's no outer iterator that picks the next task, runs developer → code-reviewer to completion, commits, and picks the next one without manual re-launch.
  3. Exploration → issues conversion is broken. The Explore phase produces a horizontal map (files, dependencies, integration points) that never converts cleanly into vertical-slice issues a Ralph-style loop can grab. The pipeline stalls before reaching the agent loop.

The user's actual daily setup is ~/.claude/rules/* plus developer and code-reviewer agents in a near-AFK manual loop. Current devkit's .devkit/tasks/, quick, wrap-up, brainstorm, plan, execute, bugfix, design, grepai skills, install.sh, and trailer system are unused.

Solution

Replace the contents of devkit with the alignment + slicing + AFK pipeline derived from Matt Pocock's published workflow (mattpocock/skills, mattpocock/sandcastle, mattpocock/ai-engineer-workshop-2026-project). The new pipeline:

idea (gh issue, label `idea`)
  → /grill-me <issue#>     (interview to shared design concept, one Q at a time with recommended answer)
  → /to-prd <issue#>        (synthesize grilling into PRD as issue body, label → `ready`,
                             includes ## QA Checklist for manual verification)
  → /to-issues <issue#>     (slice PRD into ## Slices checklist appended to issue body, ≤10 slices)
  → ./ralph/afk.sh N        (separate terminal: orchestrator + dev + reviewer per slice,
                             ticks checkboxes, commits Refs #parent, posts "Ready for QA" when done)
  → manual QA               (user ticks ## QA Checklist; files new issues via /qa for findings)
  → user closes parent       (after all QA ticks)

PRD lives as the GitHub issue body (no separate doc — avoids doc-rot). Slices live as a ## Slices checklist inside the parent issue body (no child issues), capped at 10 per PRD; if more would be needed, the PRD is split. Commits link via Refs #parent. The dev → reviewer flow uses subagents (Model 3) inside a claude --print orchestrator, fresh process per iteration to preserve smart-zone (~100k tokens). A bootstrap validation harness with a hermetic gh shim makes the system testable from a clean machine, with persisted scoring for model-drift detection.

One pipeline, two trigger modes: same agents, same skills, same rules, same output — only the trigger differs. Manual = the user, in a chat, advances slice-by-slice. AFK = ./ralph/afk.sh N automates the same loop in a separate terminal. There is no system-level distinction between modes. ~/.claude/rules/development-workflow.md mentions Ralph in one sentence; the orchestrator does not need internal Ralph knowledge.

QA gate: when all AFK slices in a PRD are ticked complete, Ralph does NOT auto-close the parent issue. It posts a Ready for QA comment for that parent and continues picking AFK slices from other open ready issues. The user opens the actual app, runs through the ## QA Checklist (3-5 manual-verification steps authored at PRD time by /to-prd), and ticks each box manually. QA findings (bugs, taste failures, UX issues) become new GitHub issues via /qa skill — they enter Ralph's queue next round. When all ## QA Checklist items are ticked, the user manually closes the parent issue. This separates auto-verifiable acceptance (slice level: tests pass, types check) from human-taste verification (QA level: looks right, feels right, ships).

Decision-recording model: per-feature decisions live inside the PRD-issue itself (the ## Implementation Decisions and ## Out of Scope sections capture chosen branches and rejected alternatives from the grilling session). System-wide invariants live in docs/ai-coding-principles.md (a documentation file, not an orchestrator rule — loaded on demand, not auto-pushed into every session). The existing ~/.claude/rules/development-workflow.md rule's "append to docs/decisions.md" already covers the system-wide architectural-decisions log; no separate docs/DECISIONS.md or ADR directory.

Domain-language model: UBIQUITOUS_LANGUAGE.md only. Matt's stack also has CONTEXT.md (per bounded context) and docs/adr/ (numbered ADR files); for solo single-context use those duplicate UBIQUITOUS_LANGUAGE.md and docs/decisions.md respectively. Dropped from v1 (Occam).

User Stories

  1. As a solo developer, I want to drop a vague brief into a GitHub issue with --label idea, so that I can park it for later without grilling immediately.
  2. As a solo developer, when I'm ready to align, I want to run /grill-me <issue#> in a chat, so that the agent walks the design tree and asks one question at a time with a recommended answer for each.
  3. As a solo developer, I want /to-prd <issue#> to synthesize the grilling conversation into a PRD as the issue body, so that I have a durable destination document without separate files that can rot.
  4. As a solo developer, in a fresh chat, I want /to-issues <issue#> to break the PRD into vertical-slice tracer-bullet tickets appended to the issue body as a ## Slices checklist, so that there's no fan-out into N child issues.
  5. As a solo developer, I want each slice tagged AFK or HITL with Acceptance, Blocked-by, and Covers user stories fields, so that the Ralph loop can deterministically pick the next workable slice.
  6. As a solo developer, I want /to-issues to enforce a soft cap of ≤10 slices per PRD (if more would be needed, the PRD is too big — /to-issues recommends splitting), so that issue bodies stay scannable and PRDs stay focused.
  7. As a solo developer, in a separate terminal, I want to run ./ralph/afk.sh 20 (or /ralph slash command), so that the loop iterates: pick next AFK slice → developer agent (preloaded /tdd) implements → code-reviewer agent reviews → tick checkbox → commit Refs #parent → repeat → exit on <promise>NO MORE TASKS</promise>.
  8. As a solo developer, I want to advance slices manually in a chat (without Ralph), so that I can do the same dev→reviewer cycle interactively when I want oversight. The system has only one pipeline; AFK and manual differ only in who triggers each iteration.
  9. As a solo developer, I want each Ralph iteration to start with a fresh claude --print process, so that the smart-zone stays clean across slices.
  10. As a solo developer, I want the inner claude --print to spawn developer and code-reviewer as subagents (Model 3), so that they reuse my existing agent registry and get isolated contexts.
  11. As a solo developer, I want Ralph to use --dangerously-skip-permissions, so that no permission prompt halts the AFK loop overnight.
  12. As a solo developer, I want the developer agent to be preloaded with the full Matt-bundle tdd skill (deep-modules, interface-design, mocking, refactoring, tests guidance), so that red-green-refactor and vertical-slice TDD happen by default without manual /tdd invocation.
  13. As a solo developer, I want the code-reviewer to receive coding standards + deep-module + interface-design rules + UBIQUITOUS_LANGUAGE.md (if present) pushed in via prompt (per Matt's "push for reviewer, pull for implementer"), so that review judgment is consistent.
  14. As a solo developer, I want the developer agent to fetch UBIQUITOUS_LANGUAGE.md if it exists in the project, so that domain language is respected.
  15. As a solo developer, I want /improve-codebase-architecture available, so that I can periodically deepen shallow modules using the depth/seam/deletion-test vocabulary. (The skill silently proceeds without CONTEXT.md/docs/adr/; v1 doesn't ship those.)
  16. As a solo developer, I want /qa available for interactive durable bug filing into GitHub issues, so that bug reports become Ralph-grabbable slices.
  17. As a solo developer, I want /ubiquitous-language available, so that the project's glossary stays current.
  18. As a solo developer, I want every PRD to include a ## QA Checklist section (3-5 concrete manual-verification steps), authored at PRD time by /to-prd, so that what to verify post-implementation is decided up front, not improvised at QA time.
  19. As a solo developer, I want Ralph to NOT auto-close the parent issue when all slices are ticked, so that manual QA is an explicit gate before close. Ralph instead posts a Ready for QA comment and stops.
  20. As a solo developer, when I run manual QA and find bugs or taste failures, I want to file them as new GitHub issues via /qa skill, so that they become next-round Ralph slices instead of getting lost in chat history.
  21. As a solo developer, I want to manually tick each ## QA Checklist item after running through it in the actual app, so that the issue cannot be closed until human-taste verification is complete. After all QA ticks, I close the issue manually.
  22. As a solo developer working IN devkit-the-project, I want docs/ai-coding-principles.md to capture the constitutional principles (smart zone, push/pull, vertical slices, deep modules, TDD discipline, anti-specs-to-code, grill-me posture, plain-text Q&A, AFK fresh-context, failure-mode awareness), so that the why-it-is-shaped-this-way is a referenceable document. Loaded on demand (not auto-pushed into every Claude Code session). Project-level CLAUDE.md directs agents working in this repo to load it.
  23. As a solo developer using devkit on other projects, I want docs/ai-coding-principles.md to NOT pollute every Claude Code session globally, so that projects with different conventions (e.g., non-TDD ecosystems) aren't constrained by devkit's opinions.
  24. As a solo developer, I want docs/SAFETY.md to explicitly name the v1 trade — --dangerously-skip-permissions without docker — so that the risk profile is documented (rather than being invisible debt) and future-me knows what was accepted.
  25. As a maintainer of devkit-the-project, I want canonical sources in devkit/{rules,agents,skills}/ symlinked into ~/.claude/ via install.sh, so that I edit once and everything stays in sync.
  26. As a maintainer of devkit-the-project, I want install.sh init-project <path> to copy templates/ralph/ into a project directory, so that each project has its own Ralph script with project-specific feedback loops.
  27. As a maintainer of devkit-the-project, I want make validate on a fresh machine to bootstrap a tmpdir CLAUDE_CONFIG_DIR, install devkit into it, create a fixture project, run all scenarios with a hermetic gh shim, and persist scored results, so that I can validate the system from zero without GitHub auth.
  28. As a maintainer of devkit-the-project, I want make validate-live to run the same scenarios against a real ephemeral GitHub repo (created and deleted within the run), so that the gh shim's faithfulness is verified pre-release.
  29. As a maintainer of devkit-the-project, I want validation results persisted as tests/validation/results/<timestamp>-<model-id>.json, so that python tests/validation/drift-report.py can plot pass-rate and LLM-judge scores over time and flag regressions across model versions.
  30. As a maintainer of devkit-the-project, I want bats tests for Ralph's slice picker, checkbox flipping, Blocked-by resolution, QA gate (does NOT close issue with unticked QA items), install idempotency, and frontmatter validity, so that deterministic regressions are caught on every commit.
  31. As a maintainer of devkit-the-project, I want pytest evals (marked @pytest.mark.eval) for every LLM-driven step (grill-me, to-prd, to-issues, developer, code-reviewer), so that drift in model behavior is observable.
  32. As a new user landing on the repo, I want README.md to be a navigation hub with a pipeline diagram (mermaid) and a scenario index ("new feature from vague idea", "improvement to existing feature", "bug report", "quick known fix", "refactor / shallow modules", "domain language drift", "background throughput") that routes me to the right skill or flow, so that I know which entry point to use without reading every doc.
  33. As a new user, I want README.md to include a single ATTRIBUTION section pointing at mattpocock/skills (MIT license) for the skills derived/copied verbatim from Matt Pocock's repo, so that provenance is credited without per-file footer ceremony.
  34. As a new user, I want a TUTORIAL.md that walks one canonical example end-to-end with copy-pastable commands and expected behavior at each step (including the QA workflow — Ralph posts Ready for QA, user runs the checklist, ticks boxes, closes), so that I know what working looks like.
  35. As a new user, I want ARCHITECTURE.md that diagrams the pipeline mechanism, explains smart-zone vs dumb-zone and push vs pull, and shows how skills/agents/rules/Ralph wire together, so that I understand how the system works structurally (separate from why, which is in docs/ai-coding-principles.md).
  36. As a long-term user, I want DRIFT.md that explains how to read validation results and what to do when scores drop, so that I have a playbook for model-update fallout.
  37. As a long-term user, I want TROUBLESHOOTING.md with common failure modes (Ralph picks no slice, developer agent skips TDD, to-issues produces horizontal slices, install.sh symlink conflicts, Ralph closed issue without QA), so that diagnosis doesn't require re-deriving the system.
  38. As a contributor or future-self working IN devkit-the-project, I want a project-level CLAUDE.md that points at docs/ai-coding-principles.md and the agent registry, so that any agent operating on this repo loads the same constraints without having to discover them.
  39. As a maintainer, I want bootstrap work to land on a v3 branch (not main), so that main keeps the v2.2 state until the redo passes its own QA Checklist — making mid-bootstrap rollback a one-line git checkout main operation.
  40. As a maintainer, I want each shipped version tagged with annotated semver tags (v3.0.0, v3.1.0, v3.0.1), so that git checkout v3.0.0 reproduces a known-good state and install.sh link-skills reports what's installed via git describe.
  41. As an operator running AFK Ralph, I want the developer agent to surface non-obvious judgment calls (ambiguous spec, defaulted choices) as a Judgment: line in the commit message, and the code-reviewer to mirror those as a non-blocking slice-NNN judgment call comment on the parent issue, so that decisions made without me are visible at QA time without halting the loop.

Implementation Decisions

Distribution model: hybrid. Devkit-the-project is the canonical source. Skills, rules, and workflow agents are user-level (mirrored to ~/.claude/skills, ~/.claude/rules, ~/.claude/agents via symlinks). Only ralph/ is per-project (because feedback loops are language-specific, e.g., npm test vs pytest).

Modules to build / update in devkit-the-project:

  • rules/ — canonical source for ~/.claude/rules/.

    • development-workflow.md — rewritten: pipeline is /grill-me → /to-prd → /to-issues → Ralph → manual QA → close. Bypass section retained. Adds two sentences: (1) "After slicing, advance manually one slice at a time in chat, or run ./ralph/afk.sh N from a separate terminal to automate the same loop." (2) "When Ralph posts Ready for QA, run the parent issue's ## QA Checklist manually, tick each box, file follow-ups via /qa, and close the parent." Existing "Architectural decisions log" section (append to docs/decisions.md) kept.
    • github-issues.md — updated: idea/ready labels keep meaning; slice checklist convention with HITL/AFK as inline tags; Refs #N commit linkage; explicit "no child issues per PRD"; ≤10 slices per PRD soft cap; QA gate convention (issue stays open with unticked QA items).
    • coding-preferences.md, safety.md, user-interaction.md, principles.md, identity.md — kept; identity.md gets a minor edit to reference the new pipeline.
  • agents/ — canonical workflow agents.

    • developer.md — frontmatter: skills: [tdd, emil-design-engineering]. Body: per-slice contract — input is slice block + parent PRD body + parent issue number; output is commit ending Refs #<parent> + 2-3 line summary. Self-fetches UBIQUITOUS_LANGUAGE.md if present.
    • code-reviewer.md — body: receives commit hash + slice acceptance; pushes coding preferences, deep-module rules, interface-design rules, UBIQUITOUS_LANGUAGE.md (if present) into prompt; returns findings: [...] (file:line) or clean.
    • design-doc-writer.md — DELETED (replaced by /to-prd).
  • skills/ — canonical for ~/.claude/skills/. v1 ships 7:

    • grill-me/SKILL.md (Matt's prompt verbatim)
    • to-prd/SKILL.md (Matt's template + body verbatim; Process branched for fresh PRD vs maturing existing idea issue; template adds ## QA Checklist section with 3-5 manual-verification checkboxes authored at PRD time)
    • to-issues/SKILL.md (concept from Matt; output is ## Slices checklist via gh issue edit, no child issues; ≤10 slices cap; verifies ## QA Checklist exists in body before slicing — if missing, recommends user adds one)
    • tdd/ (full bundle: SKILL.md + deep-modules.md + interface-design.md + mocking.md + refactoring.md + tests.md — verbatim)
    • improve-codebase-architecture/ (full bundle: SKILL.md + DEEPENING.md + INTERFACE-DESIGN.md + LANGUAGE.md — verbatim; works without CONTEXT.md/docs/adr/)
    • qa/SKILL.md (verbatim)
    • ubiquitous-language/SKILL.md (verbatim)
    • domain-model/ — DROPPED from v1. Deferred to v1.1.
  • templates/ralph/ — copied per-project by install.sh init-project:

    • afk.sh — bash outer loop; per iteration runs claude --print --dangerously-skip-permissions "<orchestrator-prompt>"; exits on <promise>NO MORE TASKS</promise>.
    • once.sh — single-iteration variant for testing.
    • prompt.md — orchestrator prompt: parse open ready issues, find unchecked AFK slices in ## Slices, filter by Blocked-by, sort by priority (bug > infra > tracer-bullet > polish > refactor), spawn developer subagent with slice + PRD + parent #, on commit spawn code-reviewer with commit hash + acceptance, retry developer on findings (max 2), tick checkbox via gh issue edit --body-file -. When all AFK slices in a parent are ticked AND the parent has a ## QA Checklist section: post a Ready for QA — run the QA Checklist manually and close when done comment on the parent issue, then continue picking other parents' slices. Do NOT auto-close. Do NOT tick QA Checklist items. Emit <promise>NO MORE TASKS</promise> when no AFK slices remain across all open ready issues.
  • install.sh — subcommands link-skills (symlink devkit/{rules,agents,skills}/* into ~/.claude/, idempotent, --copy for detached install), init-project [path] (copy templates/ralph/, drop a /ralph slash-command file), unlink (reverse).

  • commands/ — minimal: a /ralph slash command shipped by init-project that bash-execs ralph/afk.sh "$@" from the project root (passes --once through).

  • fixtures/sample-project/ — tiny real git repo committed in-tree; synthetic codebase + sample issues; used by all validation scenarios.

  • tests/ (bats):

    • ralph_pick_next.bats, ralph_tick_checkbox.bats, ralph_no_more_tasks.bats
    • ralph_qa_gate.bats — given fixture issue body with all AFK slices ticked AND a ## QA Checklist with unticked items: Ralph posts Ready for QA comment AND does NOT close the issue. Given a body with all slices ticked AND no QA Checklist: Ralph posts the comment but still does not close (operator closes manually).
    • install_symlink_idempotent.bats, install_copy_mode.bats
    • frontmatter_valid.bats
  • evals/ (pytest, marked @pytest.mark.eval):

    • test_grill_me_opens_with_question_and_recommendation.py
    • test_to_prd_produces_six_sections_plus_qa_checklist.py (verifies all 6 sections + ## QA Checklist)
    • test_to_issues_produces_vertical_slices.py (LLM-as-judge)
    • test_to_issues_caps_at_ten_slices.py
    • test_to_issues_warns_if_no_qa_checklist.py
    • test_developer_uses_red_green_not_horizontal.py
    • test_code_reviewer_catches_internal_mock.py
    • conftest.py (fixtures: tmp project, claude invocation helper)
  • tests/validation/:

    • gh-shim/gh — Python executable on PATH during validation; implements gh issue create/list/view/edit/close/comment --label X --json Y; stores state in $TMPDIR/devkit-validation/issues/*.json.
    • gh-shim/state.py — read/write JSON state.
    • bootstrap.sh — sets CLAUDE_CONFIG_DIR=tmp, prepends gh-shim/ to PATH, runs install.sh link-skills --copy, creates fixture project, asserts symlinks/copies present.
    • run.py — orchestrates scenarios; --live flag toggles between shim and real gh (creates ephemeral devkit-validation-<ts>-<rand> repo under $DEVKIT_VALIDATION_OWNER, deletes on completion or via orphan-GC at next --live start).
    • scenarios/01-add-feature/, scenarios/02-fix-bug/, scenarios/03-shallow-modules/, scenarios/04-domain-language-conflict/, scenarios/05-blocked-by-resolution/, scenarios/06-qa-gate/ — each has brief.md, fixture overlay, expected.yaml. (Scenario 06 specifically asserts QA gate behavior.)
    • results/<timestamp>-<model-id>.json — persisted scores.
    • drift-report.pypython drift-report.py --since YYYY-MM-DD plots pass-rate and LLM-judge mean scores.
    • smoke-e2e.sh — manual full-pipeline smoke including QA gate.
  • README.md — navigation hub. Pipeline diagram (mermaid). Scenario index. 3-line install. Pointer to TUTORIAL.md. Links to ARCHITECTURE.md, DRIFT.md, TROUBLESHOOTING.md, docs/ai-coding-principles.md, docs/SAFETY.md. One ATTRIBUTION section at the bottom: mattpocock/skills (MIT license) credited for verbatim/derived skills. No per-file footers.

  • CLAUDE.md (project-level, in devkit/) — short. Says: "Before working in this repo, load docs/ai-coding-principles.md. Follow the agent registry. Defer to PRINCIPLES on conflicts."

  • docs/:

    • ai-coding-principles.md — constitutional principles (Appendix A content). Doc, not rule. Loaded on demand by user/agents working in devkit-the-project (via project-level CLAUDE.md). NOT auto-pushed into every Claude Code session.
    • SAFETY.md — names the v1 risk profile explicitly: --dangerously-skip-permissions without docker, single-developer trust, git as safety net, what changes in v2 (Sandcastle + docker).
    • TUTORIAL.md — copy-pastable canonical example end-to-end including QA workflow section.
    • ARCHITECTURE.md — pipeline diagram, smart/dumb zone, push vs pull, wiring.
    • DRIFT.md — running validation, reading results, response playbook.
    • TROUBLESHOOTING.md — common failure modes including QA-gate confusion.
    • skills/<name>.md — one-screen page per skill.
  • Makefilevalidate, validate-live, test (bats only), eval (pytest evals), lint.

Killed from current devkit: commands/devkit/, current skills/ contents, agents/design-doc-writer.md, .devkit/tasks/, .devkit/knowledge/, trailer convention, hooks/, logs/, legacy docs/, old templates/. (hooks/ and grepai disposition pending explicit confirmation; default kill.)

Smart-zone protection: Ralph runs claude --print once per iteration → fresh process → fresh context. The inner Claude is the orchestrator and spawns developer and code-reviewer as subagents.

Permission posture v1: --dangerously-skip-permissions on the inner claude --print. No docker. Documented in docs/SAFETY.md. v2 with Sandcastle + docker is the upgrade path.

Quick-drop: gh issue create --label idea --title ... --body "<brief>".

Slice format inside parent issue body (≤10 slices):

## Slices

- [ ] **slice-001** — <title> — AFK
  - Acceptance: <verifiable behavior>
  - Blocked by: none
  - Covers user stories: 1, 2, 3

QA Checklist format inside parent issue body (3-5 items, authored by /to-prd):

## QA Checklist

- [ ] <concrete user-flow verification>
- [ ] <edge case to manually exercise>
- [ ] <visual/UX taste check>

Ralph priority order: critical bug > infrastructure > tracer-bullet > polish/quick wins > refactor.

Retry policy: max 2 retries of developer on code-reviewer findings before flagging the slice as HITL-needed.

Validation modes: hermetic (default) and --live (pre-release).

Decision-recording model: per-feature in PRD-issue body; system-wide invariants in docs/ai-coding-principles.md (doc, on-demand); architectural-decisions log via existing docs/decisions.md rule.

Domain-language model: UBIQUITOUS_LANGUAGE.md only.

Attribution model: single ## Attribution section in README.md. LICENSE-UPSTREAM file at root contains Matt's MIT text.

One pipeline, two trigger modes: same agents/skills/rules. Manual = chat; AFK = ./ralph/afk.sh N.

QA gate: Ralph stops at Ready for QA. User ticks ## QA Checklist manually. User closes the parent issue manually. Findings → /qa → new issues.

Branch and version strategy:

  • Bootstrap: 10 slices land on a v3 feature branch (slice-001 creates it from main). Single commit per slice. main stays at v2.2 until slice-010 + v1 QA pass — restorable in one line.
  • Ship: after slice-010 + QA, git merge --no-ff v3 into main, tag annotated v3.0.0, push origin (branch + tag).
  • Future versions: semver with annotated git tags. Major = breaking skill-prompt or agent-input-contract change. Minor = new skill or new scenario (e.g., /prototype ships v3.1.0). Patch = fix.
  • Ralph runtime: commits to current branch. No per-slice branches in v1. v2 (Sandcastle) introduces worktree + temp-branch + merge.
  • Upgrade path: git pull && ./install.sh link-skills. Symlinks auto-reflect new content (they point at paths inside the repo).
  • Version observability: install.sh link-skills prints current devkit git describe (tag or SHA) on exit so user knows what's installed.
  • No CHANGELOG file. Git log + annotated tag messages ARE the history. Solo scale doesn't earn CHANGELOG ceremony.

Testing Decisions

What makes a good test: behavior through public interface. For LLM evals: regex on structural output OR LLM-as-judge with structured JSON for non-regex-able properties.

Modules tested:

Deterministic (bats, every commit):

  • Ralph slice picker, checkbox flip, NO MORE TASKS emission, QA gate (no auto-close, posts Ready for QA comment).
  • install.sh link-skills idempotency, --copy mode, init-project.
  • Frontmatter validity for every skill and agent.

LLM-driven (pytest evals, opt-in/nightly):

  • grill-me opens with question + recommendation.
  • to-prd produces all six sections + ## QA Checklist.
  • to-issues produces vertical slices (LLM-judge), caps at ≤10, warns when no QA Checklist exists.
  • developer uses red-green not horizontal.
  • code-reviewer catches seeded antipatterns.

End-to-end smoke: tests/smoke-e2e.sh — bootstrap → grill → prd → issues → ralph → assert Ready for QA comment posted, parent issue still open. User ticks QA, closes manually.

Validation harness: make validate + make validate-live + drift-report.py.

Prior art: Matt Pocock's mattpocock/course-video-manager uses evalite (TypeScript). We use pytest because Python is preferred.

Cadence: bats every commit; pytest evals manual + nightly; make validate pre-release.

Cost guardrails: evals use claude-haiku-4-5 for LLM-judge; full target model for grilling/PRD/issues/dev. Per-scenario ≤ $0.50; full suite ≤ $5/run.

Out of Scope

  • Sandcastle parallel orchestration — v2.
  • Docker sandboxing — v2.
  • domain-model skill — v1.1.
  • /prototype skill — Matt-inspired throwaway-prototype-route generator for frontend taste decisions. Useful for "what should this UI look like" branches; out of v1 because frontend work isn't core to the alignment+AFK gap. Defer to v1.1.
  • CONTEXT.md / CONTEXT-MAP.md / docs/adr/ — solo duplicates of UBIQUITOUS_LANGUAGE.md + docs/decisions.md.
  • GitLab / Gitea / non-GitHub trackers — v1 is GitHub-only.
  • Local-files-only mode — divergent code path, rejected.
  • Migration tooling for current devkit's .devkit/tasks/.
  • IDE plugin / web UI.
  • Multi-repo Ralph orchestration.
  • Auto-promotion of ideaready — manual.
  • design-an-interface, request-refactor-plan, zoom-out skills — v1.1.
  • Auto-tick of QA Checklist items — explicitly rejected. QA is human taste; the system MUST require manual ticks.
  • Auto-close on all-slices-ticked — explicitly rejected. QA gate is the whole point.
  • Separate docs/PRINCIPLES.md or docs/DECISIONS.md — superseded by docs/ai-coding-principles.md (doc) and PRD-issue bodies / existing docs/decisions.md rule.
  • ai-coding-principles.md as auto-loaded rule — must be a doc, not pollute every session globally.
  • Per-file MIT footers — single ATTRIBUTION block in README.

QA Checklist

Manual verification after v1 ships (run on a fresh machine where possible):

  • Fresh-machine bootstrap: clone devkit on a clean machine without gh auth; run make validate; all scenarios pass green; results JSON written to tests/validation/results/.
  • install.sh link-skills idempotency: run twice against ~/.claude/; second run produces no errors; symlinks for skills/, agents/, rules/ all present and pointing at devkit/.
  • install.sh init-project in scratch project: ralph/{afk.sh, once.sh, prompt.md} copied; commands/ralph slash command exists; running /ralph --once from that project starts a Ralph iteration.
  • /grill-me first response shape: file a vague brief as idea GitHub issue; in fresh chat, run /grill-me <issue#>; verify the first response is exactly ONE question with a recommended answer (not a plan, not multiple questions).
  • /to-prd synthesis: after a few grilling exchanges, run /to-prd <issue#>; verify body has all six sections (Problem Statement / Solution / User Stories / Implementation Decisions / Testing Decisions / Out of Scope) AND a populated ## QA Checklist (3-5 items); label flipped from idea to ready.
  • /to-issues slicing: in a fresh chat, run /to-issues <issue#>; verify body now has ## Slices with ≤10 items, each with Acceptance / Blocked-by / Covers user stories. If the test PRD is artificially long, verify slicer recommends splitting.
  • Ralph one-shot: run ./ralph/once.sh; verify it picks the first AFK slice, developer subagent runs with /tdd, code-reviewer runs after, slice ticks [ ]→[x], commit message ends with Refs #<parent>.
  • QA gate enforcement: complete all AFK slices in a test PRD via Ralph; run Ralph again; verify Ralph posts a Ready for QA comment AND does NOT close the issue. The ## QA Checklist items remain unticked.
  • Manual QA closes: tick ## QA Checklist items by hand; close issue manually; verify GitHub records the close event correctly.
  • /qa skill files follow-ups: report a manufactured bug via /qa in a chat; verify a new GitHub issue is created with proper format (What happened / Expected / Steps to reproduce); verify Ralph picks it up next round.
  • Judgment-call surfacing: review the parent issue for any slice-NNN judgment call comments; for each, decide whether the defaulted choice was correct. If wrong, file a corrective issue via /qa. Confirm code-reviewer surfaced ambiguities to the issue rather than silently letting them pass.
  • Drift report: run python tests/validation/drift-report.py --since 2026-01-01; verify it produces output without errors (chart or empty-state message).
  • TUTORIAL.md cold-read: read end-to-end as if first time; follow every step; verify everything described actually works on the live system.
  • docs/SAFETY.md exists and accurately names the v1 trade (no docker, --dangerously-skip-permissions, git as safety net).
  • docs/ai-coding-principles.md is NOT a rule: confirm it's NOT in ~/.claude/rules/; confirm a fresh Claude Code session in a non-devkit project does NOT load it; confirm a session in devkit-the-project DOES (via project-level CLAUDE.md).
  • v3.0.0 shipped: git tag -l 'v3.*' shows v3.0.0; git log --oneline main includes the merge commit at v3.0.0; git push origin main v3.0.0 succeeded; ./install.sh link-skills prints Linked devkit @ v3.0.0 (or equivalent git describe).

Slices

(Hand-populated for this bootstrap PRD because /to-issues skill doesn't exist yet — the very first thing being built. Future PRDs use the slicer skill once it exists in slice-003.)

  • slice-001 — Cleanup + skeleton — AFK

    • Acceptance: First action: git checkout -b v3 from main. All subsequent bootstrap work (slices 001-010) lands on v3. Then: old content deleted (commands/devkit/, skills/{brainstorm,plan,execute,quick,design,bugfix,wrap-up,grepai}, agents/design-doc-writer.md, .devkit/tasks/, .devkit/knowledge/, hooks/, logs/, legacy docs/, old templates/). New empty directories created: skills/, agents/, rules/, templates/ralph/, docs/, tests/, evals/, fixtures/, tests/validation/, commands/. Kept: LICENSE, README.md (will be rewritten in slice-010), this PRD's commit history. Single git commit on v3 titled slice-001: cleanup + skeleton.
    • Blocked by: none
    • Covers user stories: (no stories — pure tree-shaping; symlinking is slice-006's job)
  • slice-002 — Verbatim skill imports + LICENSE-UPSTREAM — AFK

    • Acceptance: skills/grill-me/SKILL.md copied verbatim from mattpocock/skills. skills/tdd/{SKILL,deep-modules,interface-design,mocking,refactoring,tests}.md copied verbatim. skills/qa/SKILL.md copied verbatim. skills/ubiquitous-language/SKILL.md copied verbatim. skills/improve-codebase-architecture/{SKILL,DEEPENING,INTERFACE-DESIGN,LANGUAGE}.md copied verbatim. Total 13 files. LICENSE-UPSTREAM at repo root with Matt's MIT text. All frontmatter parses (run frontmatter_valid.bats if it already exists, else manual).
    • Blocked by: slice-001
    • Covers user stories: 2, 12, 13, 14, 15, 16, 17, 20 (the /qa skill itself, which enables QA findings → new issues), 33 (LICENSE-UPSTREAM file half; README ATTRIBUTION half is slice-010)
  • slice-003 — to-prd, to-issues, ralph templates — AFK

    • Acceptance: skills/to-prd/SKILL.md authored: Matt's body + Process branched for fresh-PRD vs maturing-idea-issue + template includes ## QA Checklist section. skills/to-issues/SKILL.md authored: ≤10 slice cap, ## Slices checklist via gh issue edit, no child issues, verifies ## QA Checklist exists in body before slicing. templates/ralph/{afk.sh,once.sh,prompt.md} authored: claude --print --dangerously-skip-permissions, <promise>NO MORE TASKS</promise> sentinel, QA gate (post Ready for QA comment, do NOT close). Frontmatter valid on both skills.
    • Blocked by: slice-001
    • Covers user stories: 3, 4, 5 (slices tagged AFK/HITL with required fields — /to-issues produces this format), 6, 7, 9, 10, 11, 18 (## QA Checklist template in /to-prd), 19 (Ralph no-auto-close behavior), 21 (Ralph does NOT tick QA boxes — leaves them for human ticking)
  • slice-004 — Workflow agents rewrite — AFK

    • Acceptance: agents/developer.md frontmatter has skills: [tdd, emil-design-engineering]; body specifies slice + parent PRD body + parent issue # input; self-fetches UBIQUITOUS_LANGUAGE.md if present; commits with Refs #<parent>; when developer makes a non-obvious judgment call (ambiguous spec, defaulted choice, assumption), commit message body includes a Judgment: <one-line rationale> line so downstream review can detect it. agents/code-reviewer.md body pushes coding preferences + deep-module rules + interface-design rules + UBIQUITOUS_LANGUAGE.md (if present) inline; accepts commit hash + slice acceptance; returns findings: [...] or clean; when commit message contains a Judgment: line, code-reviewer additionally posts a comment on the parent issue formatted as: **slice-NNN judgment call** — Ambiguity: <what>. Defaulted to: <choice>. Why: <rationale>. Review at QA — file new issue if wrong. This is non-blocking — loop continues; comment is for human-QA visibility. Both have valid frontmatter (manual YAML check or via existing tooling — frontmatter_valid.bats doesn't exist yet; will re-verify once slice-007 lands).
    • Blocked by: slice-002 (needs tdd skill referenced in frontmatter)
    • Covers user stories: 12, 13, 14, 41 (judgment-call surfacing)
  • slice-005 — Rules updates — AFK

    • Acceptance: rules/development-workflow.md rewritten: pipeline section is /grill-me → /to-prd → /to-issues → Ralph → manual QA → close; bypass section retained; QA gate sentence added; existing docs/decisions.md log rule kept. rules/github-issues.md updated: slice convention + ≤10 cap + no-child-issues + QA gate convention. rules/identity.md updated to reference new pipeline. Other rule files (coding-preferences.md, safety.md, user-interaction.md, principles.md) unchanged.
    • Blocked by: slice-003 (rules reference skills that must exist)
    • Covers user stories: 8
  • slice-006 — install.sh + Makefile + project CLAUDE.md — AFK

    • Acceptance: install.sh has subcommands link-skills (idempotent symlink, --copy mode), init-project [path] (copies templates/ralph/, drops commands/ralph slash command), unlink. link-skills prints current devkit git describe on exit (e.g., Linked devkit @ v3.0.0 or @ <sha>). Makefile has targets validate, validate-live, test, eval, lint. devkit/CLAUDE.md exists, short, points at docs/ai-coding-principles.md and the agent registry. After running ./install.sh link-skills, ~/.claude/skills/grill-me/SKILL.md resolves to devkit/skills/grill-me/SKILL.md.
    • Blocked by: slice-005
    • Covers user stories: 25, 26, 38
  • slice-007 — bats tests + frontmatter check + fixture project — AFK

    • Acceptance: tests/{ralph_pick_next,ralph_tick_checkbox,ralph_no_more_tasks,ralph_qa_gate,install_symlink_idempotent,install_copy_mode,frontmatter_valid}.bats authored and all pass. fixtures/sample-project/ exists as a tiny real git repo committed in-tree (synthetic codebase + sample issues for use by validation scenarios). make test passes.
    • Blocked by: slice-006
    • Covers user stories: 30
  • slice-008 — gh shim + bootstrap.sh + scenarios — AFK

    • Acceptance: tests/validation/gh-shim/{gh,state.py} faithfully implements gh issue create/list/view/edit/close/comment --label X --json Y, state in $TMPDIR/devkit-validation/issues/*.json. tests/validation/bootstrap.sh sets CLAUDE_CONFIG_DIR=tmp, prepends shim to PATH, runs install.sh link-skills --copy, creates a fresh fixture project. tests/validation/scenarios/{01-add-feature,02-fix-bug,03-shallow-modules,04-domain-language-conflict,05-blocked-by-resolution,06-qa-gate}/ each have brief.md + fixture overlay + expected.yaml (structural assertions + LLM-judge keys).
    • Blocked by: slice-007
    • Covers user stories: 27, 28
  • slice-009 — pytest evals + run.py + drift-report + smoke E2E — AFK

    • Acceptance: evals/{test_grill_me_opens_with_question_and_recommendation,test_to_prd_produces_six_sections_plus_qa_checklist,test_to_issues_produces_vertical_slices,test_to_issues_caps_at_ten_slices,test_to_issues_warns_if_no_qa_checklist,test_developer_uses_red_green_not_horizontal,test_code_reviewer_catches_internal_mock,conftest}.py authored, all marked @pytest.mark.eval. tests/validation/run.py orchestrates scenarios, --live flag toggles real-gh mode, ephemeral repo created/deleted (with orphan-GC). tests/validation/drift-report.py plots pass-rate and LLM-judge mean scores from results/. tests/smoke-e2e.sh exercises full pipeline including QA gate. make validate succeeds end-to-end against fixtures/sample-project/.
    • Blocked by: slice-008
    • Covers user stories: 27, 28, 29, 31
  • slice-010 — All docs + README + ATTRIBUTION — HITL

    • Acceptance: docs/ai-coding-principles.md authored from this PRD's Appendix A (verbatim transfer). docs/SAFETY.md names v1 risk profile explicitly. docs/ARCHITECTURE.md has pipeline diagram (mermaid), smart/dumb zone, push/pull, wiring, distinct from the why. docs/TUTORIAL.md walks one canonical example end-to-end including QA workflow section (Ralph posts Ready for QA → user ticks → user closes). docs/DRIFT.md explains validation reading + model-update playbook. docs/TROUBLESHOOTING.md covers common failure modes including QA-gate confusion. docs/skills/<name>.md has one-screen page per v1 skill — 7 files: grill-me, to-prd, to-issues, tdd, improve-codebase-architecture, qa, ubiquitous-language. (No domain-model page; it's v1.1 — when added, ship its doc page alongside.) README.md is navigation hub: 3-sentence what/why, mermaid pipeline diagram, scenario index, 3-line install, links to all docs, single ATTRIBUTION section crediting mattpocock/skills (MIT). HITL: user reviews TUTORIAL by following it end-to-end on a real project. After v1 QA Checklist passes (manual ticks on this issue): git checkout main && git merge --no-ff v3 (commit message: v3.0.0 — alignment + AFK pipeline), then git tag -a v3.0.0 -m "DevKit v3.0.0 — see issue #1", then git push origin main v3.0.0. Verify git describe --tags on main returns v3.0.0. Close issue DevKit v3: redo around alignment + AFK pipeline (grill-me → to-prd → to-issues → Ralph) #1 manually after confirming all QA boxes ticked AND tag pushed.
    • Blocked by: slice-009
    • Covers user stories: 22, 23, 24, 32, 33, 34, 35, 36, 37, 38

Further Notes

  • Bootstrap caveat: this PRD's ## Slices section was hand-populated, not produced by /to-issues, because /to-issues itself is what slice-003 creates. Future PRDs run through the real slicer.
  • Bootstrap operational order: slices 001-006 are hand-driven (use existing ~/.claude/agents/{developer,code-reviewer} agents in their current state — they get rewritten in slice-004 mid-stream). After slice-006, run ./install.sh init-project . against devkit-the-project itself so the /ralph slash command and templates/ralph/ are wired locally. From there, AFK Ralph can drive slices 007-009 (./ralph/afk.sh against this same issue DevKit v3: redo around alignment + AFK pipeline (grill-me → to-prd → to-issues → Ralph) #1). Slice-010 is HITL by design (TUTORIAL needs human eyes).
  • Source repos for skill content: mattpocock/skills and mattpocock/ai-engineer-workshop-2026-project, cloned to /tmp/mattpocock-skills/{repo,workshop}/ for reference. If those tmp paths are gone in a future session, git clone fresh.
  • Sandcastle (mattpocock/sandcastle) is the v2 parallelization upgrade target.
  • Grilling discipline (one question at a time, recommended answer per question, walk the design tree) preserved verbatim in grill-me skill — this PRD is the synthesis of a 12-question grill-me session plus a final QA-workflow alignment round.
  • Hooks/grepai disposition: pending explicit confirmation. Default action: kill in slice-001. Speak up before slice-001 ships if either should survive.
  • /ralph slash command supports --once flag passthrough to ralph/once.sh. Trivial.
  • Inspiration sources: "Software Fundamentals Matter More Than Ever" — Matt Pocock, "Essential Skills for AI Coding from Planning to Production" — Matt Pocock workshop.
  • Reuse posture summary:
    • Verbatim (drop-in copy): grill-me, all 6 tdd/ files, all 4 improve-codebase-architecture/ files, qa, ubiquitous-language = 13 files
    • Mix (Matt's body, our process): to-prd/SKILL.md = 1 file
    • New (concept from Matt, our output): to-issues/SKILL.md, ralph/prompt.md = 2 files
    • Adapted scripts: ralph/afk.sh, ralph/once.sh = 2 files
    • Rewrite existing: developer.md, code-reviewer.md, development-workflow.md = 3 files
    • Update existing: github-issues.md, identity.md = 2 files
    • Kept: coding-preferences.md, safety.md, user-interaction.md, principles.md = 4 files
    • New (devkit-original): docs/ai-coding-principles.md, docs/SAFETY.md, all tests/evals/validation/install/Makefile/README/CLAUDE.md/scenario fixtures, templates/ralph/ slash command, LICENSE-UPSTREAM

Appendix A — Content for docs/ai-coding-principles.md

This appendix captures the constitutional principles that came out of the grill-me session for this PRD. Slice-010 authors docs/ai-coding-principles.md using this content.

Posture: project-scoped doc. Loaded on demand via project-level CLAUDE.md when an agent works in devkit-the-project. Not an auto-loaded rule.

Context discipline

  • Smart zone is ~100k tokens. Beyond that, attention degrades regardless of advertised context window. Size tasks to fit.
  • Clear over compact. Memento-style: every new task starts with a fresh context. Compaction preserves sediment that hurts later judgment.
  • Push for reviewer, pull for implementer. Implementer pulls skills on demand. Reviewer gets coding standards pushed inline.

Planning discipline

  • Specs-to-code is rejected. Don't ignore the code; don't just regenerate from a spec. Code is the battleground.
  • Grill-me before you plan. Reach a shared design concept first (Brooks). Walk the design tree, one question at a time, with a recommended answer per question.
  • Don't read the PRD after generation. It's destination doc only.
  • Don't keep PRDs around long-term. Doc-rot risks future grilling sessions anchoring on stale text.

Slicing discipline

  • Vertical slices, never horizontal. One slice cuts schema → service → UI → test.
  • Issue body is the source of truth. PRD lives as issue body. Slices live as ## Slices checklist appended to body. No child issues.
  • ≤10 slices per PRD. Encoded in to-issues skill.

Module discipline

  • Deep modules over shallow. Simple interface, complex implementation (Ousterhout).
  • Deletion test. Would removing this module concentrate complexity in callers (good) or just move it (bad)?
  • Design the interface, delegate the implementation. Treat modules as gray boxes once the interface is locked.

TDD discipline

  • AI cheats at tests. Default: writes all impl, then writes tests against it. Counter: red-green-refactor with vertical slicing.
  • Test behavior, not implementation. Tests must survive internal refactor.
  • Feedback loops are the speed limit. Tests + types + browser MCP. Don't outrun your headlights.

Front-end discipline

  • Front-end is multimodal — AI can't see. Use throwaway prototype routes for taste decisions.

Interaction discipline

  • Plain text Q&A only. AskUserQuestion UI is rejected.
  • Wait for responses. When asking, stop and wait.

Domain discipline

  • Maintain UBIQUITOUS_LANGUAGE.md. Shared terminology between user, AI, and code.
  • Architectural decisions log only for system-wide choices. Per-feature decisions in PRD-issue bodies; system-wide invariants here. Append to docs/decisions.md only when the decision binds future architecture project-wide.

Operational discipline

  • One pipeline, two trigger modes. Manual = chat. AFK = ./ralph/afk.sh N. Same agents, same skills, same output.
  • AFK loop has fresh context per iteration. Ralph script restarts claude --print per slice.
  • --dangerously-skip-permissions is acceptable for solo trusted operator. Git is the safety net. Risk profile in docs/SAFETY.md.
  • AFK observability: terminal stream + git log + GitHub issue body checkboxes + close events. Orchestration chat and Ralph terminal are separate observables; no chat-feedback channel needed.
  • Editing external state through local buffers: when an external system stores authoritative state (GitHub issue body, remote config), the discipline is pull → surgical Edit → push, never full-file rewrite. For GitHub issues: gh issue view <N> --json body --jq .body > /tmp/<name>.md (pull fresh), targeted Edit calls (each shows explicit old_stringnew_string — diff visible, typo blast radius bounded), gh issue edit <N> --body-file /tmp/<name>.md (push), discard the local buffer afterward (don't reuse across rounds — pull fresh next time to avoid silently overwriting parallel edits). Heredoc-based full-body rewrites are tempting but lose diff visibility, burn tokens regenerating unchanged content, and can clobber concurrent changes.

QA discipline

  • QA is human taste. What makes a feature actually ship-worthy is not auto-verifiable — it's clicking through the user flow and noticing what's wrong.
  • ## QA Checklist lives in the PRD body. 3-5 concrete manual-verification steps authored at PRD time by /to-prd.
  • QA gate before close: when all AFK slices ticked, Ralph posts Ready for QA and stops. User runs ## QA Checklist manually, ticks each box, closes the parent issue manually.
  • QA findings become new issues via /qa skill — never ad-hoc fixes mid-Ralph-run.
  • Don't auto-tick QA boxes, ever. The whole point is human-in-the-loop verification.

Failure-mode awareness

  • Misalignment (AI builds wrong thing) → grill harder.
  • Verbose output → ubiquitous language gap.
  • Doesn't work → feedback loop weakness.
  • Brain can't keep up → modules are shallow; deepen them.
  • Plan mode is too eager → reach the design concept first via grill-me.
  • Feature feels off after Ralph completes it → QA gate caught it; file new issues via /qa, don't bypass to a quick patch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    readyExplored and scoped; agent can pick up and execute

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions