A battle-tested process for building software with AI coding agents.
Most teams using AI coding agents (Claude Code, Cursor, Copilot, etc.) have the same experience: the agent writes code fast, but the code doesn't follow your conventions, breaks existing patterns, introduces lint errors, and creates inconsistent commit histories. You spend as much time correcting the agent as you would have spent writing the code yourself.
This playbook solves that. It's a complete engineering process — extracted from a production application — that makes AI agents disciplined contributors rather than chaotic ones. It covers everything from how to configure the agent's instructions, to how quality gates prevent regressions, to how debugging knowledge compounds over time.
The Playbook (playbook.md)
A comprehensive guide (1,400+ lines) covering 13 areas of engineering process, written for a junior engineer setting up a new project:
| Section | What It Covers |
|---|---|
| Philosophy & Principles | Quality ratchet, local CI = remote CI, compound learning, honest opposition |
| Agent.md as Project Constitution | The single file that makes every agent session productive |
| Agent Tooling Setup | Serena (MCP), agent settings, memory systems |
| Skills System | Reusable workflows agents activate on demand (commit, review, brainstorm) |
| Branching & Worktree Workflow | Parallel-safe isolation for humans and agents working simultaneously |
| Code Quality Gates | Two tiers: local CI (before push) and remote CI (safety net) — same checks, same source of truth |
| Code Health Metrics | Maintainability index, complexity, file size, duplication, quality ratchet |
| CI/CD Pipeline | Single source of truth for checks, dev-first promotion, zero-rebuild deploys |
| Architecture Patterns | Exception hierarchy, Unit of Work, Repository, task queue protocol |
| Testing Strategy | Pytest config, fixtures, async patterns, what NOT to test |
| Documentation Infrastructure | Architecture docs, lessons-learned, compound learning loop |
| Infrastructure & DevOps | Dev setup, expand/contract migrations, secrets, deployment |
| Adoption Roadmap | Phased rollout from Day 1 to Month 2+ |
Copy-Pasteable Examples (examples/)
Every script, config file, and template you need — adapted for portability with ADAPT comments marking project-specific values:
examples/
├── ci-checks.json # CI check definitions (single source of truth)
├── scripts/
│ ├── ci_check_local.py # Local CI runner (reads ci-checks.json)
│ ├── check_code_health.py # Code health metrics (MI, CC, SLOC, duplication)
│ ├── quality_delta.py # PR quality regression gate (the ratchet)
│ └── check_migration_heads.py # Alembic migration head validator
├── skills/
│ ├── commit/SKILL.md # Commit -> push -> PR -> cleanup workflow
│ ├── code-review/
│ │ ├── SKILL.md # Multi-persona review orchestration
│ │ ├── findings-schema.md # Structured findings format
│ │ └── personas/ # 5 specialized reviewer personas
│ │ ├── correctness.md # Logic & edge case reviewer
│ │ ├── testing.md # Test coverage reviewer
│ │ ├── project-standards.md # Convention compliance (adapt this)
│ │ ├── security.md # Security reviewer (conditional)
│ │ └── adversarial.md # Failure scenario reviewer (conditional)
│ └── brainstorming/SKILL.md # Scope-adaptive brainstorming (Light/Standard/Deep)
├── config/
│ ├── mcp.json # Serena MCP server config
│ └── claude-settings.json # Agent permissions
└── docs/
├── Agent.md # Starter project constitution
└── lessons-learned/
├── README.md # Lesson index template
└── TEMPLATE.md # Individual lesson template
Without a structured process, AI-assisted development creates these problems:
-
Inconsistent quality. One agent session follows your conventions, the next doesn't. You get random commit messages, missing type annotations, lint errors that slip through, and architectural violations that take days to unwind.
-
No institutional memory. Every agent session starts from zero. It doesn't know your branching strategy, your testing conventions, your exception hierarchy, or the bug you spent 4 hours tracking down last week. You repeat the same corrections session after session.
-
CI whack-a-mole. Push, wait 5 minutes for CI, discover a lint error, fix it, push again, wait another 5 minutes, discover a type error... This loop destroys flow and creates noisy commit histories.
-
Quality debt accumulates silently. Without metrics that ratchet, every "just this once" exception becomes permanent. Type suppressions grow, lint suppressions grow, complexity grows — and nobody notices until the codebase is painful to work in.
-
Debugging knowledge evaporates. You spend 2 hours discovering that SAQ's default timeout is 10 seconds. Next month, someone (or some agent) hits the same problem. The knowledge existed briefly in a conversation, then vanished.
| Problem | Solution | Playbook Section |
|---|---|---|
| Inconsistent quality | Agent.md — a project constitution every agent session reads | Section 2 |
| No institutional memory | Memory systems + compound learning loop | Section 3, Section 11 |
| CI whack-a-mole | Local CI = Remote CI — same checks, catch everything before push | Section 6 |
| Silent quality debt | Quality ratchet — metrics can only improve, never regress | Section 7 |
| Debugging knowledge lost | Lessons-learned docs that feed back into Agent.md rules | Section 11 |
These provide immediate value with minimal effort:
-
Copy
Agent.mdto your project root. Edit it to match your conventions.cp examples/docs/Agent.md ./Agent.md
-
Set up local CI. Copy the check definitions and runner script.
cp examples/ci-checks.json ./ci-checks.json mkdir -p scripts cp examples/scripts/ci_check_local.py ./scripts/ # Edit ci-checks.json to match your project's commands and paths -
Run it before every commit. Your commit skill should invoke this automatically:
python scripts/ci_check_local.py --fix
The script auto-fixes formatting/linting and validates everything else (types, tests, code health, migrations). The same script runs in your CI workflow — so "local passes" means "CI will pass."
- Copy
scripts/check_code_health.pyand add to yourci-checks.json - Create
.type-ignore-thresholdwith your current count - Mirror your
ci-checks.jsonin your CI workflow
- Copy
skills/commit/andskills/code-review/to your project - Start
docs/lessons-learned/after your first non-trivial debugging session - Adapt
skills/code-review/personas/project-standards.mdto your conventions
- Add
quality_delta.pyfor PR regression checks - Set up Serena MCP for semantic code navigation
- Add brainstorming skill for design workflows
See the full Adoption Roadmap for details.
A markdown file at your project root that AI coding agents read at the start of every session. It defines rules, conventions, and architectural decisions. Without it, every session starts from zero. With it, the agent knows your branching strategy, testing conventions, exception hierarchy, and file size limits from the first message.
For Claude Code, this file is named CLAUDE.md. For other agents, adapt the name to whatever your tool reads (e.g., .cursorrules, .github/copilot-instructions.md). The content is the same — the playbook uses the generic name Agent.md.
Metrics can only improve, never regress. Every PR is compared against its merge base:
- Added a
# type: ignore? Remove one somewhere else. - Introduced a complexity violation? Simplify it.
- New public function without type annotations? Add them.
This is enforced by quality_delta.py, which runs on PRs and fails if any metric regressed.
A single ci-checks.json file defines every check. The local runner (ci_check_local.py) and the remote CI workflow both read from this file. Committing takes 30-60 seconds longer. But the alternative — push-wait-fail-fix-push-wait-fail-fix — wastes far more time and creates noisy histories.
Five specialized reviewers dispatched in parallel, each focused on what it's best at:
- Correctness — logic errors, edge cases, state bugs
- Testing — coverage gaps, weak assertions
- Project Standards — convention compliance
- Security (conditional) — injection, auth, data exposure
- Adversarial (conditional) — failure scenarios, cascade failures
Each reviewer has a persona document defining what to hunt for, what to ignore, and how to calibrate confidence.
A three-step loop that makes your codebase smarter over time:
- Consult — Before starting work, search
docs/lessons-learned/ - Capture — After resolving a non-trivial bug, write a lesson
- Promote — When a lesson recurs, elevate it to a rule in Agent.md
This playbook was developed on a Python/FastAPI + React/TypeScript stack, but the principles and most tooling are adaptable:
| Component | Used Here | Alternatives |
|---|---|---|
| Python linter/formatter | ruff | black + flake8, pylint |
| Type checker | mypy | pyright, pytype |
| Code metrics | radon | wily, complexipy |
| Duplication | jscpd | CPD, Simian |
| Dead code | vulture | pylint unused-import |
| Dependency check | deptry | pip-extra-reqs |
| Task queue | SAQ | Celery, Dramatiq, Arq |
| Migrations | Alembic | Django migrations, Flyway |
| Code navigation | Serena (MCP) | — |
| Agent | Claude Code | Cursor, Copilot, Aider |
This playbook is open source. If you've adapted it for a different tech stack, found improvements, or have new skill definitions that would benefit others, PRs are welcome.
MIT
Extracted from a production application by Chuck Conway. Built with Claude Code.