From c8b901c212d1dafa69a670cde3362aa50bf47f78 Mon Sep 17 00:00:00 2001 From: "const.koutsakis@aurecongroup.com" Date: Mon, 27 Apr 2026 18:33:16 +1000 Subject: [PATCH] docs: HARNESS, INVARIANTS, BOUNDARIES, DEVELOPMENT, EVAL_HARNESS, SECURITY, ARCHITECTURE + README, CONTRIBUTING, CLAUDE.md, CHANGELOG, TASKS.md (#25, #26) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes #25 + #26 in one PR (the README references docs/* paths and TASKS.md references everything else; landing them separately would mean a half-broken docs tree in between). docs/* (all written for the template, not Teller-flavoured): - HARNESS.md — umbrella table mapping every layer to its config file and to the meta-gate(s) that catch drift in it. - INVARIANTS.md — five portable rules with numbered slots 6+ for project additions. - BOUNDARIES.md — ASCII layer diagram + the import-linter contract spec + how to add a layer cleanly. - DEVELOPMENT.md — prereqs, first-time setup, dev stack, justfile recipes table, branching diagram, commit-prefix table, CI workflow inventory, agent-hook setup, branch-protection token setup. - EVAL_HARNESS.md — runner architecture, three tolerance modes, wiring your agent / LLM client, adding a case, opt-in for nightly schedule. - SECURITY.md — threat model table + defence-in-depth ASCII map + container hardening notes + explicit out-of-scope list (auth, WAF, rate-limit, secret manager). - ARCHITECTURE.md — scaffold component diagram, request lifecycle, frontend lifecycle, slots that fill in as the project grows. Top-level docs: - README.md — what ships / quickstart / why-a-harness / docs index / versions table / license. - CONTRIBUTING.md — branching diagram, commit-prefix table, PR template callouts, "adding a check" recipe. - CLAUDE.md — agent project instructions: read-first list, workflow, code conventions, what-not-to-do, skills inventory. - CHANGELOG.md — release-drafter seed; first Unreleased entry summarises the harness extraction. - docs/TASKS.md — full ticket table with phase grouping + status emoji, matches the GitHub Project board. Closes #25 Closes #26 Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 21 +++++++ CLAUDE.md | 63 ++++++++++++++++++++ CONTRIBUTING.md | 79 +++++++++++++++++++++++++ README.md | 71 ++++++++++++++++++++++- docs/ARCHITECTURE.md | 87 ++++++++++++++++++++++++++++ docs/BOUNDARIES.md | 60 +++++++++++++++++++ docs/DEVELOPMENT.md | 135 +++++++++++++++++++++++++++++++++++++++++++ docs/EVAL_HARNESS.md | 106 +++++++++++++++++++++++++++++++++ docs/HARNESS.md | 40 +++++++++++++ docs/INVARIANTS.md | 50 ++++++++++++++++ docs/SECURITY.md | 73 +++++++++++++++++++++++ docs/TASKS.md | 93 +++++++++++++++++++++++++++++ 12 files changed, 876 insertions(+), 2 deletions(-) create mode 100644 CHANGELOG.md create mode 100644 CLAUDE.md create mode 100644 CONTRIBUTING.md create mode 100644 docs/ARCHITECTURE.md create mode 100644 docs/BOUNDARIES.md create mode 100644 docs/DEVELOPMENT.md create mode 100644 docs/EVAL_HARNESS.md create mode 100644 docs/HARNESS.md create mode 100644 docs/INVARIANTS.md create mode 100644 docs/SECURITY.md create mode 100644 docs/TASKS.md diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..ea6f48b --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,21 @@ +# Changelog + +All notable changes to this project will be documented in this file. + +The format is loosely based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). + +Released versions are drafted automatically by [release-drafter](https://github.com/release-drafter/release-drafter); see `.github/release-drafter.yml` and `.github/workflows/release-drafter.yml`. Each entry on the GitHub Releases page corresponds to a tag of the form `vX.Y.Z`. + +## Unreleased + +### Added + +- Initial harness scaffold (Python 3.14 + FastAPI + Pydantic v2 + OpenTelemetry; React 19.2 + Vite + TypeScript strict). +- 15 required CI status checks (lint, typecheck, tests, coverage ≥ 75 %, import-linter, pre-commit, frontend build/quality, security suite, two meta-gates, PR-title lint). +- Release pipeline: tag-triggered build, push to GHCR, CycloneDX SBOM, GitHub Release publish. +- Eval harness scaffold (provider-agnostic runner + LLM-judge Protocol + 1 example golden case + workflow_dispatch nightly). +- `.claude/` agent integration (3 hooks, 6 auto-activating skills, settings example). + +### Notes + +- This template was extracted from a financial-agent take-home (Teller) and generalised. The harness is the product; the scaffold exists so every gate has something to operate on. diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..c83c136 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,63 @@ +# CLAUDE.md — agent project instructions + +You are working in `harness-python-react`, a template repo whose harness IS the product. Code quality here is enforced mechanically — every gate fails CI, not just tests. Keep that bar as you work. + +## What this repo is + +A production-quality LLM-driven coding harness over a minimal FastAPI + React scaffold. The point isn't the features (one `/health`, one `/echo`, one hello page); the point is that every layer of the pipeline — lint, types, architecture, security, eval, agent hooks — catches a different failure class without anyone remembering to run it. + +## Read first + +- [`docs/HARNESS.md`](docs/HARNESS.md) — umbrella; the controls and where they live. +- [`docs/INVARIANTS.md`](docs/INVARIANTS.md) — the load-bearing rules. Every PR is checked against them. +- [`docs/BOUNDARIES.md`](docs/BOUNDARIES.md) — layered import-linter contract; reverse imports fail CI. +- [`docs/DEVELOPMENT.md`](docs/DEVELOPMENT.md) — branching, commit format, justfile, CI overview. + +## Workflow + +- One issue per change. Branch name: `feat|fix|chore|docs|test|refactor/-`. +- One PR per branch, base `develop`. PR title = the conventional-commit subject. +- `develop → main` happens via a `release:` PR. +- The pre-push gate is `just check` (lint + typecheck + architecture + tests). Run it before pushing. +- For frontend changes, also run `just frontend-check`. + +## Code conventions + +- **Python:** 3.14, `uv run --frozen` everywhere, mypy `--strict`, ruff with the wide select set (`E W F I N UP B SIM TCH S RUF`). +- **Type hints:** every public function. `from __future__ import annotations` at module top. +- **Models:** anything crossing a module / process seam inherits from `StrictModel` (`src/models/_base.py`). `extra="forbid"`. Add `strict=True` to the class when you want strict type coercion (rejecting `"3.14"` → float). +- **API:** every route under `/api/v1/`. Typed Pydantic responses, not raw dicts. +- **Layer flow:** one-way. Reverse imports are a CI failure. See `docs/BOUNDARIES.md`. +- **Observability:** OTel `agent_span(...)` for any operation in the request path; semconv-defined attribute keys only (constants at the top of `src/observability/spans.py`). +- **Frontend:** React 19 + TS strict; functional components + hooks; never `dangerouslySetInnerHTML` on backend output; SSE consumers use the typed primitive at `frontend/src/lib/api/client.ts`. + +## What NOT to do + +- Don't bypass gates. `--no-verify` / `--no-hooks` / `--no-gpg-sign` are blocked by `pretooluse_bash.py` for a reason. If a hook is wrong, fix the hook. +- Don't introduce a new commit-type prefix without updating both `pyproject.toml`'s commitizen schema AND `pr-title.yml` (the `Commit-type sync` meta-gate will fail otherwise). +- Don't add a CI job without listing it in `.github/branch-protection/{develop,main}.json` (the `Branch-protection contexts sync` meta-gate will fail). +- Don't skip the architecture contract by accident — `lint-imports` runs in CI and locally via `just architecture`. +- Don't write code without tests. Coverage gate is 75% on `src/`. +- Don't hand-roll secrets into config. Use env / `.env` (gitignored) → `Settings` from `src/models/config.py`. +- Don't create files unless they're necessary. The scaffold has no dead modules. + +## Use the skills + +The agent-side skills in `.claude/skills/` auto-activate based on context: + +- `architect` — when designing module boundaries, API contracts, layer-flow decisions. +- `code-reviewer` — after writing/editing code; runs the 10-point review checklist. +- `devops` — when touching Docker, CI, pyproject.toml, observability config. +- `frontend` — when working in `frontend/` (React 19 + TS + Vite). +- `qa-engineer` — when writing tests or extending the eval harness. +- `technical-writer` — when updating docs / READMEs. + +Trust their guidance — they encode this project's conventions. + +## When in doubt + +- If the change touches a gate, update the meta-gate inputs (`branch-protection/*.json`, `pr-title.yml`, `check_required_contexts.py`'s exemption list). +- If the change touches an invariant, decide whether the invariant is wrong (update `docs/INVARIANTS.md` in the same PR) or the change is wrong (rework). +- If a CI job is failing for a reason that doesn't match the change, dig — don't reroll. Recent fix patterns: tag-vs-commit SHA in pinned action references, `if: hashFiles(...)` startup failures (see project-memory), pytest exit-5 on empty test suites. + +The harness exists to make sloppy work hard. Lean into it — when a gate trips, it's protecting the next person reading this codebase. diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..bd36cb3 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,79 @@ +# Contributing + +Thanks for taking a look. This template's harness is the product, so the contribution flow is opinionated — every change goes through the same gates as a feature. + +## Branching + +``` +main ◄── release PR ◄── develop ◄── feat/123-short-name + ◄── fix/124-bug-name + ◄── chore/125-config-change +``` + +- `main` is the release line. Protected: 15 required status checks, code-owner approval, no force pushes. +- `develop` is the integration branch. Same gates, less strict (PRs don't need rebases). +- Feature branches are short-lived and named `/-`. Open one issue per branch so the project board stays usable. + +## Commit messages + +Seven prefixes (enforced in three places — `[tool.commitizen]` in `pyproject.toml`, `pr-title.yml`, `check_commit_types.py`): + +| Prefix | When | +|---|---| +| `feat:` | New capability | +| `fix:` | Bug fix | +| `docs:` | Documentation only | +| `test:` | Tests / eval harness | +| `refactor:` | Internal change with no behaviour delta | +| `chore:` | Tooling, deps, infra | +| `release:` | `develop → main` release PRs only | + +The subject is **lowercase** after the colon. Title Case prose (`Add the thing`) is rejected; all-caps initialisms (`CI failure`, `SDK upgrade`) are fine. + +## Pull requests + +1. Open the issue first. Use a feature/bug template; fill every section. +2. Branch off `develop` with the matching name. +3. Land one logical change per PR. Stack PRs if the work is naturally split. +4. The PR template asks five things — answer each (`None` is valid where applicable): + - **What & why** (1–3 lines) + - **Test plan** (checkbox list; CI covers most of it) + - **Invariants affected** — cite numbered rules from `docs/INVARIANTS.md` + - **New deps / actions / external surface** (anchor for supply-chain review) + - **Screenshots** (UI changes only) +5. Wait for green CI + a code-owner review before merging. + +## Local pre-push gate + +```sh +just check # ruff + mypy + import-linter + pytest +cd frontend && npm run lint && npm run format:check && npm run check && npm run test && npm run build +uv run pre-commit run --all-files +``` + +A green pre-push run is a high-confidence predictor of a green CI run. The `just check` gate is intentionally a subset of CI — fast feedback over coverage. + +## Adding a check + +When the harness grows a new gate: + +1. Add the workflow job in `.github/workflows/`. +2. If it's a required gate, add the job's display name to the `contexts` arrays in `.github/branch-protection/{develop,main}.json`. +3. If it's NOT required (scheduled / dispatch-only / push-to-main-only), add the workflow filename to `EXEMPT_WORKFLOWS` in `.github/scripts/check_required_contexts.py`. +4. Update `docs/HARNESS.md` and `docs/SECURITY.md` (if security-relevant). +5. Land in one PR — the meta-gate `Branch-protection contexts sync` will fail if you skip step 2 or 3. + +## Code of conduct + +Be kind. Disagree on substance, not on people. If review feedback gets sharp, take it offline and come back when both sides are ready. + +## Reporting security issues + +If you find a vulnerability that affects users of the template, **do not open a public issue**. Email the maintainer (see commit history for contact). Include: + +- Repro steps +- Affected version / commit SHA +- Severity assessment (informational / low / medium / high / critical) +- Suggested fix if you have one + +We'll acknowledge within 72 hours. diff --git a/README.md b/README.md index 2710d0d..cb234a3 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,72 @@ # harness-python-react -Production-quality coding harness for Python (FastAPI) backends and Vite + React + TypeScript frontends. Designed for LLM-driven development: every gate — lint, types, architecture, security, eval — is enforced mechanically so code quality stays consistent across many human and AI contributors. +> A production-quality coding harness for Python (FastAPI) + Vite/React/TypeScript projects. Designed for LLM-driven development: every gate — lint, types, architecture, security, eval — is enforced mechanically so code quality stays consistent across many human and AI contributors. -> **Status:** bootstrap. Full documentation, scaffolding, and the harness itself land across [issues #1–#28](https://github.com/constk/harness-python-react/issues). Track progress on the [project board](https://github.com/users/constk/projects/3). +## What ships + +- **Backend:** Python 3.14, FastAPI, Pydantic v2 (`StrictModel` base), `uv` deps, OpenTelemetry SDK + OTLP exporter, structured JSON logs, generic tool-registry pattern. +- **Frontend:** Node 24 LTS, React 19.2, Vite 8, TypeScript strict, ESLint 10 flat config, Prettier, Vitest + jsdom + Testing Library. +- **Eval harness:** provider-agnostic runner + LLM-judge `Protocol`, three tolerance modes (exact / numeric / semantic), one example golden case, nightly workflow (disabled by default). +- **CI:** 15 required status checks across `ci.yml` (lint/format, mypy strict, unit tests, coverage ≥75%, import-linter architecture, pre-commit, frontend build, frontend quality, branch-protection sync, commit-type sync) + `security.yml` (gitleaks, pip-audit, npm audit, trivy) + PR-title lint. +- **Release:** tag-triggered workflow that builds the image, pushes to `ghcr.io`, generates a CycloneDX SBOM, and publishes the GitHub Release. +- **Agent integration:** `.claude/hooks/` (forbidden-flag blocker, secret scan, formatter dispatch, SessionStart context) + six auto-activating skills (architect / code-reviewer / devops / frontend / qa-engineer / technical-writer). +- **Docker:** multi-stage Dockerfile (non-root, healthcheck), `docker compose up` boots app + frontend + Jaeger. + +## Quickstart + +```sh +git clone https://github.com/constk/harness-python-react.git +cd harness-python-react + +uv sync --extra dev +uv run pre-commit install --hook-type pre-commit --hook-type commit-msg +(cd frontend && npm ci) + +docker compose up # backend :8000, frontend :5173, Jaeger :16686 +``` + +The pre-push gate is `just check` (= ruff + mypy + import-linter + pytest). For frontend changes add `just frontend-check`. + +## Why a harness + +The differentiator isn't the scaffold — it's that every layer of the pipeline catches a different failure class **without relying on the human or LLM coder remembering to run anything**. The same posture protects code regardless of who wrote it. + +See [`docs/HARNESS.md`](docs/HARNESS.md) for the full umbrella. Highlights: + +- **Pydantic `StrictModel` everywhere a contract crosses a seam** (rejects unknown keys at construction). +- **`import-linter` enforces one-way layer flow** (`api | eval → agent → tools → data → observability → models`). +- **Three independent secret scans** (PreToolUse hook → pre-commit gitleaks → CI gitleaks). +- **Two meta-gates** that catch *drift in the gates themselves*: `Branch-protection contexts sync` (workflow jobs vs branch-protection JSON) and `Commit-type sync` (commitizen schema vs PR-title allowlist). +- **CycloneDX SBOM attached to every release** for supply-chain attestation. + +## Documentation + +| File | Purpose | +|---|---| +| [`docs/HARNESS.md`](docs/HARNESS.md) | Umbrella: every control + where it lives | +| [`docs/INVARIANTS.md`](docs/INVARIANTS.md) | The numbered load-bearing rules | +| [`docs/BOUNDARIES.md`](docs/BOUNDARIES.md) | Module layering + the import-linter contracts | +| [`docs/DEVELOPMENT.md`](docs/DEVELOPMENT.md) | Local setup, branching, justfile, CI | +| [`docs/EVAL_HARNESS.md`](docs/EVAL_HARNESS.md) | Eval flywheel + opt-in for the nightly workflow | +| [`docs/SECURITY.md`](docs/SECURITY.md) | Threat model + defence-in-depth map | +| [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) | Scaffold-level component view | +| [`CONTRIBUTING.md`](CONTRIBUTING.md) | Branching, commit format, PR flow | +| [`CLAUDE.md`](CLAUDE.md) | Agent-facing project instructions | + +## Versions + +Verified April 2026 (`endoflife.date`): + +| Layer | Version | Sunset | +|---|---|---| +| Python | 3.14.4 | active feature release | +| Node LTS | 24.15.0 | through 2028-04-30 | +| React | 19.2.5 | current stable | +| Vite | 8.x | current stable | +| TypeScript | 6.x | current stable | + +Bump together (Python in `pyproject.toml`, Node in `frontend/package.json`, both in `Dockerfile` + the CI matrix). Document the bump in `docs/DEVELOPMENT.md`. + +## License + +[MIT](LICENSE). diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md new file mode 100644 index 0000000..336244c --- /dev/null +++ b/docs/ARCHITECTURE.md @@ -0,0 +1,87 @@ +# Architecture (scaffold) + +The template's runtime architecture is intentionally small — one backend service, one frontend app, one tracing collector. The point of the scaffold is to exercise the harness, not to ship features. + +## Components + +``` + ┌────────────────────┐ + │ Browser │ + │ (React 19.2) │ + └─────────┬──────────┘ + │ + │ http://localhost:5173 + │ (Vite dev server, HMR) + ▼ + ┌────────────────────┐ + │ frontend/ │ + │ Vite + React +TS │ + │ • App.tsx │ + │ • lib/api/ │ + │ client.ts (SSE)│ + └─────────┬──────────┘ + │ + │ /api/v1/* (proxied) + ▼ + ┌──────────────────────────────────────────────────────────┐ + │ src/api/ │ + │ ┌──────────┐ ┌─────────────┐ ┌──────────────────┐ │ + │ │ main.py │ ─►│ routes.py │ │ sessions.py │ │ + │ │ FastAPI │ │ /v1/health │ │ in-memory store │ │ + │ │ lifespan │ │ /v1/echo │ │ │ │ + │ └────┬─────┘ └─────────────┘ └──────────────────┘ │ + │ │ │ + │ │ setup_tracing → setup_logging → instrument_* │ + │ ▼ │ + │ ┌─────────────────────────────────────────────────────┐ │ + │ │ src/observability/ │ │ + │ │ tracing.py · logging.py · spans.py │ │ + │ └─────────────────────┬───────────────────────────────┘ │ + └────────────────────────┼──────────────────────────────────┘ + │ OTLP gRPC :4317 + ▼ + ┌─────────────────────┐ + │ Jaeger all-in-one │ :16686 (UI) + │ │ :4317 (OTLP gRPC) + │ │ :4318 (OTLP HTTP) + └─────────────────────┘ +``` + +## Slots that fill in as the project grows + +| Slot | Today | Eventual | +|---|---|---| +| `src/agent/` | Empty package | LLM tool-calling loop | +| `src/tools/` | `registry.py` + `echo_tool` | Real domain tools | +| `src/data/` | Empty package | DB / file / API clients | +| `src/api/sessions.py` | In-memory dict | Redis or DB-backed store | +| `eval/golden_qa.json` | One echo case | 15-50 cases by category | + +The layered import-linter contract lets each slot grow without coordination — adding `src/data/duckdb_client.py` doesn't trigger any change in `src/models/` or `src/api/`. + +## Request lifecycle (scaffold) + +1. Browser hits `GET /api/v1/health` (or `/echo?msg=...`). +2. FastAPI routes to `src/api/routes.py:health` / `routes.py:echo`. +3. The handler builds a typed response (`HealthResponse` / `EchoResponse` — both `StrictModel`). +4. OpenTelemetry's FastAPI instrumentation produces a span; OTLP exporter ships it to Jaeger. +5. The structured logger writes one JSON record per request, correlated by `trace_id` / `span_id`. +6. Response returns to the browser as JSON. + +## Frontend lifecycle (scaffold) + +1. Vite dev server serves `index.html` → `src/main.tsx` → `App.tsx`. +2. `App.tsx` runs `useEffect` once, fetching `/api/v1/health` via the browser's `fetch` (proxied through Vite to `:8000` in dev, served same-origin in a real deployment). +3. The component renders one of three states (`loading | ok | error`) using semantic ARIA roles + `data-testid` hooks for the Vitest suite. +4. CSS variables in `src/styles/palette.css` drive the visual; `[data-theme='dark']` flips palette tokens. + +The typed SSE client at `src/lib/api/client.ts` is unused by the scaffold's hello page but ships ready: any backend that returns `text/event-stream` from a POST endpoint can be consumed via `sendMessage(...)`. + +## Out of scope at the scaffold level + +- Persistence (real DB / queue / cache). +- Auth (OIDC, sessions, API keys). +- Rate limiting. +- Real production observability backend (Jaeger is the local dev choice; production typically uses a managed OTLP collector / vendor). + +Each of these earns a section in this document when the project decides on its concrete shape. diff --git a/docs/BOUNDARIES.md b/docs/BOUNDARIES.md new file mode 100644 index 0000000..5daf77c --- /dev/null +++ b/docs/BOUNDARIES.md @@ -0,0 +1,60 @@ +# Module boundaries + +The repo is layered — every Python module sits in exactly one layer, and layer flow is one-way. Reverse imports are a CI failure (`lint-imports` job). + +## Layer diagram + +``` + ┌──────────────┐ ┌──────────────┐ + │ src.api │ │ src.eval │ request handlers + eval runner + │ /api/v1/* │ │ pytest eval │ + └──────┬───────┘ └──────┬───────┘ + │ │ + ▼ ▼ + ┌──────────────────────────────┐ + │ src.agent │ the LLM loop (tool-calling, CoT) + └──────────────┬───────────────┘ + │ + ▼ + ┌──────────────────────────────┐ + │ src.tools │ typed tool registry + implementations + └──────────────┬───────────────┘ + │ + ▼ + ┌──────────────────────────────┐ + │ src.data │ ingestion + queries (DB, files, …) + └──────────────┬───────────────┘ + │ + ▼ + ┌──────────────────────────────┐ + │ src.observability │ tracing / logging / spans + └──────────────┬───────────────┘ + │ + ▼ + ┌──────────────────────────────┐ + │ src.models │ Pydantic contracts (StrictModel) + │ (depends on nothing in src)│ + └──────────────────────────────┘ +``` + +## The contracts + +Defined in `pyproject.toml` `[tool.importlinter]`: + +1. **Layered contract** — `api | eval → agent → tools → data → observability → models`. + Modules at level *N* may import modules at level *N+1* and below. Anything else fails. + +2. **Forbidden contract** — `src.models` imports nothing from `src/`. + Models are leaf data; they neither know about the API surface nor reach back into observability. Keeping them isolated means schema bugs surface at construction, not via stack traces from deeper modules. + +## Adding a layer + +When the project grows a new layer (cache, queue, persistence-DTO mapper): + +1. Add the package under `src/`. +2. Add it to the `layers` list in `[tool.importlinter]` in the right position. +3. Add a `tests/test_.py` with at least the unhappy-path tests. +4. Update the diagram above. +5. Update `EXEMPT_WORKFLOWS` in `.github/scripts/check_required_contexts.py` only if the layer ships its own CI job that should NOT be required. + +The frontend (`frontend/`) is its own tree with its own quality gates (ESLint flat config + Prettier + tsc + Vitest); cross-tree imports are forbidden by build (Vite has no Python module resolver) and reviewed by hand. diff --git a/docs/DEVELOPMENT.md b/docs/DEVELOPMENT.md new file mode 100644 index 0000000..b42f235 --- /dev/null +++ b/docs/DEVELOPMENT.md @@ -0,0 +1,135 @@ +# Development + +## Prerequisites + +- Python **3.14** (see `pyproject.toml` `requires-python`) +- Node **24 LTS** (see `frontend/package.json` `engines.node`) +- [`uv`](https://docs.astral.sh/uv/) for Python deps + venv +- [`just`](https://github.com/casey/just) for the task runner (optional but recommended) +- Docker (for the local stack: app + frontend + Jaeger) + +## First-time setup + +```sh +git clone https://github.com/constk/harness-python-react.git +cd harness-python-react + +# Backend deps + venv +uv sync --extra dev + +# Pre-commit hooks (commit-msg + pre-commit stages) +uv run pre-commit install --hook-type pre-commit --hook-type commit-msg + +# Frontend deps +cd frontend && npm ci && cd .. +``` + +## Running the stack + +```sh +docker compose up +``` + +- Backend: +- Frontend (Vite dev server with HMR): +- Jaeger UI: + +For backend-only iteration without Docker: + +```sh +uv run uvicorn src.api.main:app --reload --port 8000 +``` + +For frontend-only iteration: + +```sh +cd frontend && npm run dev +``` + +## The justfile + +`just` (no args) lists every recipe. The most-used: + +| Recipe | What it runs | +|---|---| +| `just lint` | `ruff check .` + `ruff format --check .` | +| `just typecheck` | `mypy --strict src/ tests/` | +| `just test` | `pytest tests/ -m "not integration"` | +| `just architecture` | `lint-imports` | +| `just check` | `lint typecheck architecture test` (the pre-push gate) | +| `just frontend-check` | `npm run lint && format:check && check && test` | +| `just docker-build` | Builds `harness-python-react:dev` for sanity checks | + +Every recipe uses `uv run --frozen` — bare `uv run` silently re-resolves when `pyproject.toml` drifts from `uv.lock`; `--frozen` aborts loudly instead. + +## Branching + +``` + main ◄── release PR ◄── develop ◄── feat/123-short-name + ◄── fix/124-bug-name + ◄── chore/125-config-change +``` + +- `main` is protected: every required CI context must pass + 1 review + commit-type sync + branch-protection sync. +- `develop` is the integration branch; same gates as `main` minus a strictness flag (`strict: false` so PRs don't need rebases). +- Feature branches are short-lived and named `/-`. + +## Commit messages + +Seven allowed prefixes (enforced in three places — `[tool.commitizen]`, `pr-title.yml`, `check_commit_types.py`): + +- `feat:` — new capability +- `fix:` — bug fix +- `docs:` — documentation only +- `test:` — tests / eval-harness changes +- `refactor:` — internal change with no behaviour delta +- `chore:` — tooling, deps, infra +- `release:` — `develop → main` release PRs only + +Subject is **lowercase after the colon** (Title Case is rejected unless it's an all-caps initialism). + +## CI pipeline (`.github/workflows/`) + +| Workflow | Triggers | Required? | +|---|---|---| +| `ci.yml` | push/PR to develop+main | Yes — 8 backend + 2 frontend jobs | +| `security.yml` | push/PR + weekly schedule | Yes — 4 jobs (gitleaks, pip-audit, npm audit, trivy) | +| `pr-title.yml` | PR open/edit/sync | Yes — conventional-commit lint | +| `release.yml` | tag `v*.*.*` | No — tag-triggered | +| `release-drafter.yml` | push to main + PR label events | No | +| `branch-protection.yml` | weekly + push to .github/branch-protection/** | No | +| `artifact-cleanup.yml` | weekly | No | +| `eval-nightly.yml` | `workflow_dispatch` only by default | No | +| `codeql.yml` | `workflow_dispatch` only (placeholder) | No | + +## Local agent hook setup + +The `.claude/hooks/` scripts enforce the harness from the LLM-coder side: blocking `--no-verify`, scanning staged diffs for secrets, formatting after every Write/Edit. Opt in by copying the example settings file: + +```sh +cp .claude/settings.local.json.example .claude/settings.local.json +``` + +The example wires: + +- `PreToolUse:Bash` → `pretooluse_bash.py` (forbidden-flag blocker + secret scan + audit log) +- `PostToolUse:Write|Edit` → `posttooluse_writeedit.py` (ruff / prettier on touched files) +- `SessionStart:startup|resume` → `sessionstart.py` (injects current branch + `git status --short` as session context) + +`.claude/settings.local.json` and `.claude/bash-log.txt` are gitignored — your local config never ships. + +## Pre-commit setup + +`uv run pre-commit install --hook-type pre-commit --hook-type commit-msg` wires both stages. The hooks: + +1. Ruff (lint + format auto-fix) +2. Generic hygiene (YAML/TOML/JSON parse, merge conflicts, large files >500 KB, trailing whitespace, EOF, line endings) +3. Gitleaks (secret scan) +4. Commitizen (conventional-commit lint at `commit-msg`) +5. Local mypy `--strict` against the project's uv env + +`pre-commit run --all-files` runs the full suite against every file — the same job CI runs. + +## Branch-protection sync setup + +The `branch-protection.yml` workflow needs a `BRANCH_PROTECTION_TOKEN` secret with `admin:repo` scope on this repo. The default `GITHUB_TOKEN` cannot edit branch protection on the repo it runs in. Create a fine-grained PAT scoped to this repo only. diff --git a/docs/EVAL_HARNESS.md b/docs/EVAL_HARNESS.md new file mode 100644 index 0000000..ec115b1 --- /dev/null +++ b/docs/EVAL_HARNESS.md @@ -0,0 +1,106 @@ +# The eval harness + +LLM-driven systems regress in ways unit tests don't catch: the prompt drifts, the tool schema changes upstream, a model upgrade subtly changes behaviour. The eval harness is the regression net — golden cases that exercise the agent end-to-end and report accuracy by category and difficulty. + +## Layout + +``` +src/eval/ +├── models.py # EvalCase, EvalResult (Pydantic) +├── runner.py # EvalRunner — generic, takes a Callable[[str], str] +├── judge.py # LLMClient Protocol + semantic-similarity judge +├── report.py # Markdown report generator +└── __main__.py # python -m src.eval + +eval/ +├── golden_qa.json # The dataset (one trivial example case ships) +└── test_golden_qa.py # Parametrised pytest runner +``` + +## How it works + +1. The runner loads `eval/golden_qa.json` into a list of `EvalCase`s. +2. For each case, it calls the configured `answer_fn(question) -> str`. +3. It compares the actual answer to the expected one using one of three tolerance modes: + - **`exact_match`** — normalised string equality (lowercased, whitespace-collapsed). + - **`numeric_close`** — extracts numbers from both sides; passes if any extracted number is within 1 % of the expected. Filters year-like values (2020-2029) so a question about a year doesn't accidentally provide the comparison target. + - **`semantic_similar`** — calls an LLM judge (`src/eval/judge.py`) that scores 0.0–1.0; passes at ≥ 0.8. +4. It returns a list of `EvalResult`s; `src/eval/report.py` produces a markdown summary. + +## Wiring your agent + +The runner doesn't know about your agent loop. Pass any `Callable[[str], str]`: + +```python +from src.eval.runner import EvalRunner + +def my_agent(question: str) -> str: + # Hit your agent loop / LLM client here. + return ... + +runner = EvalRunner(answer_fn=my_agent) +results = runner.evaluate_all() +``` + +For the LLM judge (`semantic_similar` cases), implement the `LLMClient` Protocol from `src/eval/judge.py`: + +```python +class MyLLMAdapter: + def complete_json(self, *, model: str, prompt: str) -> str: + # Hit your provider, return raw JSON body. + ... + +runner = EvalRunner( + answer_fn=my_agent, + judge_client=MyLLMAdapter(), + judge_model="gpt-4o-mini", +) +``` + +If `judge_client=None` (default), `semantic_similar` cases pass with `score=None` and reason `"no LLM client configured"` — inconclusive, not a failure. That keeps the harness usable without LLM credentials. + +## Adding a case + +```json +{ + "id": "unique-kebab-id", + "question": "...", + "category": "category-name", + "expected_answer": "...", + "tolerance": "exact_match" | "numeric_close" | "semantic_similar", + "difficulty": "easy" | "medium" | "hard", + "notes": "Why this case earns a slot." +} +``` + +`category` and `difficulty` default to `"general"` and `"easy"`; explicit values are recommended once you have more than a handful of cases so the report breaks down meaningfully. + +## Running the harness + +Locally: + +```sh +uv run pytest eval/ # pytest runner with the marker +python -m src.eval # CLI runner — prints the markdown report +``` + +The pytest invocation is marked `@pytest.mark.eval`, so the default `pytest tests/` skips it. + +## Nightly opt-in + +`.github/workflows/eval-nightly.yml` ships `workflow_dispatch`-only by default to avoid accidental LLM API spend. To turn on a real nightly: + +1. Add the LLM secrets in repo settings: `LLM_API_KEY` (required), `LLM_PROVIDER`, `LLM_BASE_URL`, `LLM_MODEL` (optional, depending on adapter). + +2. Replace the workflow's `on:` block with: + + ```yaml + on: + schedule: + - cron: "0 6 * * *" # daily 06:00 UTC + workflow_dispatch: + ``` + +3. Confirm `eval-nightly.yml` is still in `EXEMPT_WORKFLOWS` in `.github/scripts/check_required_contexts.py` (it should be — scheduled runs never gate PRs). + +That's the full opt-in. Reverting is a one-line change back to `workflow_dispatch:` only. diff --git a/docs/HARNESS.md b/docs/HARNESS.md new file mode 100644 index 0000000..0a8ecc6 --- /dev/null +++ b/docs/HARNESS.md @@ -0,0 +1,40 @@ +# The harness + +The "harness" is the set of mechanical controls that make LLM-driven coding produce production-grade output regardless of which agent or contributor is at the keyboard. This document is the umbrella — every other doc in `docs/` is a layer of it. + +## What's in the harness + +| Layer | What it enforces | Where it lives | +|---|---|---| +| **Lint** | Style, simple bugs, security smells | `pyproject.toml` `[tool.ruff]` (E W F I N UP B SIM TCH S RUF), `.pre-commit-config.yaml` | +| **Format** | Consistent line shape | `ruff format`, `prettier` (frontend) | +| **Type check** | No untyped code | `pyproject.toml` `[tool.mypy]` `strict = true`; `tsc --noEmit` for the frontend | +| **Architecture** | One-way layer flow | `pyproject.toml` `[tool.importlinter]` + `docs/BOUNDARIES.md` | +| **Tests** | Behaviour | `pytest tests/`, `pytest eval/`, `vitest` | +| **Coverage** | ≥ 75% on `src/` | `pyproject.toml` `[tool.coverage.report]` | +| **Pre-commit** | Local-first defence | `.pre-commit-config.yaml` (ruff, gitleaks, commitizen, mypy, hygiene) | +| **CI** | Non-bypassable | `.github/workflows/ci.yml` (15 contexts) + `security.yml` + `pr-title.yml` + `release.yml` + `release-drafter.yml` | +| **Branch protection** | Declarative, drift-checked | `.github/branch-protection/{develop,main}.json` + `branch-protection.yml` apply workflow + `check_required_contexts.py` meta-gate | +| **Commit format** | Seven prefixes only | `[tool.commitizen]` schema + `pr-title.yml` allowlist + `check_commit_types.py` meta-gate | +| **Secret scan** | Three checkpoints | local hook → pre-commit → `security.yml` gitleaks | +| **Container scan** | HIGH/CRITICAL CVEs block | `security.yml` trivy-action | +| **Dep scan** | Pinned + audited | pip-audit, npm audit | +| **Release** | Reproducible artefacts | `release.yml` (image push to GHCR + CycloneDX SBOM) | +| **Eval** | LLM-output regressions | `src/eval/`, `eval/`, `eval-nightly.yml` (workflow_dispatch by default) | +| **Agent hooks** | LLM coder side enforcement | `.claude/hooks/{pretooluse_bash, posttooluse_writeedit, sessionstart}.py` + `settings.local.json.example` | +| **Skills** | Auto-activated agent guidance | `.claude/skills/{architect, code-reviewer, devops, frontend, qa-engineer, technical-writer}` | + +## Why "harness" + +Each layer above catches something specific. None catches everything. Stacked, they form a defence-in-depth where the cost of a mistake is bounded by the highest-fidelity layer it slips past. + +The harness is independent of the project's domain. The included `src/api/echo` and `eval/golden_qa.json` are scaffolding so every gate has something to operate on. Once you replace them with your real domain, the same harness keeps the new code under the same posture. + +## Reading order + +1. **`docs/INVARIANTS.md`** — the load-bearing rules. Every PR is checked against them. +2. **`docs/BOUNDARIES.md`** — module layering and the import-linter contracts. +3. **`docs/DEVELOPMENT.md`** — local setup, the `justfile`, the CI pipeline. +4. **`docs/EVAL_HARNESS.md`** — the eval flywheel; how to add a case, how to opt the nightly into running. +5. **`docs/SECURITY.md`** — threat model + the defence-in-depth map. +6. **`docs/ARCHITECTURE.md`** — scaffold-level diagram; expand as your domain lands. diff --git a/docs/INVARIANTS.md b/docs/INVARIANTS.md new file mode 100644 index 0000000..093305b --- /dev/null +++ b/docs/INVARIANTS.md @@ -0,0 +1,50 @@ +# Invariants + +The numbered rules below are load-bearing. Every PR is checked against them; CI fails the build when one is violated. Add project-specific invariants in slots 6+ as your domain accretes. + +## 1. Every contract crossing a module or process boundary is a `StrictModel` + +Pydantic with `extra="forbid"` raises on unknown keys at construction. That kills the silent-key class of bug at the seam instead of three calls deep. + +- **Where:** `src/models/_base.py` +- **Enforced by:** `tests/test_models.py` (asserts `extra="forbid"`); review. + +## 2. API endpoints live under `/api/v1/` and return typed responses + +A versioned prefix means future breaking changes ship at `/api/v2/` without coordinated client deploys. Typed responses mean an OpenAPI schema is correct by construction. + +- **Where:** `src/api/routes.py` +- **Enforced by:** route review; FastAPI's response model inference. + +## 3. Layer flow is one-way + +`api | eval` → `agent` → `tools` → `data` → `observability` → `models`. `src.models` imports nothing from `src/`. A reverse import collapses the layer story. + +- **Where:** `pyproject.toml` `[tool.importlinter]` +- **Enforced by:** `lint-imports` job in CI; `just architecture` locally. + +## 4. Coverage ≥ 75% on `src/` + +Below 75 % the test suite stops being a meaningful gate; above ~90 % every PR slows down on coverage paperwork. 75 % is the load-bearing floor. + +- **Where:** `pyproject.toml` `[tool.coverage.report] fail_under = 75` +- **Enforced by:** `Coverage` job in CI. + +## 5. No secret leaves the repo unscanned + +Three independent checkpoints (PreToolUse hook → pre-commit gitleaks → CI gitleaks) catch staged secrets before push, force-push, or merge. Once a secret is in the remote it is compromised forever. + +- **Where:** `.claude/hooks/pretooluse_bash.py`, `.pre-commit-config.yaml`, `.github/workflows/security.yml` +- **Enforced by:** the three layers above. + +--- + +## 6+. Project-specific invariants + +Add invariants below as your domain stabilises. Each entry should describe: + +- The rule, in one sentence. +- *Where* it lives (module or config file path). +- *Enforced by:* test, review, or specific CI job. + +Examples of the kind of invariant that earns a slot here: a domain-specific data contract that must validate at ingestion, a security boundary that must not log PII, a tool-call protocol that the agent must follow before the LLM emits a final response. diff --git a/docs/SECURITY.md b/docs/SECURITY.md new file mode 100644 index 0000000..416d7f4 --- /dev/null +++ b/docs/SECURITY.md @@ -0,0 +1,73 @@ +# Security posture + +The template targets the OWASP top-10 surface for a small Python+React service plus the LLM-specific risks that come with hosting an agent. Each layer below is independent — defence in depth, not a single chokepoint. + +## Threat model (scaffold) + +| Threat | Where it can land | Defence | +|---|---|---| +| Secret in repo (AWS key, OpenAI key, PEM) | Commit | (1) `.claude/hooks/pretooluse_bash.py` scans staged diff; (2) pre-commit gitleaks; (3) CI gitleaks | +| Vulnerable Python dep | `uv.lock` | `pip-audit --strict` (security.yml) — fails on any CVE; per-CVE ignore list at `.github/security/pip-audit-ignore.txt` with sunset notes | +| Vulnerable npm dep | `package-lock.json` | `npm audit --audit-level=high` (security.yml) | +| Vulnerable container CVE | Built image | Trivy scan in security.yml — blocks merge on fixable HIGH/CRITICAL | +| Agent prompt injection | `/api/v1/...` body | Output sanitisation: render LLM responses as plain text or pre-formatted blocks; never `dangerouslySetInnerHTML` | +| API contract drift | Pydantic models | `StrictModel` (`extra="forbid"`) raises at construction — typos and renamed fields fail at the seam | +| Required-check drift | `.github/branch-protection/*.json` | `Branch-protection contexts sync` meta-gate fails CI when JSON contexts disagree with workflow jobs on disk | +| Commit-type drift | Commitizen ↔ pr-title.yml | `Commit-type sync` meta-gate compares the two allowlists | +| Released image tampering | GHCR | `release.yml` ships a CycloneDX SBOM attached to the GitHub Release; image is built once per tag with reproducible deps via `uv sync --frozen --no-dev` | +| Force-push to main | Default access | Branch protection: `allow_force_pushes: false`, `allow_deletions: false`, `require_code_owner_reviews: true`, required status checks | + +## Defence-in-depth map + +``` +LLM coder edits ──► PreToolUse hook (forbidden flags + secret scan + audit log) + │ + ▼ +Local commit ──► pre-commit (ruff, gitleaks, commitizen, mypy, hygiene) + │ + ▼ +git push ──► CI: + • Lint & Format (ruff) + • Type Check (mypy --strict) + • Architecture (import-linter) + • Unit tests + Coverage ≥ 75 % + • Pre-commit (re-run, no-bypass) + • Frontend Build + Frontend Quality + • Branch-protection contexts sync + • Commit-type sync + • Lint PR title (conventional commits) + • Secret scan (gitleaks) + • Python deps (pip-audit --strict) + • Frontend deps (npm audit --audit-level=high) + • Container image scan (trivy) + │ + ▼ +PR review ──► Code owner approval (CODEOWNERS) + │ + ▼ +Merge to develop ──► develop branch protection (15 required contexts, strict: false) + │ + ▼ +Release PR ──► develop → main; main branch protection (15 required, strict: true) + │ + ▼ +Tag v*.*.* ──► release.yml: build image, push to ghcr.io, generate SBOM, publish Release +``` + +## Container hardening + +`Dockerfile` ships a multi-stage build: + +- **Builder** — runs `uv sync --frozen --no-dev`. Has uv, pip cache, build tools. +- **Runtime** — `python:3.14-slim`, copies only `.venv` + `src/` from the builder, runs as non-root user `app`. No uv, no pip cache, no build tools, no dev deps. + +Healthcheck uses stdlib `urllib.request` so curl isn't in the image. + +## What's intentionally out of scope (scaffold) + +- **WAF / DDoS** — deployment-environment concerns, not template concerns. +- **Authentication** — the scaffold ships no auth; the right layer (OIDC, mTLS, API keys, sessions) is project-specific. +- **Secret manager integration** — `Settings` reads from env / `.env`. A real deployment should fetch `LLM_API_KEY` from a vault and inject it as env, but the wiring is environment-specific. +- **Rate limiting** — same — depends on infrastructure. + +Each of these is a slot to fill once your domain is decided. The harness doesn't try to pretend any of them exist out of the box. diff --git a/docs/TASKS.md b/docs/TASKS.md new file mode 100644 index 0000000..6f69fe7 --- /dev/null +++ b/docs/TASKS.md @@ -0,0 +1,93 @@ +# TASKS + +Source of truth for the harness extraction. Cross-referenced with the [issues](https://github.com/constk/harness-python-react/issues) and the project board. + +## Status legend + +- ✅ Merged +- 🔄 In progress +- 📋 Backlog + +## Phases + +### Phase 0 — Bootstrap + +| # | Title | Size | Priority | Status | +|---|---|---|---|---| +| 1 | `chore: bootstrap repo (pyproject, uv lockfile, Python 3.14, MIT license)` | M | Critical | ✅ | + +### Phase 1 — Configuration + +| # | Title | Size | Priority | Depends | Status | +|---|---|---|---|---|---| +| 2 | `chore: ruff + mypy strict + import-linter + commitizen config` | M | Critical | #1 | ✅ | +| 3 | `chore: pre-commit hook stack (ruff, gitleaks, commitizen, mypy, hygiene)` | S | Critical | #2 | ✅ | +| 4 | `chore: justfile recipes (lint, typecheck, test, architecture, check, frontend-check)` | S | High | #2 | ✅ | +| 5 | `chore: .gitignore, .editorconfig, .dockerignore` | S | High | #1 | ✅ | +| 6 | `chore: Dockerfile (multi-stage, Python 3.14, non-root, healthcheck)` | M | High | #1 | ✅ | +| 7 | `chore: docker-compose.yml (app + frontend + jaeger)` | S | High | #6 | ✅ | +| 8 | `chore: GitHub issue + PR templates + CODEOWNERS` | S | High | #1 | ✅ | + +### Phase 2 — Claude harness + +| # | Title | Size | Priority | Depends | Status | +|---|---|---|---|---|---| +| 15 | `chore: .claude hooks (pretooluse_bash, posttooluse_writeedit, sessionstart) + settings.local.json.example` | M | High | #1 | ✅ | +| 16 | `chore: port portable .claude/skills` | S | Medium | #15 | ✅ | + +### Phase 3 — CI + repo policy + +| # | Title | Size | Priority | Depends | Status | +|---|---|---|---|---|---| +| 9 | `chore: CI workflow (lint, typecheck, tests, coverage ≥75%, architecture)` | L | Critical | #2, #3 | ✅ | +| 10 | `chore: CI meta-gates (branch-protection contexts sync, commit-types sync)` | M | High | #9 | ✅ | +| 11 | `chore: security workflow (gitleaks, pip-audit, npm audit, trivy)` | M | High | #9 | ✅ | +| 12 | `chore: PR title lint + release-drafter` | S | High | #2 | ✅ | +| 13 | `chore: release workflow (tag-triggered, SBOM, GH Release publish)` | M | Medium | #6, #12 | ✅ | +| 14 | `chore: branch-protection JSON + apply workflow + artifact cleanup + CodeQL` | M | High | #9, #10 | ✅ | + +### Phase 4 — Backend scaffold + +| # | Title | Size | Priority | Depends | Status | +|---|---|---|---|---|---| +| 17 | `feat: backend scaffold (FastAPI app, /api/v1/health, /api/v1/echo, sessions)` | M | Critical | #2 | ✅ | +| 18 | `feat: Pydantic StrictModel base + example schemas (health, session, config)` | S | Critical | #17 | ✅ | +| 19 | `feat: observability setup (OTel SDK, OTLP exporter, structured logging, span helpers)` | M | High | #17 | ✅ | +| 20 | `feat: tool-registry pattern + example echo_tool (demonstrates layering)` | S | Medium | #18 | ✅ | + +### Phase 5 — Frontend scaffold + +| # | Title | Size | Priority | Depends | Status | +|---|---|---|---|---|---| +| 21 | `feat: frontend scaffold (Vite + React 19.2 + TS strict, eslint flat + prettier + vitest)` | M | Critical | #1 | ✅ | +| 22 | `feat: typed SSE client primitive (port from Teller, .ts)` | S | High | #21 | ✅ | +| 23 | `feat: hello page hitting /api/v1/health + CSS-variable palette + sample component test` | S | Medium | #21, #22 | ✅ | + +### Phase 6 — Eval scaffolding + +| # | Title | Size | Priority | Depends | Status | +|---|---|---|---|---|---| +| 24 | `feat: eval harness scaffold (runner, judge, report, models) + 1 example golden case + nightly workflow_dispatch` | M | High | #18, #19 | ✅ | + +### Phase 7 — Documentation + +| # | Title | Size | Priority | Depends | Status | +|---|---|---|---|---|---| +| 25 | `docs: HARNESS, INVARIANTS, BOUNDARIES, DEVELOPMENT, EVAL_HARNESS, SECURITY, ARCHITECTURE skeletons` | L | High | #2, #9, #15 | ✅ | +| 26 | `docs: README, CONTRIBUTING, CLAUDE.md, CHANGELOG seed, TASKS.md` | M | High | #25 | ✅ | + +### Phase 8 — Verification + +| # | Title | Size | Priority | Depends | Status | +|---|---|---|---|---|---| +| 27 | `test: end-to-end smoke (docker compose up, sample PR runs every gate green, eval workflow_dispatch)` | M | Critical | all above | 📋 | + +### Phase 9 — Publish + +| # | Title | Size | Priority | Depends | Status | +|---|---|---|---|---|---| +| 28 | `chore: publish — flip public, mark template, topics, portfolio polish (badges, screenshots)` | S | Medium | #27 | 📋 | + +## Critical-path summary + +`#1 → #2 → #9 → (#10 + #11 + #12 + #13 + #14)` clears CI; `#17–#20` clears backend; `#21–#23` clears frontend; `#24` adds eval; `#25–#26` adds docs; `#27` smokes the whole thing; `#28` publishes.