Rebuild the natural-language criterion front door around "generate big,
score tiny, verify everything":
- compile-criterion: criterion -> verified prompt dataset + preset, with
agent (two-phase generation-request flow), llamacpp, and heuristic
generators; NLI margin gate (pos>=0.7 / neg<=0.3 / >=8 per side,
recorded exclusions, balance trim) plus real assay validation.
- score-prompts: score any dataset against a criterion hypothesis with a
tiny zero-shot NLI cross-encoder (new [criteria] extra, ~70M, nothing
bundled); exact ScoredPrompt JSONL with per-row criterion_score_source.
- Honest degradation: hash-cosine fallback labeled weak everywhere,
margin gate downgrades to advisory with explicit warnings.
- 2 new MCP tools (21 total): compile_criterion, score_prompts.
- AGENTS.md opens with "operationalize the criterion first".
539 tests (up from 500), all heavy paths stubbed behind factory seams.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>