Skip to content

v3.1.0

@asystemoffields asystemoffields tagged this 10 Jun 18:37
Rebuild the natural-language criterion front door around "generate big,
score tiny, verify everything":

- compile-criterion: criterion -> verified prompt dataset + preset, with
  agent (two-phase generation-request flow), llamacpp, and heuristic
  generators; NLI margin gate (pos>=0.7 / neg<=0.3 / >=8 per side,
  recorded exclusions, balance trim) plus real assay validation.
- score-prompts: score any dataset against a criterion hypothesis with a
  tiny zero-shot NLI cross-encoder (new [criteria] extra, ~70M, nothing
  bundled); exact ScoredPrompt JSONL with per-row criterion_score_source.
- Honest degradation: hash-cosine fallback labeled weak everywhere,
  margin gate downgrades to advisory with explicit warnings.
- 2 new MCP tools (21 total): compile_criterion, score_prompts.
- AGENTS.md opens with "operationalize the criterion first".

539 tests (up from 500), all heavy paths stubbed behind factory seams.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Assets 2
Loading