Nine features close the investigation loop:
- plan-evidence: per-card evidence-gap diagnosis with Student-t power
analysis; ranks the cheapest grade-moving interventions
- Criterion dossiers: cumulative evidence per (model, criterion) across
runs; grade transitions, same-provenance contradiction tracking
- calibrate: plants synthetic ground truth (causal features, correlated
decoys, noise), runs the real pipeline blind, reports grade calibration
with Wilson CIs; current machinery: precision@k=1.0, decoy resistance=1.0
- quant-diff: which validated features did quantization break; preset
workflow + docs for FP16-vs-quant feature audits
- Steering artifacts: export-steering (provenance-gated) + apply-steering
- migrate-report: re-score pre-2.3 reports under current semantics
- GGUF bridge: export-gguf-records (llama.cpp final-layer embeddings) +
convert-hidden-dump (any-runtime multi-layer dump converter)
- Hypothesis invariant suite: association-only inputs can never produce
causal-labeled outputs, as a generative property across all surfaces
- MCP server: 19 tools (was 10) covering the whole loop; Claude Code
skill; AGENTS.md/README/COMMANDS updated
Fixed (found by the invariant suite): contradicted/contradicted_effect now
require intervention provenance; association-only opposite pairs grade
needs_causal_evidence with reason opposite_associations_lack_intervention_provenance.
Breaking: 2.3.0 legacy next-action keys removed from emitted payloads
(canonical {id, title, command?+argv?, instruction?, requires?} only).
Suite: 363 -> 500 tests.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>