test(authoring): RAG ablation for SNAP — corpus drives criteria but not output by danielnaab · Pull Request #104 · flexion/forms-lab

danielnaab · 2026-04-20T18:02:56Z

Controlled ablation answering your earlier question: is the RAG actually helping the SNAP authoring scenario?

Result

Measurement	With RAG (21 chunks)	Without RAG (empty corpus)	Delta
Criteria	22	0	−22
Pages	8	8	0
Fields	87	81	−6
Field recall	7.1%	7.1%	0
Field precision	6.9%	7.4%	+0.5
Type accuracy	100%	100%	0

The corpus drives the criteria stage strongly (22 vs 0) but has no measurable effect on the final form. Precision is even marginally better without it.

Why

buildStructurePrompt has the 8 SNAP page titles hard-coded into it. Without corpus, the model still produces the prescribed structure.
Sonnet's parametric knowledge covers what a standard SNAP form needs. The corpus tells it what it already knows.
The 7.1 % ceiling is a measurement artifact: deterministic field-matching penalises naming (`first_name` vs `firstName`) and grouping differences (model combines Earned + Unearned Income; ground truth splits them). The model is producing correct fields that get scored as misses.

What this does and doesn't say

Doesn't say: RAG is useless. The `sonnet-with-rag` extraction variant raised sensitivity-label accuracy from 27 % to 53 % across the extraction suite — that's a real measurable win in a different task.
Doesn't say: criteria are worthless. Zero criteria means no citations for auditability, which matters for the compliance-review UX even when it doesn't move a field-match metric.
Does say: for SNAP authoring output as currently wired, RAG is inert. The prompts carry the SNAP knowledge; the retrieval step is pedagogical scaffolding.

What would fix it

LLM-as-judge scorer. The 7.1 % cap is the scorer, not the pipeline. Both variants likely deserve 50–70 % recall on a semantic judge.
Per-stage retrieval scoping. Section generation currently gets the full corpus; if it got only the chunks relevant to "Household Composition", the fine-grained regulatory detail could actually show through.
A harder fixture. A state waiver form or a rare supplemental benefit the model hasn't seen in pre-training would separate parametric knowledge from retrieved context more cleanly.

Changes

src/entrypoints/cli/commands/evaluate-authoring.ts — adds an optional `useCorpus` flag on `VariantConfig` (defaults to `true`). Registers `no-rag-sonnet` as the ablation variant.
catalog/experiments/authoring-pipeline/rag-ablation-snap.md — full writeup.
notes/snap-rag-ablation/{all-sonnet,no-rag-sonnet}.json — raw pipeline outputs for both runs.

Testing

`bun run check` — 1332 tests pass.
Both variants executed live against Bedrock today; artefacts committed.

…ot output Controlled test answering "is the RAG actually helping?" for the SNAP Wisconsin authoring pipeline. Adds a \`no-rag-sonnet\` variant that passes an empty PolicyChunk[] to every stage, and runs both it and \`all-sonnet\` against the SNAP ground truth. Result: | Measurement | With RAG | Without RAG | |------------------|----------|-------------| | Criteria | 22 | 0 | | Fields generated | 87 | 81 | | Field recall | 7.1% | 7.1% | | Field precision | 6.9% | 7.4% | | Type accuracy | 100% | 100% | The corpus drives the criteria stage strongly (22 vs 0) but has no measurable effect on the final form. Precision is marginally better without the corpus. This is because: 1. buildStructurePrompt has SNAP-specific page titles baked in. 2. Sonnet's parametric knowledge of SNAP covers what an application form needs without external grounding. 3. The field-level scorer penalises naming and grouping differences rather than semantic fit — both variants are capped at 7.1%. RAG still has a legitimate role here (citations in the criteria review UI, auditability) and a demonstrated win on a different task (sensitivity labelling in extraction, 27% to 53% on the suite). But for SNAP authoring output quality, as currently wired, it's inert. Writeup in catalog/experiments/authoring-pipeline/rag-ablation-snap.md. Raw outputs in notes/snap-rag-ablation/. CLI change is minimal — a \`useCorpus?: boolean\` flag on VariantConfig that defaults to true, plus a one-line log showing which mode the run is in.

danielnaab merged commit 6e97d5c into main Apr 20, 2026
4 checks passed

danielnaab deleted the experiment/rag-ablation-snap branch April 20, 2026 18:03

This was referenced Apr 20, 2026

fix(authoring): de-leak structure prompt; RAG now doubles SNAP recall #105

Merged

docs(catalog): move hybrid headline, refresh RAG status, validate links #106

Merged

feat(authoring): per-stage retrieval against the project corpus #112

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(authoring): RAG ablation for SNAP — corpus drives criteria but not output#104

test(authoring): RAG ablation for SNAP — corpus drives criteria but not output#104
danielnaab merged 1 commit intomainfrom
experiment/rag-ablation-snap

danielnaab commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielnaab commented Apr 20, 2026

Result

Why

What this does and doesn't say

What would fix it

Changes

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant