Skip to content

test(authoring): RAG ablation for SNAP — corpus drives criteria but not output#104

Merged
danielnaab merged 1 commit intomainfrom
experiment/rag-ablation-snap
Apr 20, 2026
Merged

test(authoring): RAG ablation for SNAP — corpus drives criteria but not output#104
danielnaab merged 1 commit intomainfrom
experiment/rag-ablation-snap

Conversation

@danielnaab
Copy link
Copy Markdown
Member

Controlled ablation answering your earlier question: is the RAG actually helping the SNAP authoring scenario?

Result

Measurement With RAG (21 chunks) Without RAG (empty corpus) Delta
Criteria 22 0 −22
Pages 8 8 0
Fields 87 81 −6
Field recall 7.1% 7.1% 0
Field precision 6.9% 7.4% +0.5
Type accuracy 100% 100% 0

The corpus drives the criteria stage strongly (22 vs 0) but has no measurable effect on the final form. Precision is even marginally better without it.

Why

  1. buildStructurePrompt has the 8 SNAP page titles hard-coded into it. Without corpus, the model still produces the prescribed structure.
  2. Sonnet's parametric knowledge covers what a standard SNAP form needs. The corpus tells it what it already knows.
  3. The 7.1 % ceiling is a measurement artifact: deterministic field-matching penalises naming (`first_name` vs `firstName`) and grouping differences (model combines Earned + Unearned Income; ground truth splits them). The model is producing correct fields that get scored as misses.

What this does and doesn't say

  • Doesn't say: RAG is useless. The `sonnet-with-rag` extraction variant raised sensitivity-label accuracy from 27 % to 53 % across the extraction suite — that's a real measurable win in a different task.
  • Doesn't say: criteria are worthless. Zero criteria means no citations for auditability, which matters for the compliance-review UX even when it doesn't move a field-match metric.
  • Does say: for SNAP authoring output as currently wired, RAG is inert. The prompts carry the SNAP knowledge; the retrieval step is pedagogical scaffolding.

What would fix it

  • LLM-as-judge scorer. The 7.1 % cap is the scorer, not the pipeline. Both variants likely deserve 50–70 % recall on a semantic judge.
  • Per-stage retrieval scoping. Section generation currently gets the full corpus; if it got only the chunks relevant to "Household Composition", the fine-grained regulatory detail could actually show through.
  • A harder fixture. A state waiver form or a rare supplemental benefit the model hasn't seen in pre-training would separate parametric knowledge from retrieved context more cleanly.

Changes

  • src/entrypoints/cli/commands/evaluate-authoring.ts — adds an optional `useCorpus` flag on `VariantConfig` (defaults to `true`). Registers `no-rag-sonnet` as the ablation variant.
  • catalog/experiments/authoring-pipeline/rag-ablation-snap.md — full writeup.
  • notes/snap-rag-ablation/{all-sonnet,no-rag-sonnet}.json — raw pipeline outputs for both runs.

Testing

  • `bun run check` — 1332 tests pass.
  • Both variants executed live against Bedrock today; artefacts committed.

…ot output

Controlled test answering "is the RAG actually helping?" for the
SNAP Wisconsin authoring pipeline. Adds a \`no-rag-sonnet\` variant
that passes an empty PolicyChunk[] to every stage, and runs both it
and \`all-sonnet\` against the SNAP ground truth.

Result:

| Measurement      | With RAG | Without RAG |
|------------------|----------|-------------|
| Criteria         | 22       | 0           |
| Fields generated | 87       | 81          |
| Field recall     | 7.1%     | 7.1%        |
| Field precision  | 6.9%     | 7.4%        |
| Type accuracy    | 100%     | 100%        |

The corpus drives the criteria stage strongly (22 vs 0) but has no
measurable effect on the final form. Precision is marginally better
without the corpus. This is because:

1. buildStructurePrompt has SNAP-specific page titles baked in.
2. Sonnet's parametric knowledge of SNAP covers what an application
   form needs without external grounding.
3. The field-level scorer penalises naming and grouping differences
   rather than semantic fit — both variants are capped at 7.1%.

RAG still has a legitimate role here (citations in the criteria
review UI, auditability) and a demonstrated win on a different task
(sensitivity labelling in extraction, 27% to 53% on the suite). But
for SNAP authoring output quality, as currently wired, it's inert.

Writeup in catalog/experiments/authoring-pipeline/rag-ablation-snap.md.
Raw outputs in notes/snap-rag-ablation/.

CLI change is minimal — a \`useCorpus?: boolean\` flag on VariantConfig
that defaults to true, plus a one-line log showing which mode the run
is in.
@danielnaab danielnaab merged commit 6e97d5c into main Apr 20, 2026
4 checks passed
@danielnaab danielnaab deleted the experiment/rag-ablation-snap branch April 20, 2026 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant