test(authoring): RAG ablation for SNAP — corpus drives criteria but not output#104
Merged
danielnaab merged 1 commit intomainfrom Apr 20, 2026
Merged
test(authoring): RAG ablation for SNAP — corpus drives criteria but not output#104danielnaab merged 1 commit intomainfrom
danielnaab merged 1 commit intomainfrom
Conversation
…ot output Controlled test answering "is the RAG actually helping?" for the SNAP Wisconsin authoring pipeline. Adds a \`no-rag-sonnet\` variant that passes an empty PolicyChunk[] to every stage, and runs both it and \`all-sonnet\` against the SNAP ground truth. Result: | Measurement | With RAG | Without RAG | |------------------|----------|-------------| | Criteria | 22 | 0 | | Fields generated | 87 | 81 | | Field recall | 7.1% | 7.1% | | Field precision | 6.9% | 7.4% | | Type accuracy | 100% | 100% | The corpus drives the criteria stage strongly (22 vs 0) but has no measurable effect on the final form. Precision is marginally better without the corpus. This is because: 1. buildStructurePrompt has SNAP-specific page titles baked in. 2. Sonnet's parametric knowledge of SNAP covers what an application form needs without external grounding. 3. The field-level scorer penalises naming and grouping differences rather than semantic fit — both variants are capped at 7.1%. RAG still has a legitimate role here (citations in the criteria review UI, auditability) and a demonstrated win on a different task (sensitivity labelling in extraction, 27% to 53% on the suite). But for SNAP authoring output quality, as currently wired, it's inert. Writeup in catalog/experiments/authoring-pipeline/rag-ablation-snap.md. Raw outputs in notes/snap-rag-ablation/. CLI change is minimal — a \`useCorpus?: boolean\` flag on VariantConfig that defaults to true, plus a one-line log showing which mode the run is in.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Controlled ablation answering your earlier question: is the RAG actually helping the SNAP authoring scenario?
Result
The corpus drives the criteria stage strongly (22 vs 0) but has no measurable effect on the final form. Precision is even marginally better without it.
Why
buildStructurePrompthas the 8 SNAP page titles hard-coded into it. Without corpus, the model still produces the prescribed structure.What this does and doesn't say
What would fix it
Changes
src/entrypoints/cli/commands/evaluate-authoring.ts— adds an optional `useCorpus` flag on `VariantConfig` (defaults to `true`). Registers `no-rag-sonnet` as the ablation variant.catalog/experiments/authoring-pipeline/rag-ablation-snap.md— full writeup.notes/snap-rag-ablation/{all-sonnet,no-rag-sonnet}.json— raw pipeline outputs for both runs.Testing