fix(authoring): de-leak structure prompt; RAG now doubles SNAP recall by danielnaab · Pull Request #105 · flexion/forms-lab

danielnaab · 2026-04-20T18:18:43Z

The first RAG ablation (#104) reported that the corpus had no measurable effect (7.1 % / 7.1 %). That was wrong, and the reason is worth stating plainly: the structure prompt had the 8 SNAP page titles hard-coded, so the model reproduced them even when criteria and corpus were both empty. The prompt was impersonating the RAG.

The change

Rewrite `buildStructurePrompt` to:

Derive pages and groups from the approved criteria and the policy corpus.
Explicitly tell the model to call no tools when input is empty.
Drop the hard-coded "A SNAP application typically needs 6-8 pages" line and the enumerated page list.

No other prompts changed. `buildCriteriaPrompt` was already generic; `buildSectionPrompt` takes a group title that the caller resolves from the previous stage's output, so it inherits whatever the structure stage produced.

Re-run results

Measurement	With RAG	Without RAG	Delta
Criteria extracted	21	0	−21
Pages generated	14	1	−13
Fields generated	140	7	−133
Field recall	10.6 %	4.7 %	+5.9 pp
Precision	6.4 %	57.1 %	(trade-off)
Type accuracy	88.9 %	100 %	−11.1 pp

RAG more than doubles recall and enables 20× the field coverage. Without the corpus, the pipeline produces a single page with the obvious identity fields and stops. With it, the model generates 14 pages that track ground truth's topical structure — including sections that exist only because of specific chunks in the expanded corpus:

Citizenship (273.4)
Student Status (273.5)
Categorical Eligibility (273.2(j))
Change Reporting and Certification (273.12 + 273.10(f))
Shelter (273.9(d), previously merged into generic "Expenses")
Earned/Unearned Income split (273.9(b), previously merged)

On precision

The no-RAG run's 57 % precision is 4 of 7 fields matching — it hits the obvious identity fields and nothing else. With-RAG hits 9 of 140 — lower per-field precision but meaningfully higher ground-truth coverage. For authoring, candidates you can review and prune are more valuable than an empty page.

On the lesson

The earlier negative result was a useful false negative — it forced us to audit the prompts for leaked domain knowledge. Generalisable takeaway: before concluding retrieval adds no value, check whether the prompt is doing the retrieval's job.

Testing

`bun run check` — 1341 tests pass.
Both variants re-run live against Bedrock today; artifacts in `notes/snap-rag-ablation/`.
Updated writeup at `catalog/experiments/authoring-pipeline/rag-ablation-snap.md` with the methodology-audit lesson.

The first RAG ablation showed identical scores with and without the corpus (7.1% recall each). The cause was a leaky \`buildStructurePrompt\` that hard-coded the eight SNAP page titles verbatim — with that text baked in, the model reproduced the page list even when criteria and corpus were both empty. The prompt was impersonating the RAG. Rewrite the structure prompt to derive topical structure from the approved criteria and the policy corpus, with an explicit instruction to call no tools when input is insufficient. The model must now actually use what it's retrieved. Re-running the ablation with the honest prompt: | Measurement | With RAG | Without RAG | |---------------|----------|-------------| | Criteria | 21 | 0 | | Pages | 14 | 1 | | Fields | 140 | 7 | | Field recall | 10.6% | 4.7% | | Precision | 6.4% | 57.1% | | Type accuracy | 88.9% | 100% | RAG more than doubles recall and enables 20x the field coverage. Without the corpus, the pipeline produces a single-page skeleton with the obvious identity fields and stops. With it, the model generates 14 pages that track the ground truth's topical structure — including sections (Citizenship, Student Status, Categorical Eligibility, Change Reporting, Earned/Unearned Income split) that exist only because of specific chunks in the expanded corpus. The precision drop (57% to 6%) is a coverage trade-off: 7 conservative fields vs 140 candidate fields. For authoring, candidates you can review and prune are more valuable than an empty page. Writeup rewritten at catalog/experiments/authoring-pipeline/rag-ablation-snap.md, including explicit discussion of why the earlier result was wrong (prompt leakage) and how to audit for it. Fresh eval artifacts in notes/snap-rag-ablation/.

…status, validate links (#106) Three related cleanups on the catalog surface. 1) Move the hybrid-v1 "headline finding" off the catalog landing page. It's a claim about the PDF extraction suite, not about the catalog overall — presenting it at the root misleads visitors into thinking it covers every experiment. The extraction suite's _suite.md already hosts the same finding as its "Summary of findings" section, which is the right home. 2) Refresh the authoring-pipeline suite page to reflect current state: corpus expanded from 13 to 21 chunks, structure prompt de-leaked (#105), RAG ablation run. Rename index.md to _suite.md so /catalog/experiments/authoring-pipeline resolves as a suite (matches the convention used by the other suites) and update the landing card href. 3) Add scripts/validate-catalog-links.ts and wire it into \`bun run check\`. Finds all /catalog/... URLs in TSX routes and markdown files and all relative markdown links within the catalog; verifies each resolves to an actual file, directory, or known code route. Caught one real broken link (catalog/decisions/design-system/component-scaffold.md was using ../../notes/ when it should have been ../../../notes/) along with a pile of false positives that the validator now skips (dynamic template literals, custom src: protocol, external URLs). 185 references checked across 94 markdown files and 8 TSX files — all resolve.

danielnaab merged commit af42edc into main Apr 20, 2026
4 checks passed

danielnaab deleted the experiment/realistic-rag-prompt branch April 20, 2026 18:18

danielnaab mentioned this pull request Apr 20, 2026

docs(catalog): move hybrid headline, refresh RAG status, validate links #106

Merged

1 task

danielnaab mentioned this pull request Apr 20, 2026

feat(authoring): per-stage retrieval against the project corpus #112

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(authoring): de-leak structure prompt; RAG now doubles SNAP recall#105

fix(authoring): de-leak structure prompt; RAG now doubles SNAP recall#105
danielnaab merged 1 commit intomainfrom
experiment/realistic-rag-prompt

danielnaab commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielnaab commented Apr 20, 2026

The change

Re-run results

On precision

On the lesson

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant