fix(authoring): de-leak structure prompt; RAG now doubles SNAP recall#105
Merged
danielnaab merged 1 commit intomainfrom Apr 20, 2026
Merged
fix(authoring): de-leak structure prompt; RAG now doubles SNAP recall#105danielnaab merged 1 commit intomainfrom
danielnaab merged 1 commit intomainfrom
Conversation
The first RAG ablation showed identical scores with and without the corpus (7.1% recall each). The cause was a leaky \`buildStructurePrompt\` that hard-coded the eight SNAP page titles verbatim — with that text baked in, the model reproduced the page list even when criteria and corpus were both empty. The prompt was impersonating the RAG. Rewrite the structure prompt to derive topical structure from the approved criteria and the policy corpus, with an explicit instruction to call no tools when input is insufficient. The model must now actually use what it's retrieved. Re-running the ablation with the honest prompt: | Measurement | With RAG | Without RAG | |---------------|----------|-------------| | Criteria | 21 | 0 | | Pages | 14 | 1 | | Fields | 140 | 7 | | Field recall | 10.6% | 4.7% | | Precision | 6.4% | 57.1% | | Type accuracy | 88.9% | 100% | RAG more than doubles recall and enables 20x the field coverage. Without the corpus, the pipeline produces a single-page skeleton with the obvious identity fields and stops. With it, the model generates 14 pages that track the ground truth's topical structure — including sections (Citizenship, Student Status, Categorical Eligibility, Change Reporting, Earned/Unearned Income split) that exist only because of specific chunks in the expanded corpus. The precision drop (57% to 6%) is a coverage trade-off: 7 conservative fields vs 140 candidate fields. For authoring, candidates you can review and prune are more valuable than an empty page. Writeup rewritten at catalog/experiments/authoring-pipeline/rag-ablation-snap.md, including explicit discussion of why the earlier result was wrong (prompt leakage) and how to audit for it. Fresh eval artifacts in notes/snap-rag-ablation/.
1 task
danielnaab
added a commit
that referenced
this pull request
Apr 20, 2026
…status, validate links (#106) Three related cleanups on the catalog surface. 1) Move the hybrid-v1 "headline finding" off the catalog landing page. It's a claim about the PDF extraction suite, not about the catalog overall — presenting it at the root misleads visitors into thinking it covers every experiment. The extraction suite's _suite.md already hosts the same finding as its "Summary of findings" section, which is the right home. 2) Refresh the authoring-pipeline suite page to reflect current state: corpus expanded from 13 to 21 chunks, structure prompt de-leaked (#105), RAG ablation run. Rename index.md to _suite.md so /catalog/experiments/authoring-pipeline resolves as a suite (matches the convention used by the other suites) and update the landing card href. 3) Add scripts/validate-catalog-links.ts and wire it into \`bun run check\`. Finds all /catalog/... URLs in TSX routes and markdown files and all relative markdown links within the catalog; verifies each resolves to an actual file, directory, or known code route. Caught one real broken link (catalog/decisions/design-system/component-scaffold.md was using ../../notes/ when it should have been ../../../notes/) along with a pile of false positives that the validator now skips (dynamic template literals, custom src: protocol, external URLs). 185 references checked across 94 markdown files and 8 TSX files — all resolve.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The first RAG ablation (#104) reported that the corpus had no measurable effect (7.1 % / 7.1 %). That was wrong, and the reason is worth stating plainly: the structure prompt had the 8 SNAP page titles hard-coded, so the model reproduced them even when criteria and corpus were both empty. The prompt was impersonating the RAG.
The change
Rewrite `buildStructurePrompt` to:
No other prompts changed. `buildCriteriaPrompt` was already generic; `buildSectionPrompt` takes a group title that the caller resolves from the previous stage's output, so it inherits whatever the structure stage produced.
Re-run results
RAG more than doubles recall and enables 20× the field coverage. Without the corpus, the pipeline produces a single page with the obvious identity fields and stops. With it, the model generates 14 pages that track ground truth's topical structure — including sections that exist only because of specific chunks in the expanded corpus:
On precision
The no-RAG run's 57 % precision is 4 of 7 fields matching — it hits the obvious identity fields and nothing else. With-RAG hits 9 of 140 — lower per-field precision but meaningfully higher ground-truth coverage. For authoring, candidates you can review and prune are more valuable than an empty page.
On the lesson
The earlier negative result was a useful false negative — it forced us to audit the prompts for leaked domain knowledge. Generalisable takeaway: before concluding retrieval adds no value, check whether the prompt is doing the retrieval's job.
Testing