Skip to content

fix(authoring): de-leak structure prompt; RAG now doubles SNAP recall#105

Merged
danielnaab merged 1 commit intomainfrom
experiment/realistic-rag-prompt
Apr 20, 2026
Merged

fix(authoring): de-leak structure prompt; RAG now doubles SNAP recall#105
danielnaab merged 1 commit intomainfrom
experiment/realistic-rag-prompt

Conversation

@danielnaab
Copy link
Copy Markdown
Member

The first RAG ablation (#104) reported that the corpus had no measurable effect (7.1 % / 7.1 %). That was wrong, and the reason is worth stating plainly: the structure prompt had the 8 SNAP page titles hard-coded, so the model reproduced them even when criteria and corpus were both empty. The prompt was impersonating the RAG.

The change

Rewrite `buildStructurePrompt` to:

  • Derive pages and groups from the approved criteria and the policy corpus.
  • Explicitly tell the model to call no tools when input is empty.
  • Drop the hard-coded "A SNAP application typically needs 6-8 pages" line and the enumerated page list.

No other prompts changed. `buildCriteriaPrompt` was already generic; `buildSectionPrompt` takes a group title that the caller resolves from the previous stage's output, so it inherits whatever the structure stage produced.

Re-run results

Measurement With RAG Without RAG Delta
Criteria extracted 21 0 −21
Pages generated 14 1 −13
Fields generated 140 7 −133
Field recall 10.6 % 4.7 % +5.9 pp
Precision 6.4 % 57.1 % (trade-off)
Type accuracy 88.9 % 100 % −11.1 pp

RAG more than doubles recall and enables 20× the field coverage. Without the corpus, the pipeline produces a single page with the obvious identity fields and stops. With it, the model generates 14 pages that track ground truth's topical structure — including sections that exist only because of specific chunks in the expanded corpus:

  • Citizenship (273.4)
  • Student Status (273.5)
  • Categorical Eligibility (273.2(j))
  • Change Reporting and Certification (273.12 + 273.10(f))
  • Shelter (273.9(d), previously merged into generic "Expenses")
  • Earned/Unearned Income split (273.9(b), previously merged)

On precision

The no-RAG run's 57 % precision is 4 of 7 fields matching — it hits the obvious identity fields and nothing else. With-RAG hits 9 of 140 — lower per-field precision but meaningfully higher ground-truth coverage. For authoring, candidates you can review and prune are more valuable than an empty page.

On the lesson

The earlier negative result was a useful false negative — it forced us to audit the prompts for leaked domain knowledge. Generalisable takeaway: before concluding retrieval adds no value, check whether the prompt is doing the retrieval's job.

Testing

  • `bun run check` — 1341 tests pass.
  • Both variants re-run live against Bedrock today; artifacts in `notes/snap-rag-ablation/`.
  • Updated writeup at `catalog/experiments/authoring-pipeline/rag-ablation-snap.md` with the methodology-audit lesson.

The first RAG ablation showed identical scores with and without the
corpus (7.1% recall each). The cause was a leaky \`buildStructurePrompt\`
that hard-coded the eight SNAP page titles verbatim — with that text
baked in, the model reproduced the page list even when criteria and
corpus were both empty. The prompt was impersonating the RAG.

Rewrite the structure prompt to derive topical structure from the
approved criteria and the policy corpus, with an explicit instruction
to call no tools when input is insufficient. The model must now
actually use what it's retrieved.

Re-running the ablation with the honest prompt:

| Measurement   | With RAG | Without RAG |
|---------------|----------|-------------|
| Criteria      | 21       | 0           |
| Pages         | 14       | 1           |
| Fields        | 140      | 7           |
| Field recall  | 10.6%    | 4.7%        |
| Precision     | 6.4%     | 57.1%       |
| Type accuracy | 88.9%    | 100%        |

RAG more than doubles recall and enables 20x the field coverage.
Without the corpus, the pipeline produces a single-page skeleton
with the obvious identity fields and stops. With it, the model
generates 14 pages that track the ground truth's topical structure —
including sections (Citizenship, Student Status, Categorical
Eligibility, Change Reporting, Earned/Unearned Income split) that
exist only because of specific chunks in the expanded corpus.

The precision drop (57% to 6%) is a coverage trade-off: 7 conservative
fields vs 140 candidate fields. For authoring, candidates you can
review and prune are more valuable than an empty page.

Writeup rewritten at
catalog/experiments/authoring-pipeline/rag-ablation-snap.md,
including explicit discussion of why the earlier result was wrong
(prompt leakage) and how to audit for it.

Fresh eval artifacts in notes/snap-rag-ablation/.
@danielnaab danielnaab merged commit af42edc into main Apr 20, 2026
4 checks passed
@danielnaab danielnaab deleted the experiment/realistic-rag-prompt branch April 20, 2026 18:18
danielnaab added a commit that referenced this pull request Apr 20, 2026
…status, validate links (#106)

Three related cleanups on the catalog surface.

1) Move the hybrid-v1 "headline finding" off the catalog landing
   page. It's a claim about the PDF extraction suite, not about the
   catalog overall — presenting it at the root misleads visitors
   into thinking it covers every experiment. The extraction suite's
   _suite.md already hosts the same finding as its "Summary of
   findings" section, which is the right home.

2) Refresh the authoring-pipeline suite page to reflect current
   state: corpus expanded from 13 to 21 chunks, structure prompt
   de-leaked (#105), RAG ablation run. Rename index.md to _suite.md
   so /catalog/experiments/authoring-pipeline resolves as a suite
   (matches the convention used by the other suites) and update the
   landing card href.

3) Add scripts/validate-catalog-links.ts and wire it into
   \`bun run check\`. Finds all /catalog/... URLs in TSX routes and
   markdown files and all relative markdown links within the
   catalog; verifies each resolves to an actual file, directory, or
   known code route. Caught one real broken link
   (catalog/decisions/design-system/component-scaffold.md was using
   ../../notes/ when it should have been ../../../notes/) along
   with a pile of false positives that the validator now skips
   (dynamic template literals, custom src: protocol, external URLs).

185 references checked across 94 markdown files and 8 TSX files —
all resolve.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant