Insight extraction quality by smodee · Pull Request #31 · algorithmicgovernance/BioScanCast

smodee · 2026-06-03T11:23:29Z

Insight extraction quality

Split from the original feat/end-to-end-orchestrator branch (was #28).
Companion PRs: #29 (orchestrator) and #30 (search/filter quality).
All three are independent and can merge in any order.

Summary

Cheap-model output quality fixes + an adaptive retrieval knob, driven by live-
run inspection of insight records on q7 (Mpox) and q12 (Ebola). Each fix maps
to a specific failure mode observed in those runs' artifacts.

What's included

suspected_deaths added to the controlled metric_name vocabulary —
was collapsing "160 suspected deaths" into plain deaths (q12 Record 3).
The vocabulary now has confirmed_* / suspected_* variants for both
cases and deaths, paired with explicit prompt guidance ("suspected deaths
/ probable deaths / deaths under investigation" all map to
suspected_deaths).
Extraction prompt requires the quote field to contain the figure
(digits / number-word / relative reference). Closes a gap where a
metric_value was attached to a quantitatively-empty quote (q12 Record 4:
metric_value=82 with quote "the outbreak now poses a 'very high' risk
for Congo" — no number anywhere in the quote). Done at the prompt level
rather than as a deterministic post-hoc filter, to avoid false-positive
rejection of word numbers and relative quantities.
Hallucination guard gains a case-insensitive layer 4 — recovers real
quotes the model lowercased the leading letter of, while still rejecting
content-insertion fabrications. Regression-tested against the real q12
cases that previously triggered false-positive guard rejections.
Adaptive retrieval_top_k (default 12 → 20 when fewer than 5 usable
documents survive filtering), so per-doc retrieval depth isn't the
bottleneck on coverage when the filter passes through only a handful of
documents. q7 reached insight with only 2 surviving documents in baseline
runs; the adaptive lift gives the model more chunks per surviving doc.
Default config (12 doc threshold not hit → normal top_k=12) is
unchanged.

Verification

python -m pytest bioscancast/tests/ — 450 passed, 2 skipped (live).
New tests cover the case-insensitive guard layer, suspected_*
vocabulary, the quote-must-contain-figure prompt change, and the adaptive
top_k path.
Live runs confirmed q12 now correctly tags suspected_cases /
suspected_deaths (e.g., a Guardian quote "904 suspected cases and 119
suspected deaths" → two separate records with correct metric_name
instead of one collapsed deaths record at 160 vs 61 case-fatality
scandal). Same chunks that previously emitted q12 Record 4 no longer
attach a metric_value to a quantitatively-empty quote.

Issues this PR addresses

Strong-model refinement pass in the insight stage #26 (strong-model refinement) — partial: the suspected_deaths
vocab nibbles at one of the failure modes flagged by that issue
(cheap-model collapsing modifiers), but the refinement pass itself isn't
built. Not closeable on this PR alone.

Reviewer checklist

Confirm the prompt change to require quote-contains-figure doesn't
over-restrict the model (it shouldn't — phrasing accepts digits,
number-words, and relative references explicitly).
Sanity-check the case-insensitive guard layer doesn't accept
content-insertion fabrications (the regression tests cover the known
failure modes).

q12 Record 3 reported metric_name=deaths, metric_value=160 from the source quote "160 suspected deaths out of 670 suspected cases". The prompt's canonical vocabulary already had `suspected_cases` (for the 670) but no `suspected_deaths` slot - so the model collapsed the "suspected" qualifier and emitted plain `deaths`. The result was an arithmetically scandalous record (160 deaths against 61 confirmed cases) that wouldn't survive any reasonable post-hoc sanity check. Changes: - Add `suspected_deaths` alongside the existing `suspected_cases` entry. Two-tier system per category (confirmed_* and suspected_*), matching the agreed shape of the vocab. - Drop the now-redundant standalone `probable_cases` line; the `suspected_cases` description explicitly covers "suspected", "probable", and "possible" as the same tier. WHO's combined `confirmed_or_probable_cases` bucket is kept separately because it is a distinct reporting category. - Add a deaths-family mapping rule paralleling the existing cases-family rule ("suspected deaths", "probable deaths", "deaths under investigation" all map to suspected_deaths). Mirror what the existing `confirmed_cases` line does for the cases family. - Clean up the stale "possible all get their own variants below" parenthetical which referenced a `possible_cases` slot that has never existed. After this, q12's "160 suspected deaths" should extract as suspected_deaths=160 rather than deaths=160 - same value, correct category, no longer competing with confirmed_cases for downstream forecasting weight. Implements item 3 from the Tier 1 roadmap.

q12 Record 4 reported metric_value=82 with the quote "the outbreak now poses a 'very high' risk for Congo - up from a previous categorization of 'high'" - no digit, no number-word, no relative reference. The hallucination guard's verbatim-substring check passed because the quote string did appear in the source chunk, but nothing in the guard required the quote sentence to be the one actually carrying the figure. The metric_value of 82 came from elsewhere in the chunk; the quote was a "supporting context" sentence. A deterministic post-hoc check (str(metric_value) in quote) would over-reject: word numbers ("a dozen"), relative quantities ("a quarter of the population"), and number-word forms ("ninety-nine thousand") would all be false-positive rejections. So the fix lives at the prompt level instead: tell the model the quote MUST be the sentence that carries the figure - digits, number-word, or a clear relative reference. A purely contextual sentence is not acceptable. The verbatim-substring guard remains the safety net. This change tightens the model's understanding of what `quote` is supposed to do without committing to a brittle programmatic check that would lose real signal on legitimate paraphrases. Implements item 4 from the Tier 2 roadmap.

When the filtering stage passes only a handful of documents through to insight, per-document retrieval depth becomes the bottleneck on coverage. q7's live run reached insight with 2 usable documents and hit retrieval_top_k=12 on each - meaning at most ~24 chunk extractions for the whole question. Bumping per-doc retrieval depth costs little and gives the model more chances to find the relevant figure in each surviving document. Adds two InsightConfig fields: - low_survival_doc_threshold (default 5) - low_survival_top_k (default 20) When the count of usable documents (status != "failed" and non-empty chunks) is at or below the threshold, both retrieval_top_k and max_chunks_per_document effectively rise to low_survival_top_k for the run, and a note is appended to InsightRunResult.notes flagging that the adaptive path engaged. Default config (12 doc threshold not hit -> normal top_k=12) is unchanged. Tests that pin retrieval_top_k to small values to control fake-LLM call counts now also pin low_survival_top_k to the same value, opting out of the adaptive lift explicitly. 447 tests still passing. Implements item 6 from the Tier 2 roadmap. Completes the planned bundle of items 1+2+3+4+5+6.

Item 8 investigation: of the three quotes the guard dropped across the q12 live runs, two were real facts lost purely because the model lowercased the leading letter of a sentence it quoted from mid-paragraph: source: "There are now 750 suspected cases and 177 suspected deaths" model: "there are now 750 suspected cases and 177 suspected deaths" source: "The Congolese Ministry of Communication, in a post to X ... said that there were 904 suspected cases and 119 ..." model: "the Congolese Ministry of Communication, in a post to X ..." The third rejection was a genuine content-insertion hallucination - the model bolted the real prefix "a total of 105 confirmed cases (including 10 deaths)" onto a fabricated continuation "...have been reported in Ituri, North Kivu, and South Kivu" (the source actually continues "...and 906 suspected cases"). Fix: a fourth, case-insensitive substring layer. It returns the chunk's own casing so the stored quote still reflects the source verbatim. The key safety property - verified against the real q12 fabrication and captured in a new regression test - is that case-folding does NOT recover content insertions: a fabricated continuation fails the substring test regardless of case. Tests: new _LAYER4_CASE_INSENSITIVE_CASES (the two recovered q12 quotes) plus a hallucination case mirroring the q12 fabrication that must stay rejected. 450 passed (was 447; +3 guard cases). Implements item 8 from the roadmap.

smodee added 4 commits June 3, 2026 14:19

This was referenced Jun 3, 2026

End-to-end pipeline orchestrator #29

Open

Search/filter dashboard chokepoint + relevance ranking #30

Open

End-to-end pipeline orchestrator + filter/extraction quality bundle #28

Closed

smodee marked this pull request as ready for review June 3, 2026 11:33

smodee requested a review from rapsoj June 3, 2026 11:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insight extraction quality#31

Insight extraction quality#31
smodee wants to merge 4 commits into
mainfrom
feat/insight-extraction-quality

smodee commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

smodee commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Insight extraction quality

Summary

What's included

Verification

Issues this PR addresses

Reviewer checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

smodee commented Jun 3, 2026 •

edited

Loading