Insight extraction quality#31
Open
smodee wants to merge 4 commits into
Open
Conversation
q12 Record 3 reported metric_name=deaths, metric_value=160 from the
source quote "160 suspected deaths out of 670 suspected cases". The
prompt's canonical vocabulary already had `suspected_cases` (for the
670) but no `suspected_deaths` slot - so the model collapsed the
"suspected" qualifier and emitted plain `deaths`. The result was an
arithmetically scandalous record (160 deaths against 61 confirmed
cases) that wouldn't survive any reasonable post-hoc sanity check.
Changes:
- Add `suspected_deaths` alongside the existing `suspected_cases`
entry. Two-tier system per category (confirmed_* and suspected_*),
matching the agreed shape of the vocab.
- Drop the now-redundant standalone `probable_cases` line; the
`suspected_cases` description explicitly covers "suspected",
"probable", and "possible" as the same tier. WHO's combined
`confirmed_or_probable_cases` bucket is kept separately because it
is a distinct reporting category.
- Add a deaths-family mapping rule paralleling the existing
cases-family rule ("suspected deaths", "probable deaths", "deaths
under investigation" all map to suspected_deaths). Mirror what the
existing `confirmed_cases` line does for the cases family.
- Clean up the stale "possible all get their own variants below"
parenthetical which referenced a `possible_cases` slot that has
never existed.
After this, q12's "160 suspected deaths" should extract as
suspected_deaths=160 rather than deaths=160 - same value, correct
category, no longer competing with confirmed_cases for downstream
forecasting weight.
Implements item 3 from the Tier 1 roadmap.
q12 Record 4 reported metric_value=82 with the quote
"the outbreak now poses a 'very high' risk for Congo - up from a
previous categorization of 'high'" - no digit, no number-word, no
relative reference. The hallucination guard's verbatim-substring check
passed because the quote string did appear in the source chunk, but
nothing in the guard required the quote sentence to be the one
actually carrying the figure. The metric_value of 82 came from
elsewhere in the chunk; the quote was a "supporting context"
sentence.
A deterministic post-hoc check (str(metric_value) in quote) would
over-reject: word numbers ("a dozen"), relative quantities ("a
quarter of the population"), and number-word forms ("ninety-nine
thousand") would all be false-positive rejections. So the fix lives
at the prompt level instead: tell the model the quote MUST be the
sentence that carries the figure - digits, number-word, or a clear
relative reference. A purely contextual sentence is not acceptable.
The verbatim-substring guard remains the safety net. This change
tightens the model's understanding of what `quote` is supposed to do
without committing to a brittle programmatic check that would lose
real signal on legitimate paraphrases.
Implements item 4 from the Tier 2 roadmap.
When the filtering stage passes only a handful of documents through to insight, per-document retrieval depth becomes the bottleneck on coverage. q7's live run reached insight with 2 usable documents and hit retrieval_top_k=12 on each - meaning at most ~24 chunk extractions for the whole question. Bumping per-doc retrieval depth costs little and gives the model more chances to find the relevant figure in each surviving document. Adds two InsightConfig fields: - low_survival_doc_threshold (default 5) - low_survival_top_k (default 20) When the count of usable documents (status != "failed" and non-empty chunks) is at or below the threshold, both retrieval_top_k and max_chunks_per_document effectively rise to low_survival_top_k for the run, and a note is appended to InsightRunResult.notes flagging that the adaptive path engaged. Default config (12 doc threshold not hit -> normal top_k=12) is unchanged. Tests that pin retrieval_top_k to small values to control fake-LLM call counts now also pin low_survival_top_k to the same value, opting out of the adaptive lift explicitly. 447 tests still passing. Implements item 6 from the Tier 2 roadmap. Completes the planned bundle of items 1+2+3+4+5+6.
Item 8 investigation: of the three quotes the guard dropped across the
q12 live runs, two were real facts lost purely because the model
lowercased the leading letter of a sentence it quoted from
mid-paragraph:
source: "There are now 750 suspected cases and 177 suspected deaths"
model: "there are now 750 suspected cases and 177 suspected deaths"
source: "The Congolese Ministry of Communication, in a post to X ...
said that there were 904 suspected cases and 119 ..."
model: "the Congolese Ministry of Communication, in a post to X ..."
The third rejection was a genuine content-insertion hallucination -
the model bolted the real prefix "a total of 105 confirmed cases
(including 10 deaths)" onto a fabricated continuation "...have been
reported in Ituri, North Kivu, and South Kivu" (the source actually
continues "...and 906 suspected cases").
Fix: a fourth, case-insensitive substring layer. It returns the chunk's
own casing so the stored quote still reflects the source verbatim. The
key safety property - verified against the real q12 fabrication and
captured in a new regression test - is that case-folding does NOT
recover content insertions: a fabricated continuation fails the
substring test regardless of case.
Tests: new _LAYER4_CASE_INSENSITIVE_CASES (the two recovered q12
quotes) plus a hallucination case mirroring the q12 fabrication that
must stay rejected. 450 passed (was 447; +3 guard cases).
Implements item 8 from the roadmap.
This was referenced Jun 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Insight extraction quality
Summary
Cheap-model output quality fixes + an adaptive retrieval knob, driven by live-
run inspection of insight records on q7 (Mpox) and q12 (Ebola). Each fix maps
to a specific failure mode observed in those runs' artifacts.
What's included
suspected_deathsadded to the controlledmetric_namevocabulary —was collapsing "160 suspected deaths" into plain
deaths(q12 Record 3).The vocabulary now has
confirmed_*/suspected_*variants for bothcases and deaths, paired with explicit prompt guidance ("suspected deaths
/ probable deaths / deaths under investigation" all map to
suspected_deaths).quotefield to contain the figure(digits / number-word / relative reference). Closes a gap where a
metric_valuewas attached to a quantitatively-empty quote (q12 Record 4:metric_value=82with quote "the outbreak now poses a 'very high' riskfor Congo" — no number anywhere in the quote). Done at the prompt level
rather than as a deterministic post-hoc filter, to avoid false-positive
rejection of word numbers and relative quantities.
quotes the model lowercased the leading letter of, while still rejecting
content-insertion fabrications. Regression-tested against the real q12
cases that previously triggered false-positive guard rejections.
retrieval_top_k(default 12 → 20 when fewer than 5 usabledocuments survive filtering), so per-doc retrieval depth isn't the
bottleneck on coverage when the filter passes through only a handful of
documents. q7 reached insight with only 2 surviving documents in baseline
runs; the adaptive lift gives the model more chunks per surviving doc.
Default config (12 doc threshold not hit → normal
top_k=12) isunchanged.
Verification
python -m pytest bioscancast/tests/— 450 passed, 2 skipped (live).New tests cover the case-insensitive guard layer,
suspected_*vocabulary, the quote-must-contain-figure prompt change, and the adaptive
top_kpath.suspected_cases/suspected_deaths(e.g., a Guardian quote "904 suspected cases and 119suspected deaths" → two separate records with correct
metric_nameinstead of one collapsed
deathsrecord at 160 vs 61 case-fatalityscandal). Same chunks that previously emitted q12 Record 4 no longer
attach a
metric_valueto a quantitatively-empty quote.Issues this PR addresses
suspected_deathsvocab nibbles at one of the failure modes flagged by that issue
(cheap-model collapsing modifiers), but the refinement pass itself isn't
built. Not closeable on this PR alone.
Reviewer checklist
over-restrict the model (it shouldn't — phrasing accepts digits,
number-words, and relative references explicitly).
content-insertion fabrications (the regression tests cover the known
failure modes).