Skip to content

Insight extraction quality#31

Open
smodee wants to merge 4 commits into
mainfrom
feat/insight-extraction-quality
Open

Insight extraction quality#31
smodee wants to merge 4 commits into
mainfrom
feat/insight-extraction-quality

Conversation

@smodee
Copy link
Copy Markdown
Collaborator

@smodee smodee commented Jun 3, 2026

Insight extraction quality

Split from the original feat/end-to-end-orchestrator branch (was #28).
Companion PRs: #29 (orchestrator) and #30 (search/filter quality).
All three are independent and can merge in any order.

Summary

Cheap-model output quality fixes + an adaptive retrieval knob, driven by live-
run inspection of insight records on q7 (Mpox) and q12 (Ebola). Each fix maps
to a specific failure mode observed in those runs' artifacts.

What's included

  • suspected_deaths added to the controlled metric_name vocabulary
    was collapsing "160 suspected deaths" into plain deaths (q12 Record 3).
    The vocabulary now has confirmed_* / suspected_* variants for both
    cases and deaths, paired with explicit prompt guidance ("suspected deaths
    / probable deaths / deaths under investigation" all map to
    suspected_deaths).
  • Extraction prompt requires the quote field to contain the figure
    (digits / number-word / relative reference). Closes a gap where a
    metric_value was attached to a quantitatively-empty quote (q12 Record 4:
    metric_value=82 with quote "the outbreak now poses a 'very high' risk
    for Congo"
    — no number anywhere in the quote). Done at the prompt level
    rather than as a deterministic post-hoc filter, to avoid false-positive
    rejection of word numbers and relative quantities.
  • Hallucination guard gains a case-insensitive layer 4 — recovers real
    quotes the model lowercased the leading letter of, while still rejecting
    content-insertion fabrications. Regression-tested against the real q12
    cases that previously triggered false-positive guard rejections.
  • Adaptive retrieval_top_k (default 12 → 20 when fewer than 5 usable
    documents survive filtering), so per-doc retrieval depth isn't the
    bottleneck on coverage when the filter passes through only a handful of
    documents. q7 reached insight with only 2 surviving documents in baseline
    runs; the adaptive lift gives the model more chunks per surviving doc.
    Default config (12 doc threshold not hit → normal top_k=12) is
    unchanged.

Verification

  • python -m pytest bioscancast/tests/450 passed, 2 skipped (live).
    New tests cover the case-insensitive guard layer, suspected_*
    vocabulary, the quote-must-contain-figure prompt change, and the adaptive
    top_k path.
  • Live runs confirmed q12 now correctly tags suspected_cases /
    suspected_deaths (e.g., a Guardian quote "904 suspected cases and 119
    suspected deaths" → two separate records with correct metric_name
    instead of one collapsed deaths record at 160 vs 61 case-fatality
    scandal). Same chunks that previously emitted q12 Record 4 no longer
    attach a metric_value to a quantitatively-empty quote.

Issues this PR addresses

  • Strong-model refinement pass in the insight stage #26 (strong-model refinement) — partial: the suspected_deaths
    vocab nibbles at one of the failure modes flagged by that issue
    (cheap-model collapsing modifiers), but the refinement pass itself isn't
    built. Not closeable on this PR alone.

Reviewer checklist

  • Confirm the prompt change to require quote-contains-figure doesn't
    over-restrict the model (it shouldn't — phrasing accepts digits,
    number-words, and relative references explicitly).
  • Sanity-check the case-insensitive guard layer doesn't accept
    content-insertion fabrications (the regression tests cover the known
    failure modes).

smodee added 4 commits June 3, 2026 14:19
q12 Record 3 reported metric_name=deaths, metric_value=160 from the
source quote "160 suspected deaths out of 670 suspected cases". The
prompt's canonical vocabulary already had `suspected_cases` (for the
670) but no `suspected_deaths` slot - so the model collapsed the
"suspected" qualifier and emitted plain `deaths`. The result was an
arithmetically scandalous record (160 deaths against 61 confirmed
cases) that wouldn't survive any reasonable post-hoc sanity check.

Changes:

- Add `suspected_deaths` alongside the existing `suspected_cases`
  entry. Two-tier system per category (confirmed_* and suspected_*),
  matching the agreed shape of the vocab.
- Drop the now-redundant standalone `probable_cases` line; the
  `suspected_cases` description explicitly covers "suspected",
  "probable", and "possible" as the same tier. WHO's combined
  `confirmed_or_probable_cases` bucket is kept separately because it
  is a distinct reporting category.
- Add a deaths-family mapping rule paralleling the existing
  cases-family rule ("suspected deaths", "probable deaths", "deaths
  under investigation" all map to suspected_deaths). Mirror what the
  existing `confirmed_cases` line does for the cases family.
- Clean up the stale "possible all get their own variants below"
  parenthetical which referenced a `possible_cases` slot that has
  never existed.

After this, q12's "160 suspected deaths" should extract as
suspected_deaths=160 rather than deaths=160 - same value, correct
category, no longer competing with confirmed_cases for downstream
forecasting weight.

Implements item 3 from the Tier 1 roadmap.
q12 Record 4 reported metric_value=82 with the quote
"the outbreak now poses a 'very high' risk for Congo - up from a
previous categorization of 'high'" - no digit, no number-word, no
relative reference. The hallucination guard's verbatim-substring check
passed because the quote string did appear in the source chunk, but
nothing in the guard required the quote sentence to be the one
actually carrying the figure. The metric_value of 82 came from
elsewhere in the chunk; the quote was a "supporting context"
sentence.

A deterministic post-hoc check (str(metric_value) in quote) would
over-reject: word numbers ("a dozen"), relative quantities ("a
quarter of the population"), and number-word forms ("ninety-nine
thousand") would all be false-positive rejections. So the fix lives
at the prompt level instead: tell the model the quote MUST be the
sentence that carries the figure - digits, number-word, or a clear
relative reference. A purely contextual sentence is not acceptable.

The verbatim-substring guard remains the safety net. This change
tightens the model's understanding of what `quote` is supposed to do
without committing to a brittle programmatic check that would lose
real signal on legitimate paraphrases.

Implements item 4 from the Tier 2 roadmap.
When the filtering stage passes only a handful of documents through to
insight, per-document retrieval depth becomes the bottleneck on
coverage. q7's live run reached insight with 2 usable documents and
hit retrieval_top_k=12 on each - meaning at most ~24 chunk
extractions for the whole question. Bumping per-doc retrieval depth
costs little and gives the model more chances to find the relevant
figure in each surviving document.

Adds two InsightConfig fields:
- low_survival_doc_threshold (default 5)
- low_survival_top_k (default 20)

When the count of usable documents (status != "failed" and
non-empty chunks) is at or below the threshold, both retrieval_top_k
and max_chunks_per_document effectively rise to low_survival_top_k for
the run, and a note is appended to InsightRunResult.notes flagging
that the adaptive path engaged. Default config (12 doc threshold not
hit -> normal top_k=12) is unchanged.

Tests that pin retrieval_top_k to small values to control fake-LLM
call counts now also pin low_survival_top_k to the same value, opting
out of the adaptive lift explicitly. 447 tests still passing.

Implements item 6 from the Tier 2 roadmap. Completes the planned
bundle of items 1+2+3+4+5+6.
Item 8 investigation: of the three quotes the guard dropped across the
q12 live runs, two were real facts lost purely because the model
lowercased the leading letter of a sentence it quoted from
mid-paragraph:

  source: "There are now 750 suspected cases and 177 suspected deaths"
  model:  "there are now 750 suspected cases and 177 suspected deaths"

  source: "The Congolese Ministry of Communication, in a post to X ...
           said that there were 904 suspected cases and 119 ..."
  model:  "the Congolese Ministry of Communication, in a post to X ..."

The third rejection was a genuine content-insertion hallucination -
the model bolted the real prefix "a total of 105 confirmed cases
(including 10 deaths)" onto a fabricated continuation "...have been
reported in Ituri, North Kivu, and South Kivu" (the source actually
continues "...and 906 suspected cases").

Fix: a fourth, case-insensitive substring layer. It returns the chunk's
own casing so the stored quote still reflects the source verbatim. The
key safety property - verified against the real q12 fabrication and
captured in a new regression test - is that case-folding does NOT
recover content insertions: a fabricated continuation fails the
substring test regardless of case.

Tests: new _LAYER4_CASE_INSENSITIVE_CASES (the two recovered q12
quotes) plus a hallucination case mirroring the q12 fabrication that
must stay rejected. 450 passed (was 447; +3 guard cases).

Implements item 8 from the roadmap.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant