Skip to content

Insight stage hardening: 11 improvements + 5 deferred-but-included items#27

Merged
rapsoj merged 21 commits into
mainfrom
feat/insight-stage-hardening
May 28, 2026
Merged

Insight stage hardening: 11 improvements + 5 deferred-but-included items#27
rapsoj merged 21 commits into
mainfrom
feat/insight-stage-hardening

Conversation

@smodee
Copy link
Copy Markdown
Collaborator

@smodee smodee commented May 26, 2026

Summary

Hardens the insight stage so the pipeline produces usable records on real biosecurity documents (WHO sitreps, CDC MMWR, ECDC CDTR, CIDRAP, ProMED, Africa CDC). The branch started with 8 items from a consolidated action list and grew to 16 commits as testing surfaced additional issues that were small, related, and clearly worth fixing in-scope.

The unifying theme: fix the things that prevent the insight pipeline from producing good records on real inputs. Some items touch extraction-stage code (HTML/PDF parsing) because insight-stage symptoms had extraction-stage causes.

Depends on #24

Rebased onto feat/as-of-date-replay to absorb the LLM-interface migration for contamination.py and the historical_roleplay parameter for query_decomposition.py. Without this rebase, the textual merge of #24+#27 produces broken imports (#24's contamination module uses the legacy bioscancast.llm.client that #27 deletes) and a conflict in query_decomposition.py (both PRs modified it independently). Merging this PR after #24 is now self-consistent; the orchestrator branch's cleanup commit becomes redundant on its next rebase.

What's included

# Commit What it does
1 810327f Loosen hallucination guard to accept NFKC / terminal-punctuation / wrapping-punctuation drift while still rejecting content insertions and synthesized quotes
2 5c50f99 New real-document integration test in CI — 23 assertions across 6 real biosec docs, runs in ~7s with fakes
3 99f93dd Drop or repair empty chunks at extraction time — recovers MMWR's borderless-table chunk and ProMED's 157-row outbreak table
4 a47be32 Partial-date dedup — event_date_precision field + two-stage dedup so 2026-01-25 (day) merges with 2026-01 (month)
5 ecdb726 Replace 30-country hardcoded map with pycountry — full ISO 3166-1 coverage, US states / regional rejection
6 d3205a4 Reject filename-shaped PDF /Title metadata (e.g. ECDC's 2026-WCP-0020 Draft.docx no longer leaks through)
7 49e374d Migrate filtering stage to the shared LLMClient protocol
8 e29103d Parallelise per-chunk extraction with ThreadPoolExecutor — ~2× per-doc wall-clock
9 efe62c3 Controlled vocabulary for metric_name in the extraction prompt — collapses 17 phrasing variants to 5 canonical ones
10 95e4738 Value-aware dedup — refuse to merge when metric_value differs, catches LLM location-attribution errors
11 ce3ba11 Complete the LLM migration: search stage moved to shared protocol, legacy bioscancast/llm/client.py deleted
12 366aed8 Remove redundant legacy-import regression test (the module it forbade no longer exists)
13 e324d19 Use trafilatura's structural XML output instead of throwing it away — CIDRAP chunks 18 → 2, ProMED table preserved
14 a0b6d9c Broaden HTML published_date extraction: JSON-LD, Dublin Core (dcterms:issued/DC.date.issued), sailthru/parsely, +modification-semantic fallback
15 537fd93 URL path date parsing — extracts year from /han/2024/... (CDC HAN) and /item/2024-DON530 (WHO DON)
16 618484e Remove use_strong_model_refinement no-op flag; design preserved in #26
17 2d77493 Migrate contamination.py and the two as-of-date tests from bioscancast.llm.client to the shared bioscancast.llm.base.LLMClient interface — closes the cross-PR gap with #24

Headline numbers — final live verification on 6 real biosecurity docs

Before this branch After this branch
Records extracted 23 39 (+70%)
Per-run wall-clock 101.9s 37.8s (2.7×)
Unique metric_name strings 9 (with phrasing drift) 5 (all from canonical vocab)
Records with iso_country_code populated 22% (5/23) 74% (29/39)
Hallucination guard catches fabrications yes yes
Cross-doc dedup merges correctly yes yes
Africa CDC fails fast with requires_ocr yes yes
Estimated cost per run (gpt-4o-mini) $0.0058 $0.0075

Cost increase (~30%) is driven entirely by ProMED now contributing real content (rendered outbreak table) and ECDC producing ~2× the records — both increases are buying real value, still solidly under $0.01 per run.

What's NOT in this PR

Deferred deliberately. Not blockers for forecasting; each warrants its own dedicated scope:

  • End-to-end orchestrator (item 2 from the original list) — needs a coherent design for chaining search → filter → extract → insight per ForecastQuestion. Bigger scope, separate branch.
  • Historical-roleplay clarification (item 13 from the original list) — benchmark-fairness question. Better addressed in a benchmark-prep PR with the full historical-replay context in view.
  • Strong-model refinement pass — scaffold removed in commit 618484e; design preserved in Strong-model refinement pass in the insight stage #26 (labelled post-benchmark). Revisit after the first benchmark run identifies whether the cheap-model records need refinement.

Test plan

  • python -m pytest bioscancast/tests/447 passed, 2 skipped, 0 failed on the rebased branch (was 221 before; +226 new tests across this PR and Add historical-replay mode for benchmark fairness #24)
  • New real-document integration test (bioscancast/tests/test_insight_real_docs_integration.py) confirms extraction + insight pipeline produces records on all 6 working biosecurity documents
  • Live verification on 6-doc fixture set + spot-checks on 5 live HTML sources (WHO DON, CDC HAN, CIDRAP homepage/article, Reuters healthcare, a 404 page) — see headline numbers above
  • Hallucination guard still rejects all fabricated quotes across all 6 docs
  • Cross-doc dedup still merges twin facts into one record with provenance from both docs
  • contamination.py and the two as-of-date tests from Add historical-replay mode for benchmark fairness #24 pass after migration to the shared LLMClient interface

Reviewer notes

  • All commits are individually meaningful — review per-commit is straightforward and matches the per-item structure
  • The first 4 commits in the log (before commit 1 of this PR) are from Add historical-replay mode for benchmark fairness #24 and are reviewed in that PR, not here
  • Investigation artifacts from the iterative testing (LLM-run outputs under data/investigations/, the manual eval script at scripts/eval_insight_on_real_docs.py) are deliberately not committed
  • No changes to schemas that would break existing consumers — only additive (event_date_precision is a new optional field on InsightRecord)

🤖 Generated with Claude Code

smodee and others added 4 commits May 20, 2026 11:54
Activated by setting ForecastQuestion.as_of_date; None (default)
preserves live behavior unchanged.

Core:
  * Schema fields published_date_source, cutoff_applied, fetch_strategy,
    snapshot_timestamp on SearchResult and Document for post-hoc audit
  * SearchStagePipeline filters post-cutoff results; recovers undated
    ones from URL slug or Wayback first-seen; drops the unrecoverable
  * lookup_dashboards rewrites URLs to closest Wayback snapshot at-or-
    before cutoff; suppresses dashboards with no pre-cutoff snapshot
  * ExtractionPipeline fetches via Wayback id_ snapshots when cutoff
    is set; falls back to live with strategy recorded on Document
  * SearchCache key incorporates as_of_date so replays don't collide
  * Optional historical_roleplay decomposition prompt
  * eval_stage/contamination.py adds filter_caught_contamination_rate
    (explicit lower bound, never silent) and retrieval_free baseline (E4)

Hardening surfaced by live testing:
  * Tavily end_date accepted on the Protocol but not forwarded
    (verified empirically not to filter)
  * Wayback CDX retries on 5xx / 429 / timeout with exponential backoff
  * Sub-queries get cutoff year appended in historical mode
  * Top-up round with bigger max_results when survivors < threshold
  * RFC 2822 date parsing for Tavily news topic responses

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tavily's news endpoint silently ignores `end_date` when passed alone but
honors the start_date+end_date pair, returning 20/20 native pre-cutoff
results across the resolved corpus (q1, q3, q7, q9). The pipeline now
synthesizes `start_date = as_of_date - 365d` (configurable via
`historical_lookback_days`) and forwards both bounds. The 0/20 pre-cutoff
disaster on live testing of q1 is fully addressed by this change; the
post-retrieval cutoff filter remains as defense-in-depth.

The TavilyBackend drops a lone `end_date` with a warning rather than
sending a request Tavily will misinterpret. Stale comments referring to
"Tavily ignores end_date" are updated across pipeline.py and
tavily_backend.py.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Wayback CDX endpoint rate-limits at ~60 req/min server-side. The
existing reactive RETRY_BACKOFF_SECONDS = (0, 10, 30, 90, 240) ladder
only fires after the server has already started returning 429s, burning
~6 min per failure. Historical-replay benchmarks routinely hit dozens
of these failures, producing ~30 min wall-clock on q1.

This commit adds two complementary measures:

1. Proactive throttle in wayback.py: a module-level _throttle() gate
   sleeps before every urlopen to maintain a 2.0 s minimum interval
   (~30 req/min, half the server cap). Overridable via env var
   BIOSCANCAST_WAYBACK_MIN_INTERVAL_SECONDS. The retry ladder is
   unchanged and still handles genuine 503s / read timeouts.

2. Selective-recovery gate in pipeline._apply_cutoff_filter: skip the
   Wayback first-seen leg of recover_published_date() for aggregator
   domains (metaculus, manifold, kalshi, ...) and source_tier=="unknown".
   The URL-slug regex and Last-Modified strategies still run for gated
   results. New wayback_skipped counter in the cutoff-filter log line.

Live q1 smoke test: ~30 min -> 49 s (~37x).
Test suite: 252 -> 261 passed (9 new tests covering the throttle gate,
env override, retry interaction, gate decisions, and end-to-end
recovery routing).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
probe_tavily_topic.py grows from a one-query news/general comparison
into a corpus iterator with config caching, synthetic-backdated stress
queries, and per-knob result dumping. analyze_tavily_probe.py is new:
it reads the cached probe payloads and recomputes hit-rate tables
without re-paying the Tavily quota.

Both scripts default to writing/reading specs/probe-results/ (gitignored
by convention; create on first run). They were the workhorses behind
the start_date+end_date investigation that produced commit 211f6df.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@smodee smodee requested a review from rapsoj May 27, 2026 06:53
smodee and others added 17 commits May 27, 2026 15:09
The previous strict whitespace-normalised substring check rejected real
factual quotes whenever the LLM made any small punctuation or unicode
adjustment — and live tests showed it does so constantly on real
WHO/CDC/ECDC documents. The headline failure was CDC MMWR producing a
single record from a doc that says "99 cases" five times, because every
variant the model emitted had either a trailing period where the source
had a comma, parens dropped around "(NMDOH)", or a smart-quote vs
straight-quote mismatch.

The new layered guard `_quote_matches` accepts a quote that matches the
chunk under any of three increasingly permissive normalisations:

  1. NFKC + explicit typography fold (smart quotes, em/en dashes,
     ellipsis) + whitespace collapse.
  2. Strip terminal ".;,:!?" from the quote.
  3. Strip wrapping punctuation `()[]{}"'` from both sides.

The function returns the canonical chunk substring it matched against,
so ChunkReference.quote always stores a verbatim chunk excerpt rather
than the model's altered output. Tests show this lifts real-fact
capture from 23 to 34 records on the 6 real biosecurity test
documents (a 48% increase) while still rejecting fabricated quotes,
content-insertion hallucinations (extra words in a list), synthesised
prefix-onto-fragment paraphrases, and wrong-number alterations.

The 30-character-window match described in the plan turned out to be
unnecessary once the typography fold and paren-strip layers were in
place — and would have weakened the guard on real hallucinations
without buying additional real-quote coverage.

New parametrised tests cover each layer plus the rejection cases.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires the real ExtractionPipeline (fetcher monkey-patched to read on-disk
bytes from data/docling_eval/sources/) into the insight pipeline with
two deterministic fake LLMs:

  QuoteEchoingFakeLLM    — picks a number-bearing sentence from each
                           retrieved chunk and emits a synthetic fact
                           citing it verbatim. Exercises the happy path
                           of the hallucination guard.
  HallucinatingFakeLLM   — always emits a fabricated quote. The guard
                           must drop every fact.

Covers 23 assertions across:
  - extraction per source (Africa CDC fails with requires_ocr, the
    other 6 succeed with >= 5 chunks each)
  - WHO mpox PDF metadata yields a publication date
  - insight pipeline produces records for every text-extractable doc
  - the CIDRAP article's "602 cases" headline appears in at least one
    record's quote
  - failed-extraction docs are skipped without LLM calls
  - the hallucination guard rejects every fabricated quote on every doc
  - cross-document dedup merges twin facts into one record with sources
    from both docs

Uses >= thresholds so subsequent items in the insight-stage hardening
plan (empty-chunk filter, partial-date dedup, pycountry resolution)
can lift record counts without breaking this test. Runs in ~7 seconds
with no live LLM calls.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Empty-text chunks cost the insight stage twice: they take a top-k slot
in retrieval (BM25 indexes the heading even when the body is empty),
and then trigger an LLM call against blank chunk text. Tests show the
CDC MMWR borderless table case where this happened on a chunk that
ranked #1 by retrieval score — a wasted call every time.

A new helper, `_drop_or_repair_empty_chunks`, runs in
`ExtractionPipeline.extract_one` between `normalize_chunks` and chunk
renumbering. Two paths:

  - Table chunks with empty text but populated `table_data`: render the
    rows to a tab-separated text block so BM25 and the LLM can see the
    cell contents. `table_data` itself is preserved unchanged for any
    consumer that wants the structured form.
  - Other empty chunks: drop with a DEBUG log carrying chunk_id, type
    and heading. An empty prose chunk almost always indicates a half-
    broken upstream parser section (heading without body, footer
    artefact); nothing the insight stage can act on.

Tests confirm the existing MMWR table chunk (previously empty-text)
now carries 277 characters of rendered table content while the
underlying rows stay accessible via `table_data`. Empty prose chunks
disappear before reaching downstream stages.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous single-stage dedup key formatted every event_date as
YYYY-MM-DD, which prevented merging two reports of the same event
when one source gave a day and another gave only the month — tests
on the 6 real biosecurity documents showed WHO cholera repeatedly
producing two records for the DRC January-2026 6543-cases fact, one
from a prose sentence ("In January 2026 ... reported 6543 new cholera
cases") and one from a table cell ("Democratic Republic of the Congo
6 543") with subtly different parsed dates.

Changes:

  - New `event_date_precision` field on `InsightRecord` carries the
    granularity ("year"|"month"|"day") alongside the canonical
    `event_date` datetime (which is now the start of the period when
    only a partial date is known).
  - `_parse_event_date` accepts YYYY, YYYY-MM, YYYY-MM-DD, and the
    existing free-form day-precision variants, returning a
    (datetime, precision) tuple.
  - Extraction prompt instructs the model to use the most specific
    ISO date the chunk supports and NOT to invent a day-of-month
    when only a month is given.
  - Two-stage `_deduplicate_records`: first group by (event_type,
    metric_name, normalized_location), then within each group walk
    records in order and merge each into the first surviving entry
    whose date bucket overlaps at the coarser precision. The
    surviving record adopts the finer precision so downstream
    consumers don't lose information.
  - Confidence is taken as the max across merged records; provenance
    references from every merged source are preserved.

Tests cover the matrix of precision combinations (day/month overlap,
year/day overlap, equal months, different days, undated vs dated,
three-way mixed-precision merge, separate locations). Live runs show
records merging correctly when the model picks the same metric_name
across sources; cases where the model uses subtly different metric
names ("new cases" vs "Cases" vs "cholera cases") still stay
separate — that's a metric-name-normalisation problem separate from
this change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous COUNTRY_TO_ISO dict hand-listed ~30 countries — primarily
because nobody wanted to take on a dependency at the time. Live tests
show the cost: most extracted records on the 6 real biosecurity docs
had iso_country_code=None despite clear country names like "Austria",
"Bulgaria", "Comoros", and "Madagascar" that the map simply didn't
cover. With ~250 ISO 3166-1 entries, hand-maintenance was untenable.

pycountry covers all 249 ISO 3166-1 entries by canonical, common, and
official name plus alpha-2/alpha-3 codes. The new resolver in
`chunk_extractor.py` layers four steps:

  1. Typography fold (smart quotes → ASCII) so "Côte d'Ivoire" with
     either apostrophe variant resolves.
  2. Explicit not-a-country set ("Africa", "European Region",
     "EU/EEA", etc.) — these are multi-country roll-ups, not single
     ISO entries, and pycountry's `search_fuzzy` produces surprising
     false positives here. Returns None deliberately.
  3. Alias dict for forms pycountry won't match directly ("UK", "DRC",
     "Russia", "Burma", "Ivory Coast", "North Korea", and UK
     constituents like "England" / "Scotland" → "GB").
  4. US subnational set — all 50 states plus DC and US territories
     → "US". Common in biosecurity reporting (CDC MMWR routinely
     phrases location as "Lea County, New Mexico").

`pycountry.countries.search_fuzzy` is deliberately not used — it
would resolve "Eastern Mediterranean Region" to a single country
(false positive). Strict matching only.

Compound locations like "Mubende district, Uganda" still work via
the existing right-to-left comma-segment fallback.

Tests cover: bare country lookups across all 6 real docs' locations,
the alias set, US states, multi-country region rejection, compound
locations, alpha-2/alpha-3 codes, smart-quote variants.

pycountry pinned to >=24.0 in requirements.txt.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PyMuPDF surfaces whatever the PDF's /Title metadata field says, with no
filtering. Real biosecurity PDFs often have stale conversion titles
leaked from the source Word document — tests show ECDC's CDTR
returning "2026-WCP-0020 Draft.docx" as its title, which then
displayed verbatim through Document.title and into any downstream
consumer.

A new `_sanitize_title` helper on PdfParser drops:

  - Titles ending in a document-format extension (.pdf, .docx, .doc,
    .odt, .rtf, .txt, .pages, .xlsx, .pptx, .html, case-insensitive).
  - Empty / whitespace-only titles.
  - Implausibly short titles (< 5 chars).

When the sanitiser returns None, the existing
`parsed.title or filtered_doc.title` fallback chain in
`extraction/pipeline.py` picks up the search-side title instead —
which is the desired behaviour.

Tests confirm the ECDC stale title is dropped, every other
filename extension variant is rejected, and real document titles
(WHO sitreps, MMWR articles, ECDC CDTR) pass through unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The filtering stage was using the older
`bioscancast/llm/client.py:LLMClient` (single positional prompt string,
returns plain dict) — a parallel protocol that's been carrying a long-
standing TODO in the insight README. The shared protocol at
`bioscancast/llm/base.py:LLMClient` uses keyword-only system/user/schema/
model/max_tokens and returns LLMResponse with structured token
accounting, matching what insight (and soon forecasting) already uses.

Changes:

  - `bioscancast/filtering/llm_filter.py`:
      * `build_filter_prompt` now returns a (system, user, schema) triple
        instead of a single concatenated JSON string. System carries the
        task instructions; user carries the question + candidates
        payload; schema is a real strict JSON Schema (not the previous
        example-dict).
      * `llm_filter_candidates` calls `llm_client.generate_json` with
        kwargs and reads `response.content["decisions"]` instead of
        the raw dict.
      * Adds default `model` and `max_tokens` parameters.

  - `bioscancast/filtering/pipeline.py`: switches `LLMClient` import to
    the shared base protocol.

  - `bioscancast/llm/client.py`: keeps the legacy single-positional
    protocol intact (search stage still uses it) but adds a top-of-file
    docstring warning new callers off.

  - `bioscancast/insight/README.md`: the filtering-migration TODO is
    closed; replaced with the remaining search-stage migration as a
    follow-up.

  - New `bioscancast/tests/test_filtering_llm.py` covers the new
    protocol path: prompt triple, strict JSON schema shape, the
    expected `generate_json(**kwargs)` call signature, missing-decision
    handling, empty-input shortcut, and a regression check that the
    filter module never re-imports the legacy client.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The insight pipeline's per-chunk LLM call loop was strictly sequential.
Tests on the 6 real biosecurity documents showed each doc spending
almost all its wall-clock in serial OpenAI request latency: WHO mpox
~30s, ECDC CDTR ~38s, MMWR ~11s for top-k=5 chunks. Since each
extract_facts_from_chunk call is independent and the OpenAI sync
client is thread-safe, this was leaving easy speedups on the floor.

Changes:

  - `InsightPipeline.run` now dispatches per-chunk extractions to a
    `ThreadPoolExecutor` whose pool size is `min(chunk_workers,
    len(scored_chunks))`. With chunk_workers=1 or only one chunk, the
    code takes a sequential fallback path. Errors in one chunk are
    caught, logged, and don't abort the document.
  - Budget accounting still happens serially on the main thread after
    futures complete, so BudgetTracker stays simple (no locks needed).
  - New `chunk_workers: int = 6` field on `InsightConfig`. Six is a
    pragmatic default — matches the typical retrieval_top_k while
    staying well below OpenAI's per-minute rate limits for
    gpt-4o-mini. Setting to 1 reproduces the previous sequential
    behaviour.
  - `FakeLLMClient.generate_json` and `enqueue` now hold a
    `threading.Lock` around the response deque and counters so the
    test fakes stay deterministic under concurrent calls.

New tests:

  - test_pipeline_parallel_chunk_extraction_produces_all_records:
    content-keyed fake confirms every chunk in the top-k is processed
    by parallel workers and records survive provenance checks.
  - test_pipeline_sequential_and_parallel_produce_same_record_count:
    chunk_workers=1 and chunk_workers=4 yield identical record counts
    and identical input-token totals on the same input.
  - test_pipeline_parallel_isolates_chunk_failures: a fake that throws
    on the 2nd chunk doesn't kill the doc — the other three still
    complete and the doc is marked processed.

Live verification on the 6 real biosecurity documents: wall-clock per
doc drops ~2× across the board (WHO mpox 24s→13s, WHO cholera 11s→3s,
MMWR 11s→3s, ECDC CDTR 37s→15s). Record counts stable within LLM
stochasticity.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Live observation from items 6 (partial-date dedup) was that the model
emitted lots of different metric_name strings for what was essentially
the same metric — "confirmed cases" / "cases" / "reported cases" /
"total cases" / "total_cases" / "cholera cases" / "new cases" / "Cases"
all appeared in a single live run, and they prevented the dedup logic
from merging facts about the same event. ECDC alone produced 9
distinct metric_name variants of "case count".

Tests show that listing a canonical snake_case vocabulary directly in
the extraction prompt — with explicit guidance that qualifiers (sex,
sub-region, time-period) belong in `summary` or `location` rather than
in `metric_name` — collapses the diversity dramatically: 17 unique
metric_names → 4–6 across the same 6 real biosecurity test documents,
all drawn from the canonical list. The model can still invent a
short snake_case label when none of the canonical values fit.

The vocabulary covers the common biosecurity metrics observed in
live tests (confirmed_cases, suspected_cases, probable_cases,
confirmed_or_probable_cases, deaths, hospitalizations, recoveries,
vaccinations_administered, vaccine_doses_distributed, affected_herds,
affected_animals, new_outbreaks_declared, reproductive_number,
case_fatality_ratio).

This change works together with the value-aware dedup added in the
follow-up commit — together they let real duplicates merge cleanly
while preventing the merge from being too aggressive when the model
misattributes locations.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The two-stage dedup added in item 6 groups records by (event_type,
metric_name, normalized_location) and merges any whose date buckets
overlap. But it didn't compare metric_value — so two records that
share a dedup key with overlapping dates AND disagreeing values would
still merge, silently dropping one value.

This matters in practice because LLMs occasionally misattribute
locations. Live test on the WHO cholera doc exposed exactly this:
the model emitted "In January 2026, the African Region reported the
highest number of cases (9782 cases; 13 countries)" with
location=DRC and value=9782 (the African Region figure incorrectly
tagged with DRC). Under the previous dedup logic this merged into
the legitimate DRC 6543 records, hiding the attribution error and
silently dropping the 9782 value from the dataframe.

New `_values_compatible(v1, v2)` helper allows merging when:
  - Both values are None (no count claimed)
  - Either value is None (one source omitted the count)
  - Values are equal
  - Values are within 1% relative tolerance (accommodates rounding
    e.g. "about 6500" vs "6543" — same fact, different precision)

Values further apart are treated as distinct facts and kept as
separate records, surfacing the conflict to downstream consumers
rather than burying it.

Live verification on the 6 real biosecurity documents: WHO cholera
went from 1 (over-merged, value lost) → 2 records (legitimate DRC
merge + Africa Region preserved separately). Total record count
35 → 40, all 5 additional records are legitimate distinct facts
the vocabulary-only configuration silently dropped.

Three new tests cover: distinct-value rejection, within-tolerance
merge (rounding), and one-value-None merge.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Item 10 stopped short of full retirement: it migrated filtering to the
shared `bioscancast.llm.base.LLMClient` protocol but left
`bioscancast.llm.client` in place because the search stage still
called it. This change finishes the migration and deletes the legacy
module — one LLM protocol for the whole codebase.

Search-stage changes:

  - `bioscancast/stages/search_stage/query_decomposition.py` now
    builds (system, user, schema) triples for both the question-type
    classifier and the sub-query decomposer. Two new JSON schemas
    (`CLASSIFY_SCHEMA`, `DECOMPOSE_SCHEMA`) constrain the model
    output to the existing QUESTION_TYPES and VALID_AXES enums.
    Calls switch from `generate_json(prompt) -> dict` to the kwargs
    form returning `LLMResponse`.
  - `bioscancast/stages/search_stage/pipeline.py` imports `LLMClient`
    from `bioscancast.llm.base` instead of the legacy module.
  - `scripts/run_search_stage.py` instantiates `OpenAILLMClient`
    (the production class for the shared protocol) instead of the
    legacy `OpenAIClient`.

Test updates:

  - `bioscancast/tests/test_query_decomposition.py` rewritten:
    `FakeLLMClient` now implements the shared protocol (kwargs +
    `LLMResponse`), responses are wrapped via a small `_resp` helper,
    and two new tests assert the right schema is passed to each call.
  - `bioscancast/tests/test_filtering_llm.py` docstring updated to
    note the legacy module is gone; the regression check is kept
    around as a guard against anyone reintroducing it.

Cleanup:

  - `bioscancast/llm/client.py` deleted.
  - `bioscancast/llm/__init__.py` no longer exports the now-defunct
    `FilteringLLMClient` / `OpenAIClient` aliases; nothing in the
    repo referenced them.
  - The follow-up TODO line in `bioscancast/insight/README.md` is
    removed — there is no remaining migration debt.

All 348 tests still pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous commit deleted bioscancast/llm/client.py. The regression
test that asserted filtering didn't import from that module is now
dead weight: any attempt to import the deleted module would fail at
import time with ModuleNotFoundError — much louder than a source-text
grep assertion.

Source-text grep tests are generally fragile too: they couple to
implementation details (the literal `from bioscancast.llm.client`
string) rather than behaviour. The new-protocol tests already in
this file verify the actual filter calls work correctly, which is
what we actually care about.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The HTML parser was already calling trafilatura, but only using the
plain-text output as a `raw_text` fallback. Sections were rebuilt by
walking the entire raw DOM body — which on pages like CIDRAP (whose
body contains the target article plus three unrelated articles plus
a "top reads" sidebar plus a footer) produced 18 chunks of mostly
noise. The investigation showed this wasn't a deliberate design
decision; the original code just used trafilatura.extract() at its
default text-only setting and didn't discover the structured output
modes.

Switching to `trafilatura.extract(output_format='xml', include_tables=True)`
gives us a cleaned `<doc><main>...` tree with the article body
properly isolated. The new `_extract_sections_from_trafilatura_xml`
walks that tree the same way the DOM walker walks BeautifulSoup,
emitting heading-stack-aware sections plus tables.

The previous DOM walker is kept as a fallback: when trafilatura's
output has less than 200 chars of body text (listing pages, error
pages, or pages whose layout trafilatura's heuristics misjudge), the
parser falls through to the original code path so we never silently
extract nothing.

Title, published_date, and language continue to come from raw DOM
head/meta queries — those are reliable regardless of which body path
runs.

Verified end-to-end:

  - CIDRAP fixture: 18 chunks (4 unrelated articles + nav + footer)
    → 2 chunks (just the Utah measles article). The "602 cases"
    headline survives; the insight pipeline still produces a record
    citing it.
  - ProMED fixture: chunks dropped from 7 to 6 (header tagline gone);
    the 157-row outbreak table is fully preserved in `table_rows`.
  - PDF docs (WHO mpox/cholera, CDC MMWR, ECDC, Africa CDC): unchanged
    (PDFs don't go through this parser).
  - Live spot-checks on 5 fresh HTML sources (WHO DON article, CIDRAP
    homepage, CDC HAN alert, Reuters healthcare landing, a 404 page)
    all behave correctly. Articles get clean structural extraction;
    listings get a small but useful section; 404s correctly take the
    fallback path.

Test changes:

  - CIDRAP's min-chunks floor in test_insight_real_docs_integration.py
    drops to 1 (was 5) — that's the new, lower, cleaner extraction
    output. Other fixtures unchanged.
  - New `TestTrafilaturaXmlExtraction` class in test_extraction_html.py
    with synthetic-HTML cases for sibling-article stripping, table
    preservation, DOM-walker fallback on thin pages, and metadata
    extraction independent of which body path runs.

All 351 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…atforms

The previous date extractor checked only four patterns:
``article:published_time``, ``og:published_time``, ``<meta name="publication_date">``,
and ``<time datetime>``. That's enough for CIDRAP-style article pages
but silently returns ``None`` for sources using JSON-LD Schema.org
(common on news sites), Dublin Core (common on government / academic
sites), and the sailthru / parsely conventions widely used by news
platforms.

The new ``_iter_date_candidates`` walks a prioritised list:

  1. Publication-semantic ``<meta>``: article:published_time,
     og:article:published_time, og:published_time, sailthru.date,
     parsely-pub-date, article.published, pubdate.
  2. JSON-LD datePublished (walking nested @graph entries too —
     Article-inside-WebPage is a common JSON-LD shape).
  3. Dublin Core publication-semantic: DC.date.issued, dcterms:issued,
     DC.date.created, dcterms:created.
  4. Legacy publication_date.
  5. JSON-LD dateCreated.
  6. First ``<time datetime>``.
  7. Modification-semantic (last resort): article:modified_time,
     og:modified_time, JSON-LD dateModified.

Bare ``DC.date`` is deliberately ignored — CDC HAN alerts (and many
other sites) use it as a last-rendered timestamp rather than a
publication date, and returning it would silently regress data quality.
Body-text regex extraction is also deliberately out of scope: too many
pages (especially epi articles) contain dozens of unrelated dates that
a regex would mistakenly pick up.

Honest live impact on the 5 sources we currently exercise:

  - WHO DON article: still None (date is in body text only — pattern
    deliberately not handled here).
  - CDC HAN alert: still None (its only ISO-format meta is ``DC.date``,
    which is a modification stamp; the cdc:* namespace entries are
    free-text and inconsistent).
  - CIDRAP homepage and fixture article: unchanged (still extracted
    correctly via article:published_time / <time>).
  - ProMED fixture: still None (no metadata anywhere — legitimately).

The change is preventive robustness for the source classes we'll
encounter beyond the current fixture set. Twelve new tests pin each
pattern in the priority chain so future additions don't shuffle the
precedence silently.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds structural URL date extraction as a candidate in the
published-date priority chain. Sits below all publication-semantic
metadata (article:published_time, JSON-LD datePublished, Dublin Core
issued/created, etc.) but above the generic <time> tag and any
modification-semantic source. The placement matches the user's
instruction: meta wins when it exists, URL provides a fallback when
it doesn't.

Why URL parsing is robust enough to outrank generic <time> tags:

  - It's structurally bounded — patterns must occupy a full path
    segment or be anchored to its start. Year-shaped numbers buried
    inside slugs ("section-2024-summary") are NOT picked up.
  - Editors put dates in URLs as a publishing convention, not a
    coincidence. The signal is editorially intentional in a way that
    arbitrary <time> tags scattered around a page are not.
  - It avoids the failure mode where a sidebar's <time> tag for a
    related article is mistakenly read as the page's own date.

Patterns recognised, in order of specificity (most-specific wins):

  1. Full ISO in one segment: /2024-08-13/ or /20240813/
  2. Three consecutive segments: /2024/08/13/
  3. Year-month in one segment: /2024-08/
  4. Two consecutive segments: /2024/08/
  5. Year-prefixed slug (separator required): /2024-DON530, /2024_q3
  6. Bare year segment: /2024/

Years are bounded to [1990, 2100] to reject typoed far-past / far-
future dates. Patterns gracefully degrade — /2024/13/ (invalid month)
still resolves to year-2024 via pass 6 rather than returning None.

Live impact on the three problem sources from the previous commit:

  - WHO DON .../item/2024-DON530:  None -> 2024-01-01 (year prefix slug)
  - CDC HAN .../han/2024/...:      None -> 2024-01-01 (bare year segment)
  - ProMED .../promed-posts/:      None -> None (no URL date — correct)

Fixture and live CIDRAP sources unchanged — their meta tags continue
to win, as the priority chain demands.

15 new test cases pin: every URL pattern, the precedence vs each
higher-tier metadata source, the precedence vs <time> and modified-
time (URL beats both), out-of-range year rejection, and graceful
degradation for partially-invalid dates.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The insight stage shipped with a `use_strong_model_refinement: bool`
config flag and a placeholder branch in `InsightPipeline.run` that
only appended a "not yet implemented" note when the flag was True.
A `strong_model: str = "gpt-4o"` config sibling pointed at the
intended refinement model.

That state is the worst of all worlds: users who enable the flag get
a silent no-op (their records aren't refined, the only sign is a
note buried in `InsightRunResult.notes`), and future contributors
have to reverse-engineer the design intent. Better to remove the
scaffold entirely and revisit when there's data showing a strong
pass is actually needed.

Design intent is preserved in
#26
labelled `post-benchmark` — that issue captures:

  - What the two-pass design was supposed to do
  - When to revisit (after the first end-to-end benchmark run
    identifies failure modes — confidence miscalibration, misclassified
    event_type, missing structured fields, or subtle hallucinations
    surviving the substring guard)
  - Concrete scope when implementing (confidence-only vs field-fill
    vs full refinement, budget plumbing, reject-path semantics, tests)
  - Definition of done with cost guardrails

Changes:

  - `bioscancast/insight/config.py`: removed `use_strong_model_refinement`
    and `strong_model` from both INSIGHT_CONFIG and InsightConfig.
  - `bioscancast/insight/pipeline.py`: removed the placeholder branch.
  - `bioscancast/insight/README.md`: replaced the TODO bullet with a
    pointer to issue #26 explaining why the scaffold is gone.

All 384 tests still pass — the flag had no behaviour to test.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…erface

contamination.py and test_historical_topup.py were both added in
feat/as-of-date-replay (#24) against the legacy bioscancast.llm.client
module, which feat/insight-stage-hardening (#27) deleted as part of the
LLM-protocol migration. The textual merge succeeded but left a broken
import and two stale fake LLM classes using the old single-arg
generate_json signature.

Migrates both to the new structured generate_json(system=, user=, schema=,
model=, max_tokens=) interface returning LLMResponse, with explicit
prompt/schema constants in contamination.py.

pytest: 447 passed, 2 skipped.
@smodee smodee force-pushed the feat/insight-stage-hardening branch from 1f19eaa to 2d77493 Compare May 27, 2026 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants