Insight stage hardening: 11 improvements + 5 deferred-but-included items by smodee · Pull Request #27 · algorithmicgovernance/BioScanCast

smodee · 2026-05-26T20:36:02Z

Summary

Hardens the insight stage so the pipeline produces usable records on real biosecurity documents (WHO sitreps, CDC MMWR, ECDC CDTR, CIDRAP, ProMED, Africa CDC). The branch started with 8 items from a consolidated action list and grew to 16 commits as testing surfaced additional issues that were small, related, and clearly worth fixing in-scope.

The unifying theme: fix the things that prevent the insight pipeline from producing good records on real inputs. Some items touch extraction-stage code (HTML/PDF parsing) because insight-stage symptoms had extraction-stage causes.

Depends on #24

Rebased onto feat/as-of-date-replay to absorb the LLM-interface migration for contamination.py and the historical_roleplay parameter for query_decomposition.py. Without this rebase, the textual merge of #24+#27 produces broken imports (#24's contamination module uses the legacy bioscancast.llm.client that #27 deletes) and a conflict in query_decomposition.py (both PRs modified it independently). Merging this PR after #24 is now self-consistent; the orchestrator branch's cleanup commit becomes redundant on its next rebase.

What's included

#	Commit	What it does
1	`810327f`	Loosen hallucination guard to accept NFKC / terminal-punctuation / wrapping-punctuation drift while still rejecting content insertions and synthesized quotes
2	`5c50f99`	New real-document integration test in CI — 23 assertions across 6 real biosec docs, runs in ~7s with fakes
3	`99f93dd`	Drop or repair empty chunks at extraction time — recovers MMWR's borderless-table chunk and ProMED's 157-row outbreak table
4	`a47be32`	Partial-date dedup — `event_date_precision` field + two-stage dedup so `2026-01-25` (day) merges with `2026-01` (month)
5	`ecdb726`	Replace 30-country hardcoded map with `pycountry` — full ISO 3166-1 coverage, US states / regional rejection
6	`d3205a4`	Reject filename-shaped PDF `/Title` metadata (e.g. ECDC's `2026-WCP-0020 Draft.docx` no longer leaks through)
7	`49e374d`	Migrate filtering stage to the shared `LLMClient` protocol
8	`e29103d`	Parallelise per-chunk extraction with `ThreadPoolExecutor` — ~2× per-doc wall-clock
9	`efe62c3`	Controlled vocabulary for `metric_name` in the extraction prompt — collapses 17 phrasing variants to 5 canonical ones
10	`95e4738`	Value-aware dedup — refuse to merge when `metric_value` differs, catches LLM location-attribution errors
11	`ce3ba11`	Complete the LLM migration: search stage moved to shared protocol, legacy `bioscancast/llm/client.py` deleted
12	`366aed8`	Remove redundant legacy-import regression test (the module it forbade no longer exists)
13	`e324d19`	Use `trafilatura`'s structural XML output instead of throwing it away — CIDRAP chunks 18 → 2, ProMED table preserved
14	`a0b6d9c`	Broaden HTML `published_date` extraction: JSON-LD, Dublin Core (`dcterms:issued`/`DC.date.issued`), sailthru/parsely, +modification-semantic fallback
15	`537fd93`	URL path date parsing — extracts year from `/han/2024/...` (CDC HAN) and `/item/2024-DON530` (WHO DON)
16	`618484e`	Remove `use_strong_model_refinement` no-op flag; design preserved in #26
17	`2d77493`	Migrate `contamination.py` and the two as-of-date tests from `bioscancast.llm.client` to the shared `bioscancast.llm.base.LLMClient` interface — closes the cross-PR gap with #24

Headline numbers — final live verification on 6 real biosecurity docs

	Before this branch	After this branch
Records extracted	23	39 (+70%)
Per-run wall-clock	101.9s	37.8s (2.7×)
Unique `metric_name` strings	9 (with phrasing drift)	5 (all from canonical vocab)
Records with `iso_country_code` populated	22% (5/23)	74% (29/39)
Hallucination guard catches fabrications	yes	yes
Cross-doc dedup merges correctly	yes	yes
Africa CDC fails fast with `requires_ocr`	yes	yes
Estimated cost per run (gpt-4o-mini)	$0.0058	$0.0075

Cost increase (~30%) is driven entirely by ProMED now contributing real content (rendered outbreak table) and ECDC producing ~2× the records — both increases are buying real value, still solidly under $0.01 per run.

What's NOT in this PR

Deferred deliberately. Not blockers for forecasting; each warrants its own dedicated scope:

End-to-end orchestrator (item 2 from the original list) — needs a coherent design for chaining search → filter → extract → insight per ForecastQuestion. Bigger scope, separate branch.
Historical-roleplay clarification (item 13 from the original list) — benchmark-fairness question. Better addressed in a benchmark-prep PR with the full historical-replay context in view.
Strong-model refinement pass — scaffold removed in commit 618484e; design preserved in Strong-model refinement pass in the insight stage #26 (labelled post-benchmark). Revisit after the first benchmark run identifies whether the cheap-model records need refinement.

Test plan

python -m pytest bioscancast/tests/ — 447 passed, 2 skipped, 0 failed on the rebased branch (was 221 before; +226 new tests across this PR and Add historical-replay mode for benchmark fairness #24)
New real-document integration test (bioscancast/tests/test_insight_real_docs_integration.py) confirms extraction + insight pipeline produces records on all 6 working biosecurity documents
Live verification on 6-doc fixture set + spot-checks on 5 live HTML sources (WHO DON, CDC HAN, CIDRAP homepage/article, Reuters healthcare, a 404 page) — see headline numbers above
Hallucination guard still rejects all fabricated quotes across all 6 docs
Cross-doc dedup still merges twin facts into one record with provenance from both docs
contamination.py and the two as-of-date tests from Add historical-replay mode for benchmark fairness #24 pass after migration to the shared LLMClient interface

Reviewer notes

All commits are individually meaningful — review per-commit is straightforward and matches the per-item structure
The first 4 commits in the log (before commit 1 of this PR) are from Add historical-replay mode for benchmark fairness #24 and are reviewed in that PR, not here
Investigation artifacts from the iterative testing (LLM-run outputs under data/investigations/, the manual eval script at scripts/eval_insight_on_real_docs.py) are deliberately not committed
No changes to schemas that would break existing consumers — only additive (event_date_precision is a new optional field on InsightRecord)

🤖 Generated with Claude Code

Activated by setting ForecastQuestion.as_of_date; None (default) preserves live behavior unchanged. Core: * Schema fields published_date_source, cutoff_applied, fetch_strategy, snapshot_timestamp on SearchResult and Document for post-hoc audit * SearchStagePipeline filters post-cutoff results; recovers undated ones from URL slug or Wayback first-seen; drops the unrecoverable * lookup_dashboards rewrites URLs to closest Wayback snapshot at-or- before cutoff; suppresses dashboards with no pre-cutoff snapshot * ExtractionPipeline fetches via Wayback id_ snapshots when cutoff is set; falls back to live with strategy recorded on Document * SearchCache key incorporates as_of_date so replays don't collide * Optional historical_roleplay decomposition prompt * eval_stage/contamination.py adds filter_caught_contamination_rate (explicit lower bound, never silent) and retrieval_free baseline (E4) Hardening surfaced by live testing: * Tavily end_date accepted on the Protocol but not forwarded (verified empirically not to filter) * Wayback CDX retries on 5xx / 429 / timeout with exponential backoff * Sub-queries get cutoff year appended in historical mode * Top-up round with bigger max_results when survivors < threshold * RFC 2822 date parsing for Tavily news topic responses Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Tavily's news endpoint silently ignores `end_date` when passed alone but honors the start_date+end_date pair, returning 20/20 native pre-cutoff results across the resolved corpus (q1, q3, q7, q9). The pipeline now synthesizes `start_date = as_of_date - 365d` (configurable via `historical_lookback_days`) and forwards both bounds. The 0/20 pre-cutoff disaster on live testing of q1 is fully addressed by this change; the post-retrieval cutoff filter remains as defense-in-depth. The TavilyBackend drops a lone `end_date` with a warning rather than sending a request Tavily will misinterpret. Stale comments referring to "Tavily ignores end_date" are updated across pipeline.py and tavily_backend.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The Wayback CDX endpoint rate-limits at ~60 req/min server-side. The existing reactive RETRY_BACKOFF_SECONDS = (0, 10, 30, 90, 240) ladder only fires after the server has already started returning 429s, burning ~6 min per failure. Historical-replay benchmarks routinely hit dozens of these failures, producing ~30 min wall-clock on q1. This commit adds two complementary measures: 1. Proactive throttle in wayback.py: a module-level _throttle() gate sleeps before every urlopen to maintain a 2.0 s minimum interval (~30 req/min, half the server cap). Overridable via env var BIOSCANCAST_WAYBACK_MIN_INTERVAL_SECONDS. The retry ladder is unchanged and still handles genuine 503s / read timeouts. 2. Selective-recovery gate in pipeline._apply_cutoff_filter: skip the Wayback first-seen leg of recover_published_date() for aggregator domains (metaculus, manifold, kalshi, ...) and source_tier=="unknown". The URL-slug regex and Last-Modified strategies still run for gated results. New wayback_skipped counter in the cutoff-filter log line. Live q1 smoke test: ~30 min -> 49 s (~37x). Test suite: 252 -> 261 passed (9 new tests covering the throttle gate, env override, retry interaction, gate decisions, and end-to-end recovery routing). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

probe_tavily_topic.py grows from a one-query news/general comparison into a corpus iterator with config caching, synthetic-backdated stress queries, and per-knob result dumping. analyze_tavily_probe.py is new: it reads the cached probe payloads and recomputes hit-rate tables without re-paying the Tavily quota. Both scripts default to writing/reading specs/probe-results/ (gitignored by convention; create on first run). They were the workhorses behind the start_date+end_date investigation that produced commit 211f6df. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous strict whitespace-normalised substring check rejected real factual quotes whenever the LLM made any small punctuation or unicode adjustment — and live tests showed it does so constantly on real WHO/CDC/ECDC documents. The headline failure was CDC MMWR producing a single record from a doc that says "99 cases" five times, because every variant the model emitted had either a trailing period where the source had a comma, parens dropped around "(NMDOH)", or a smart-quote vs straight-quote mismatch. The new layered guard `_quote_matches` accepts a quote that matches the chunk under any of three increasingly permissive normalisations: 1. NFKC + explicit typography fold (smart quotes, em/en dashes, ellipsis) + whitespace collapse. 2. Strip terminal ".;,:!?" from the quote. 3. Strip wrapping punctuation `()[]{}"'` from both sides. The function returns the canonical chunk substring it matched against, so ChunkReference.quote always stores a verbatim chunk excerpt rather than the model's altered output. Tests show this lifts real-fact capture from 23 to 34 records on the 6 real biosecurity test documents (a 48% increase) while still rejecting fabricated quotes, content-insertion hallucinations (extra words in a list), synthesised prefix-onto-fragment paraphrases, and wrong-number alterations. The 30-character-window match described in the plan turned out to be unnecessary once the typography fold and paren-strip layers were in place — and would have weakened the guard on real hallucinations without buying additional real-quote coverage. New parametrised tests cover each layer plus the rejection cases. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Wires the real ExtractionPipeline (fetcher monkey-patched to read on-disk bytes from data/docling_eval/sources/) into the insight pipeline with two deterministic fake LLMs: QuoteEchoingFakeLLM — picks a number-bearing sentence from each retrieved chunk and emits a synthetic fact citing it verbatim. Exercises the happy path of the hallucination guard. HallucinatingFakeLLM — always emits a fabricated quote. The guard must drop every fact. Covers 23 assertions across: - extraction per source (Africa CDC fails with requires_ocr, the other 6 succeed with >= 5 chunks each) - WHO mpox PDF metadata yields a publication date - insight pipeline produces records for every text-extractable doc - the CIDRAP article's "602 cases" headline appears in at least one record's quote - failed-extraction docs are skipped without LLM calls - the hallucination guard rejects every fabricated quote on every doc - cross-document dedup merges twin facts into one record with sources from both docs Uses >= thresholds so subsequent items in the insight-stage hardening plan (empty-chunk filter, partial-date dedup, pycountry resolution) can lift record counts without breaking this test. Runs in ~7 seconds with no live LLM calls. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Empty-text chunks cost the insight stage twice: they take a top-k slot in retrieval (BM25 indexes the heading even when the body is empty), and then trigger an LLM call against blank chunk text. Tests show the CDC MMWR borderless table case where this happened on a chunk that ranked #1 by retrieval score — a wasted call every time. A new helper, `_drop_or_repair_empty_chunks`, runs in `ExtractionPipeline.extract_one` between `normalize_chunks` and chunk renumbering. Two paths: - Table chunks with empty text but populated `table_data`: render the rows to a tab-separated text block so BM25 and the LLM can see the cell contents. `table_data` itself is preserved unchanged for any consumer that wants the structured form. - Other empty chunks: drop with a DEBUG log carrying chunk_id, type and heading. An empty prose chunk almost always indicates a half- broken upstream parser section (heading without body, footer artefact); nothing the insight stage can act on. Tests confirm the existing MMWR table chunk (previously empty-text) now carries 277 characters of rendered table content while the underlying rows stay accessible via `table_data`. Empty prose chunks disappear before reaching downstream stages. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous single-stage dedup key formatted every event_date as YYYY-MM-DD, which prevented merging two reports of the same event when one source gave a day and another gave only the month — tests on the 6 real biosecurity documents showed WHO cholera repeatedly producing two records for the DRC January-2026 6543-cases fact, one from a prose sentence ("In January 2026 ... reported 6543 new cholera cases") and one from a table cell ("Democratic Republic of the Congo 6 543") with subtly different parsed dates. Changes: - New `event_date_precision` field on `InsightRecord` carries the granularity ("year"|"month"|"day") alongside the canonical `event_date` datetime (which is now the start of the period when only a partial date is known). - `_parse_event_date` accepts YYYY, YYYY-MM, YYYY-MM-DD, and the existing free-form day-precision variants, returning a (datetime, precision) tuple. - Extraction prompt instructs the model to use the most specific ISO date the chunk supports and NOT to invent a day-of-month when only a month is given. - Two-stage `_deduplicate_records`: first group by (event_type, metric_name, normalized_location), then within each group walk records in order and merge each into the first surviving entry whose date bucket overlaps at the coarser precision. The surviving record adopts the finer precision so downstream consumers don't lose information. - Confidence is taken as the max across merged records; provenance references from every merged source are preserved. Tests cover the matrix of precision combinations (day/month overlap, year/day overlap, equal months, different days, undated vs dated, three-way mixed-precision merge, separate locations). Live runs show records merging correctly when the model picks the same metric_name across sources; cases where the model uses subtly different metric names ("new cases" vs "Cases" vs "cholera cases") still stay separate — that's a metric-name-normalisation problem separate from this change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous COUNTRY_TO_ISO dict hand-listed ~30 countries — primarily because nobody wanted to take on a dependency at the time. Live tests show the cost: most extracted records on the 6 real biosecurity docs had iso_country_code=None despite clear country names like "Austria", "Bulgaria", "Comoros", and "Madagascar" that the map simply didn't cover. With ~250 ISO 3166-1 entries, hand-maintenance was untenable. pycountry covers all 249 ISO 3166-1 entries by canonical, common, and official name plus alpha-2/alpha-3 codes. The new resolver in `chunk_extractor.py` layers four steps: 1. Typography fold (smart quotes → ASCII) so "Côte d'Ivoire" with either apostrophe variant resolves. 2. Explicit not-a-country set ("Africa", "European Region", "EU/EEA", etc.) — these are multi-country roll-ups, not single ISO entries, and pycountry's `search_fuzzy` produces surprising false positives here. Returns None deliberately. 3. Alias dict for forms pycountry won't match directly ("UK", "DRC", "Russia", "Burma", "Ivory Coast", "North Korea", and UK constituents like "England" / "Scotland" → "GB"). 4. US subnational set — all 50 states plus DC and US territories → "US". Common in biosecurity reporting (CDC MMWR routinely phrases location as "Lea County, New Mexico"). `pycountry.countries.search_fuzzy` is deliberately not used — it would resolve "Eastern Mediterranean Region" to a single country (false positive). Strict matching only. Compound locations like "Mubende district, Uganda" still work via the existing right-to-left comma-segment fallback. Tests cover: bare country lookups across all 6 real docs' locations, the alias set, US states, multi-country region rejection, compound locations, alpha-2/alpha-3 codes, smart-quote variants. pycountry pinned to >=24.0 in requirements.txt. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

PyMuPDF surfaces whatever the PDF's /Title metadata field says, with no filtering. Real biosecurity PDFs often have stale conversion titles leaked from the source Word document — tests show ECDC's CDTR returning "2026-WCP-0020 Draft.docx" as its title, which then displayed verbatim through Document.title and into any downstream consumer. A new `_sanitize_title` helper on PdfParser drops: - Titles ending in a document-format extension (.pdf, .docx, .doc, .odt, .rtf, .txt, .pages, .xlsx, .pptx, .html, case-insensitive). - Empty / whitespace-only titles. - Implausibly short titles (< 5 chars). When the sanitiser returns None, the existing `parsed.title or filtered_doc.title` fallback chain in `extraction/pipeline.py` picks up the search-side title instead — which is the desired behaviour. Tests confirm the ECDC stale title is dropped, every other filename extension variant is rejected, and real document titles (WHO sitreps, MMWR articles, ECDC CDTR) pass through unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The filtering stage was using the older `bioscancast/llm/client.py:LLMClient` (single positional prompt string, returns plain dict) — a parallel protocol that's been carrying a long- standing TODO in the insight README. The shared protocol at `bioscancast/llm/base.py:LLMClient` uses keyword-only system/user/schema/ model/max_tokens and returns LLMResponse with structured token accounting, matching what insight (and soon forecasting) already uses. Changes: - `bioscancast/filtering/llm_filter.py`: * `build_filter_prompt` now returns a (system, user, schema) triple instead of a single concatenated JSON string. System carries the task instructions; user carries the question + candidates payload; schema is a real strict JSON Schema (not the previous example-dict). * `llm_filter_candidates` calls `llm_client.generate_json` with kwargs and reads `response.content["decisions"]` instead of the raw dict. * Adds default `model` and `max_tokens` parameters. - `bioscancast/filtering/pipeline.py`: switches `LLMClient` import to the shared base protocol. - `bioscancast/llm/client.py`: keeps the legacy single-positional protocol intact (search stage still uses it) but adds a top-of-file docstring warning new callers off. - `bioscancast/insight/README.md`: the filtering-migration TODO is closed; replaced with the remaining search-stage migration as a follow-up. - New `bioscancast/tests/test_filtering_llm.py` covers the new protocol path: prompt triple, strict JSON schema shape, the expected `generate_json(**kwargs)` call signature, missing-decision handling, empty-input shortcut, and a regression check that the filter module never re-imports the legacy client. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The insight pipeline's per-chunk LLM call loop was strictly sequential. Tests on the 6 real biosecurity documents showed each doc spending almost all its wall-clock in serial OpenAI request latency: WHO mpox ~30s, ECDC CDTR ~38s, MMWR ~11s for top-k=5 chunks. Since each extract_facts_from_chunk call is independent and the OpenAI sync client is thread-safe, this was leaving easy speedups on the floor. Changes: - `InsightPipeline.run` now dispatches per-chunk extractions to a `ThreadPoolExecutor` whose pool size is `min(chunk_workers, len(scored_chunks))`. With chunk_workers=1 or only one chunk, the code takes a sequential fallback path. Errors in one chunk are caught, logged, and don't abort the document. - Budget accounting still happens serially on the main thread after futures complete, so BudgetTracker stays simple (no locks needed). - New `chunk_workers: int = 6` field on `InsightConfig`. Six is a pragmatic default — matches the typical retrieval_top_k while staying well below OpenAI's per-minute rate limits for gpt-4o-mini. Setting to 1 reproduces the previous sequential behaviour. - `FakeLLMClient.generate_json` and `enqueue` now hold a `threading.Lock` around the response deque and counters so the test fakes stay deterministic under concurrent calls. New tests: - test_pipeline_parallel_chunk_extraction_produces_all_records: content-keyed fake confirms every chunk in the top-k is processed by parallel workers and records survive provenance checks. - test_pipeline_sequential_and_parallel_produce_same_record_count: chunk_workers=1 and chunk_workers=4 yield identical record counts and identical input-token totals on the same input. - test_pipeline_parallel_isolates_chunk_failures: a fake that throws on the 2nd chunk doesn't kill the doc — the other three still complete and the doc is marked processed. Live verification on the 6 real biosecurity documents: wall-clock per doc drops ~2× across the board (WHO mpox 24s→13s, WHO cholera 11s→3s, MMWR 11s→3s, ECDC CDTR 37s→15s). Record counts stable within LLM stochasticity. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Live observation from items 6 (partial-date dedup) was that the model emitted lots of different metric_name strings for what was essentially the same metric — "confirmed cases" / "cases" / "reported cases" / "total cases" / "total_cases" / "cholera cases" / "new cases" / "Cases" all appeared in a single live run, and they prevented the dedup logic from merging facts about the same event. ECDC alone produced 9 distinct metric_name variants of "case count". Tests show that listing a canonical snake_case vocabulary directly in the extraction prompt — with explicit guidance that qualifiers (sex, sub-region, time-period) belong in `summary` or `location` rather than in `metric_name` — collapses the diversity dramatically: 17 unique metric_names → 4–6 across the same 6 real biosecurity test documents, all drawn from the canonical list. The model can still invent a short snake_case label when none of the canonical values fit. The vocabulary covers the common biosecurity metrics observed in live tests (confirmed_cases, suspected_cases, probable_cases, confirmed_or_probable_cases, deaths, hospitalizations, recoveries, vaccinations_administered, vaccine_doses_distributed, affected_herds, affected_animals, new_outbreaks_declared, reproductive_number, case_fatality_ratio). This change works together with the value-aware dedup added in the follow-up commit — together they let real duplicates merge cleanly while preventing the merge from being too aggressive when the model misattributes locations. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The two-stage dedup added in item 6 groups records by (event_type, metric_name, normalized_location) and merges any whose date buckets overlap. But it didn't compare metric_value — so two records that share a dedup key with overlapping dates AND disagreeing values would still merge, silently dropping one value. This matters in practice because LLMs occasionally misattribute locations. Live test on the WHO cholera doc exposed exactly this: the model emitted "In January 2026, the African Region reported the highest number of cases (9782 cases; 13 countries)" with location=DRC and value=9782 (the African Region figure incorrectly tagged with DRC). Under the previous dedup logic this merged into the legitimate DRC 6543 records, hiding the attribution error and silently dropping the 9782 value from the dataframe. New `_values_compatible(v1, v2)` helper allows merging when: - Both values are None (no count claimed) - Either value is None (one source omitted the count) - Values are equal - Values are within 1% relative tolerance (accommodates rounding e.g. "about 6500" vs "6543" — same fact, different precision) Values further apart are treated as distinct facts and kept as separate records, surfacing the conflict to downstream consumers rather than burying it. Live verification on the 6 real biosecurity documents: WHO cholera went from 1 (over-merged, value lost) → 2 records (legitimate DRC merge + Africa Region preserved separately). Total record count 35 → 40, all 5 additional records are legitimate distinct facts the vocabulary-only configuration silently dropped. Three new tests cover: distinct-value rejection, within-tolerance merge (rounding), and one-value-None merge. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Item 10 stopped short of full retirement: it migrated filtering to the shared `bioscancast.llm.base.LLMClient` protocol but left `bioscancast.llm.client` in place because the search stage still called it. This change finishes the migration and deletes the legacy module — one LLM protocol for the whole codebase. Search-stage changes: - `bioscancast/stages/search_stage/query_decomposition.py` now builds (system, user, schema) triples for both the question-type classifier and the sub-query decomposer. Two new JSON schemas (`CLASSIFY_SCHEMA`, `DECOMPOSE_SCHEMA`) constrain the model output to the existing QUESTION_TYPES and VALID_AXES enums. Calls switch from `generate_json(prompt) -> dict` to the kwargs form returning `LLMResponse`. - `bioscancast/stages/search_stage/pipeline.py` imports `LLMClient` from `bioscancast.llm.base` instead of the legacy module. - `scripts/run_search_stage.py` instantiates `OpenAILLMClient` (the production class for the shared protocol) instead of the legacy `OpenAIClient`. Test updates: - `bioscancast/tests/test_query_decomposition.py` rewritten: `FakeLLMClient` now implements the shared protocol (kwargs + `LLMResponse`), responses are wrapped via a small `_resp` helper, and two new tests assert the right schema is passed to each call. - `bioscancast/tests/test_filtering_llm.py` docstring updated to note the legacy module is gone; the regression check is kept around as a guard against anyone reintroducing it. Cleanup: - `bioscancast/llm/client.py` deleted. - `bioscancast/llm/__init__.py` no longer exports the now-defunct `FilteringLLMClient` / `OpenAIClient` aliases; nothing in the repo referenced them. - The follow-up TODO line in `bioscancast/insight/README.md` is removed — there is no remaining migration debt. All 348 tests still pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous commit deleted bioscancast/llm/client.py. The regression test that asserted filtering didn't import from that module is now dead weight: any attempt to import the deleted module would fail at import time with ModuleNotFoundError — much louder than a source-text grep assertion. Source-text grep tests are generally fragile too: they couple to implementation details (the literal `from bioscancast.llm.client` string) rather than behaviour. The new-protocol tests already in this file verify the actual filter calls work correctly, which is what we actually care about. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The HTML parser was already calling trafilatura, but only using the plain-text output as a `raw_text` fallback. Sections were rebuilt by walking the entire raw DOM body — which on pages like CIDRAP (whose body contains the target article plus three unrelated articles plus a "top reads" sidebar plus a footer) produced 18 chunks of mostly noise. The investigation showed this wasn't a deliberate design decision; the original code just used trafilatura.extract() at its default text-only setting and didn't discover the structured output modes. Switching to `trafilatura.extract(output_format='xml', include_tables=True)` gives us a cleaned `<doc><main>...` tree with the article body properly isolated. The new `_extract_sections_from_trafilatura_xml` walks that tree the same way the DOM walker walks BeautifulSoup, emitting heading-stack-aware sections plus tables. The previous DOM walker is kept as a fallback: when trafilatura's output has less than 200 chars of body text (listing pages, error pages, or pages whose layout trafilatura's heuristics misjudge), the parser falls through to the original code path so we never silently extract nothing. Title, published_date, and language continue to come from raw DOM head/meta queries — those are reliable regardless of which body path runs. Verified end-to-end: - CIDRAP fixture: 18 chunks (4 unrelated articles + nav + footer) → 2 chunks (just the Utah measles article). The "602 cases" headline survives; the insight pipeline still produces a record citing it. - ProMED fixture: chunks dropped from 7 to 6 (header tagline gone); the 157-row outbreak table is fully preserved in `table_rows`. - PDF docs (WHO mpox/cholera, CDC MMWR, ECDC, Africa CDC): unchanged (PDFs don't go through this parser). - Live spot-checks on 5 fresh HTML sources (WHO DON article, CIDRAP homepage, CDC HAN alert, Reuters healthcare landing, a 404 page) all behave correctly. Articles get clean structural extraction; listings get a small but useful section; 404s correctly take the fallback path. Test changes: - CIDRAP's min-chunks floor in test_insight_real_docs_integration.py drops to 1 (was 5) — that's the new, lower, cleaner extraction output. Other fixtures unchanged. - New `TestTrafilaturaXmlExtraction` class in test_extraction_html.py with synthetic-HTML cases for sibling-article stripping, table preservation, DOM-walker fallback on thin pages, and metadata extraction independent of which body path runs. All 351 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@graph

…atforms The previous date extractor checked only four patterns: ``article:published_time``, ``og:published_time``, ``<meta name="publication_date">``, and ``<time datetime>``. That's enough for CIDRAP-style article pages but silently returns ``None`` for sources using JSON-LD Schema.org (common on news sites), Dublin Core (common on government / academic sites), and the sailthru / parsely conventions widely used by news platforms. The new ``_iter_date_candidates`` walks a prioritised list: 1. Publication-semantic ``<meta>``: article:published_time, og:article:published_time, og:published_time, sailthru.date, parsely-pub-date, article.published, pubdate. 2. JSON-LD datePublished (walking nested @graph entries too — Article-inside-WebPage is a common JSON-LD shape). 3. Dublin Core publication-semantic: DC.date.issued, dcterms:issued, DC.date.created, dcterms:created. 4. Legacy publication_date. 5. JSON-LD dateCreated. 6. First ``<time datetime>``. 7. Modification-semantic (last resort): article:modified_time, og:modified_time, JSON-LD dateModified. Bare ``DC.date`` is deliberately ignored — CDC HAN alerts (and many other sites) use it as a last-rendered timestamp rather than a publication date, and returning it would silently regress data quality. Body-text regex extraction is also deliberately out of scope: too many pages (especially epi articles) contain dozens of unrelated dates that a regex would mistakenly pick up. Honest live impact on the 5 sources we currently exercise: - WHO DON article: still None (date is in body text only — pattern deliberately not handled here). - CDC HAN alert: still None (its only ISO-format meta is ``DC.date``, which is a modification stamp; the cdc:* namespace entries are free-text and inconsistent). - CIDRAP homepage and fixture article: unchanged (still extracted correctly via article:published_time / <time>). - ProMED fixture: still None (no metadata anywhere — legitimately). The change is preventive robustness for the source classes we'll encounter beyond the current fixture set. Twelve new tests pin each pattern in the priority chain so future additions don't shuffle the precedence silently. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds structural URL date extraction as a candidate in the published-date priority chain. Sits below all publication-semantic metadata (article:published_time, JSON-LD datePublished, Dublin Core issued/created, etc.) but above the generic <time> tag and any modification-semantic source. The placement matches the user's instruction: meta wins when it exists, URL provides a fallback when it doesn't. Why URL parsing is robust enough to outrank generic <time> tags: - It's structurally bounded — patterns must occupy a full path segment or be anchored to its start. Year-shaped numbers buried inside slugs ("section-2024-summary") are NOT picked up. - Editors put dates in URLs as a publishing convention, not a coincidence. The signal is editorially intentional in a way that arbitrary <time> tags scattered around a page are not. - It avoids the failure mode where a sidebar's <time> tag for a related article is mistakenly read as the page's own date. Patterns recognised, in order of specificity (most-specific wins): 1. Full ISO in one segment: /2024-08-13/ or /20240813/ 2. Three consecutive segments: /2024/08/13/ 3. Year-month in one segment: /2024-08/ 4. Two consecutive segments: /2024/08/ 5. Year-prefixed slug (separator required): /2024-DON530, /2024_q3 6. Bare year segment: /2024/ Years are bounded to [1990, 2100] to reject typoed far-past / far- future dates. Patterns gracefully degrade — /2024/13/ (invalid month) still resolves to year-2024 via pass 6 rather than returning None. Live impact on the three problem sources from the previous commit: - WHO DON .../item/2024-DON530: None -> 2024-01-01 (year prefix slug) - CDC HAN .../han/2024/...: None -> 2024-01-01 (bare year segment) - ProMED .../promed-posts/: None -> None (no URL date — correct) Fixture and live CIDRAP sources unchanged — their meta tags continue to win, as the priority chain demands. 15 new test cases pin: every URL pattern, the precedence vs each higher-tier metadata source, the precedence vs <time> and modified- time (URL beats both), out-of-range year rejection, and graceful degradation for partially-invalid dates. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The insight stage shipped with a `use_strong_model_refinement: bool` config flag and a placeholder branch in `InsightPipeline.run` that only appended a "not yet implemented" note when the flag was True. A `strong_model: str = "gpt-4o"` config sibling pointed at the intended refinement model. That state is the worst of all worlds: users who enable the flag get a silent no-op (their records aren't refined, the only sign is a note buried in `InsightRunResult.notes`), and future contributors have to reverse-engineer the design intent. Better to remove the scaffold entirely and revisit when there's data showing a strong pass is actually needed. Design intent is preserved in #26 labelled `post-benchmark` — that issue captures: - What the two-pass design was supposed to do - When to revisit (after the first end-to-end benchmark run identifies failure modes — confidence miscalibration, misclassified event_type, missing structured fields, or subtle hallucinations surviving the substring guard) - Concrete scope when implementing (confidence-only vs field-fill vs full refinement, budget plumbing, reject-path semantics, tests) - Definition of done with cost guardrails Changes: - `bioscancast/insight/config.py`: removed `use_strong_model_refinement` and `strong_model` from both INSIGHT_CONFIG and InsightConfig. - `bioscancast/insight/pipeline.py`: removed the placeholder branch. - `bioscancast/insight/README.md`: replaced the TODO bullet with a pointer to issue #26 explaining why the scaffold is gone. All 384 tests still pass — the flag had no behaviour to test. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…erface contamination.py and test_historical_topup.py were both added in feat/as-of-date-replay (#24) against the legacy bioscancast.llm.client module, which feat/insight-stage-hardening (#27) deleted as part of the LLM-protocol migration. The textual merge succeeded but left a broken import and two stale fake LLM classes using the old single-arg generate_json signature. Migrates both to the new structured generate_json(system=, user=, schema=, model=, max_tokens=) interface returning LLMResponse, with explicit prompt/schema constants in contamination.py. pytest: 447 passed, 2 skipped.

smodee and others added 4 commits May 20, 2026 11:54

smodee requested a review from rapsoj May 27, 2026 06:53

smodee and others added 17 commits May 27, 2026 15:09

smodee force-pushed the feat/insight-stage-hardening branch from 1f19eaa to 2d77493 Compare May 27, 2026 13:14

smodee mentioned this pull request May 27, 2026

Add historical-replay mode for benchmark fairness #24

Merged

4 tasks

rapsoj approved these changes May 28, 2026

View reviewed changes

rapsoj merged commit 75edb40 into main May 28, 2026

smodee mentioned this pull request May 28, 2026

End-to-end pipeline orchestrator + filter/extraction quality bundle #28

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insight stage hardening: 11 improvements + 5 deferred-but-included items#27

Insight stage hardening: 11 improvements + 5 deferred-but-included items#27
rapsoj merged 21 commits into
mainfrom
feat/insight-stage-hardening

smodee commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

smodee commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Depends on #24

What's included

Headline numbers — final live verification on 6 real biosecurity docs

What's NOT in this PR

Test plan

Reviewer notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

smodee commented May 26, 2026 •

edited

Loading