Insight stage hardening: 11 improvements + 5 deferred-but-included items#27
Merged
Conversation
Activated by setting ForecastQuestion.as_of_date; None (default)
preserves live behavior unchanged.
Core:
* Schema fields published_date_source, cutoff_applied, fetch_strategy,
snapshot_timestamp on SearchResult and Document for post-hoc audit
* SearchStagePipeline filters post-cutoff results; recovers undated
ones from URL slug or Wayback first-seen; drops the unrecoverable
* lookup_dashboards rewrites URLs to closest Wayback snapshot at-or-
before cutoff; suppresses dashboards with no pre-cutoff snapshot
* ExtractionPipeline fetches via Wayback id_ snapshots when cutoff
is set; falls back to live with strategy recorded on Document
* SearchCache key incorporates as_of_date so replays don't collide
* Optional historical_roleplay decomposition prompt
* eval_stage/contamination.py adds filter_caught_contamination_rate
(explicit lower bound, never silent) and retrieval_free baseline (E4)
Hardening surfaced by live testing:
* Tavily end_date accepted on the Protocol but not forwarded
(verified empirically not to filter)
* Wayback CDX retries on 5xx / 429 / timeout with exponential backoff
* Sub-queries get cutoff year appended in historical mode
* Top-up round with bigger max_results when survivors < threshold
* RFC 2822 date parsing for Tavily news topic responses
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tavily's news endpoint silently ignores `end_date` when passed alone but honors the start_date+end_date pair, returning 20/20 native pre-cutoff results across the resolved corpus (q1, q3, q7, q9). The pipeline now synthesizes `start_date = as_of_date - 365d` (configurable via `historical_lookback_days`) and forwards both bounds. The 0/20 pre-cutoff disaster on live testing of q1 is fully addressed by this change; the post-retrieval cutoff filter remains as defense-in-depth. The TavilyBackend drops a lone `end_date` with a warning rather than sending a request Tavily will misinterpret. Stale comments referring to "Tavily ignores end_date" are updated across pipeline.py and tavily_backend.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Wayback CDX endpoint rate-limits at ~60 req/min server-side. The existing reactive RETRY_BACKOFF_SECONDS = (0, 10, 30, 90, 240) ladder only fires after the server has already started returning 429s, burning ~6 min per failure. Historical-replay benchmarks routinely hit dozens of these failures, producing ~30 min wall-clock on q1. This commit adds two complementary measures: 1. Proactive throttle in wayback.py: a module-level _throttle() gate sleeps before every urlopen to maintain a 2.0 s minimum interval (~30 req/min, half the server cap). Overridable via env var BIOSCANCAST_WAYBACK_MIN_INTERVAL_SECONDS. The retry ladder is unchanged and still handles genuine 503s / read timeouts. 2. Selective-recovery gate in pipeline._apply_cutoff_filter: skip the Wayback first-seen leg of recover_published_date() for aggregator domains (metaculus, manifold, kalshi, ...) and source_tier=="unknown". The URL-slug regex and Last-Modified strategies still run for gated results. New wayback_skipped counter in the cutoff-filter log line. Live q1 smoke test: ~30 min -> 49 s (~37x). Test suite: 252 -> 261 passed (9 new tests covering the throttle gate, env override, retry interaction, gate decisions, and end-to-end recovery routing). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
probe_tavily_topic.py grows from a one-query news/general comparison into a corpus iterator with config caching, synthetic-backdated stress queries, and per-knob result dumping. analyze_tavily_probe.py is new: it reads the cached probe payloads and recomputes hit-rate tables without re-paying the Tavily quota. Both scripts default to writing/reading specs/probe-results/ (gitignored by convention; create on first run). They were the workhorses behind the start_date+end_date investigation that produced commit 211f6df. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous strict whitespace-normalised substring check rejected real
factual quotes whenever the LLM made any small punctuation or unicode
adjustment — and live tests showed it does so constantly on real
WHO/CDC/ECDC documents. The headline failure was CDC MMWR producing a
single record from a doc that says "99 cases" five times, because every
variant the model emitted had either a trailing period where the source
had a comma, parens dropped around "(NMDOH)", or a smart-quote vs
straight-quote mismatch.
The new layered guard `_quote_matches` accepts a quote that matches the
chunk under any of three increasingly permissive normalisations:
1. NFKC + explicit typography fold (smart quotes, em/en dashes,
ellipsis) + whitespace collapse.
2. Strip terminal ".;,:!?" from the quote.
3. Strip wrapping punctuation `()[]{}"'` from both sides.
The function returns the canonical chunk substring it matched against,
so ChunkReference.quote always stores a verbatim chunk excerpt rather
than the model's altered output. Tests show this lifts real-fact
capture from 23 to 34 records on the 6 real biosecurity test
documents (a 48% increase) while still rejecting fabricated quotes,
content-insertion hallucinations (extra words in a list), synthesised
prefix-onto-fragment paraphrases, and wrong-number alterations.
The 30-character-window match described in the plan turned out to be
unnecessary once the typography fold and paren-strip layers were in
place — and would have weakened the guard on real hallucinations
without buying additional real-quote coverage.
New parametrised tests cover each layer plus the rejection cases.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires the real ExtractionPipeline (fetcher monkey-patched to read on-disk
bytes from data/docling_eval/sources/) into the insight pipeline with
two deterministic fake LLMs:
QuoteEchoingFakeLLM — picks a number-bearing sentence from each
retrieved chunk and emits a synthetic fact
citing it verbatim. Exercises the happy path
of the hallucination guard.
HallucinatingFakeLLM — always emits a fabricated quote. The guard
must drop every fact.
Covers 23 assertions across:
- extraction per source (Africa CDC fails with requires_ocr, the
other 6 succeed with >= 5 chunks each)
- WHO mpox PDF metadata yields a publication date
- insight pipeline produces records for every text-extractable doc
- the CIDRAP article's "602 cases" headline appears in at least one
record's quote
- failed-extraction docs are skipped without LLM calls
- the hallucination guard rejects every fabricated quote on every doc
- cross-document dedup merges twin facts into one record with sources
from both docs
Uses >= thresholds so subsequent items in the insight-stage hardening
plan (empty-chunk filter, partial-date dedup, pycountry resolution)
can lift record counts without breaking this test. Runs in ~7 seconds
with no live LLM calls.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Empty-text chunks cost the insight stage twice: they take a top-k slot in retrieval (BM25 indexes the heading even when the body is empty), and then trigger an LLM call against blank chunk text. Tests show the CDC MMWR borderless table case where this happened on a chunk that ranked #1 by retrieval score — a wasted call every time. A new helper, `_drop_or_repair_empty_chunks`, runs in `ExtractionPipeline.extract_one` between `normalize_chunks` and chunk renumbering. Two paths: - Table chunks with empty text but populated `table_data`: render the rows to a tab-separated text block so BM25 and the LLM can see the cell contents. `table_data` itself is preserved unchanged for any consumer that wants the structured form. - Other empty chunks: drop with a DEBUG log carrying chunk_id, type and heading. An empty prose chunk almost always indicates a half- broken upstream parser section (heading without body, footer artefact); nothing the insight stage can act on. Tests confirm the existing MMWR table chunk (previously empty-text) now carries 277 characters of rendered table content while the underlying rows stay accessible via `table_data`. Empty prose chunks disappear before reaching downstream stages. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous single-stage dedup key formatted every event_date as
YYYY-MM-DD, which prevented merging two reports of the same event
when one source gave a day and another gave only the month — tests
on the 6 real biosecurity documents showed WHO cholera repeatedly
producing two records for the DRC January-2026 6543-cases fact, one
from a prose sentence ("In January 2026 ... reported 6543 new cholera
cases") and one from a table cell ("Democratic Republic of the Congo
6 543") with subtly different parsed dates.
Changes:
- New `event_date_precision` field on `InsightRecord` carries the
granularity ("year"|"month"|"day") alongside the canonical
`event_date` datetime (which is now the start of the period when
only a partial date is known).
- `_parse_event_date` accepts YYYY, YYYY-MM, YYYY-MM-DD, and the
existing free-form day-precision variants, returning a
(datetime, precision) tuple.
- Extraction prompt instructs the model to use the most specific
ISO date the chunk supports and NOT to invent a day-of-month
when only a month is given.
- Two-stage `_deduplicate_records`: first group by (event_type,
metric_name, normalized_location), then within each group walk
records in order and merge each into the first surviving entry
whose date bucket overlaps at the coarser precision. The
surviving record adopts the finer precision so downstream
consumers don't lose information.
- Confidence is taken as the max across merged records; provenance
references from every merged source are preserved.
Tests cover the matrix of precision combinations (day/month overlap,
year/day overlap, equal months, different days, undated vs dated,
three-way mixed-precision merge, separate locations). Live runs show
records merging correctly when the model picks the same metric_name
across sources; cases where the model uses subtly different metric
names ("new cases" vs "Cases" vs "cholera cases") still stay
separate — that's a metric-name-normalisation problem separate from
this change.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous COUNTRY_TO_ISO dict hand-listed ~30 countries — primarily
because nobody wanted to take on a dependency at the time. Live tests
show the cost: most extracted records on the 6 real biosecurity docs
had iso_country_code=None despite clear country names like "Austria",
"Bulgaria", "Comoros", and "Madagascar" that the map simply didn't
cover. With ~250 ISO 3166-1 entries, hand-maintenance was untenable.
pycountry covers all 249 ISO 3166-1 entries by canonical, common, and
official name plus alpha-2/alpha-3 codes. The new resolver in
`chunk_extractor.py` layers four steps:
1. Typography fold (smart quotes → ASCII) so "Côte d'Ivoire" with
either apostrophe variant resolves.
2. Explicit not-a-country set ("Africa", "European Region",
"EU/EEA", etc.) — these are multi-country roll-ups, not single
ISO entries, and pycountry's `search_fuzzy` produces surprising
false positives here. Returns None deliberately.
3. Alias dict for forms pycountry won't match directly ("UK", "DRC",
"Russia", "Burma", "Ivory Coast", "North Korea", and UK
constituents like "England" / "Scotland" → "GB").
4. US subnational set — all 50 states plus DC and US territories
→ "US". Common in biosecurity reporting (CDC MMWR routinely
phrases location as "Lea County, New Mexico").
`pycountry.countries.search_fuzzy` is deliberately not used — it
would resolve "Eastern Mediterranean Region" to a single country
(false positive). Strict matching only.
Compound locations like "Mubende district, Uganda" still work via
the existing right-to-left comma-segment fallback.
Tests cover: bare country lookups across all 6 real docs' locations,
the alias set, US states, multi-country region rejection, compound
locations, alpha-2/alpha-3 codes, smart-quote variants.
pycountry pinned to >=24.0 in requirements.txt.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PyMuPDF surfaces whatever the PDF's /Title metadata field says, with no
filtering. Real biosecurity PDFs often have stale conversion titles
leaked from the source Word document — tests show ECDC's CDTR
returning "2026-WCP-0020 Draft.docx" as its title, which then
displayed verbatim through Document.title and into any downstream
consumer.
A new `_sanitize_title` helper on PdfParser drops:
- Titles ending in a document-format extension (.pdf, .docx, .doc,
.odt, .rtf, .txt, .pages, .xlsx, .pptx, .html, case-insensitive).
- Empty / whitespace-only titles.
- Implausibly short titles (< 5 chars).
When the sanitiser returns None, the existing
`parsed.title or filtered_doc.title` fallback chain in
`extraction/pipeline.py` picks up the search-side title instead —
which is the desired behaviour.
Tests confirm the ECDC stale title is dropped, every other
filename extension variant is rejected, and real document titles
(WHO sitreps, MMWR articles, ECDC CDTR) pass through unchanged.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The filtering stage was using the older
`bioscancast/llm/client.py:LLMClient` (single positional prompt string,
returns plain dict) — a parallel protocol that's been carrying a long-
standing TODO in the insight README. The shared protocol at
`bioscancast/llm/base.py:LLMClient` uses keyword-only system/user/schema/
model/max_tokens and returns LLMResponse with structured token
accounting, matching what insight (and soon forecasting) already uses.
Changes:
- `bioscancast/filtering/llm_filter.py`:
* `build_filter_prompt` now returns a (system, user, schema) triple
instead of a single concatenated JSON string. System carries the
task instructions; user carries the question + candidates
payload; schema is a real strict JSON Schema (not the previous
example-dict).
* `llm_filter_candidates` calls `llm_client.generate_json` with
kwargs and reads `response.content["decisions"]` instead of
the raw dict.
* Adds default `model` and `max_tokens` parameters.
- `bioscancast/filtering/pipeline.py`: switches `LLMClient` import to
the shared base protocol.
- `bioscancast/llm/client.py`: keeps the legacy single-positional
protocol intact (search stage still uses it) but adds a top-of-file
docstring warning new callers off.
- `bioscancast/insight/README.md`: the filtering-migration TODO is
closed; replaced with the remaining search-stage migration as a
follow-up.
- New `bioscancast/tests/test_filtering_llm.py` covers the new
protocol path: prompt triple, strict JSON schema shape, the
expected `generate_json(**kwargs)` call signature, missing-decision
handling, empty-input shortcut, and a regression check that the
filter module never re-imports the legacy client.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The insight pipeline's per-chunk LLM call loop was strictly sequential.
Tests on the 6 real biosecurity documents showed each doc spending
almost all its wall-clock in serial OpenAI request latency: WHO mpox
~30s, ECDC CDTR ~38s, MMWR ~11s for top-k=5 chunks. Since each
extract_facts_from_chunk call is independent and the OpenAI sync
client is thread-safe, this was leaving easy speedups on the floor.
Changes:
- `InsightPipeline.run` now dispatches per-chunk extractions to a
`ThreadPoolExecutor` whose pool size is `min(chunk_workers,
len(scored_chunks))`. With chunk_workers=1 or only one chunk, the
code takes a sequential fallback path. Errors in one chunk are
caught, logged, and don't abort the document.
- Budget accounting still happens serially on the main thread after
futures complete, so BudgetTracker stays simple (no locks needed).
- New `chunk_workers: int = 6` field on `InsightConfig`. Six is a
pragmatic default — matches the typical retrieval_top_k while
staying well below OpenAI's per-minute rate limits for
gpt-4o-mini. Setting to 1 reproduces the previous sequential
behaviour.
- `FakeLLMClient.generate_json` and `enqueue` now hold a
`threading.Lock` around the response deque and counters so the
test fakes stay deterministic under concurrent calls.
New tests:
- test_pipeline_parallel_chunk_extraction_produces_all_records:
content-keyed fake confirms every chunk in the top-k is processed
by parallel workers and records survive provenance checks.
- test_pipeline_sequential_and_parallel_produce_same_record_count:
chunk_workers=1 and chunk_workers=4 yield identical record counts
and identical input-token totals on the same input.
- test_pipeline_parallel_isolates_chunk_failures: a fake that throws
on the 2nd chunk doesn't kill the doc — the other three still
complete and the doc is marked processed.
Live verification on the 6 real biosecurity documents: wall-clock per
doc drops ~2× across the board (WHO mpox 24s→13s, WHO cholera 11s→3s,
MMWR 11s→3s, ECDC CDTR 37s→15s). Record counts stable within LLM
stochasticity.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Live observation from items 6 (partial-date dedup) was that the model emitted lots of different metric_name strings for what was essentially the same metric — "confirmed cases" / "cases" / "reported cases" / "total cases" / "total_cases" / "cholera cases" / "new cases" / "Cases" all appeared in a single live run, and they prevented the dedup logic from merging facts about the same event. ECDC alone produced 9 distinct metric_name variants of "case count". Tests show that listing a canonical snake_case vocabulary directly in the extraction prompt — with explicit guidance that qualifiers (sex, sub-region, time-period) belong in `summary` or `location` rather than in `metric_name` — collapses the diversity dramatically: 17 unique metric_names → 4–6 across the same 6 real biosecurity test documents, all drawn from the canonical list. The model can still invent a short snake_case label when none of the canonical values fit. The vocabulary covers the common biosecurity metrics observed in live tests (confirmed_cases, suspected_cases, probable_cases, confirmed_or_probable_cases, deaths, hospitalizations, recoveries, vaccinations_administered, vaccine_doses_distributed, affected_herds, affected_animals, new_outbreaks_declared, reproductive_number, case_fatality_ratio). This change works together with the value-aware dedup added in the follow-up commit — together they let real duplicates merge cleanly while preventing the merge from being too aggressive when the model misattributes locations. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The two-stage dedup added in item 6 groups records by (event_type,
metric_name, normalized_location) and merges any whose date buckets
overlap. But it didn't compare metric_value — so two records that
share a dedup key with overlapping dates AND disagreeing values would
still merge, silently dropping one value.
This matters in practice because LLMs occasionally misattribute
locations. Live test on the WHO cholera doc exposed exactly this:
the model emitted "In January 2026, the African Region reported the
highest number of cases (9782 cases; 13 countries)" with
location=DRC and value=9782 (the African Region figure incorrectly
tagged with DRC). Under the previous dedup logic this merged into
the legitimate DRC 6543 records, hiding the attribution error and
silently dropping the 9782 value from the dataframe.
New `_values_compatible(v1, v2)` helper allows merging when:
- Both values are None (no count claimed)
- Either value is None (one source omitted the count)
- Values are equal
- Values are within 1% relative tolerance (accommodates rounding
e.g. "about 6500" vs "6543" — same fact, different precision)
Values further apart are treated as distinct facts and kept as
separate records, surfacing the conflict to downstream consumers
rather than burying it.
Live verification on the 6 real biosecurity documents: WHO cholera
went from 1 (over-merged, value lost) → 2 records (legitimate DRC
merge + Africa Region preserved separately). Total record count
35 → 40, all 5 additional records are legitimate distinct facts
the vocabulary-only configuration silently dropped.
Three new tests cover: distinct-value rejection, within-tolerance
merge (rounding), and one-value-None merge.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Item 10 stopped short of full retirement: it migrated filtering to the
shared `bioscancast.llm.base.LLMClient` protocol but left
`bioscancast.llm.client` in place because the search stage still
called it. This change finishes the migration and deletes the legacy
module — one LLM protocol for the whole codebase.
Search-stage changes:
- `bioscancast/stages/search_stage/query_decomposition.py` now
builds (system, user, schema) triples for both the question-type
classifier and the sub-query decomposer. Two new JSON schemas
(`CLASSIFY_SCHEMA`, `DECOMPOSE_SCHEMA`) constrain the model
output to the existing QUESTION_TYPES and VALID_AXES enums.
Calls switch from `generate_json(prompt) -> dict` to the kwargs
form returning `LLMResponse`.
- `bioscancast/stages/search_stage/pipeline.py` imports `LLMClient`
from `bioscancast.llm.base` instead of the legacy module.
- `scripts/run_search_stage.py` instantiates `OpenAILLMClient`
(the production class for the shared protocol) instead of the
legacy `OpenAIClient`.
Test updates:
- `bioscancast/tests/test_query_decomposition.py` rewritten:
`FakeLLMClient` now implements the shared protocol (kwargs +
`LLMResponse`), responses are wrapped via a small `_resp` helper,
and two new tests assert the right schema is passed to each call.
- `bioscancast/tests/test_filtering_llm.py` docstring updated to
note the legacy module is gone; the regression check is kept
around as a guard against anyone reintroducing it.
Cleanup:
- `bioscancast/llm/client.py` deleted.
- `bioscancast/llm/__init__.py` no longer exports the now-defunct
`FilteringLLMClient` / `OpenAIClient` aliases; nothing in the
repo referenced them.
- The follow-up TODO line in `bioscancast/insight/README.md` is
removed — there is no remaining migration debt.
All 348 tests still pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous commit deleted bioscancast/llm/client.py. The regression test that asserted filtering didn't import from that module is now dead weight: any attempt to import the deleted module would fail at import time with ModuleNotFoundError — much louder than a source-text grep assertion. Source-text grep tests are generally fragile too: they couple to implementation details (the literal `from bioscancast.llm.client` string) rather than behaviour. The new-protocol tests already in this file verify the actual filter calls work correctly, which is what we actually care about. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The HTML parser was already calling trafilatura, but only using the
plain-text output as a `raw_text` fallback. Sections were rebuilt by
walking the entire raw DOM body — which on pages like CIDRAP (whose
body contains the target article plus three unrelated articles plus
a "top reads" sidebar plus a footer) produced 18 chunks of mostly
noise. The investigation showed this wasn't a deliberate design
decision; the original code just used trafilatura.extract() at its
default text-only setting and didn't discover the structured output
modes.
Switching to `trafilatura.extract(output_format='xml', include_tables=True)`
gives us a cleaned `<doc><main>...` tree with the article body
properly isolated. The new `_extract_sections_from_trafilatura_xml`
walks that tree the same way the DOM walker walks BeautifulSoup,
emitting heading-stack-aware sections plus tables.
The previous DOM walker is kept as a fallback: when trafilatura's
output has less than 200 chars of body text (listing pages, error
pages, or pages whose layout trafilatura's heuristics misjudge), the
parser falls through to the original code path so we never silently
extract nothing.
Title, published_date, and language continue to come from raw DOM
head/meta queries — those are reliable regardless of which body path
runs.
Verified end-to-end:
- CIDRAP fixture: 18 chunks (4 unrelated articles + nav + footer)
→ 2 chunks (just the Utah measles article). The "602 cases"
headline survives; the insight pipeline still produces a record
citing it.
- ProMED fixture: chunks dropped from 7 to 6 (header tagline gone);
the 157-row outbreak table is fully preserved in `table_rows`.
- PDF docs (WHO mpox/cholera, CDC MMWR, ECDC, Africa CDC): unchanged
(PDFs don't go through this parser).
- Live spot-checks on 5 fresh HTML sources (WHO DON article, CIDRAP
homepage, CDC HAN alert, Reuters healthcare landing, a 404 page)
all behave correctly. Articles get clean structural extraction;
listings get a small but useful section; 404s correctly take the
fallback path.
Test changes:
- CIDRAP's min-chunks floor in test_insight_real_docs_integration.py
drops to 1 (was 5) — that's the new, lower, cleaner extraction
output. Other fixtures unchanged.
- New `TestTrafilaturaXmlExtraction` class in test_extraction_html.py
with synthetic-HTML cases for sibling-article stripping, table
preservation, DOM-walker fallback on thin pages, and metadata
extraction independent of which body path runs.
All 351 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…atforms
The previous date extractor checked only four patterns:
``article:published_time``, ``og:published_time``, ``<meta name="publication_date">``,
and ``<time datetime>``. That's enough for CIDRAP-style article pages
but silently returns ``None`` for sources using JSON-LD Schema.org
(common on news sites), Dublin Core (common on government / academic
sites), and the sailthru / parsely conventions widely used by news
platforms.
The new ``_iter_date_candidates`` walks a prioritised list:
1. Publication-semantic ``<meta>``: article:published_time,
og:article:published_time, og:published_time, sailthru.date,
parsely-pub-date, article.published, pubdate.
2. JSON-LD datePublished (walking nested @graph entries too —
Article-inside-WebPage is a common JSON-LD shape).
3. Dublin Core publication-semantic: DC.date.issued, dcterms:issued,
DC.date.created, dcterms:created.
4. Legacy publication_date.
5. JSON-LD dateCreated.
6. First ``<time datetime>``.
7. Modification-semantic (last resort): article:modified_time,
og:modified_time, JSON-LD dateModified.
Bare ``DC.date`` is deliberately ignored — CDC HAN alerts (and many
other sites) use it as a last-rendered timestamp rather than a
publication date, and returning it would silently regress data quality.
Body-text regex extraction is also deliberately out of scope: too many
pages (especially epi articles) contain dozens of unrelated dates that
a regex would mistakenly pick up.
Honest live impact on the 5 sources we currently exercise:
- WHO DON article: still None (date is in body text only — pattern
deliberately not handled here).
- CDC HAN alert: still None (its only ISO-format meta is ``DC.date``,
which is a modification stamp; the cdc:* namespace entries are
free-text and inconsistent).
- CIDRAP homepage and fixture article: unchanged (still extracted
correctly via article:published_time / <time>).
- ProMED fixture: still None (no metadata anywhere — legitimately).
The change is preventive robustness for the source classes we'll
encounter beyond the current fixture set. Twelve new tests pin each
pattern in the priority chain so future additions don't shuffle the
precedence silently.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds structural URL date extraction as a candidate in the
published-date priority chain. Sits below all publication-semantic
metadata (article:published_time, JSON-LD datePublished, Dublin Core
issued/created, etc.) but above the generic <time> tag and any
modification-semantic source. The placement matches the user's
instruction: meta wins when it exists, URL provides a fallback when
it doesn't.
Why URL parsing is robust enough to outrank generic <time> tags:
- It's structurally bounded — patterns must occupy a full path
segment or be anchored to its start. Year-shaped numbers buried
inside slugs ("section-2024-summary") are NOT picked up.
- Editors put dates in URLs as a publishing convention, not a
coincidence. The signal is editorially intentional in a way that
arbitrary <time> tags scattered around a page are not.
- It avoids the failure mode where a sidebar's <time> tag for a
related article is mistakenly read as the page's own date.
Patterns recognised, in order of specificity (most-specific wins):
1. Full ISO in one segment: /2024-08-13/ or /20240813/
2. Three consecutive segments: /2024/08/13/
3. Year-month in one segment: /2024-08/
4. Two consecutive segments: /2024/08/
5. Year-prefixed slug (separator required): /2024-DON530, /2024_q3
6. Bare year segment: /2024/
Years are bounded to [1990, 2100] to reject typoed far-past / far-
future dates. Patterns gracefully degrade — /2024/13/ (invalid month)
still resolves to year-2024 via pass 6 rather than returning None.
Live impact on the three problem sources from the previous commit:
- WHO DON .../item/2024-DON530: None -> 2024-01-01 (year prefix slug)
- CDC HAN .../han/2024/...: None -> 2024-01-01 (bare year segment)
- ProMED .../promed-posts/: None -> None (no URL date — correct)
Fixture and live CIDRAP sources unchanged — their meta tags continue
to win, as the priority chain demands.
15 new test cases pin: every URL pattern, the precedence vs each
higher-tier metadata source, the precedence vs <time> and modified-
time (URL beats both), out-of-range year rejection, and graceful
degradation for partially-invalid dates.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The insight stage shipped with a `use_strong_model_refinement: bool` config flag and a placeholder branch in `InsightPipeline.run` that only appended a "not yet implemented" note when the flag was True. A `strong_model: str = "gpt-4o"` config sibling pointed at the intended refinement model. That state is the worst of all worlds: users who enable the flag get a silent no-op (their records aren't refined, the only sign is a note buried in `InsightRunResult.notes`), and future contributors have to reverse-engineer the design intent. Better to remove the scaffold entirely and revisit when there's data showing a strong pass is actually needed. Design intent is preserved in #26 labelled `post-benchmark` — that issue captures: - What the two-pass design was supposed to do - When to revisit (after the first end-to-end benchmark run identifies failure modes — confidence miscalibration, misclassified event_type, missing structured fields, or subtle hallucinations surviving the substring guard) - Concrete scope when implementing (confidence-only vs field-fill vs full refinement, budget plumbing, reject-path semantics, tests) - Definition of done with cost guardrails Changes: - `bioscancast/insight/config.py`: removed `use_strong_model_refinement` and `strong_model` from both INSIGHT_CONFIG and InsightConfig. - `bioscancast/insight/pipeline.py`: removed the placeholder branch. - `bioscancast/insight/README.md`: replaced the TODO bullet with a pointer to issue #26 explaining why the scaffold is gone. All 384 tests still pass — the flag had no behaviour to test. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…erface contamination.py and test_historical_topup.py were both added in feat/as-of-date-replay (#24) against the legacy bioscancast.llm.client module, which feat/insight-stage-hardening (#27) deleted as part of the LLM-protocol migration. The textual merge succeeded but left a broken import and two stale fake LLM classes using the old single-arg generate_json signature. Migrates both to the new structured generate_json(system=, user=, schema=, model=, max_tokens=) interface returning LLMResponse, with explicit prompt/schema constants in contamination.py. pytest: 447 passed, 2 skipped.
1f19eaa to
2d77493
Compare
4 tasks
rapsoj
approved these changes
May 28, 2026
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hardens the insight stage so the pipeline produces usable records on real biosecurity documents (WHO sitreps, CDC MMWR, ECDC CDTR, CIDRAP, ProMED, Africa CDC). The branch started with 8 items from a consolidated action list and grew to 16 commits as testing surfaced additional issues that were small, related, and clearly worth fixing in-scope.
The unifying theme: fix the things that prevent the insight pipeline from producing good records on real inputs. Some items touch extraction-stage code (HTML/PDF parsing) because insight-stage symptoms had extraction-stage causes.
Depends on #24
Rebased onto
feat/as-of-date-replayto absorb the LLM-interface migration forcontamination.pyand thehistorical_roleplayparameter forquery_decomposition.py. Without this rebase, the textual merge of #24+#27 produces broken imports (#24's contamination module uses the legacybioscancast.llm.clientthat #27 deletes) and a conflict inquery_decomposition.py(both PRs modified it independently). Merging this PR after #24 is now self-consistent; the orchestrator branch's cleanup commit becomes redundant on its next rebase.What's included
810327f5c50f9999f93dda47be32event_date_precisionfield + two-stage dedup so2026-01-25(day) merges with2026-01(month)ecdb726pycountry— full ISO 3166-1 coverage, US states / regional rejectiond3205a4/Titlemetadata (e.g. ECDC's2026-WCP-0020 Draft.docxno longer leaks through)49e374dLLMClientprotocole29103dThreadPoolExecutor— ~2× per-doc wall-clockefe62c3metric_namein the extraction prompt — collapses 17 phrasing variants to 5 canonical ones95e4738metric_valuediffers, catches LLM location-attribution errorsce3ba11bioscancast/llm/client.pydeleted366aed8e324d19trafilatura's structural XML output instead of throwing it away — CIDRAP chunks 18 → 2, ProMED table preserveda0b6d9cpublished_dateextraction: JSON-LD, Dublin Core (dcterms:issued/DC.date.issued), sailthru/parsely, +modification-semantic fallback537fd93/han/2024/...(CDC HAN) and/item/2024-DON530(WHO DON)618484euse_strong_model_refinementno-op flag; design preserved in #262d77493contamination.pyand the two as-of-date tests frombioscancast.llm.clientto the sharedbioscancast.llm.base.LLMClientinterface — closes the cross-PR gap with #24Headline numbers — final live verification on 6 real biosecurity docs
metric_namestringsiso_country_codepopulatedrequires_ocrCost increase (~30%) is driven entirely by ProMED now contributing real content (rendered outbreak table) and ECDC producing ~2× the records — both increases are buying real value, still solidly under $0.01 per run.
What's NOT in this PR
Deferred deliberately. Not blockers for forecasting; each warrants its own dedicated scope:
ForecastQuestion. Bigger scope, separate branch.618484e; design preserved in Strong-model refinement pass in the insight stage #26 (labelledpost-benchmark). Revisit after the first benchmark run identifies whether the cheap-model records need refinement.Test plan
python -m pytest bioscancast/tests/— 447 passed, 2 skipped, 0 failed on the rebased branch (was 221 before; +226 new tests across this PR and Add historical-replay mode for benchmark fairness #24)bioscancast/tests/test_insight_real_docs_integration.py) confirms extraction + insight pipeline produces records on all 6 working biosecurity documentscontamination.pyand the two as-of-date tests from Add historical-replay mode for benchmark fairness #24 pass after migration to the shared LLMClient interfaceReviewer notes
data/investigations/, the manual eval script atscripts/eval_insight_on_real_docs.py) are deliberately not committedevent_date_precisionis a new optional field onInsightRecord)🤖 Generated with Claude Code