Skip to content

Search/filter dashboard chokepoint + relevance ranking#30

Open
smodee wants to merge 8 commits into
mainfrom
feat/search-filter-quality
Open

Search/filter dashboard chokepoint + relevance ranking#30
smodee wants to merge 8 commits into
mainfrom
feat/search-filter-quality

Conversation

@smodee
Copy link
Copy Markdown
Collaborator

@smodee smodee commented Jun 3, 2026

Search/filter dashboard chokepoint + relevance ranking

Split from the original feat/end-to-end-orchestrator branch (was #28).
Companion PRs: #29 (orchestrator) and #31 (insight extraction
quality). All three are independent and can merge in any order.

Summary

A bundle of search-stage and filter-stage improvements driven by a systematic
review of 10 live forecasting questions (H5N1, the 2026 DRC+Uganda Ebola
outbreak, the Andes-virus cruise hantavirus cluster, mpox, Marburg) spanning
range/binary/categorical, plus a deterministic offline sweep of filter
thresholds and search-stage weights against hand-labeled live pools. Findings
file (data/investigations/findings-issues-3-4-13.md) carries the full method
and numbers — most of the relevant tables are surfaced below. Total API spend
across all the live runs that produced this branch: ~$0.05.

Addresses issues #3, #4, #13, #14.

What's included

Filter chokepoint (#13, #14)

  • Dashboard-injected results bypass the keyword-overlap heuristic
    (retrieval_reason == "dashboard_lookup") — they were getting
    keyword_overlap_score == 0.000 and being dropped despite being curated
    authoritative sources.
  • Dashboard titles/snippets enriched with pathogen-specific text in
    biosecurity_sources.py (was "Dashboard: cdc.gov" / "Known mpox monitoring
    dashboard"; now real readable titles+snippets per entry).
  • heuristic_keep_threshold lowered 0.72 → 0.65. Note (from the sweep)
    the threshold sits on a flat plateau 0.60–0.775 — see the Heuristic filtering scores too low for real-world search results #13 finding.
  • Dashboard-bypass docs exempted from cap_per_domain_and_type so a
    curated dashboard doesn't displace an organic result on the same domain.

Search-stage relevance ranking (#4)

search_stage_score was 0.5·domain + 0.3·freshness + 0.2·(1/rank)no
topical-relevance term
, so high-authority but off-topic results ranked at
the top and consumed total_cap slots. It now is

0.45·relevance + 0.30·domain + 0.10·freshness + 0.15·(1/rank)

where relevance reuses the filter's existing keyword_overlap_score /
build_query_terms. Freshness is weighted low because it is near-uniform in
live mode. Weights sum to 1.0; the score drives ranking + truncation only.

Sweep numbers (micro-averaged precision@P / MAP, scored vs labels):

Variant (domain/fresh/rank[/rel]) prec@P MAP
current 0.5/0.3/0.2 0.694 0.506
no-freshness 0.6/0/0.4 0.694 0.528
rank-heavy 0.4/0.1/0.5 0.694 0.580
domain-heavy 0.7/0.1/0.2 0.694 0.506
+relevance 0.25/0.15/0.1/0.5 0.725 0.600
relevance-only 0.755 0.656

Reordering domain/freshness/rank leaves precision@P flat at 0.694; adding a
relevance term is the only thing that moves the needle.

Dashboard sources (#3)

  • Fixed a broken dashboard URL (cdc.gov/mpox/data-research → 404; now
    the extractable monkeypox/situation-summary page — re-running mpox after
    the URL fix went 0 → 2 insight records) and two stale redirects
    (afro.who.int ebola-disease, cdc.gov/ebola/about).
  • DASHBOARD_LOOKUP routing was an exact lowercase-key match, so the
    CSV-natural "marburg virus disease" failed to route to the marburg key
    (→ zero on-topic results). Added _resolve_pathogen_key with alias +
    substring matching (marburg virus disease→marburg, monkeypox→mpox,
    bird flu→h5n1).

Source tiers (#13)

  • Promoted ~22 national/international outlets (CNN, NBC, CBS, ABC, NPR, USA
    Today, LA Times, Politico, Axios, Forbes, Bloomberg, FT, WSJ, Economist,
    Time, The Atlantic, Ars Technica, Business Insider, …) from unknown (0.2)
    to Tier 3 trusted_media (0.6). Legitimate outbreak reporting from these
    was being floored below the filter's credibility threshold.

No-LLM filter fallback (#13)

  • Default-off FILTER_CONFIG["no_llm_soft_fallback"] (+
    no_llm_fallback_relevance_threshold). When llm_client is None, the
    ambiguous rerank band was always rejected (fail-closed) — too aggressive
    for dev/offline/no-API-key runs. With the flag on, a borderline candidate
    is kept iff it is an official domain OR clears the relevance threshold,
    approximating the LLM-rescue path. Production (always has an LLM client)
    is unchanged.

Known interaction to flag for reviewers

Promoting outlets to Tier 3 raises recall of legitimate reporting, but
because the filter's priority_score still weights credibility heavily
(and the new relevance term lives in search ranking, not the filter's keep
decision), it can also admit off-topic pieces from those same outlets (an
H5N1 run kept a CBS transcript). A sensible follow-up is to raise the
filter's keyword-overlap weight / lower its 0.25·credibility blend — that's
not in this PR.

Issues this PR addresses

  • Closes Dashboard-injected results have low keyword overlap due to generic titles #14 — dashboard low-keyword-overlap problem fixed at the root
    (titles) and as a backstop (bypass + cap exemption).
  • Tune search_stage_score weights (0.5/0.3/0.2) #4 (search-stage weight tuning) — addressed: added the relevance
    term and retuned weights; the follow-up review showed the missing relevance
    signal, not the weight split, was the issue. Reviewers can likely close.
  • Evaluate dashboard lookup value vs organic search #3 (dashboard value vs organic) — substantially addressed: evaluated
    value (bimodal extractability finding), fixed the broken/stale URLs, and
    made routing tolerant. One follow-up remains: deciding whether
    non-extractable interactive dashboards should consume survival slots (and
    trimming/expanding the list accordingly).
  • Heuristic filtering scores too low for real-world search results #13 (heuristic scores too low) — materially improved (threshold +
    bypass + cap exemption + tier coverage + opt-in no-LLM soft fallback). The
    review also showed the 0.65 threshold is on a flat plateau (not the lever)
    and that the filter's credibility-vs-relevance balance is the remaining
    knob — see the follow-up section. Reviewers decide whether the original
    bug is resolved.

Verification

  • python -m pytest bioscancast/tests/452 passed, 2 skipped (live).
    New tests cover the relevance scoring formula, tolerant dashboard routing,
    Tier 3 outlet coverage, and the no-LLM soft-fallback flag.
  • Live runs of q7 (historical replay) and q12 (live) end-to-end producing
    artifacts inspected for filter survival, records, and cost.
  • Follow-up: 10 fresh live questions + hand-labeled offline sweep — see
    data/investigations/findings-issues-3-4-13.md. Re-running mpox after the
    dashboard URL fix went from 0 → 2 insight records.

Reviewer checklist

  • Confirm the dashboard cap-exemption policy in
    cap_per_domain_and_type (curated dashboards never consume a domain slot).
  • Sanity-check the Tier 3 outlet additions in source_tiers.py and the
    credibility-vs-relevance interaction noted above.
  • Note the 0.65 keep threshold is unchanged and now known to sit on a
    flat plateau (a sweep was done — it's not the lever).

smodee and others added 8 commits June 3, 2026 14:18
Dashboard URLs from the curated registry in
bioscancast/datasets/biosecurity_sources.py have hand-picked titles
("Dashboard: cdc.gov") and generic snippets that produce
keyword_overlap_score = 0.000 against any real forecast question. The
heuristic priority score drags them under the 0.72 keep threshold even
though they are by construction high-value sources.

Live-run evidence: q7 and q12 each injected two dashboards. All four
had keyword_overlap = 0.000. Two of those four were dropped pre-LLM,
including ourworldindata.org for q7 - which is the resolution source
named in the question's relevant_links column.

Fix: in heuristic_filter, detect retrieval_reason == "dashboard_lookup"
and auto-keep with reason_code "dashboard_lookup_bypass" and a synthetic
priority_score of 1.0. The dashboards still go through the rest of the
filtering pipeline (dedup, per-domain cap, extraction-hint assignment)
unchanged - this is the keyword-overlap chokepoint only.

Implements item 1 from the Tier 1 roadmap. Pairs with the dashboard
title/snippet enrichment in the next commit.
The previous dashboard injection used generic strings ("Dashboard: cdc.gov",
"Known mpox monitoring dashboard") that produced keyword_overlap_score
= 0.000 against every real forecast question - 4/4 injected dashboards
in the q7/q12 live runs had this exact failure mode.

The fix: turn DASHBOARD_LOOKUP into a list of DashboardEntry dataclasses
carrying url + title + snippet, with hand-written pathogen-specific text
for each entry. The titles read as real search-result titles ("CDC H5N1
bird flu situation summary: human cases and outbreaks in the United
States") and the snippets describe what data the page hosts.

Pairs with the previous commit's dashboard heuristic bypass: even with
the bypass in place, better titles still help (a) the keyword-overlap
score for downstream scoring, and (b) the LLM rescue path when it
encounters other pathogen-specific dashboards we add later. The bypass
keeps low-keyword-overlap dashboards alive; this commit makes them
discoverable on their own merits.

Implements item 5 from the Tier 1/2 roadmap.
Live runs on q7 and q12 showed filter survival of 4.7% and 13.5%
respectively, even with LLM rescue enabled. The 0.72 threshold was set
without benchmarking against real Tavily output and is too tight for
the heuristic's actual signal.

With the new threshold, priority_scores in the 0.65-0.72 band are
auto-kept by heuristics instead of routed to the LLM rescue path. The
borderline threshold (0.45) is unchanged, so the LLM filter still
gates 0.45-0.65 candidates - the change just moves the auto-keep line
to better match what the heuristic can actually distinguish.

Implements item 2 from the Tier 1 roadmap. Pairs with the dashboard
bypass + enrichment commits to attack the filter chokepoint from
multiple angles.
q7's second live run on this branch surfaced an interaction between the
new dashboard heuristic bypass and the per-domain cap. With
max_docs_per_domain=2 and the dashboard bypass injecting one who.int
slot at synthetic priority 1.0, the cap was effectively reducing
who.int to ONE organic slot - and the slot was going to a
priority-0.7097 strategic-plan announcement page, squeezing out the
priority-0.6966 WHO mpox research event page that the baseline run had
extracted records from.

Offline filter replay on the saved q7 search.json confirms the
mechanism:

  Heuristic-keep (4 who.int / ourworldindata.org docs):
    1.0000 WHO sitreps dashboard (bypass)
    1.0000 OWID mpox dashboard (bypass)
    0.7097 WHO global strategic preparedness plan (organic)
    0.6966 WHO mpox research event (organic) <- baseline's data source

  After old cap_per_domain (max=2 per domain):
    Dashboards displace one organic each; research event capped out.

The fix: dashboard-bypass docs (selection_reasons contains
"dashboard_lookup_bypass") are always kept and do not consume a slot
against the per-domain or per-type caps. They are curated additions,
not competing organic results.

After the change all four candidates survive, and the WHO research
event page reaches insight as it did in the baseline.

447 tests still passing.
search_stage_score was 0.5*domain + 0.3*freshness + 0.2*rank with no topical-relevance signal, so high-authority but off-topic results ranked at the top (e.g. sports/legal/unrelated-pathogen news). It is now 0.45*relevance + 0.30*domain + 0.10*freshness + 0.15*rank, reusing the filter's keyword_overlap_score/build_query_terms. Freshness is kept low because it is near-uniform in live mode. Addresses #4.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The CDC mpox dashboard URL returned 404; replaced with the (extractable) monkeypox/situation-summary page. Updated two stale redirects (afro.who.int ebola-disease, cdc.gov/ebola/about). DASHBOARD_LOOKUP routing was an exact lowercase key match, so 'marburg virus disease' failed to route to the 'marburg' key; added _resolve_pathogen_key with alias + substring matching (marburg virus disease->marburg, monkeypox->mpox, bird flu->h5n1). Addresses #3.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reputable outbreak reporting from outlets like CNN, NBC, CBS, ABC, NPR, USA Today, LA Times, Politico, Axios, Forbes, Bloomberg, FT, WSJ, Economist, Time, The Atlantic, Ars Technica and Business Insider was resolving to the 'unknown' tier (domain_score 0.2), sinking it below the filter's credibility floor. Promote them to Tier 3 (trusted_media, 0.6); second-level-domain matching covers subdomains. Relates to #13.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When llm_client is None the ambiguous rerank band was always rejected (fail-closed), which is overly aggressive for dev/offline/no-API-key runs. Add a default-off FILTER_CONFIG flag 'no_llm_soft_fallback' (+ no_llm_fallback_relevance_threshold) that instead keeps a borderline candidate iff it is an official domain OR its keyword-overlap relevance clears the threshold, approximating the LLM-rescue path. Production (always has an LLM client) is unchanged. Addresses #13.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dashboard-injected results have low keyword overlap due to generic titles

1 participant