Search/filter dashboard chokepoint + relevance ranking by smodee · Pull Request #30 · algorithmicgovernance/BioScanCast

smodee · 2026-06-03T11:23:16Z

Search/filter dashboard chokepoint + relevance ranking

Split from the original feat/end-to-end-orchestrator branch (was #28).
Companion PRs: #29 (orchestrator) and #31 (insight extraction
quality). All three are independent and can merge in any order.

Summary

A bundle of search-stage and filter-stage improvements driven by a systematic
review of 10 live forecasting questions (H5N1, the 2026 DRC+Uganda Ebola
outbreak, the Andes-virus cruise hantavirus cluster, mpox, Marburg) spanning
range/binary/categorical, plus a deterministic offline sweep of filter
thresholds and search-stage weights against hand-labeled live pools. Findings
file (data/investigations/findings-issues-3-4-13.md) carries the full method
and numbers — most of the relevant tables are surfaced below. Total API spend
across all the live runs that produced this branch: ~$0.05.

Addresses issues #3, #4, #13, #14.

What's included

Filter chokepoint (#13, #14)

Dashboard-injected results bypass the keyword-overlap heuristic
(retrieval_reason == "dashboard_lookup") — they were getting
keyword_overlap_score == 0.000 and being dropped despite being curated
authoritative sources.
Dashboard titles/snippets enriched with pathogen-specific text in
biosecurity_sources.py (was "Dashboard: cdc.gov" / "Known mpox monitoring
dashboard"; now real readable titles+snippets per entry).
heuristic_keep_threshold lowered 0.72 → 0.65. Note (from the sweep)
the threshold sits on a flat plateau 0.60–0.775 — see the Heuristic filtering scores too low for real-world search results #13 finding.
Dashboard-bypass docs exempted from cap_per_domain_and_type so a
curated dashboard doesn't displace an organic result on the same domain.

Search-stage relevance ranking (#4)

search_stage_score was 0.5·domain + 0.3·freshness + 0.2·(1/rank) — no
topical-relevance term, so high-authority but off-topic results ranked at
the top and consumed total_cap slots. It now is

0.45·relevance + 0.30·domain + 0.10·freshness + 0.15·(1/rank)

where relevance reuses the filter's existing keyword_overlap_score /
build_query_terms. Freshness is weighted low because it is near-uniform in
live mode. Weights sum to 1.0; the score drives ranking + truncation only.

Sweep numbers (micro-averaged precision@P / MAP, scored vs labels):

Variant (domain/fresh/rank[/rel])	prec@P	MAP
current 0.5/0.3/0.2	0.694	0.506
no-freshness 0.6/0/0.4	0.694	0.528
rank-heavy 0.4/0.1/0.5	0.694	0.580
domain-heavy 0.7/0.1/0.2	0.694	0.506
+relevance 0.25/0.15/0.1/0.5	0.725	0.600
relevance-only	0.755	0.656

Reordering domain/freshness/rank leaves precision@P flat at 0.694; adding a
relevance term is the only thing that moves the needle.

Dashboard sources (#3)

Fixed a broken dashboard URL (cdc.gov/mpox/data-research → 404; now
the extractable monkeypox/situation-summary page — re-running mpox after
the URL fix went 0 → 2 insight records) and two stale redirects
(afro.who.int ebola-disease, cdc.gov/ebola/about).
DASHBOARD_LOOKUP routing was an exact lowercase-key match, so the
CSV-natural "marburg virus disease" failed to route to the marburg key
(→ zero on-topic results). Added _resolve_pathogen_key with alias +
substring matching (marburg virus disease→marburg, monkeypox→mpox,
bird flu→h5n1).

Source tiers (#13)

Promoted ~22 national/international outlets (CNN, NBC, CBS, ABC, NPR, USA
Today, LA Times, Politico, Axios, Forbes, Bloomberg, FT, WSJ, Economist,
Time, The Atlantic, Ars Technica, Business Insider, …) from unknown (0.2)
to Tier 3 trusted_media (0.6). Legitimate outbreak reporting from these
was being floored below the filter's credibility threshold.

No-LLM filter fallback (#13)

Default-off FILTER_CONFIG["no_llm_soft_fallback"] (+
no_llm_fallback_relevance_threshold). When llm_client is None, the
ambiguous rerank band was always rejected (fail-closed) — too aggressive
for dev/offline/no-API-key runs. With the flag on, a borderline candidate
is kept iff it is an official domain OR clears the relevance threshold,
approximating the LLM-rescue path. Production (always has an LLM client)
is unchanged.

Known interaction to flag for reviewers

Promoting outlets to Tier 3 raises recall of legitimate reporting, but
because the filter's priority_score still weights credibility heavily
(and the new relevance term lives in search ranking, not the filter's keep
decision), it can also admit off-topic pieces from those same outlets (an
H5N1 run kept a CBS transcript). A sensible follow-up is to raise the
filter's keyword-overlap weight / lower its 0.25·credibility blend — that's
not in this PR.

Issues this PR addresses

Closes Dashboard-injected results have low keyword overlap due to generic titles #14 — dashboard low-keyword-overlap problem fixed at the root
(titles) and as a backstop (bypass + cap exemption).
Tune search_stage_score weights (0.5/0.3/0.2) #4 (search-stage weight tuning) — addressed: added the relevance
term and retuned weights; the follow-up review showed the missing relevance
signal, not the weight split, was the issue. Reviewers can likely close.
Evaluate dashboard lookup value vs organic search #3 (dashboard value vs organic) — substantially addressed: evaluated
value (bimodal extractability finding), fixed the broken/stale URLs, and
made routing tolerant. One follow-up remains: deciding whether
non-extractable interactive dashboards should consume survival slots (and
trimming/expanding the list accordingly).
Heuristic filtering scores too low for real-world search results #13 (heuristic scores too low) — materially improved (threshold +
bypass + cap exemption + tier coverage + opt-in no-LLM soft fallback). The
review also showed the 0.65 threshold is on a flat plateau (not the lever)
and that the filter's credibility-vs-relevance balance is the remaining
knob — see the follow-up section. Reviewers decide whether the original
bug is resolved.

Verification

python -m pytest bioscancast/tests/ — 452 passed, 2 skipped (live).
New tests cover the relevance scoring formula, tolerant dashboard routing,
Tier 3 outlet coverage, and the no-LLM soft-fallback flag.
Live runs of q7 (historical replay) and q12 (live) end-to-end producing
artifacts inspected for filter survival, records, and cost.
Follow-up: 10 fresh live questions + hand-labeled offline sweep — see
data/investigations/findings-issues-3-4-13.md. Re-running mpox after the
dashboard URL fix went from 0 → 2 insight records.

Reviewer checklist

Confirm the dashboard cap-exemption policy in
cap_per_domain_and_type (curated dashboards never consume a domain slot).
Sanity-check the Tier 3 outlet additions in source_tiers.py and the
credibility-vs-relevance interaction noted above.
Note the 0.65 keep threshold is unchanged and now known to sit on a
flat plateau (a sweep was done — it's not the lever).

Dashboard URLs from the curated registry in bioscancast/datasets/biosecurity_sources.py have hand-picked titles ("Dashboard: cdc.gov") and generic snippets that produce keyword_overlap_score = 0.000 against any real forecast question. The heuristic priority score drags them under the 0.72 keep threshold even though they are by construction high-value sources. Live-run evidence: q7 and q12 each injected two dashboards. All four had keyword_overlap = 0.000. Two of those four were dropped pre-LLM, including ourworldindata.org for q7 - which is the resolution source named in the question's relevant_links column. Fix: in heuristic_filter, detect retrieval_reason == "dashboard_lookup" and auto-keep with reason_code "dashboard_lookup_bypass" and a synthetic priority_score of 1.0. The dashboards still go through the rest of the filtering pipeline (dedup, per-domain cap, extraction-hint assignment) unchanged - this is the keyword-overlap chokepoint only. Implements item 1 from the Tier 1 roadmap. Pairs with the dashboard title/snippet enrichment in the next commit.

The previous dashboard injection used generic strings ("Dashboard: cdc.gov", "Known mpox monitoring dashboard") that produced keyword_overlap_score = 0.000 against every real forecast question - 4/4 injected dashboards in the q7/q12 live runs had this exact failure mode. The fix: turn DASHBOARD_LOOKUP into a list of DashboardEntry dataclasses carrying url + title + snippet, with hand-written pathogen-specific text for each entry. The titles read as real search-result titles ("CDC H5N1 bird flu situation summary: human cases and outbreaks in the United States") and the snippets describe what data the page hosts. Pairs with the previous commit's dashboard heuristic bypass: even with the bypass in place, better titles still help (a) the keyword-overlap score for downstream scoring, and (b) the LLM rescue path when it encounters other pathogen-specific dashboards we add later. The bypass keeps low-keyword-overlap dashboards alive; this commit makes them discoverable on their own merits. Implements item 5 from the Tier 1/2 roadmap.

Live runs on q7 and q12 showed filter survival of 4.7% and 13.5% respectively, even with LLM rescue enabled. The 0.72 threshold was set without benchmarking against real Tavily output and is too tight for the heuristic's actual signal. With the new threshold, priority_scores in the 0.65-0.72 band are auto-kept by heuristics instead of routed to the LLM rescue path. The borderline threshold (0.45) is unchanged, so the LLM filter still gates 0.45-0.65 candidates - the change just moves the auto-keep line to better match what the heuristic can actually distinguish. Implements item 2 from the Tier 1 roadmap. Pairs with the dashboard bypass + enrichment commits to attack the filter chokepoint from multiple angles.

q7's second live run on this branch surfaced an interaction between the new dashboard heuristic bypass and the per-domain cap. With max_docs_per_domain=2 and the dashboard bypass injecting one who.int slot at synthetic priority 1.0, the cap was effectively reducing who.int to ONE organic slot - and the slot was going to a priority-0.7097 strategic-plan announcement page, squeezing out the priority-0.6966 WHO mpox research event page that the baseline run had extracted records from. Offline filter replay on the saved q7 search.json confirms the mechanism: Heuristic-keep (4 who.int / ourworldindata.org docs): 1.0000 WHO sitreps dashboard (bypass) 1.0000 OWID mpox dashboard (bypass) 0.7097 WHO global strategic preparedness plan (organic) 0.6966 WHO mpox research event (organic) <- baseline's data source After old cap_per_domain (max=2 per domain): Dashboards displace one organic each; research event capped out. The fix: dashboard-bypass docs (selection_reasons contains "dashboard_lookup_bypass") are always kept and do not consume a slot against the per-domain or per-type caps. They are curated additions, not competing organic results. After the change all four candidates survive, and the WHO research event page reaches insight as it did in the baseline. 447 tests still passing.

search_stage_score was 0.5*domain + 0.3*freshness + 0.2*rank with no topical-relevance signal, so high-authority but off-topic results ranked at the top (e.g. sports/legal/unrelated-pathogen news). It is now 0.45*relevance + 0.30*domain + 0.10*freshness + 0.15*rank, reusing the filter's keyword_overlap_score/build_query_terms. Freshness is kept low because it is near-uniform in live mode. Addresses #4. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The CDC mpox dashboard URL returned 404; replaced with the (extractable) monkeypox/situation-summary page. Updated two stale redirects (afro.who.int ebola-disease, cdc.gov/ebola/about). DASHBOARD_LOOKUP routing was an exact lowercase key match, so 'marburg virus disease' failed to route to the 'marburg' key; added _resolve_pathogen_key with alias + substring matching (marburg virus disease->marburg, monkeypox->mpox, bird flu->h5n1). Addresses #3. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Reputable outbreak reporting from outlets like CNN, NBC, CBS, ABC, NPR, USA Today, LA Times, Politico, Axios, Forbes, Bloomberg, FT, WSJ, Economist, Time, The Atlantic, Ars Technica and Business Insider was resolving to the 'unknown' tier (domain_score 0.2), sinking it below the filter's credibility floor. Promote them to Tier 3 (trusted_media, 0.6); second-level-domain matching covers subdomains. Relates to #13. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

When llm_client is None the ambiguous rerank band was always rejected (fail-closed), which is overly aggressive for dev/offline/no-API-key runs. Add a default-off FILTER_CONFIG flag 'no_llm_soft_fallback' (+ no_llm_fallback_relevance_threshold) that instead keeps a borderline candidate iff it is an official domain OR its keyword-overlap relevance clears the threshold, approximating the LLM-rescue path. Production (always has an LLM client) is unchanged. Addresses #13. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

smodee and others added 8 commits June 3, 2026 14:18

smodee marked this pull request as ready for review June 3, 2026 11:33

smodee requested a review from rapsoj June 3, 2026 11:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search/filter dashboard chokepoint + relevance ranking#30

Search/filter dashboard chokepoint + relevance ranking#30
smodee wants to merge 8 commits into
mainfrom
feat/search-filter-quality

smodee commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

smodee commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Search/filter dashboard chokepoint + relevance ranking

Summary

What's included

Filter chokepoint (#13, #14)

Search-stage relevance ranking (#4)

Dashboard sources (#3)

Source tiers (#13)

No-LLM filter fallback (#13)

Known interaction to flag for reviewers

Issues this PR addresses

Verification

Reviewer checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

smodee commented Jun 3, 2026 •

edited

Loading