Search/filter dashboard chokepoint + relevance ranking#30
Open
smodee wants to merge 8 commits into
Open
Conversation
Dashboard URLs from the curated registry in
bioscancast/datasets/biosecurity_sources.py have hand-picked titles
("Dashboard: cdc.gov") and generic snippets that produce
keyword_overlap_score = 0.000 against any real forecast question. The
heuristic priority score drags them under the 0.72 keep threshold even
though they are by construction high-value sources.
Live-run evidence: q7 and q12 each injected two dashboards. All four
had keyword_overlap = 0.000. Two of those four were dropped pre-LLM,
including ourworldindata.org for q7 - which is the resolution source
named in the question's relevant_links column.
Fix: in heuristic_filter, detect retrieval_reason == "dashboard_lookup"
and auto-keep with reason_code "dashboard_lookup_bypass" and a synthetic
priority_score of 1.0. The dashboards still go through the rest of the
filtering pipeline (dedup, per-domain cap, extraction-hint assignment)
unchanged - this is the keyword-overlap chokepoint only.
Implements item 1 from the Tier 1 roadmap. Pairs with the dashboard
title/snippet enrichment in the next commit.
The previous dashboard injection used generic strings ("Dashboard: cdc.gov",
"Known mpox monitoring dashboard") that produced keyword_overlap_score
= 0.000 against every real forecast question - 4/4 injected dashboards
in the q7/q12 live runs had this exact failure mode.
The fix: turn DASHBOARD_LOOKUP into a list of DashboardEntry dataclasses
carrying url + title + snippet, with hand-written pathogen-specific text
for each entry. The titles read as real search-result titles ("CDC H5N1
bird flu situation summary: human cases and outbreaks in the United
States") and the snippets describe what data the page hosts.
Pairs with the previous commit's dashboard heuristic bypass: even with
the bypass in place, better titles still help (a) the keyword-overlap
score for downstream scoring, and (b) the LLM rescue path when it
encounters other pathogen-specific dashboards we add later. The bypass
keeps low-keyword-overlap dashboards alive; this commit makes them
discoverable on their own merits.
Implements item 5 from the Tier 1/2 roadmap.
Live runs on q7 and q12 showed filter survival of 4.7% and 13.5% respectively, even with LLM rescue enabled. The 0.72 threshold was set without benchmarking against real Tavily output and is too tight for the heuristic's actual signal. With the new threshold, priority_scores in the 0.65-0.72 band are auto-kept by heuristics instead of routed to the LLM rescue path. The borderline threshold (0.45) is unchanged, so the LLM filter still gates 0.45-0.65 candidates - the change just moves the auto-keep line to better match what the heuristic can actually distinguish. Implements item 2 from the Tier 1 roadmap. Pairs with the dashboard bypass + enrichment commits to attack the filter chokepoint from multiple angles.
q7's second live run on this branch surfaced an interaction between the
new dashboard heuristic bypass and the per-domain cap. With
max_docs_per_domain=2 and the dashboard bypass injecting one who.int
slot at synthetic priority 1.0, the cap was effectively reducing
who.int to ONE organic slot - and the slot was going to a
priority-0.7097 strategic-plan announcement page, squeezing out the
priority-0.6966 WHO mpox research event page that the baseline run had
extracted records from.
Offline filter replay on the saved q7 search.json confirms the
mechanism:
Heuristic-keep (4 who.int / ourworldindata.org docs):
1.0000 WHO sitreps dashboard (bypass)
1.0000 OWID mpox dashboard (bypass)
0.7097 WHO global strategic preparedness plan (organic)
0.6966 WHO mpox research event (organic) <- baseline's data source
After old cap_per_domain (max=2 per domain):
Dashboards displace one organic each; research event capped out.
The fix: dashboard-bypass docs (selection_reasons contains
"dashboard_lookup_bypass") are always kept and do not consume a slot
against the per-domain or per-type caps. They are curated additions,
not competing organic results.
After the change all four candidates survive, and the WHO research
event page reaches insight as it did in the baseline.
447 tests still passing.
search_stage_score was 0.5*domain + 0.3*freshness + 0.2*rank with no topical-relevance signal, so high-authority but off-topic results ranked at the top (e.g. sports/legal/unrelated-pathogen news). It is now 0.45*relevance + 0.30*domain + 0.10*freshness + 0.15*rank, reusing the filter's keyword_overlap_score/build_query_terms. Freshness is kept low because it is near-uniform in live mode. Addresses #4. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The CDC mpox dashboard URL returned 404; replaced with the (extractable) monkeypox/situation-summary page. Updated two stale redirects (afro.who.int ebola-disease, cdc.gov/ebola/about). DASHBOARD_LOOKUP routing was an exact lowercase key match, so 'marburg virus disease' failed to route to the 'marburg' key; added _resolve_pathogen_key with alias + substring matching (marburg virus disease->marburg, monkeypox->mpox, bird flu->h5n1). Addresses #3. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reputable outbreak reporting from outlets like CNN, NBC, CBS, ABC, NPR, USA Today, LA Times, Politico, Axios, Forbes, Bloomberg, FT, WSJ, Economist, Time, The Atlantic, Ars Technica and Business Insider was resolving to the 'unknown' tier (domain_score 0.2), sinking it below the filter's credibility floor. Promote them to Tier 3 (trusted_media, 0.6); second-level-domain matching covers subdomains. Relates to #13. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When llm_client is None the ambiguous rerank band was always rejected (fail-closed), which is overly aggressive for dev/offline/no-API-key runs. Add a default-off FILTER_CONFIG flag 'no_llm_soft_fallback' (+ no_llm_fallback_relevance_threshold) that instead keeps a borderline candidate iff it is an official domain OR its keyword-overlap relevance clears the threshold, approximating the LLM-rescue path. Production (always has an LLM client) is unchanged. Addresses #13. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced Jun 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Search/filter dashboard chokepoint + relevance ranking
Summary
A bundle of search-stage and filter-stage improvements driven by a systematic
review of 10 live forecasting questions (H5N1, the 2026 DRC+Uganda Ebola
outbreak, the Andes-virus cruise hantavirus cluster, mpox, Marburg) spanning
range/binary/categorical, plus a deterministic offline sweep of filter
thresholds and search-stage weights against hand-labeled live pools. Findings
file (
data/investigations/findings-issues-3-4-13.md) carries the full methodand numbers — most of the relevant tables are surfaced below. Total API spend
across all the live runs that produced this branch: ~$0.05.
Addresses issues #3, #4, #13, #14.
What's included
Filter chokepoint (#13, #14)
(
retrieval_reason == "dashboard_lookup") — they were gettingkeyword_overlap_score == 0.000and being dropped despite being curatedauthoritative sources.
biosecurity_sources.py(was "Dashboard: cdc.gov" / "Known mpox monitoringdashboard"; now real readable titles+snippets per entry).
heuristic_keep_thresholdlowered 0.72 → 0.65. Note (from the sweep)the threshold sits on a flat plateau 0.60–0.775 — see the Heuristic filtering scores too low for real-world search results #13 finding.
cap_per_domain_and_typeso acurated dashboard doesn't displace an organic result on the same domain.
Search-stage relevance ranking (#4)
search_stage_scorewas0.5·domain + 0.3·freshness + 0.2·(1/rank)— notopical-relevance term, so high-authority but off-topic results ranked at
the top and consumed
total_capslots. It now iswhere
relevancereuses the filter's existingkeyword_overlap_score/build_query_terms. Freshness is weighted low because it is near-uniform inlive mode. Weights sum to 1.0; the score drives ranking + truncation only.
Sweep numbers (micro-averaged precision@P / MAP, scored vs labels):
Reordering domain/freshness/rank leaves precision@P flat at 0.694; adding a
relevance term is the only thing that moves the needle.
Dashboard sources (#3)
cdc.gov/mpox/data-research→ 404; nowthe extractable
monkeypox/situation-summarypage — re-running mpox afterthe URL fix went 0 → 2 insight records) and two stale redirects
(
afro.who.intebola-disease,cdc.gov/ebola/about).DASHBOARD_LOOKUProuting was an exact lowercase-key match, so theCSV-natural "marburg virus disease" failed to route to the
marburgkey(→ zero on-topic results). Added
_resolve_pathogen_keywith alias +substring matching (
marburg virus disease→marburg,monkeypox→mpox,bird flu→h5n1).Source tiers (#13)
Today, LA Times, Politico, Axios, Forbes, Bloomberg, FT, WSJ, Economist,
Time, The Atlantic, Ars Technica, Business Insider, …) from
unknown(0.2)to Tier 3
trusted_media(0.6). Legitimate outbreak reporting from thesewas being floored below the filter's credibility threshold.
No-LLM filter fallback (#13)
FILTER_CONFIG["no_llm_soft_fallback"](+no_llm_fallback_relevance_threshold). Whenllm_client is None, theambiguous rerank band was always rejected (fail-closed) — too aggressive
for dev/offline/no-API-key runs. With the flag on, a borderline candidate
is kept iff it is an official domain OR clears the relevance threshold,
approximating the LLM-rescue path. Production (always has an LLM client)
is unchanged.
Known interaction to flag for reviewers
Promoting outlets to Tier 3 raises recall of legitimate reporting, but
because the filter's
priority_scorestill weights credibility heavily(and the new relevance term lives in search ranking, not the filter's keep
decision), it can also admit off-topic pieces from those same outlets (an
H5N1 run kept a CBS transcript). A sensible follow-up is to raise the
filter's keyword-overlap weight / lower its
0.25·credibilityblend — that'snot in this PR.
Issues this PR addresses
(titles) and as a backstop (bypass + cap exemption).
term and retuned weights; the follow-up review showed the missing relevance
signal, not the weight split, was the issue. Reviewers can likely close.
value (bimodal extractability finding), fixed the broken/stale URLs, and
made routing tolerant. One follow-up remains: deciding whether
non-extractable interactive dashboards should consume survival slots (and
trimming/expanding the list accordingly).
bypass + cap exemption + tier coverage + opt-in no-LLM soft fallback). The
review also showed the 0.65 threshold is on a flat plateau (not the lever)
and that the filter's credibility-vs-relevance balance is the remaining
knob — see the follow-up section. Reviewers decide whether the original
bug is resolved.
Verification
python -m pytest bioscancast/tests/— 452 passed, 2 skipped (live).New tests cover the relevance scoring formula, tolerant dashboard routing,
Tier 3 outlet coverage, and the no-LLM soft-fallback flag.
artifacts inspected for filter survival, records, and cost.
data/investigations/findings-issues-3-4-13.md. Re-running mpox after thedashboard URL fix went from 0 → 2 insight records.
Reviewer checklist
cap_per_domain_and_type(curated dashboards never consume a domain slot).source_tiers.pyand thecredibility-vs-relevance interaction noted above.
flat plateau (a sweep was done — it's not the lever).