fix(constituents): persist sector_map in local cache for outage resilience#208
Merged
Merged
Conversation
S&P 400 Wikipedia page inserted a disambiguation-warning banner table at position 0 on/around 2026-05-11. `tables[0]` then returned a 1-row 2-column banner (columns: [0, 1]) instead of the 400-row constituents table at position 1. `_fetch_constituents` raised "GICS sector column missing", fell through to the local cache (which only stores ticker symbols, no sector mapping), and `constituents.collect()` then raised "Sector mapping incomplete: 903 of 903 tickers missing GICS sector". This silently aborted MorningEnrich on the 2026-05-11 weekday SF: `weekly_collector` exited 1, but `python ... 2>&1 | tee ...` with no `set -o pipefail` masked the exit code, SSM reported Success, the SF moved on, and the morning planner aborted minutes later with "daily_data: 46h stale". Replace position-based selection with `_select_constituents_table`, which scans every table on the page and picks the largest one that has both a Symbol/Ticker column AND a GICS Sector (non-sub-industry) column. Raises explicitly if no match — Wikipedia layout drift surfaces loudly instead of silently selecting the wrong table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ience The local CSV cache previously stored only ticker symbols. When Wikipedia fetch raised (e.g. 2026-05-11: S&P 400 page schema drift before PR #207), the fallback returned (cached_tickers, {}, {}, 0, 0) — empty sector maps. `collect()`'s coverage check then raised "Sector mapping incomplete: 903 of 903 tickers missing GICS sector" and `weekly_collector` exited 1. Cache now persists three columns: ticker, gics_sector, sector_etf. On a successful Wikipedia fetch the full mapping is written; on the next Wikipedia outage the fallback returns a fully-populated sector_map and the coverage check passes. Reader (`_load_from_cache`) tolerates the legacy ticker-only schema — existing EC2 caches return empty sector dicts (same behavior as before), and the next successful Wikipedia fetch upgrades the cache to the new schema in place. Also gitignore `data/constituents_cache.csv` — test runs of `_fetch_constituents` write to the real cache path, so the file otherwise lands in `git status` as untracked test pollution. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
data/constituents_cache.csv) now persists three columns:ticker,gics_sector,sector_etf(previously:tickeronly)._load_from_cachereconstructs a fully-populatedsector_map+sector_etf_mapsocollect()'s coverage check passes instead of raising "Sector mapping incomplete".data/constituents_cache.csv— test runs write to the real cache path so the file otherwise shows up as untracked test pollution.Why
Defense-in-depth for the 2026-05-11 cascade. Stacked on top of #207 (which fixes the underlying Wikipedia table-selection bug); together they ensure:
collect()short-circuits with status=error before any S3 write ✓ (this PR)Stacked on #207
This PR's diff is on top of #207. If #207 merges first, this rebases cleanly.
Test plan
pytest tests/test_constituents_sector_map.py— 13/13 pass (4 new cache tests)pytest tests/— 720 passed, 1 skippedtest_cache_fallback_handles_legacy_ticker_only_schema)🤖 Generated with Claude Code