fix(constituents): find Wikipedia table by columns, not by position#207
Merged
Conversation
S&P 400 Wikipedia page inserted a disambiguation-warning banner table at position 0 on/around 2026-05-11. `tables[0]` then returned a 1-row 2-column banner (columns: [0, 1]) instead of the 400-row constituents table at position 1. `_fetch_constituents` raised "GICS sector column missing", fell through to the local cache (which only stores ticker symbols, no sector mapping), and `constituents.collect()` then raised "Sector mapping incomplete: 903 of 903 tickers missing GICS sector". This silently aborted MorningEnrich on the 2026-05-11 weekday SF: `weekly_collector` exited 1, but `python ... 2>&1 | tee ...` with no `set -o pipefail` masked the exit code, SSM reported Success, the SF moved on, and the morning planner aborted minutes later with "daily_data: 46h stale". Replace position-based selection with `_select_constituents_table`, which scans every table on the page and picks the largest one that has both a Symbol/Ticker column AND a GICS Sector (non-sub-industry) column. Raises explicitly if no match — Wikipedia layout drift surfaces loudly instead of silently selecting the wrong table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 11, 2026
cipher813
added a commit
that referenced
this pull request
May 11, 2026
…ience (#208) * fix(constituents): find Wikipedia table by columns, not by position S&P 400 Wikipedia page inserted a disambiguation-warning banner table at position 0 on/around 2026-05-11. `tables[0]` then returned a 1-row 2-column banner (columns: [0, 1]) instead of the 400-row constituents table at position 1. `_fetch_constituents` raised "GICS sector column missing", fell through to the local cache (which only stores ticker symbols, no sector mapping), and `constituents.collect()` then raised "Sector mapping incomplete: 903 of 903 tickers missing GICS sector". This silently aborted MorningEnrich on the 2026-05-11 weekday SF: `weekly_collector` exited 1, but `python ... 2>&1 | tee ...` with no `set -o pipefail` masked the exit code, SSM reported Success, the SF moved on, and the morning planner aborted minutes later with "daily_data: 46h stale". Replace position-based selection with `_select_constituents_table`, which scans every table on the page and picks the largest one that has both a Symbol/Ticker column AND a GICS Sector (non-sub-industry) column. Raises explicitly if no match — Wikipedia layout drift surfaces loudly instead of silently selecting the wrong table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(constituents): persist sector_map in local cache for outage resilience The local CSV cache previously stored only ticker symbols. When Wikipedia fetch raised (e.g. 2026-05-11: S&P 400 page schema drift before PR #207), the fallback returned (cached_tickers, {}, {}, 0, 0) — empty sector maps. `collect()`'s coverage check then raised "Sector mapping incomplete: 903 of 903 tickers missing GICS sector" and `weekly_collector` exited 1. Cache now persists three columns: ticker, gics_sector, sector_etf. On a successful Wikipedia fetch the full mapping is written; on the next Wikipedia outage the fallback returns a fully-populated sector_map and the coverage check passes. Reader (`_load_from_cache`) tolerates the legacy ticker-only schema — existing EC2 caches return empty sector dicts (same behavior as before), and the next successful Wikipedia fetch upgrades the cache to the new schema in place. Also gitignore `data/constituents_cache.csv` — test runs of `_fetch_constituents` write to the real cache path, so the file otherwise lands in `git status` as untracked test pollution. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
5 tasks
cipher813
added a commit
that referenced
this pull request
May 20, 2026
…279) Phase 3a of the pillar-decomposed attractiveness scoring arc (plan doc `alpha-engine-docs/private/attractiveness-pillars-260520.md`; ROADMAP P1 = alpha-engine-config #254 merged; lib Phase 1 = alpha-engine-lib #53 v0.22.0; research+config Phase 2 = alpha-engine-research #207 + alpha-engine-config #255 merged). Adds 5 new fundamental fields surfaced from the existing Finnhub `/stock/metric?metric=all` response — no new API integrations, no extra rate-limit pressure on the existing collector budget. The fields back the Growth + Stewardship pillar quant composites that get added to `scoring/factor_scoring.py` in the follow-up Phase 3b PR (alpha-engine-research): Growth pillar quant substrate (2): - `revenue_growth_3y` — 3y revenue CAGR (Finnhub `revenueGrowth3Y`, fallback `revenueGrowth5Y`); clipped to [-0.5, 1.5] - `eps_growth_3y` — 3y EPS CAGR (Finnhub `epsGrowth3Y`, fallbacks `epsBasicExclExtraItemsAnnual5Y` / `epsGrowth5Y`); clipped to [-1.0, 2.0] Stewardship pillar quant substrate (3): - `payout_ratio` — TTM dividends / NI (Finnhub `payoutRatioTTM` → `payoutRatioAnnual`); clipped to [0, 2] - `dividend_yield` — Indicated annual yield (Finnhub `dividendYieldIndicatedAnnual` → `currentDividendYieldTTM`); clipped to [0, 0.20] - `capex_growth_5y` — 5y CAPEX growth (Finnhub `capitalSpendingGrowth5Y`); clipped to [-1, 2] 3y CAGR is preferred over TTM YoY for ranking because it's smoother (base-effect / single-quarter anomalies average out); 5y/annual fallbacks for newer listings without a full 3y history. Insider ownership % is NOT in this PR — Finnhub `metric=all` does not surface it; would require a separate `/stock/insider-transactions` integration (extra API calls + rate-limit pressure). Deferred to a follow-up if/when stewardship's composite weight in the backtester optimization argues for it. The three signals shipped here cover the "capital allocation discipline" axis: payout (return-of-capital intensity), dividend yield (combined with payout, identifies low-yield + low-payout = buyback-heavy retainers), and CAPEX growth (reinvestment intensity). Plumbed end-to-end: - `collectors/fundamentals.py`: NEUTRAL grows to 13 fields (was 8); `_fetch_single_ticker` extracts the 5 new Finnhub fields with TTM-preferred / annual-fallback chains; values clipped to the ranges above. - `features/registry.py`: 5 new `FeatureEntry` records (group= "fundamental"), bringing the fundamental group from 8 → 13. - `features/feature_engineer.py`: 5 new columns in `EXPECTED_FEATURE_COLUMNS` + DataFrame-write site populates from `fundamental_data` dict / falls back to 0.0 when fundamental_data is None (matches existing pattern). Behavior change: when the Saturday SF DataPhase1 runs after this PR merges + deploys, `features/{date}/fundamental.parquet` carries 13 columns instead of 8. Downstream consumers (alpha-engine-research `scoring/factor_scoring.py` + the Phase 3b composite extension) treat the new columns as additive — the 4 existing composites (quality / momentum / value / low_vol) and their _n counts continue to compute identically. No behavioral coupling with Phase 3b — the new columns can sit in the parquet for a soak before research's Phase 3b PR consumes them. Tests: 12 new tests in `tests/test_fundamentals_finnhub.py` covering NEUTRAL field completeness, field mapping for typical AAPL payload, fallback chains for each of the 5 fields, clipping bounds (extreme growth, extreme yield, payout > 2), empty-payload preservation. Suite: 1400 → 1412 passing (zero regressions, 7.47s). Composes with: factor-substrate-260513 (this PR extends its fundamental.parquet schema); alpha-engine-research Phase 3b (consumer composites land in a follow-up PR). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tables[0]heuristic in_fetch_constituentswith_select_constituents_table, which scans every table frompd.read_htmland picks the largest one that has both a Symbol/Ticker column AND a GICS Sector (non-sub-industry) column.Why
S&P 400 Wikipedia page inserted a disambiguation-warning banner table at position 0 on/around 2026-05-11.
tables[0]then returned a 1-row, 2-column banner (columns:[0, 1]) instead of the 400-row constituents table at position 1.The cascade today (2026-05-11 weekday SF):
_fetch_constituentsraised "GICS sector column missing" on the banner tablecollect()raised "Sector mapping incomplete: 903 of 903 tickers missing GICS sector"weekly_collectorexited 1, butpython ... 2>&1 | tee ...masked the exit code (noset -o pipefail)daily_data: 46h staleThis PR closes (1). Cache hardening +
pipefailfollow in PRs 2 + 3.Test plan
pytest tests/test_constituents_sector_map.py— 9/9 pass (4 prior + 5 new)pytest tests/— 716 passed, 1 skippedhttps://en.wikipedia.org/wiki/List_of_S%26P_400_companies— selector returns the 400-row constituents table,tickers[:3] = ['AA', 'AAL', 'AAON']weekly_collector.py --morning-enrich) succeeds against live Wikipedia🤖 Generated with Claude Code