Skip to content

fix(constituents): find Wikipedia table by columns, not by position#207

Merged
cipher813 merged 1 commit into
mainfrom
fix/constituents-wiki-table-selection
May 11, 2026
Merged

fix(constituents): find Wikipedia table by columns, not by position#207
cipher813 merged 1 commit into
mainfrom
fix/constituents-wiki-table-selection

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

  • Replaces tables[0] heuristic in _fetch_constituents with _select_constituents_table, which scans every table from pd.read_html and picks the largest one that has both a Symbol/Ticker column AND a GICS Sector (non-sub-industry) column.
  • Raises explicitly with a clear "Wikipedia layout drift" message if no candidate matches.
  • 5 new unit tests covering: banner-table-at-index-0 skip, no-match raise, largest-candidate tiebreaker, MultiIndex flattening, and end-to-end fetch through the banner.

Why

S&P 400 Wikipedia page inserted a disambiguation-warning banner table at position 0 on/around 2026-05-11. tables[0] then returned a 1-row, 2-column banner (columns: [0, 1]) instead of the 400-row constituents table at position 1.

The cascade today (2026-05-11 weekday SF):

  1. _fetch_constituents raised "GICS sector column missing" on the banner table
  2. Fell through to local cache (only stores ticker symbols, no sector mapping)
  3. collect() raised "Sector mapping incomplete: 903 of 903 tickers missing GICS sector"
  4. weekly_collector exited 1, but python ... 2>&1 | tee ... masked the exit code (no set -o pipefail)
  5. SF moved on; morning planner aborted with daily_data: 46h stale
  6. Daemon emitted "upstream health failed..no order book written" Telegram

This PR closes (1). Cache hardening + pipefail follow in PRs 2 + 3.

Test plan

  • pytest tests/test_constituents_sector_map.py — 9/9 pass (4 prior + 5 new)
  • Full suite: pytest tests/ — 716 passed, 1 skipped
  • Live verified against current https://en.wikipedia.org/wiki/List_of_S%26P_400_companies — selector returns the 400-row constituents table, tickers[:3] = ['AA', 'AAL', 'AAON']
  • Post-merge: Saturday SF DataPhase1 constituents step (or manual weekly_collector.py --morning-enrich) succeeds against live Wikipedia

🤖 Generated with Claude Code

S&P 400 Wikipedia page inserted a disambiguation-warning banner table
at position 0 on/around 2026-05-11. `tables[0]` then returned a 1-row
2-column banner (columns: [0, 1]) instead of the 400-row constituents
table at position 1. `_fetch_constituents` raised "GICS sector column
missing", fell through to the local cache (which only stores ticker
symbols, no sector mapping), and `constituents.collect()` then raised
"Sector mapping incomplete: 903 of 903 tickers missing GICS sector".

This silently aborted MorningEnrich on the 2026-05-11 weekday SF:
`weekly_collector` exited 1, but `python ... 2>&1 | tee ...` with no
`set -o pipefail` masked the exit code, SSM reported Success, the SF
moved on, and the morning planner aborted minutes later with
"daily_data: 46h stale".

Replace position-based selection with `_select_constituents_table`,
which scans every table on the page and picks the largest one that has
both a Symbol/Ticker column AND a GICS Sector (non-sub-industry) column.
Raises explicitly if no match — Wikipedia layout drift surfaces loudly
instead of silently selecting the wrong table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit fa69c23 into main May 11, 2026
1 check passed
@cipher813 cipher813 deleted the fix/constituents-wiki-table-selection branch May 11, 2026 13:52
cipher813 added a commit that referenced this pull request May 11, 2026
…ience (#208)

* fix(constituents): find Wikipedia table by columns, not by position

S&P 400 Wikipedia page inserted a disambiguation-warning banner table
at position 0 on/around 2026-05-11. `tables[0]` then returned a 1-row
2-column banner (columns: [0, 1]) instead of the 400-row constituents
table at position 1. `_fetch_constituents` raised "GICS sector column
missing", fell through to the local cache (which only stores ticker
symbols, no sector mapping), and `constituents.collect()` then raised
"Sector mapping incomplete: 903 of 903 tickers missing GICS sector".

This silently aborted MorningEnrich on the 2026-05-11 weekday SF:
`weekly_collector` exited 1, but `python ... 2>&1 | tee ...` with no
`set -o pipefail` masked the exit code, SSM reported Success, the SF
moved on, and the morning planner aborted minutes later with
"daily_data: 46h stale".

Replace position-based selection with `_select_constituents_table`,
which scans every table on the page and picks the largest one that has
both a Symbol/Ticker column AND a GICS Sector (non-sub-industry) column.
Raises explicitly if no match — Wikipedia layout drift surfaces loudly
instead of silently selecting the wrong table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(constituents): persist sector_map in local cache for outage resilience

The local CSV cache previously stored only ticker symbols. When Wikipedia
fetch raised (e.g. 2026-05-11: S&P 400 page schema drift before PR #207),
the fallback returned (cached_tickers, {}, {}, 0, 0) — empty sector maps.
`collect()`'s coverage check then raised "Sector mapping incomplete:
903 of 903 tickers missing GICS sector" and `weekly_collector` exited 1.

Cache now persists three columns: ticker, gics_sector, sector_etf. On a
successful Wikipedia fetch the full mapping is written; on the next
Wikipedia outage the fallback returns a fully-populated sector_map and
the coverage check passes.

Reader (`_load_from_cache`) tolerates the legacy ticker-only schema —
existing EC2 caches return empty sector dicts (same behavior as before),
and the next successful Wikipedia fetch upgrades the cache to the new
schema in place.

Also gitignore `data/constituents_cache.csv` — test runs of
`_fetch_constituents` write to the real cache path, so the file
otherwise lands in `git status` as untracked test pollution.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 20, 2026
…279)

Phase 3a of the pillar-decomposed attractiveness scoring arc (plan doc
`alpha-engine-docs/private/attractiveness-pillars-260520.md`; ROADMAP P1
= alpha-engine-config #254 merged; lib Phase 1 = alpha-engine-lib #53
v0.22.0; research+config Phase 2 = alpha-engine-research #207 +
alpha-engine-config #255 merged).

Adds 5 new fundamental fields surfaced from the existing Finnhub
`/stock/metric?metric=all` response — no new API integrations, no
extra rate-limit pressure on the existing collector budget. The fields
back the Growth + Stewardship pillar quant composites that get added
to `scoring/factor_scoring.py` in the follow-up Phase 3b PR
(alpha-engine-research):

Growth pillar quant substrate (2):
- `revenue_growth_3y` — 3y revenue CAGR (Finnhub `revenueGrowth3Y`,
  fallback `revenueGrowth5Y`); clipped to [-0.5, 1.5]
- `eps_growth_3y` — 3y EPS CAGR (Finnhub `epsGrowth3Y`, fallbacks
  `epsBasicExclExtraItemsAnnual5Y` / `epsGrowth5Y`); clipped to [-1.0, 2.0]

Stewardship pillar quant substrate (3):
- `payout_ratio` — TTM dividends / NI (Finnhub `payoutRatioTTM` →
  `payoutRatioAnnual`); clipped to [0, 2]
- `dividend_yield` — Indicated annual yield (Finnhub
  `dividendYieldIndicatedAnnual` → `currentDividendYieldTTM`); clipped
  to [0, 0.20]
- `capex_growth_5y` — 5y CAPEX growth (Finnhub
  `capitalSpendingGrowth5Y`); clipped to [-1, 2]

3y CAGR is preferred over TTM YoY for ranking because it's smoother
(base-effect / single-quarter anomalies average out); 5y/annual
fallbacks for newer listings without a full 3y history.

Insider ownership % is NOT in this PR — Finnhub `metric=all` does not
surface it; would require a separate `/stock/insider-transactions`
integration (extra API calls + rate-limit pressure). Deferred to a
follow-up if/when stewardship's composite weight in the backtester
optimization argues for it. The three signals shipped here cover the
"capital allocation discipline" axis: payout (return-of-capital
intensity), dividend yield (combined with payout, identifies
low-yield + low-payout = buyback-heavy retainers), and CAPEX growth
(reinvestment intensity).

Plumbed end-to-end:

- `collectors/fundamentals.py`: NEUTRAL grows to 13 fields (was 8);
  `_fetch_single_ticker` extracts the 5 new Finnhub fields with
  TTM-preferred / annual-fallback chains; values clipped to the ranges
  above.
- `features/registry.py`: 5 new `FeatureEntry` records (group=
  "fundamental"), bringing the fundamental group from 8 → 13.
- `features/feature_engineer.py`: 5 new columns in
  `EXPECTED_FEATURE_COLUMNS` + DataFrame-write site populates from
  `fundamental_data` dict / falls back to 0.0 when fundamental_data
  is None (matches existing pattern).

Behavior change: when the Saturday SF DataPhase1 runs after this PR
merges + deploys, `features/{date}/fundamental.parquet` carries 13
columns instead of 8. Downstream consumers (alpha-engine-research
`scoring/factor_scoring.py` + the Phase 3b composite extension)
treat the new columns as additive — the 4 existing composites
(quality / momentum / value / low_vol) and their _n counts continue
to compute identically. No behavioral coupling with Phase 3b — the
new columns can sit in the parquet for a soak before research's
Phase 3b PR consumes them.

Tests: 12 new tests in `tests/test_fundamentals_finnhub.py` covering
NEUTRAL field completeness, field mapping for typical AAPL payload,
fallback chains for each of the 5 fields, clipping bounds (extreme
growth, extreme yield, payout > 2), empty-payload preservation.

Suite: 1400 → 1412 passing (zero regressions, 7.47s).

Composes with: factor-substrate-260513 (this PR extends its
fundamental.parquet schema); alpha-engine-research Phase 3b
(consumer composites land in a follow-up PR).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant