feat(data): per-collector value-range validation (ROADMAP L1243 / extends #215) by cipher813 · Pull Request #254 · cipher813/alpha-engine-data

cipher813 · 2026-05-18T16:33:41Z

Summary

Closes the residual scope of ROADMAP L1243 — "Per-collector value-range validation" (P1, promoted from P2 2026-05-16). L940 / #215 shipped the write-time quality gate on the OHLCV builders/daily_append.py path. This PR extends equivalent write-time value-range validation to the two feature collectors that bypass daily_append entirely and write feature-source rows straight to S3 (consumed by the predictor feature store + research scoring) — the exact FMP-zero'd-fundamentals class that already burned ~2 weeks of alpha.

Scope

`validators/price_validator.py`

Two intra-bar checks the ROADMAP residual called out as not-yet-covered by validate_today_row:
- intrabar_inconsistent — Close outside [Low, High] even when High >= Low (a split-day reporting lag / corp-action artifact that bad_ohlc's High-vs-Low check misses). Default block. Skipped when High < Low so it doesn't double-count the same corrupt bar bad_ohlc already owns.
- negative_volume — the existing zero_volume check only caught Volume == 0; a negative volume is unambiguously corrupt. Default block.
- Both added to DEFAULT_BLOCK_ANOMALY_TYPES.
New validate_feature_record(record, field_specs, ticker) — generic value-range gate for non-OHLCV records, same structured-anomaly + per-type-severity contract as validate_today_row so callers reuse identical block-set / metric wiring. Caller passes a per-field spec (nonneg / lo / hi, all optional). None values skipped (coverage is the ok_ratio gate's job).

Checks added (field → severity)

Surface	Check	Anomaly type	Default severity
`validate_today_row` (OHLCV)	`Close` outside `[Low, High]`	`intrabar_inconsistent`	block
`validate_today_row` (OHLCV)	`Volume < 0`	`negative_volume`	block
feature record (any spec'd field)	NaN / inf	`nan_or_inf`	block
feature record (field declared `nonneg`)	value `< 0`	`negative_where_nonneg`	block
feature record (field with `lo`/`hi`)	value outside band	`gross_outlier`	warn

`collectors/fundamentals.py`

Validates each fully-shaped (post-_clip) per-ticker dict before the S3 snapshot. The load-bearing residual _clip can't catch is NaN/inf (clip of NaN propagates NaN) + structural non-negativity of gross_margin/current_ratio; lo/hi mirror the clip bands so a gross outlier surfaces if a future refactor drops a _clip. Block → replace with NEUTRAL + logger.error (counted as an error so the existing ok_ratio gate accounts for it). Env-tunable block set via FUNDAMENTALS_BLOCK_ANOMALY_TYPES. CW metric AlphaEngine/Data/fundamentals_quality_*.

`collectors/alternative.py`

Validates each spec'd numeric sub-section (analyst_consensus, eps_revision, options_flow, insider_activity, institutional) of the assembled per-ticker payload before the S3 write. The pre-existing per-source ok_ratio gate only checks data presence; this closes the value-range half. Block → refuse the whole ticker write, accounted exactly like a fetch failure (existing failed/errors + ok_ratio machinery surfaces it). Env-tunable block set via ALT_BLOCK_ANOMALY_TYPES. CW metric AlphaEngine/Data/alternative_quality_*. Quality counts threaded into the manifest + every return path.

Severity defaults

Mirrors #215's split: definitely-corrupt conditions (NaN/inf, negative-where-impossible, intra-bar inconsistency, negative volume) block by default; rare-but-possibly-legitimate (gross outlier — e.g. a real -300% ROE for a wiped-out firm that defeated an upstream clamp) warns. Operators flip behavior per-collector via the env vars; unknown-type / malformed-JSON raises (NoSilentFails), matching _load_block_anomaly_types.

Tests

+39 tests (1220 → 1259); full suite 1258 passed, 1 skipped (pre-existing). No new pip deps.

tests/test_price_validator.py: TestValidateTodayRowIntrabarChecks (Close-out-of-band block, negative-volume block, zero-volume-still-warns, double-count avoidance) + TestValidateFeatureRecord (NaN/inf block, negative-where-nonneg block, gross-outlier warn, None/non-numeric/absent skip, NaN-short-circuits-further-checks). Updated the DEFAULT_BLOCK_ANOMALY_TYPES contract test for the two added intra-bar types.
tests/test_fundamentals_finnhub.py: TestFundamentalsValueRangeGate (NaN block + NEUTRAL'd, negative-margin block, gross-outlier warn, return-path field presence, env-loader malformed/unknown/promote).
tests/test_alternative_value_range_gate.py (new): block paths (NaN/negative target/negative fund count, blocked-ticker-not-written), warn path, clean path, manifest + return contract, env-tunable block set (malformed/unknown/promote/empty-list observation mode). Reuses the established _patch_collect stub pattern from test_alternative_ok_ratio_gate.py.

🤖 Generated with Claude Code

…ends #215) Extends the #215 write-time quality gate beyond builders/daily_append.py to the two feature collectors that bypass validate_today_row entirely and write feature-source rows straight to S3 / ArcticDB: - validators/price_validator.py: - Add the two intra-bar checks the ROADMAP L1243 residual called out as not-yet-covered by validate_today_row: intrabar_inconsistent (Close outside [Low, High] even when High>=Low) + negative_volume (the existing zero_volume check only caught Volume==0). Both default-block (unambiguous corruption); both added to DEFAULT_BLOCK_ANOMALY_TYPES. - New validate_feature_record(record, field_specs, ticker): generic value-range gate for non-OHLCV records. Same structured-anomaly + per-type-severity contract as validate_today_row. Checks NaN/inf (block), negative-where-field-declared-nonneg (block), gross outlier outside a caller-declared sane band (warn). None values skipped (coverage is the ok_ratio gate's job, not value-range's). - collectors/fundamentals.py: validate each per-ticker dict before the S3 snapshot. Block -> replace with NEUTRAL + logger.error (counted as an error so the ok_ratio gate already accounts for it). Env-tunable block set via FUNDAMENTALS_BLOCK_ANOMALY_TYPES. CW metric AlphaEngine/Data/fundamentals_quality_*. - collectors/alternative.py: validate each spec'd numeric sub-section of the assembled per-ticker payload before the S3 write. Block -> refuse the whole ticker write, accounted exactly like a fetch failure. Env-tunable block set via ALT_BLOCK_ANOMALY_TYPES. CW metric AlphaEngine/Data/alternative_quality_*. Quality counts threaded into the manifest + every return path. Severity defaults mirror #215's split: definitely-corrupt (NaN/inf, negative-where-impossible, intra-bar inconsistency, negative volume) blocks by default; rare-but-possibly-legitimate (gross outlier) warns. Operators flip via the per-collector env vars (unknown-type / malformed-JSON raises, NoSilentFails). Tests: +39 (1220 -> 1259). New TestValidateTodayRowIntrabarChecks + TestValidateFeatureRecord in test_price_validator.py; TestFundamentalsValueRangeGate in test_fundamentals_finnhub.py; new tests/test_alternative_value_range_gate.py. Full suite 1258 passed, 1 skipped (pre-existing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…paid-tier; clears Phase-2 gate) Root cause: FMP's `analyst-estimates` endpoint began returning 402 Payment Required on the free tier (~2026-05-17 — paid-tier only), collapsing the `eps_revision` source to ~15% populated and breaching DataPhase2's per-source populated-ratio gate (`_REQUIRED_POPULATED_RATIOS["eps_revision"] = 0.50`), so DataPhase2 returned ERROR on every weekly run (incl. 5/17). Operator (Brian) decided "swap source". This swaps `_fetch_revisions` onto yfinance `Ticker.eps_trend` — yfinance is already the integrated provider for `target_price`/`options_flow` in this same module, so the auth/availability/idiom precedent is established (no new pip dep). Field chosen: `eps_trend` is itself a revision series — a DataFrame indexed by period (0q/+1q/0y/+1y) with columns current/7daysAgo/30daysAgo/60daysAgo/90daysAgo (consensus-mean EPS estimate snapshots at those lookbacks). The annual `0y` row is used, mirroring the prior FMP `period="annual"` choice: * current_estimate ← 0y row `current` column. * revision_4w ← (current - 30daysAgo) / abs(30daysAgo) * 100 on the 0y row (same "now vs ~30d ago" semantic as the prior FMP path, but sourced from yfinance's own 30-day-ago snapshot — removing the dependency on a `archive/revisions/{date}.json` snapshot that is never written anywhere in this codebase, i.e. the old revision path was effectively dead). * streak ← count of consecutive non-negative steps in the 0y estimate walking newest→oldest across current→7d→30d→60d→90d (max 4). The prior FMP impl hardcoded streak=0 (no multi-snapshot series to derive from); this is a strictly more faithful derivation of the field's intended meaning. Field and key preserved — not dropped. Contract unchanged: return dict keys/types {current_estimate: float|None, revision_4w: float|None, streak: int} are identical, so `_has_revision_data` and all downstream consumers are unaffected. `bucket`/`run_date` params retained in the signature; the read-only S3 prior-snapshot path is kept as a non-regressing fallback (no-op in practice since no writer exists). Sample populated ratio (real yfinance calls, worktree code, 8 representative tickers incl. recent listing ARM and never-profitable RBLX): 8/8 = 100% — decisively clears the 0.50 gate. yfinance exceptions carry no API credentials (no keyed querystring), so the existing module yfinance warning idiom (_fetch_analyst's target_price block) is mirrored; the `_scrub_url_creds` helper added by open PR #255 consolidates the FMP/keyed-fetch scrub surface and is not needed on this unkeyed path. Merge order vs concurrent open PRs touching collectors/alternative.py: recommend #255 (adds _scrub_url_creds) → this → #254. This branch is based off origin/main and will likely need a trivial rebase after #255 merges (no functional conflict — different functions). Tests: new tests/test_alternative_revisions.py (13 tests) mirrors test_alternative_analyst.py style — mocks yfinance via patched sys.modules + a DataFrame eps_trend fixture; asserts contract keys/types unchanged, revision_4w sign/Δ math (positive, negative, negative-EPS abs() denominator), streak derivation (full/partial/broken runs), None-safety (empty/None df, missing 0y row, NaN 30daysAgo), loud-degrade on yfinance failure, and signature param preservation. Full suite: 1232 passed, 1 skipped (pre-existing skip, unrelated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…paid-tier; clears Phase-2 gate) (#257) Root cause: FMP's `analyst-estimates` endpoint began returning 402 Payment Required on the free tier (~2026-05-17 — paid-tier only), collapsing the `eps_revision` source to ~15% populated and breaching DataPhase2's per-source populated-ratio gate (`_REQUIRED_POPULATED_RATIOS["eps_revision"] = 0.50`), so DataPhase2 returned ERROR on every weekly run (incl. 5/17). Operator (Brian) decided "swap source". This swaps `_fetch_revisions` onto yfinance `Ticker.eps_trend` — yfinance is already the integrated provider for `target_price`/`options_flow` in this same module, so the auth/availability/idiom precedent is established (no new pip dep). Field chosen: `eps_trend` is itself a revision series — a DataFrame indexed by period (0q/+1q/0y/+1y) with columns current/7daysAgo/30daysAgo/60daysAgo/90daysAgo (consensus-mean EPS estimate snapshots at those lookbacks). The annual `0y` row is used, mirroring the prior FMP `period="annual"` choice: * current_estimate ← 0y row `current` column. * revision_4w ← (current - 30daysAgo) / abs(30daysAgo) * 100 on the 0y row (same "now vs ~30d ago" semantic as the prior FMP path, but sourced from yfinance's own 30-day-ago snapshot — removing the dependency on a `archive/revisions/{date}.json` snapshot that is never written anywhere in this codebase, i.e. the old revision path was effectively dead). * streak ← count of consecutive non-negative steps in the 0y estimate walking newest→oldest across current→7d→30d→60d→90d (max 4). The prior FMP impl hardcoded streak=0 (no multi-snapshot series to derive from); this is a strictly more faithful derivation of the field's intended meaning. Field and key preserved — not dropped. Contract unchanged: return dict keys/types {current_estimate: float|None, revision_4w: float|None, streak: int} are identical, so `_has_revision_data` and all downstream consumers are unaffected. `bucket`/`run_date` params retained in the signature; the read-only S3 prior-snapshot path is kept as a non-regressing fallback (no-op in practice since no writer exists). Sample populated ratio (real yfinance calls, worktree code, 8 representative tickers incl. recent listing ARM and never-profitable RBLX): 8/8 = 100% — decisively clears the 0.50 gate. yfinance exceptions carry no API credentials (no keyed querystring), so the existing module yfinance warning idiom (_fetch_analyst's target_price block) is mirrored; the `_scrub_url_creds` helper added by open PR #255 consolidates the FMP/keyed-fetch scrub surface and is not needed on this unkeyed path. Merge order vs concurrent open PRs touching collectors/alternative.py: recommend #255 (adds _scrub_url_creds) → this → #254. This branch is based off origin/main and will likely need a trivial rebase after #255 merges (no functional conflict — different functions). Tests: new tests/test_alternative_revisions.py (13 tests) mirrors test_alternative_analyst.py style — mocks yfinance via patched sys.modules + a DataFrame eps_trend fixture; asserts contract keys/types unchanged, revision_4w sign/Δ math (positive, negative, negative-EPS abs() denominator), streak derivation (full/partial/broken runs), None-safety (empty/None df, missing 0y row, NaN 30daysAgo), loud-degrade on yfinance failure, and signature param preservation. Full suite: 1232 passed, 1 skipped (pre-existing skip, unrelated). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…y rollback chain since PR #254 (#274) The Phase 2 Lambda CI deploy has been failing on every push since 2026-05-18 18:20Z. PR #254 (per-collector value-range validation, merged 5/16) added top-level imports from validators.price_validator import (...) to collectors/alternative.py + collectors/fundamentals.py, but did NOT add `COPY validators/` to the Dockerfile. Every push that touched a deploy.yml-triggering path since then built a fresh image, pushed v88, ran the canary, hit `No module named 'validators'` at module load, and rolled back to v87. 10 consecutive failed deploys (5/18-18:20Z through 5/20-00:25Z) sat unnoticed because the canary correctly rolled back so prod (v87) was unaffected — but the latent break blocked ANY new code from ever reaching `live`. Surfaced by the Wave-3 PR3-wave-2 deploy (#273) when Brian saw the rollback in the CI log. Fix: one-line `COPY validators/ ${LAMBDA_TASK_ROOT}/validators/` next to the other application-code COPY directives. Regression guard: new tests/test_dockerfile_copies_match_deployed _imports.py with two assertions — 1. `test_dockerfile_copies_validators_for_collectors_imports` — explicit pin for the validators/ case so the failure surfaces as a clear test name if a future refactor drops the COPY. 2. `test_every_toplevel_local_import_in_lambda_code_is_dockerfile _copied` — generic ast-scan over every deployed Python file (lambda/, weekly_collector.py, polygon_client.py, collectors/, store/, validators/). Module-scope `from <local_pkg> import ...` / `import <local_pkg>` where <local_pkg> is a directory with __init__.py at the repo root MUST appear in the Dockerfile's COPY directives — or in the explicitly-non-deployed allowlist (`tests`, `builders`, `infrastructure`, `rag`, `features`). New top-level imports of un-COPY'd packages fail in CI now, not in the post-merge canary. Suite 1399 → 1401 green. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…279) Phase 3a of the pillar-decomposed attractiveness scoring arc (plan doc `alpha-engine-docs/private/attractiveness-pillars-260520.md`; ROADMAP P1 = alpha-engine-config #254 merged; lib Phase 1 = alpha-engine-lib #53 v0.22.0; research+config Phase 2 = alpha-engine-research #207 + alpha-engine-config #255 merged). Adds 5 new fundamental fields surfaced from the existing Finnhub `/stock/metric?metric=all` response — no new API integrations, no extra rate-limit pressure on the existing collector budget. The fields back the Growth + Stewardship pillar quant composites that get added to `scoring/factor_scoring.py` in the follow-up Phase 3b PR (alpha-engine-research): Growth pillar quant substrate (2): - `revenue_growth_3y` — 3y revenue CAGR (Finnhub `revenueGrowth3Y`, fallback `revenueGrowth5Y`); clipped to [-0.5, 1.5] - `eps_growth_3y` — 3y EPS CAGR (Finnhub `epsGrowth3Y`, fallbacks `epsBasicExclExtraItemsAnnual5Y` / `epsGrowth5Y`); clipped to [-1.0, 2.0] Stewardship pillar quant substrate (3): - `payout_ratio` — TTM dividends / NI (Finnhub `payoutRatioTTM` → `payoutRatioAnnual`); clipped to [0, 2] - `dividend_yield` — Indicated annual yield (Finnhub `dividendYieldIndicatedAnnual` → `currentDividendYieldTTM`); clipped to [0, 0.20] - `capex_growth_5y` — 5y CAPEX growth (Finnhub `capitalSpendingGrowth5Y`); clipped to [-1, 2] 3y CAGR is preferred over TTM YoY for ranking because it's smoother (base-effect / single-quarter anomalies average out); 5y/annual fallbacks for newer listings without a full 3y history. Insider ownership % is NOT in this PR — Finnhub `metric=all` does not surface it; would require a separate `/stock/insider-transactions` integration (extra API calls + rate-limit pressure). Deferred to a follow-up if/when stewardship's composite weight in the backtester optimization argues for it. The three signals shipped here cover the "capital allocation discipline" axis: payout (return-of-capital intensity), dividend yield (combined with payout, identifies low-yield + low-payout = buyback-heavy retainers), and CAPEX growth (reinvestment intensity). Plumbed end-to-end: - `collectors/fundamentals.py`: NEUTRAL grows to 13 fields (was 8); `_fetch_single_ticker` extracts the 5 new Finnhub fields with TTM-preferred / annual-fallback chains; values clipped to the ranges above. - `features/registry.py`: 5 new `FeatureEntry` records (group= "fundamental"), bringing the fundamental group from 8 → 13. - `features/feature_engineer.py`: 5 new columns in `EXPECTED_FEATURE_COLUMNS` + DataFrame-write site populates from `fundamental_data` dict / falls back to 0.0 when fundamental_data is None (matches existing pattern). Behavior change: when the Saturday SF DataPhase1 runs after this PR merges + deploys, `features/{date}/fundamental.parquet` carries 13 columns instead of 8. Downstream consumers (alpha-engine-research `scoring/factor_scoring.py` + the Phase 3b composite extension) treat the new columns as additive — the 4 existing composites (quality / momentum / value / low_vol) and their _n counts continue to compute identically. No behavioral coupling with Phase 3b — the new columns can sit in the parquet for a soak before research's Phase 3b PR consumes them. Tests: 12 new tests in `tests/test_fundamentals_finnhub.py` covering NEUTRAL field completeness, field mapping for typical AAPL payload, fallback chains for each of the 5 fields, clipping bounds (extreme growth, extreme yield, payout > 2), empty-payload preservation. Suite: 1400 → 1412 passing (zero regressions, 7.47s). Composes with: factor-substrate-260513 (this PR extends its fundamental.parquet schema); alpha-engine-research Phase 3b (consumer composites land in a follow-up PR). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 mentioned this pull request May 18, 2026

feat(data): migrate eps_revision FMP→yfinance (analyst-estimates 402 paid-tier; clears Phase-2 gate) #257

Merged

cipher813 merged commit 2c1128e into main May 18, 2026
1 check passed

cipher813 deleted the feat/per-collector-value-range-validation branch May 18, 2026 18:20

cipher813 mentioned this pull request May 20, 2026

fix(lambda): COPY validators/ into Phase 2 image — closes 2-day canary rollback chain (since PR #254) #274

Merged

cipher813 mentioned this pull request May 20, 2026

feat(fundamentals): Growth + Stewardship pillar substrate (Phase 3a) #279

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data): per-collector value-range validation (ROADMAP L1243 / extends #215)#254

feat(data): per-collector value-range validation (ROADMAP L1243 / extends #215)#254
cipher813 merged 1 commit into
mainfrom
feat/per-collector-value-range-validation

cipher813 commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 18, 2026

Summary

Scope

validators/price_validator.py

Checks added (field → severity)

collectors/fundamentals.py

collectors/alternative.py

Severity defaults

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`validators/price_validator.py`

`collectors/fundamentals.py`

`collectors/alternative.py`