feat(data): per-collector value-range validation (ROADMAP L1243 / extends #215)#254
Merged
Merged
Conversation
…ends #215) Extends the #215 write-time quality gate beyond builders/daily_append.py to the two feature collectors that bypass validate_today_row entirely and write feature-source rows straight to S3 / ArcticDB: - validators/price_validator.py: - Add the two intra-bar checks the ROADMAP L1243 residual called out as not-yet-covered by validate_today_row: intrabar_inconsistent (Close outside [Low, High] even when High>=Low) + negative_volume (the existing zero_volume check only caught Volume==0). Both default-block (unambiguous corruption); both added to DEFAULT_BLOCK_ANOMALY_TYPES. - New validate_feature_record(record, field_specs, ticker): generic value-range gate for non-OHLCV records. Same structured-anomaly + per-type-severity contract as validate_today_row. Checks NaN/inf (block), negative-where-field-declared-nonneg (block), gross outlier outside a caller-declared sane band (warn). None values skipped (coverage is the ok_ratio gate's job, not value-range's). - collectors/fundamentals.py: validate each per-ticker dict before the S3 snapshot. Block -> replace with NEUTRAL + logger.error (counted as an error so the ok_ratio gate already accounts for it). Env-tunable block set via FUNDAMENTALS_BLOCK_ANOMALY_TYPES. CW metric AlphaEngine/Data/fundamentals_quality_*. - collectors/alternative.py: validate each spec'd numeric sub-section of the assembled per-ticker payload before the S3 write. Block -> refuse the whole ticker write, accounted exactly like a fetch failure. Env-tunable block set via ALT_BLOCK_ANOMALY_TYPES. CW metric AlphaEngine/Data/alternative_quality_*. Quality counts threaded into the manifest + every return path. Severity defaults mirror #215's split: definitely-corrupt (NaN/inf, negative-where-impossible, intra-bar inconsistency, negative volume) blocks by default; rare-but-possibly-legitimate (gross outlier) warns. Operators flip via the per-collector env vars (unknown-type / malformed-JSON raises, NoSilentFails). Tests: +39 (1220 -> 1259). New TestValidateTodayRowIntrabarChecks + TestValidateFeatureRecord in test_price_validator.py; TestFundamentalsValueRangeGate in test_fundamentals_finnhub.py; new tests/test_alternative_value_range_gate.py. Full suite 1258 passed, 1 skipped (pre-existing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 18, 2026
…paid-tier; clears Phase-2 gate)
Root cause: FMP's `analyst-estimates` endpoint began returning 402 Payment
Required on the free tier (~2026-05-17 — paid-tier only), collapsing the
`eps_revision` source to ~15% populated and breaching DataPhase2's
per-source populated-ratio gate (`_REQUIRED_POPULATED_RATIOS["eps_revision"]
= 0.50`), so DataPhase2 returned ERROR on every weekly run (incl. 5/17).
Operator (Brian) decided "swap source". This swaps `_fetch_revisions` onto
yfinance `Ticker.eps_trend` — yfinance is already the integrated provider
for `target_price`/`options_flow` in this same module, so the
auth/availability/idiom precedent is established (no new pip dep).
Field chosen: `eps_trend` is itself a revision series — a DataFrame
indexed by period (0q/+1q/0y/+1y) with columns
current/7daysAgo/30daysAgo/60daysAgo/90daysAgo (consensus-mean EPS estimate
snapshots at those lookbacks). The annual `0y` row is used, mirroring the
prior FMP `period="annual"` choice:
* current_estimate ← 0y row `current` column.
* revision_4w ← (current - 30daysAgo) / abs(30daysAgo) * 100 on the
0y row (same "now vs ~30d ago" semantic as the prior
FMP path, but sourced from yfinance's own 30-day-ago
snapshot — removing the dependency on a
`archive/revisions/{date}.json` snapshot that is
never written anywhere in this codebase, i.e. the
old revision path was effectively dead).
* streak ← count of consecutive non-negative steps in the 0y
estimate walking newest→oldest across
current→7d→30d→60d→90d (max 4). The prior FMP impl
hardcoded streak=0 (no multi-snapshot series to
derive from); this is a strictly more faithful
derivation of the field's intended meaning. Field
and key preserved — not dropped.
Contract unchanged: return dict keys/types
{current_estimate: float|None, revision_4w: float|None, streak: int} are
identical, so `_has_revision_data` and all downstream consumers are
unaffected. `bucket`/`run_date` params retained in the signature; the
read-only S3 prior-snapshot path is kept as a non-regressing fallback
(no-op in practice since no writer exists).
Sample populated ratio (real yfinance calls, worktree code,
8 representative tickers incl. recent listing ARM and never-profitable
RBLX): 8/8 = 100% — decisively clears the 0.50 gate.
yfinance exceptions carry no API credentials (no keyed querystring),
so the existing module yfinance warning idiom (_fetch_analyst's
target_price block) is mirrored; the `_scrub_url_creds` helper added by
open PR #255 consolidates the FMP/keyed-fetch scrub surface and is not
needed on this unkeyed path.
Merge order vs concurrent open PRs touching collectors/alternative.py:
recommend #255 (adds _scrub_url_creds) → this → #254. This branch is
based off origin/main and will likely need a trivial rebase after #255
merges (no functional conflict — different functions).
Tests: new tests/test_alternative_revisions.py (13 tests) mirrors
test_alternative_analyst.py style — mocks yfinance via patched
sys.modules + a DataFrame eps_trend fixture; asserts contract keys/types
unchanged, revision_4w sign/Δ math (positive, negative, negative-EPS
abs() denominator), streak derivation (full/partial/broken runs),
None-safety (empty/None df, missing 0y row, NaN 30daysAgo), loud-degrade
on yfinance failure, and signature param preservation. Full suite:
1232 passed, 1 skipped (pre-existing skip, unrelated).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 18, 2026
…paid-tier; clears Phase-2 gate) (#257) Root cause: FMP's `analyst-estimates` endpoint began returning 402 Payment Required on the free tier (~2026-05-17 — paid-tier only), collapsing the `eps_revision` source to ~15% populated and breaching DataPhase2's per-source populated-ratio gate (`_REQUIRED_POPULATED_RATIOS["eps_revision"] = 0.50`), so DataPhase2 returned ERROR on every weekly run (incl. 5/17). Operator (Brian) decided "swap source". This swaps `_fetch_revisions` onto yfinance `Ticker.eps_trend` — yfinance is already the integrated provider for `target_price`/`options_flow` in this same module, so the auth/availability/idiom precedent is established (no new pip dep). Field chosen: `eps_trend` is itself a revision series — a DataFrame indexed by period (0q/+1q/0y/+1y) with columns current/7daysAgo/30daysAgo/60daysAgo/90daysAgo (consensus-mean EPS estimate snapshots at those lookbacks). The annual `0y` row is used, mirroring the prior FMP `period="annual"` choice: * current_estimate ← 0y row `current` column. * revision_4w ← (current - 30daysAgo) / abs(30daysAgo) * 100 on the 0y row (same "now vs ~30d ago" semantic as the prior FMP path, but sourced from yfinance's own 30-day-ago snapshot — removing the dependency on a `archive/revisions/{date}.json` snapshot that is never written anywhere in this codebase, i.e. the old revision path was effectively dead). * streak ← count of consecutive non-negative steps in the 0y estimate walking newest→oldest across current→7d→30d→60d→90d (max 4). The prior FMP impl hardcoded streak=0 (no multi-snapshot series to derive from); this is a strictly more faithful derivation of the field's intended meaning. Field and key preserved — not dropped. Contract unchanged: return dict keys/types {current_estimate: float|None, revision_4w: float|None, streak: int} are identical, so `_has_revision_data` and all downstream consumers are unaffected. `bucket`/`run_date` params retained in the signature; the read-only S3 prior-snapshot path is kept as a non-regressing fallback (no-op in practice since no writer exists). Sample populated ratio (real yfinance calls, worktree code, 8 representative tickers incl. recent listing ARM and never-profitable RBLX): 8/8 = 100% — decisively clears the 0.50 gate. yfinance exceptions carry no API credentials (no keyed querystring), so the existing module yfinance warning idiom (_fetch_analyst's target_price block) is mirrored; the `_scrub_url_creds` helper added by open PR #255 consolidates the FMP/keyed-fetch scrub surface and is not needed on this unkeyed path. Merge order vs concurrent open PRs touching collectors/alternative.py: recommend #255 (adds _scrub_url_creds) → this → #254. This branch is based off origin/main and will likely need a trivial rebase after #255 merges (no functional conflict — different functions). Tests: new tests/test_alternative_revisions.py (13 tests) mirrors test_alternative_analyst.py style — mocks yfinance via patched sys.modules + a DataFrame eps_trend fixture; asserts contract keys/types unchanged, revision_4w sign/Δ math (positive, negative, negative-EPS abs() denominator), streak derivation (full/partial/broken runs), None-safety (empty/None df, missing 0y row, NaN 30daysAgo), loud-degrade on yfinance failure, and signature param preservation. Full suite: 1232 passed, 1 skipped (pre-existing skip, unrelated). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 20, 2026
…y rollback chain since PR #254 (#274) The Phase 2 Lambda CI deploy has been failing on every push since 2026-05-18 18:20Z. PR #254 (per-collector value-range validation, merged 5/16) added top-level imports from validators.price_validator import (...) to collectors/alternative.py + collectors/fundamentals.py, but did NOT add `COPY validators/` to the Dockerfile. Every push that touched a deploy.yml-triggering path since then built a fresh image, pushed v88, ran the canary, hit `No module named 'validators'` at module load, and rolled back to v87. 10 consecutive failed deploys (5/18-18:20Z through 5/20-00:25Z) sat unnoticed because the canary correctly rolled back so prod (v87) was unaffected — but the latent break blocked ANY new code from ever reaching `live`. Surfaced by the Wave-3 PR3-wave-2 deploy (#273) when Brian saw the rollback in the CI log. Fix: one-line `COPY validators/ ${LAMBDA_TASK_ROOT}/validators/` next to the other application-code COPY directives. Regression guard: new tests/test_dockerfile_copies_match_deployed _imports.py with two assertions — 1. `test_dockerfile_copies_validators_for_collectors_imports` — explicit pin for the validators/ case so the failure surfaces as a clear test name if a future refactor drops the COPY. 2. `test_every_toplevel_local_import_in_lambda_code_is_dockerfile _copied` — generic ast-scan over every deployed Python file (lambda/, weekly_collector.py, polygon_client.py, collectors/, store/, validators/). Module-scope `from <local_pkg> import ...` / `import <local_pkg>` where <local_pkg> is a directory with __init__.py at the repo root MUST appear in the Dockerfile's COPY directives — or in the explicitly-non-deployed allowlist (`tests`, `builders`, `infrastructure`, `rag`, `features`). New top-level imports of un-COPY'd packages fail in CI now, not in the post-merge canary. Suite 1399 → 1401 green. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
cipher813
added a commit
that referenced
this pull request
May 20, 2026
…279) Phase 3a of the pillar-decomposed attractiveness scoring arc (plan doc `alpha-engine-docs/private/attractiveness-pillars-260520.md`; ROADMAP P1 = alpha-engine-config #254 merged; lib Phase 1 = alpha-engine-lib #53 v0.22.0; research+config Phase 2 = alpha-engine-research #207 + alpha-engine-config #255 merged). Adds 5 new fundamental fields surfaced from the existing Finnhub `/stock/metric?metric=all` response — no new API integrations, no extra rate-limit pressure on the existing collector budget. The fields back the Growth + Stewardship pillar quant composites that get added to `scoring/factor_scoring.py` in the follow-up Phase 3b PR (alpha-engine-research): Growth pillar quant substrate (2): - `revenue_growth_3y` — 3y revenue CAGR (Finnhub `revenueGrowth3Y`, fallback `revenueGrowth5Y`); clipped to [-0.5, 1.5] - `eps_growth_3y` — 3y EPS CAGR (Finnhub `epsGrowth3Y`, fallbacks `epsBasicExclExtraItemsAnnual5Y` / `epsGrowth5Y`); clipped to [-1.0, 2.0] Stewardship pillar quant substrate (3): - `payout_ratio` — TTM dividends / NI (Finnhub `payoutRatioTTM` → `payoutRatioAnnual`); clipped to [0, 2] - `dividend_yield` — Indicated annual yield (Finnhub `dividendYieldIndicatedAnnual` → `currentDividendYieldTTM`); clipped to [0, 0.20] - `capex_growth_5y` — 5y CAPEX growth (Finnhub `capitalSpendingGrowth5Y`); clipped to [-1, 2] 3y CAGR is preferred over TTM YoY for ranking because it's smoother (base-effect / single-quarter anomalies average out); 5y/annual fallbacks for newer listings without a full 3y history. Insider ownership % is NOT in this PR — Finnhub `metric=all` does not surface it; would require a separate `/stock/insider-transactions` integration (extra API calls + rate-limit pressure). Deferred to a follow-up if/when stewardship's composite weight in the backtester optimization argues for it. The three signals shipped here cover the "capital allocation discipline" axis: payout (return-of-capital intensity), dividend yield (combined with payout, identifies low-yield + low-payout = buyback-heavy retainers), and CAPEX growth (reinvestment intensity). Plumbed end-to-end: - `collectors/fundamentals.py`: NEUTRAL grows to 13 fields (was 8); `_fetch_single_ticker` extracts the 5 new Finnhub fields with TTM-preferred / annual-fallback chains; values clipped to the ranges above. - `features/registry.py`: 5 new `FeatureEntry` records (group= "fundamental"), bringing the fundamental group from 8 → 13. - `features/feature_engineer.py`: 5 new columns in `EXPECTED_FEATURE_COLUMNS` + DataFrame-write site populates from `fundamental_data` dict / falls back to 0.0 when fundamental_data is None (matches existing pattern). Behavior change: when the Saturday SF DataPhase1 runs after this PR merges + deploys, `features/{date}/fundamental.parquet` carries 13 columns instead of 8. Downstream consumers (alpha-engine-research `scoring/factor_scoring.py` + the Phase 3b composite extension) treat the new columns as additive — the 4 existing composites (quality / momentum / value / low_vol) and their _n counts continue to compute identically. No behavioral coupling with Phase 3b — the new columns can sit in the parquet for a soak before research's Phase 3b PR consumes them. Tests: 12 new tests in `tests/test_fundamentals_finnhub.py` covering NEUTRAL field completeness, field mapping for typical AAPL payload, fallback chains for each of the 5 fields, clipping bounds (extreme growth, extreme yield, payout > 2), empty-payload preservation. Suite: 1400 → 1412 passing (zero regressions, 7.47s). Composes with: factor-substrate-260513 (this PR extends its fundamental.parquet schema); alpha-engine-research Phase 3b (consumer composites land in a follow-up PR). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the residual scope of ROADMAP L1243 — "Per-collector value-range validation" (P1, promoted from P2 2026-05-16). L940 / #215 shipped the write-time quality gate on the OHLCV
builders/daily_append.pypath. This PR extends equivalent write-time value-range validation to the two feature collectors that bypassdaily_appendentirely and write feature-source rows straight to S3 (consumed by the predictor feature store + research scoring) — the exact FMP-zero'd-fundamentals class that already burned ~2 weeks of alpha.Scope
validators/price_validator.pyvalidate_today_row:intrabar_inconsistent—Closeoutside[Low, High]even whenHigh >= Low(a split-day reporting lag / corp-action artifact thatbad_ohlc's High-vs-Low check misses). Default block. Skipped whenHigh < Lowso it doesn't double-count the same corrupt barbad_ohlcalready owns.negative_volume— the existingzero_volumecheck only caughtVolume == 0; a negative volume is unambiguously corrupt. Default block.DEFAULT_BLOCK_ANOMALY_TYPES.validate_feature_record(record, field_specs, ticker)— generic value-range gate for non-OHLCV records, same structured-anomaly + per-type-severity contract asvalidate_today_rowso callers reuse identical block-set / metric wiring. Caller passes a per-field spec (nonneg/lo/hi, all optional).Nonevalues skipped (coverage is the ok_ratio gate's job).Checks added (field → severity)
validate_today_row(OHLCV)Closeoutside[Low, High]intrabar_inconsistentvalidate_today_row(OHLCV)Volume < 0negative_volumenan_or_infnonneg)< 0negative_where_nonneglo/hi)gross_outliercollectors/fundamentals.pyValidates each fully-shaped (post-
_clip) per-ticker dict before the S3 snapshot. The load-bearing residual_clipcan't catch is NaN/inf (clip of NaN propagates NaN) + structural non-negativity ofgross_margin/current_ratio;lo/himirror the clip bands so a gross outlier surfaces if a future refactor drops a_clip. Block → replace withNEUTRAL+logger.error(counted as an error so the existing ok_ratio gate accounts for it). Env-tunable block set viaFUNDAMENTALS_BLOCK_ANOMALY_TYPES. CW metricAlphaEngine/Data/fundamentals_quality_*.collectors/alternative.pyValidates each spec'd numeric sub-section (
analyst_consensus,eps_revision,options_flow,insider_activity,institutional) of the assembled per-ticker payload before the S3 write. The pre-existing per-source ok_ratio gate only checks data presence; this closes the value-range half. Block → refuse the whole ticker write, accounted exactly like a fetch failure (existingfailed/errors+ ok_ratio machinery surfaces it). Env-tunable block set viaALT_BLOCK_ANOMALY_TYPES. CW metricAlphaEngine/Data/alternative_quality_*. Quality counts threaded into the manifest + every return path.Severity defaults
Mirrors #215's split: definitely-corrupt conditions (NaN/inf, negative-where-impossible, intra-bar inconsistency, negative volume) block by default; rare-but-possibly-legitimate (gross outlier — e.g. a real -300% ROE for a wiped-out firm that defeated an upstream clamp) warns. Operators flip behavior per-collector via the env vars; unknown-type / malformed-JSON raises (NoSilentFails), matching
_load_block_anomaly_types.Tests
+39 tests (1220 → 1259); full suite 1258 passed, 1 skipped (pre-existing). No new pip deps.
tests/test_price_validator.py:TestValidateTodayRowIntrabarChecks(Close-out-of-band block, negative-volume block, zero-volume-still-warns, double-count avoidance) +TestValidateFeatureRecord(NaN/inf block, negative-where-nonneg block, gross-outlier warn, None/non-numeric/absent skip, NaN-short-circuits-further-checks). Updated theDEFAULT_BLOCK_ANOMALY_TYPEScontract test for the two added intra-bar types.tests/test_fundamentals_finnhub.py:TestFundamentalsValueRangeGate(NaN block + NEUTRAL'd, negative-margin block, gross-outlier warn, return-path field presence, env-loader malformed/unknown/promote).tests/test_alternative_value_range_gate.py(new): block paths (NaN/negative target/negative fund count, blocked-ticker-not-written), warn path, clean path, manifest + return contract, env-tunable block set (malformed/unknown/promote/empty-list observation mode). Reuses the established_patch_collectstub pattern fromtest_alternative_ok_ratio_gate.py.🤖 Generated with Claude Code