Skip to content

feat(data): per-collector value-range validation (ROADMAP L1243 / extends #215)#254

Merged
cipher813 merged 1 commit into
mainfrom
feat/per-collector-value-range-validation
May 18, 2026
Merged

feat(data): per-collector value-range validation (ROADMAP L1243 / extends #215)#254
cipher813 merged 1 commit into
mainfrom
feat/per-collector-value-range-validation

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Closes the residual scope of ROADMAP L1243 — "Per-collector value-range validation" (P1, promoted from P2 2026-05-16). L940 / #215 shipped the write-time quality gate on the OHLCV builders/daily_append.py path. This PR extends equivalent write-time value-range validation to the two feature collectors that bypass daily_append entirely and write feature-source rows straight to S3 (consumed by the predictor feature store + research scoring) — the exact FMP-zero'd-fundamentals class that already burned ~2 weeks of alpha.

Scope

validators/price_validator.py

  • Two intra-bar checks the ROADMAP residual called out as not-yet-covered by validate_today_row:
    • intrabar_inconsistentClose outside [Low, High] even when High >= Low (a split-day reporting lag / corp-action artifact that bad_ohlc's High-vs-Low check misses). Default block. Skipped when High < Low so it doesn't double-count the same corrupt bar bad_ohlc already owns.
    • negative_volume — the existing zero_volume check only caught Volume == 0; a negative volume is unambiguously corrupt. Default block.
    • Both added to DEFAULT_BLOCK_ANOMALY_TYPES.
  • New validate_feature_record(record, field_specs, ticker) — generic value-range gate for non-OHLCV records, same structured-anomaly + per-type-severity contract as validate_today_row so callers reuse identical block-set / metric wiring. Caller passes a per-field spec (nonneg / lo / hi, all optional). None values skipped (coverage is the ok_ratio gate's job).

Checks added (field → severity)

Surface Check Anomaly type Default severity
validate_today_row (OHLCV) Close outside [Low, High] intrabar_inconsistent block
validate_today_row (OHLCV) Volume < 0 negative_volume block
feature record (any spec'd field) NaN / inf nan_or_inf block
feature record (field declared nonneg) value < 0 negative_where_nonneg block
feature record (field with lo/hi) value outside band gross_outlier warn

collectors/fundamentals.py

Validates each fully-shaped (post-_clip) per-ticker dict before the S3 snapshot. The load-bearing residual _clip can't catch is NaN/inf (clip of NaN propagates NaN) + structural non-negativity of gross_margin/current_ratio; lo/hi mirror the clip bands so a gross outlier surfaces if a future refactor drops a _clip. Block → replace with NEUTRAL + logger.error (counted as an error so the existing ok_ratio gate accounts for it). Env-tunable block set via FUNDAMENTALS_BLOCK_ANOMALY_TYPES. CW metric AlphaEngine/Data/fundamentals_quality_*.

collectors/alternative.py

Validates each spec'd numeric sub-section (analyst_consensus, eps_revision, options_flow, insider_activity, institutional) of the assembled per-ticker payload before the S3 write. The pre-existing per-source ok_ratio gate only checks data presence; this closes the value-range half. Block → refuse the whole ticker write, accounted exactly like a fetch failure (existing failed/errors + ok_ratio machinery surfaces it). Env-tunable block set via ALT_BLOCK_ANOMALY_TYPES. CW metric AlphaEngine/Data/alternative_quality_*. Quality counts threaded into the manifest + every return path.

Severity defaults

Mirrors #215's split: definitely-corrupt conditions (NaN/inf, negative-where-impossible, intra-bar inconsistency, negative volume) block by default; rare-but-possibly-legitimate (gross outlier — e.g. a real -300% ROE for a wiped-out firm that defeated an upstream clamp) warns. Operators flip behavior per-collector via the env vars; unknown-type / malformed-JSON raises (NoSilentFails), matching _load_block_anomaly_types.

Tests

+39 tests (1220 → 1259); full suite 1258 passed, 1 skipped (pre-existing). No new pip deps.

  • tests/test_price_validator.py: TestValidateTodayRowIntrabarChecks (Close-out-of-band block, negative-volume block, zero-volume-still-warns, double-count avoidance) + TestValidateFeatureRecord (NaN/inf block, negative-where-nonneg block, gross-outlier warn, None/non-numeric/absent skip, NaN-short-circuits-further-checks). Updated the DEFAULT_BLOCK_ANOMALY_TYPES contract test for the two added intra-bar types.
  • tests/test_fundamentals_finnhub.py: TestFundamentalsValueRangeGate (NaN block + NEUTRAL'd, negative-margin block, gross-outlier warn, return-path field presence, env-loader malformed/unknown/promote).
  • tests/test_alternative_value_range_gate.py (new): block paths (NaN/negative target/negative fund count, blocked-ticker-not-written), warn path, clean path, manifest + return contract, env-tunable block set (malformed/unknown/promote/empty-list observation mode). Reuses the established _patch_collect stub pattern from test_alternative_ok_ratio_gate.py.

🤖 Generated with Claude Code

…ends #215)

Extends the #215 write-time quality gate beyond builders/daily_append.py to
the two feature collectors that bypass validate_today_row entirely and write
feature-source rows straight to S3 / ArcticDB:

- validators/price_validator.py:
  - Add the two intra-bar checks the ROADMAP L1243 residual called out as
    not-yet-covered by validate_today_row: intrabar_inconsistent (Close
    outside [Low, High] even when High>=Low) + negative_volume (the existing
    zero_volume check only caught Volume==0). Both default-block (unambiguous
    corruption); both added to DEFAULT_BLOCK_ANOMALY_TYPES.
  - New validate_feature_record(record, field_specs, ticker): generic
    value-range gate for non-OHLCV records. Same structured-anomaly +
    per-type-severity contract as validate_today_row. Checks NaN/inf
    (block), negative-where-field-declared-nonneg (block), gross outlier
    outside a caller-declared sane band (warn). None values skipped
    (coverage is the ok_ratio gate's job, not value-range's).

- collectors/fundamentals.py: validate each per-ticker dict before the S3
  snapshot. Block -> replace with NEUTRAL + logger.error (counted as an
  error so the ok_ratio gate already accounts for it). Env-tunable block
  set via FUNDAMENTALS_BLOCK_ANOMALY_TYPES. CW metric
  AlphaEngine/Data/fundamentals_quality_*.

- collectors/alternative.py: validate each spec'd numeric sub-section of
  the assembled per-ticker payload before the S3 write. Block -> refuse
  the whole ticker write, accounted exactly like a fetch failure.
  Env-tunable block set via ALT_BLOCK_ANOMALY_TYPES. CW metric
  AlphaEngine/Data/alternative_quality_*. Quality counts threaded into the
  manifest + every return path.

Severity defaults mirror #215's split: definitely-corrupt (NaN/inf,
negative-where-impossible, intra-bar inconsistency, negative volume) blocks
by default; rare-but-possibly-legitimate (gross outlier) warns. Operators
flip via the per-collector env vars (unknown-type / malformed-JSON raises,
NoSilentFails).

Tests: +39 (1220 -> 1259). New TestValidateTodayRowIntrabarChecks +
TestValidateFeatureRecord in test_price_validator.py;
TestFundamentalsValueRangeGate in test_fundamentals_finnhub.py; new
tests/test_alternative_value_range_gate.py. Full suite 1258 passed,
1 skipped (pre-existing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 2c1128e into main May 18, 2026
1 check passed
@cipher813 cipher813 deleted the feat/per-collector-value-range-validation branch May 18, 2026 18:20
cipher813 added a commit that referenced this pull request May 18, 2026
…paid-tier; clears Phase-2 gate)

Root cause: FMP's `analyst-estimates` endpoint began returning 402 Payment
Required on the free tier (~2026-05-17 — paid-tier only), collapsing the
`eps_revision` source to ~15% populated and breaching DataPhase2's
per-source populated-ratio gate (`_REQUIRED_POPULATED_RATIOS["eps_revision"]
= 0.50`), so DataPhase2 returned ERROR on every weekly run (incl. 5/17).

Operator (Brian) decided "swap source". This swaps `_fetch_revisions` onto
yfinance `Ticker.eps_trend` — yfinance is already the integrated provider
for `target_price`/`options_flow` in this same module, so the
auth/availability/idiom precedent is established (no new pip dep).

Field chosen: `eps_trend` is itself a revision series — a DataFrame
indexed by period (0q/+1q/0y/+1y) with columns
current/7daysAgo/30daysAgo/60daysAgo/90daysAgo (consensus-mean EPS estimate
snapshots at those lookbacks). The annual `0y` row is used, mirroring the
prior FMP `period="annual"` choice:

  * current_estimate ← 0y row `current` column.
  * revision_4w      ← (current - 30daysAgo) / abs(30daysAgo) * 100 on the
                        0y row (same "now vs ~30d ago" semantic as the prior
                        FMP path, but sourced from yfinance's own 30-day-ago
                        snapshot — removing the dependency on a
                        `archive/revisions/{date}.json` snapshot that is
                        never written anywhere in this codebase, i.e. the
                        old revision path was effectively dead).
  * streak           ← count of consecutive non-negative steps in the 0y
                        estimate walking newest→oldest across
                        current→7d→30d→60d→90d (max 4). The prior FMP impl
                        hardcoded streak=0 (no multi-snapshot series to
                        derive from); this is a strictly more faithful
                        derivation of the field's intended meaning. Field
                        and key preserved — not dropped.

Contract unchanged: return dict keys/types
{current_estimate: float|None, revision_4w: float|None, streak: int} are
identical, so `_has_revision_data` and all downstream consumers are
unaffected. `bucket`/`run_date` params retained in the signature; the
read-only S3 prior-snapshot path is kept as a non-regressing fallback
(no-op in practice since no writer exists).

Sample populated ratio (real yfinance calls, worktree code,
8 representative tickers incl. recent listing ARM and never-profitable
RBLX): 8/8 = 100% — decisively clears the 0.50 gate.

yfinance exceptions carry no API credentials (no keyed querystring),
so the existing module yfinance warning idiom (_fetch_analyst's
target_price block) is mirrored; the `_scrub_url_creds` helper added by
open PR #255 consolidates the FMP/keyed-fetch scrub surface and is not
needed on this unkeyed path.

Merge order vs concurrent open PRs touching collectors/alternative.py:
recommend #255 (adds _scrub_url_creds) → this → #254. This branch is
based off origin/main and will likely need a trivial rebase after #255
merges (no functional conflict — different functions).

Tests: new tests/test_alternative_revisions.py (13 tests) mirrors
test_alternative_analyst.py style — mocks yfinance via patched
sys.modules + a DataFrame eps_trend fixture; asserts contract keys/types
unchanged, revision_4w sign/Δ math (positive, negative, negative-EPS
abs() denominator), streak derivation (full/partial/broken runs),
None-safety (empty/None df, missing 0y row, NaN 30daysAgo), loud-degrade
on yfinance failure, and signature param preservation. Full suite:
1232 passed, 1 skipped (pre-existing skip, unrelated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 18, 2026
…paid-tier; clears Phase-2 gate) (#257)

Root cause: FMP's `analyst-estimates` endpoint began returning 402 Payment
Required on the free tier (~2026-05-17 — paid-tier only), collapsing the
`eps_revision` source to ~15% populated and breaching DataPhase2's
per-source populated-ratio gate (`_REQUIRED_POPULATED_RATIOS["eps_revision"]
= 0.50`), so DataPhase2 returned ERROR on every weekly run (incl. 5/17).

Operator (Brian) decided "swap source". This swaps `_fetch_revisions` onto
yfinance `Ticker.eps_trend` — yfinance is already the integrated provider
for `target_price`/`options_flow` in this same module, so the
auth/availability/idiom precedent is established (no new pip dep).

Field chosen: `eps_trend` is itself a revision series — a DataFrame
indexed by period (0q/+1q/0y/+1y) with columns
current/7daysAgo/30daysAgo/60daysAgo/90daysAgo (consensus-mean EPS estimate
snapshots at those lookbacks). The annual `0y` row is used, mirroring the
prior FMP `period="annual"` choice:

  * current_estimate ← 0y row `current` column.
  * revision_4w      ← (current - 30daysAgo) / abs(30daysAgo) * 100 on the
                        0y row (same "now vs ~30d ago" semantic as the prior
                        FMP path, but sourced from yfinance's own 30-day-ago
                        snapshot — removing the dependency on a
                        `archive/revisions/{date}.json` snapshot that is
                        never written anywhere in this codebase, i.e. the
                        old revision path was effectively dead).
  * streak           ← count of consecutive non-negative steps in the 0y
                        estimate walking newest→oldest across
                        current→7d→30d→60d→90d (max 4). The prior FMP impl
                        hardcoded streak=0 (no multi-snapshot series to
                        derive from); this is a strictly more faithful
                        derivation of the field's intended meaning. Field
                        and key preserved — not dropped.

Contract unchanged: return dict keys/types
{current_estimate: float|None, revision_4w: float|None, streak: int} are
identical, so `_has_revision_data` and all downstream consumers are
unaffected. `bucket`/`run_date` params retained in the signature; the
read-only S3 prior-snapshot path is kept as a non-regressing fallback
(no-op in practice since no writer exists).

Sample populated ratio (real yfinance calls, worktree code,
8 representative tickers incl. recent listing ARM and never-profitable
RBLX): 8/8 = 100% — decisively clears the 0.50 gate.

yfinance exceptions carry no API credentials (no keyed querystring),
so the existing module yfinance warning idiom (_fetch_analyst's
target_price block) is mirrored; the `_scrub_url_creds` helper added by
open PR #255 consolidates the FMP/keyed-fetch scrub surface and is not
needed on this unkeyed path.

Merge order vs concurrent open PRs touching collectors/alternative.py:
recommend #255 (adds _scrub_url_creds) → this → #254. This branch is
based off origin/main and will likely need a trivial rebase after #255
merges (no functional conflict — different functions).

Tests: new tests/test_alternative_revisions.py (13 tests) mirrors
test_alternative_analyst.py style — mocks yfinance via patched
sys.modules + a DataFrame eps_trend fixture; asserts contract keys/types
unchanged, revision_4w sign/Δ math (positive, negative, negative-EPS
abs() denominator), streak derivation (full/partial/broken runs),
None-safety (empty/None df, missing 0y row, NaN 30daysAgo), loud-degrade
on yfinance failure, and signature param preservation. Full suite:
1232 passed, 1 skipped (pre-existing skip, unrelated).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 20, 2026
…y rollback chain since PR #254 (#274)

The Phase 2 Lambda CI deploy has been failing on every push since
2026-05-18 18:20Z. PR #254 (per-collector value-range validation,
merged 5/16) added top-level imports

  from validators.price_validator import (...)

to collectors/alternative.py + collectors/fundamentals.py, but did NOT
add `COPY validators/` to the Dockerfile. Every push that touched a
deploy.yml-triggering path since then built a fresh image, pushed v88,
ran the canary, hit `No module named 'validators'` at module load,
and rolled back to v87. 10 consecutive failed deploys (5/18-18:20Z
through 5/20-00:25Z) sat unnoticed because the canary correctly rolled
back so prod (v87) was unaffected — but the latent break blocked ANY
new code from ever reaching `live`. Surfaced by the Wave-3 PR3-wave-2
deploy (#273) when Brian saw the rollback in the CI log.

Fix: one-line `COPY validators/ ${LAMBDA_TASK_ROOT}/validators/` next
to the other application-code COPY directives.

Regression guard: new tests/test_dockerfile_copies_match_deployed
_imports.py with two assertions —

1. `test_dockerfile_copies_validators_for_collectors_imports` —
   explicit pin for the validators/ case so the failure surfaces as
   a clear test name if a future refactor drops the COPY.

2. `test_every_toplevel_local_import_in_lambda_code_is_dockerfile
   _copied` — generic ast-scan over every deployed Python file
   (lambda/, weekly_collector.py, polygon_client.py, collectors/,
   store/, validators/). Module-scope `from <local_pkg> import ...`
   / `import <local_pkg>` where <local_pkg> is a directory with
   __init__.py at the repo root MUST appear in the Dockerfile's
   COPY directives — or in the explicitly-non-deployed allowlist
   (`tests`, `builders`, `infrastructure`, `rag`, `features`). New
   top-level imports of un-COPY'd packages fail in CI now, not in
   the post-merge canary.

Suite 1399 → 1401 green.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 20, 2026
…279)

Phase 3a of the pillar-decomposed attractiveness scoring arc (plan doc
`alpha-engine-docs/private/attractiveness-pillars-260520.md`; ROADMAP P1
= alpha-engine-config #254 merged; lib Phase 1 = alpha-engine-lib #53
v0.22.0; research+config Phase 2 = alpha-engine-research #207 +
alpha-engine-config #255 merged).

Adds 5 new fundamental fields surfaced from the existing Finnhub
`/stock/metric?metric=all` response — no new API integrations, no
extra rate-limit pressure on the existing collector budget. The fields
back the Growth + Stewardship pillar quant composites that get added
to `scoring/factor_scoring.py` in the follow-up Phase 3b PR
(alpha-engine-research):

Growth pillar quant substrate (2):
- `revenue_growth_3y` — 3y revenue CAGR (Finnhub `revenueGrowth3Y`,
  fallback `revenueGrowth5Y`); clipped to [-0.5, 1.5]
- `eps_growth_3y` — 3y EPS CAGR (Finnhub `epsGrowth3Y`, fallbacks
  `epsBasicExclExtraItemsAnnual5Y` / `epsGrowth5Y`); clipped to [-1.0, 2.0]

Stewardship pillar quant substrate (3):
- `payout_ratio` — TTM dividends / NI (Finnhub `payoutRatioTTM` →
  `payoutRatioAnnual`); clipped to [0, 2]
- `dividend_yield` — Indicated annual yield (Finnhub
  `dividendYieldIndicatedAnnual` → `currentDividendYieldTTM`); clipped
  to [0, 0.20]
- `capex_growth_5y` — 5y CAPEX growth (Finnhub
  `capitalSpendingGrowth5Y`); clipped to [-1, 2]

3y CAGR is preferred over TTM YoY for ranking because it's smoother
(base-effect / single-quarter anomalies average out); 5y/annual
fallbacks for newer listings without a full 3y history.

Insider ownership % is NOT in this PR — Finnhub `metric=all` does not
surface it; would require a separate `/stock/insider-transactions`
integration (extra API calls + rate-limit pressure). Deferred to a
follow-up if/when stewardship's composite weight in the backtester
optimization argues for it. The three signals shipped here cover the
"capital allocation discipline" axis: payout (return-of-capital
intensity), dividend yield (combined with payout, identifies
low-yield + low-payout = buyback-heavy retainers), and CAPEX growth
(reinvestment intensity).

Plumbed end-to-end:

- `collectors/fundamentals.py`: NEUTRAL grows to 13 fields (was 8);
  `_fetch_single_ticker` extracts the 5 new Finnhub fields with
  TTM-preferred / annual-fallback chains; values clipped to the ranges
  above.
- `features/registry.py`: 5 new `FeatureEntry` records (group=
  "fundamental"), bringing the fundamental group from 8 → 13.
- `features/feature_engineer.py`: 5 new columns in
  `EXPECTED_FEATURE_COLUMNS` + DataFrame-write site populates from
  `fundamental_data` dict / falls back to 0.0 when fundamental_data
  is None (matches existing pattern).

Behavior change: when the Saturday SF DataPhase1 runs after this PR
merges + deploys, `features/{date}/fundamental.parquet` carries 13
columns instead of 8. Downstream consumers (alpha-engine-research
`scoring/factor_scoring.py` + the Phase 3b composite extension)
treat the new columns as additive — the 4 existing composites
(quality / momentum / value / low_vol) and their _n counts continue
to compute identically. No behavioral coupling with Phase 3b — the
new columns can sit in the parquet for a soak before research's
Phase 3b PR consumes them.

Tests: 12 new tests in `tests/test_fundamentals_finnhub.py` covering
NEUTRAL field completeness, field mapping for typical AAPL payload,
fallback chains for each of the 5 fields, clipping bounds (extreme
growth, extreme yield, payout > 2), empty-payload preservation.

Suite: 1400 → 1412 passing (zero regressions, 7.47s).

Composes with: factor-substrate-260513 (this PR extends its
fundamental.parquet schema); alpha-engine-research Phase 3b
(consumer composites land in a follow-up PR).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant