Skip to content

Add PredictorHealthCheck to weekday pipeline#6

Merged
cipher813 merged 1 commit into
mainfrom
feat/daily-health-check
Apr 8, 2026
Merged

Add PredictorHealthCheck to weekday pipeline#6
cipher813 merged 1 commit into
mainfrom
feat/daily-health-check

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

  • Insert PredictorHealthCheck Lambda invoke between PredictorInference and HealthCheck in weekday Step Function
  • Non-blocking: Catch sends to HealthCheck on failure (doesn't halt trading)
  • IAM policy updated with alpha-engine-predictor-health-check* Lambda ARN

Companion PR: cipher813/alpha-engine-backtester#5 (Lambda code + deploy script)

Test plan

  • Step Function already redeployed and live
  • Lambda canary passed (dry_run=true)
  • Monitor first live run tomorrow 6:05 AM PT

🤖 Generated with Claude Code

Insert daily predictor health check Lambda between PredictorInference
and HealthCheck. Non-blocking — failure continues to data health check
and executor start. IAM policy updated with new Lambda ARN.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 1ee368c into main Apr 8, 2026
1 check passed
@cipher813 cipher813 deleted the feat/daily-health-check branch April 8, 2026 02:18
cipher813 added a commit that referenced this pull request May 2, 2026
…ate preflight (#138)

Closes the prune+backfill loop that recreated 7 S&P churn-out
stragglers on every Saturday SF run. 2026-05-02 redrive #6 surfaced
the loop: pre-MorningEnrich prune (PR #134, absent_days=5) drops
stragglers ✓; Phase 1 step 8 (builders.backfill) loads ALL
predictor/price_cache/*.parquet files and writes EVERY ticker back to
ArcticDB universe — including the ones we just pruned, because their
parquet files still exist (kept for historical lookup). Loop closes;
Backtester preflight (~2 hours later) trips on the 8-day-stale rows.

## Fix 1: backfill respects current constituents

In ``builders.backfill``, load current constituents via the
``market_data/latest_weekly.json`` pointer and filter
``universe_tickers`` against it. Tickers absent from constituents
(churn-outs) get a price_cache parquet preserved (history kept) but
NO arctic row written. If a ticker comes back to S&P later, it
appears in constituents and backfill picks it up automatically.

Hard-fails on constituents-load failure (vs silently writing
everything) per feedback_no_silent_fails. Skipped in dry_run so
local smoke tests don't need S3 access.

## Fix 2: sf_preflight escalates straggler detection

``check_universe_drift`` now returns FAIL (not OK) when any straggler
is "old enough to prune" (>5 days stale). Forces operators to drop
stragglers BEFORE launching recovery SFs that skip MorningEnrich
(would otherwise burn a 120-min Backtester spot to re-discover them).
Result includes a remediation hint pointing at the prune CLI.

Validation against current state (post manual prune of 7):
  [OK]   universe_drift     1 arctic stragglers; 0 would be pruned

3 new tests in test_backfill_no_regression.py:
- backfill_skips_tickers_absent_from_constituents (the loop closure)
- backfill_hard_fails_when_constituents_load_fails (no silent
  recreate-everything fallback)
- backfill_dry_run_does_not_filter_by_constituents (CI / smoke
  doesn't need S3)

Existing test scaffolding updated to mock _load_current_constituents
across both backfill test files. 406 tests pass.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 17, 2026
…ndent; per-branch error isolation) (#251)

Research and PredictorTraining are data-independent (CLAUDE.md
Architecture: "no data flows between them"). They ran sequentially only
to "spread API load" — a now-stale rationale: predictor training
(alpha-engine-predictor/training/train_handler.py) reads ArcticDB + CPU
LightGBM and makes NO Anthropic calls (yfinance fallback removed by
predictor PR #6; train_handler yfinance docstrings are stale). Research's
only heavy load is Anthropic. They do not contend on the rate-limited API.

Restructure the sequential Research…→PredictorTraining run into an SF
Parallel state (ResearchPredictorParallel):
- Branch A: CheckSkipResearch → Research → DataPhase2 → EvalJudge chain →
  EvalRollingMean → RationaleClustering → ReplayConcordance →
  Counterfactual (everything that consumes Research output, current order,
  all CheckSkip*/quartets/fail-soft Catches intact).
- Branch B: PredictorTraining quartet + skip-gate intact.
- Join → AggregateBranchOutcomes → CheckBranchOutcomes →
  CheckSkipDriftDetection → Backtester → Parity → Evaluator (unchanged).

Per-branch error isolation (the correctness-critical requirement): SF
Parallel's default cancels siblings when one branch errors. To prevent a
strict-Research hard-fail from aborting/wasting an in-flight or
completed+S3-promoted PredictorTraining, each branch ends in a
branch-local Pass terminal (End:true) recording OK/FAILED as data — a
branch NEVER throws. The SF is failed AFTER the join (post-aggregation)
if either branch recorded FAILED, so the other branch's completed work
(incl. already-promoted predictor weights in S3) persists and the
recovery re-run's skip-set can skip whichever branch genuinely completed
(Research-fail + Predictor-done → re-run with skip_predictor_training).
Parallel-level Catch → existing shared HandleFailure (no new error
channel); Parallel Retry is a documented no-op (MaxAttempts:0) so a
completed PredictorTraining is never re-run.

Inbound edges (RegimeRetrospectiveEval Next+Catch,
CheckSkipRegimeRetrospectiveEval skip choice) re-pointed to
ResearchPredictorParallel.

Tests: new tests/test_sf_research_predictor_parallel_wiring.py (72 tests:
sibling branches; Branch-A/B contents; per-branch isolation incl. no
in-branch escape to HandleFailure; post-join fail-if-either-FAILED;
ec2_instance_id reaches Branch B; Backtester after join; no dangling
targets anywhere). Updated test_sf_eval_judge_wiring.py (flattened state
fixture + old cross-boundary edge assertions retargeted to BranchAComplete)
and test_sf_regime_substrate_wiring.py (inbound edge → Parallel). Full
suite green: 1207 passed, 1 skipped (pre-existing pandas FutureWarnings
in daily_append.py, unrelated).

DEPLOY HELD — prod SF-topology change; do not merge/redeploy/trigger
until the user directs. CLAUDE.md:100 "spread API load" rationale is
stale and must be corrected on merge (flagged, not edited — that file is
outside this repo).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant