Add PredictorHealthCheck to weekday pipeline#6
Merged
Conversation
Insert daily predictor health check Lambda between PredictorInference and HealthCheck. Non-blocking — failure continues to data health check and executor start. IAM policy updated with new Lambda ARN. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9 tasks
5 tasks
cipher813
added a commit
that referenced
this pull request
May 2, 2026
…ate preflight (#138) Closes the prune+backfill loop that recreated 7 S&P churn-out stragglers on every Saturday SF run. 2026-05-02 redrive #6 surfaced the loop: pre-MorningEnrich prune (PR #134, absent_days=5) drops stragglers ✓; Phase 1 step 8 (builders.backfill) loads ALL predictor/price_cache/*.parquet files and writes EVERY ticker back to ArcticDB universe — including the ones we just pruned, because their parquet files still exist (kept for historical lookup). Loop closes; Backtester preflight (~2 hours later) trips on the 8-day-stale rows. ## Fix 1: backfill respects current constituents In ``builders.backfill``, load current constituents via the ``market_data/latest_weekly.json`` pointer and filter ``universe_tickers`` against it. Tickers absent from constituents (churn-outs) get a price_cache parquet preserved (history kept) but NO arctic row written. If a ticker comes back to S&P later, it appears in constituents and backfill picks it up automatically. Hard-fails on constituents-load failure (vs silently writing everything) per feedback_no_silent_fails. Skipped in dry_run so local smoke tests don't need S3 access. ## Fix 2: sf_preflight escalates straggler detection ``check_universe_drift`` now returns FAIL (not OK) when any straggler is "old enough to prune" (>5 days stale). Forces operators to drop stragglers BEFORE launching recovery SFs that skip MorningEnrich (would otherwise burn a 120-min Backtester spot to re-discover them). Result includes a remediation hint pointing at the prune CLI. Validation against current state (post manual prune of 7): [OK] universe_drift 1 arctic stragglers; 0 would be pruned 3 new tests in test_backfill_no_regression.py: - backfill_skips_tickers_absent_from_constituents (the loop closure) - backfill_hard_fails_when_constituents_load_fails (no silent recreate-everything fallback) - backfill_dry_run_does_not_filter_by_constituents (CI / smoke doesn't need S3) Existing test scaffolding updated to mock _load_current_constituents across both backfill test files. 406 tests pass. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 17, 2026
…ndent; per-branch error isolation) (#251) Research and PredictorTraining are data-independent (CLAUDE.md Architecture: "no data flows between them"). They ran sequentially only to "spread API load" — a now-stale rationale: predictor training (alpha-engine-predictor/training/train_handler.py) reads ArcticDB + CPU LightGBM and makes NO Anthropic calls (yfinance fallback removed by predictor PR #6; train_handler yfinance docstrings are stale). Research's only heavy load is Anthropic. They do not contend on the rate-limited API. Restructure the sequential Research…→PredictorTraining run into an SF Parallel state (ResearchPredictorParallel): - Branch A: CheckSkipResearch → Research → DataPhase2 → EvalJudge chain → EvalRollingMean → RationaleClustering → ReplayConcordance → Counterfactual (everything that consumes Research output, current order, all CheckSkip*/quartets/fail-soft Catches intact). - Branch B: PredictorTraining quartet + skip-gate intact. - Join → AggregateBranchOutcomes → CheckBranchOutcomes → CheckSkipDriftDetection → Backtester → Parity → Evaluator (unchanged). Per-branch error isolation (the correctness-critical requirement): SF Parallel's default cancels siblings when one branch errors. To prevent a strict-Research hard-fail from aborting/wasting an in-flight or completed+S3-promoted PredictorTraining, each branch ends in a branch-local Pass terminal (End:true) recording OK/FAILED as data — a branch NEVER throws. The SF is failed AFTER the join (post-aggregation) if either branch recorded FAILED, so the other branch's completed work (incl. already-promoted predictor weights in S3) persists and the recovery re-run's skip-set can skip whichever branch genuinely completed (Research-fail + Predictor-done → re-run with skip_predictor_training). Parallel-level Catch → existing shared HandleFailure (no new error channel); Parallel Retry is a documented no-op (MaxAttempts:0) so a completed PredictorTraining is never re-run. Inbound edges (RegimeRetrospectiveEval Next+Catch, CheckSkipRegimeRetrospectiveEval skip choice) re-pointed to ResearchPredictorParallel. Tests: new tests/test_sf_research_predictor_parallel_wiring.py (72 tests: sibling branches; Branch-A/B contents; per-branch isolation incl. no in-branch escape to HandleFailure; post-join fail-if-either-FAILED; ec2_instance_id reaches Branch B; Backtester after join; no dangling targets anywhere). Updated test_sf_eval_judge_wiring.py (flattened state fixture + old cross-boundary edge assertions retargeted to BranchAComplete) and test_sf_regime_substrate_wiring.py (inbound edge → Parallel). Full suite green: 1207 passed, 1 skipped (pre-existing pandas FutureWarnings in daily_append.py, unrelated). DEPLOY HELD — prod SF-topology change; do not merge/redeploy/trigger until the user directs. CLAUDE.md:100 "spread API load" rationale is stale and must be corrected on merge (flagged, not edited — that file is outside this repo). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PredictorHealthCheckLambda invoke betweenPredictorInferenceandHealthCheckin weekday Step Functionalpha-engine-predictor-health-check*Lambda ARNCompanion PR: cipher813/alpha-engine-backtester#5 (Lambda code + deploy script)
Test plan
🤖 Generated with Claude Code