feat(sf): split DataPhase1 → MorningEnrich + DataPhase1(phase1) — preflight task split P0#249
Merged
Merged
Conversation
…reflight task split P0 Standing rule (preflight-task-split-260516.md): every preflight-bearing action is its own SF task; a downstream failure must never re-run a completed upstream task. Accept the extra spot-launch cost. Origin: 2026-05-16 Saturday SF DataPhase1 ran spot_data_weekly.sh --data-only = morning-enrich (~28 min) THEN phase1 on one spot, with phase1's preflight buried 28 minutes behind a completed morning-enrich. Every phase1 recovery re-paid the 28-min morning-enrich. A fast-fail that fires 28 minutes deep is not a fast-fail. Changes: - spot_data_weekly.sh: add --morning-enrich-only / --phase1-only run modes (RUN_MODE morning-enrich-only / phase1-only). morning-enrich and phase1+prune are now independently gated by DO_MORNING_ENRICH / DO_PHASE1 derived from RUN_MODE. --data-only preserved (runs both) for manual/adhoc backward-compat. Per-mode MODE_LABEL feeds the spot-side S3 log key + heartbeat dimension so a morning-enrich-only run is not mislabeled data-phase1. Shared scaffolding (log capture, S3 EXIT-trap upload, watchdog, heartbeat) works for all three modes. - preflight.py: dedicated "morning_enrich" mode whose checks are the UNION of what _run_morning_enrich needs (AWS_REGION env, polygon + FRED secret presence + reachability probes, S3 bucket + writeable sentinel, ArcticDB libraries present). Deliberately NO ArcticDB- freshness check -- morning-enrich is part of what makes it fresh. weekly_collector.main() now maps --morning-enrich -> "morning_enrich" (was the dependency-blind "daily" which skipped polygon/FRED probes). - step_function.json: new MorningEnrich quartet (CheckSkipMorningEnrich / MorningEnrich / WaitForMorningEnrich / CheckMorningEnrichStatus + MorningEnrichWait + ExtractMorningEnrichError) inserted BEFORE DataPhase1, mirroring the RAGIngestion/DataPhase1 quartets exactly (same Retry/Catch/Heartbeat/Timeout/HandleFailure wiring + a skip_morning_enrich Choice). MorningEnrich runs --morning-enrich-only; DataPhase1 switched --data-only -> --phase1-only. Chain: InitializeInput -> CheckSkipMorningEnrich -> MorningEnrich -> ... -> CheckSkipDataPhase1 -> DataPhase1 -> (existing next, unchanged). All downstream states untouched. - tests: +44 tests across test_sf_morning_enrich_split_wiring.py, test_spot_data_weekly_run_modes.py, test_weekly_collector_preflight_mode_mapping.py, and extended test_preflight.py (morning_enrich mode: probes polygon+FRED, no arcticdb-freshness, fail-fast on missing secret). Full suite: 1094 passed, 1 skipped (clean-main baseline ~1050; +44 new). bash -n + SF JSON parse validated. DEPLOY HELD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 18, 2026
…e; rule shipped disabled) (#258) Foundational spine of ROADMAP "Scheduled Friday-PM 'shell run'" (P1, added 2026-05-16) — the *prevention* half of Saturday-SF reliability (the *containment* half, preflight-task-split, shipped 2026-05-16 in data #249/#250). Surfaces a Saturday-fatal bootstrap break ~11.5h before the unattended Sat 02:00 PT firing, inside an operator-awake Friday-evening fix window, instead of as a Saturday-morning-after lost-week incident. STRICT SUPERSET — shell_run absent/false ⇒ byte-identical to today's real Saturday run. Only two existing edges change, each routed through a new Choice whose Default is the pre-spine target: InitializeInput.Next: CheckSkipMorningEnrich -> CheckShellRun (Default -> CheckSkipMorningEnrich; unchanged for the real run) WaitForWeeklySubstrateHealthCheck.Next: NotifyComplete -> CheckShellRunNotify (Default -> NotifyComplete; the real Saturday SUCCESS email is untouched) shell_run propagation (mirrors the existing skip_*/JsonMerge precedent exactly — no new mechanism invented): CheckShellRun (Choice): {"shell_run": true} -> ApplyShellRunDefaults ApplyShellRunDefaults (Pass): States.JsonMerge(<all 16 skip_*=true>, $, false) layers every skip flag = true UNDER the execution input so an explicit per-flag override still wins (e.g. {"shell_run":true,"skip_research":false} still runs Research). Every workload state already has a Choice-gated skip_*, so the whole workload no-ops via the EXISTING skip mechanism. Per-state dry-vs-skip inventory under shell_run (spine = pure-skip; per-module --preflight-only/--dry-run "spots boot + smoke" are SCOPED FOLLOW-ONS): SKIPPED via existing skip_* gate (16): MorningEnrich, DataPhase1, RAGIngestion, RegimeSubstrate, RegimeRetrospectiveEval, Research, DataPhase2, EvalJudge(+RollingMean), RationaleClustering, ReplayConcordance, Counterfactual, PredictorTraining, DriftDetection, Backtester, Parity, Evaluator STILL RUNS (read-only, no skip gate by design — exactly the bootstrap/ transport smoke the shell run wants Friday PM): SaturdayHealthCheck, WeeklySubstrateHealthCheck. Their shell_run-aware missing-Friday-bar tolerance is ROADMAP owed-work item 5 (scoped follow-on). NOTIFY: NotifyShellRunComplete (shell-run-tagged Subject, reuses the exact NotifyComplete SNS substrate — alpha-engine-alerts topic, same Resource). Friday EventBridge rule (CFN, the documented infra-as-code home for EventBridge rules — SaturdayTrigger/WeekdayTrigger live there): FridayShellRunTrigger, cron(30 21 ? * FRI *) = 21:30 UTC Fri = 14:30 PT (PDT, dominant season) / 13:30 PT (PST). Chosen AFTER the Friday EOD SF (~1:25 PT) so it never collides with PostMarketData/EODReconcile/ StopTradingInstance on the trading instance, and ~11.5h BEFORE the real Sat 09:00 UTC firing. Targets the SAME alpha-engine-saturday-pipeline SF (NOT a parallel SF) with {"shell_run": true}, same EventBridgeSfnRoleArn — the existing states:StartExecution grant is SF-ARN-scoped so NO IAM change is needed. SHIPPED State: DISABLED — zero-risk merge. Additive observability, NOT a backstop (the "fail loud, no backstop" design decision stands). Operator enable step: aws events enable-rule --name alpha-engine-friday-shell-run --region us-east-1 Consolidated-notify decision: shell-run SUCCESS is delivered by reusing the existing NotifyComplete SNS pattern with a SHELL RUN-tagged Subject (zero new infra). A shell-run FAILURE reuses the unchanged HandleFailure (its 20 inbound error edges deliberately NOT re-pointed: high churn, zero added operator value, and would perturb the real Saturday failure path's risk surface — the FAILED alert's Friday execution timestamp/ID is the actionable signal). The richer per-state pass/fail report (ROADMAP design point 5) is a scoped follow-on. Scoped per-module follow-on PRs (repo -> state -> dry mode needed; NOT done here — these convert "skipped" to "spots boot + smoke"): alpha-engine-data -> DataPhase1/MorningEnrich -> spot_data_weekly.sh --preflight-only (preflight + universe-freshness scan, no polygon/FMP writes); shell_run-aware tolerance for "Friday bar not yet present" alpha-engine-data -> RAGIngestion -> spot_data_weekly.sh --rag-only --preflight-only (corpus reachability + secrets, no SEC/embedding writes) alpha-engine-predictor -> PredictorTraining -> spot_train.sh --preflight-only (load + WF-gate-shape check, NO predictor/weights/ promotion) alpha-engine-backtester -> Backtester/Parity/Evaluator -> spot_backtest.sh --mode=smoke + simulate-dry, NO config/*.json auto-apply (freeze_evaluator pattern is the model) alpha-engine-data -> SaturdayHealthCheck/WeeklySubstrateHealthCheck -> shell_run-aware missing-Friday-bar tolerance (ROADMAP owed-work item 5) (Research/predictor-inference/executor already have --dry-run/--simulate; wiring those into the SF states is part of the per-state follow-ons above.) Tests: tests/test_sf_friday_shell_run_wiring.py (23 cases — strict-superset edges, JsonMerge user-input-wins order, every skip-gate covered by the defaults blob, full happy-path traversal for shell_run true vs absent, Friday rule DISABLED + same-SF + shell_run=true + cron). Updated two pre-spine wiring tests (morning_enrich_split, substrate_check) to assert through the new gates while pinning Default == pre-spine target. Full suite: 1242 passed, 1 skipped (pre-existing, unrelated). No new pip deps. No secrets. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Standing rule + origin
Per the plan doc
alpha-engine-docs/private/preflight-task-split-260516.md(§3–4, authoritative): every preflight-bearing action is its own SF task; a downstream failure must never re-run a completed upstream task. The extra spot-launch cost (~6–10 min, ~$0.005 per split) is explicitly weighed and accepted over launch economy.Origin: 2026-05-16 Saturday SF
DataPhase1ranspot_data_weekly.sh --data-only= morning-enrich (~28 min) then phase1 on one spot, with phase1's preflight buried 28 minutes behind a completed morning-enrich. Every phase1 recovery re-paid the 28-min morning-enrich. A fast-fail that fires 28 minutes deep is not a fast-fail.RAGIngestionis the canonical split precedent this PR mirrors.Changes
1.
infrastructure/spot_data_weekly.sh— added--morning-enrich-only/--phase1-onlyrun modes (RUN_MODEmorning-enrich-only/phase1-only). morning-enrich and phase1+prune are independently gated byDO_MORNING_ENRICH/DO_PHASE1derived from RUN_MODE.--data-onlypreserved (runs both) for manual/adhoc backward-compat. Per-modeMODE_LABELdrives the spot-side S3 log key (health/morning_enrich_log/…vshealth/data_phase1_log/…) and the heartbeat dimension so a morning-enrich-only run is not mislabeleddata-phase1. Shared scaffolding (log capture, S3 EXIT-trap upload, watchdog, heartbeat) works for all three modes.2.
preflight.py+weekly_collector.py— new dedicatedmorning_enrichDataPreflightmode whose checks are the UNION of what_run_morning_enrichactually needs:AWS_REGIONenv, polygon + FRED secret presence (_check_secrets), polygon + FRED reachability probes, S3 bucket + writeable sentinel, ArcticDB libraries present. Deliberately NOcheck_arcticdb_fresh— morning-enrich is part of what makes ArcticDB fresh, so a freshness gate at its own entry would be circular.weekly_collector.main()now maps--morning-enrich→"morning_enrich"(was the dependency-blind"daily", which never probed polygon/FRED — a drifted key failed 28 min into the spot run). Mode whitelist + docstring updated.3.
infrastructure/step_function.json— newMorningEnrichquartet (CheckSkipMorningEnrich/MorningEnrich/WaitForMorningEnrich/CheckMorningEnrichStatus, plusMorningEnrichWait+ExtractMorningEnrichError) inserted beforeDataPhase1, mirroring theRAGIngestion/DataPhase1quartets exactly: same Retry (States.TaskFailed, MaxAttempts 1), sameStates.ALL → HandleFailureCatch withResultPath $.error, same HeartbeatSeconds/TimeoutSeconds (5400/5460), same skip-input Choice shape (skip_morning_enrich, the analogue ofskip_data_phase1).MorningEnrichruns--morning-enrich-only;DataPhase1switched--data-only→--phase1-only. Chain:InitializeInput → CheckSkipMorningEnrich → MorningEnrich → WaitForMorningEnrich → CheckMorningEnrichStatus (success) → CheckSkipDataPhase1 → DataPhase1 → (existing next, unchanged). Every existing downstream state untouched.4. Tests — +44 tests:
test_sf_morning_enrich_split_wiring.py— quartet presence, happy-path reachability (MorningEnrich strictly before DataPhase1),--morning-enrich-only/--phase1-onlycommand shapes, HandleFailure Catch, pipefail + S3-log-trap invariants, ResultPath isolation.test_spot_data_weekly_run_modes.py— flag→RUN_MODE parsing, independentDO_*gating,SKIP_RAG_BLOCK, per-modeMODE_LABEL+ heartbeat (grep-style, mirrorstest_spot_env_source_aws_region.py).test_weekly_collector_preflight_mode_mapping.py— pins--morning-enrich→"morning_enrich"(not"daily").test_preflight.py— extended withTestMorningEnrichMode(probes polygon+FRED, no arcticdb-freshness viacheck_arcticdb_freshpatch assertion, fail-fast on missing secret).Validation
bash -n infrastructure/spot_data_weekly.sh— OKpython3 -c "import json; json.load(open('infrastructure/step_function.json'))"— OKpytest tests/ -q): 1094 passed, 1 skipped, zero failures (clean-main baseline ~1050; +44 new). 5 pre-existingdaily_append.pyconcat FutureWarnings, unrelated.Deploy
DEPLOY IS HELD. This is review-ready only. The in-flight recovery Saturday SF run must complete green (proving the #247/#248 preflight fixes end-to-end) before any SF redeploy. The Saturday SF must NOT be redeployed while a recovery execution is live on it.
🤖 Generated with Claude Code