Skip to content

feat(sf): split DataPhase1 → MorningEnrich + DataPhase1(phase1) — preflight task split P0#249

Merged
cipher813 merged 1 commit into
mainfrom
feat/split-dataphase1-morningenrich
May 16, 2026
Merged

feat(sf): split DataPhase1 → MorningEnrich + DataPhase1(phase1) — preflight task split P0#249
cipher813 merged 1 commit into
mainfrom
feat/split-dataphase1-morningenrich

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Standing rule + origin

Per the plan doc alpha-engine-docs/private/preflight-task-split-260516.md (§3–4, authoritative): every preflight-bearing action is its own SF task; a downstream failure must never re-run a completed upstream task. The extra spot-launch cost (~6–10 min, ~$0.005 per split) is explicitly weighed and accepted over launch economy.

Origin: 2026-05-16 Saturday SF DataPhase1 ran spot_data_weekly.sh --data-only = morning-enrich (~28 min) then phase1 on one spot, with phase1's preflight buried 28 minutes behind a completed morning-enrich. Every phase1 recovery re-paid the 28-min morning-enrich. A fast-fail that fires 28 minutes deep is not a fast-fail. RAGIngestion is the canonical split precedent this PR mirrors.

Changes

1. infrastructure/spot_data_weekly.sh — added --morning-enrich-only / --phase1-only run modes (RUN_MODE morning-enrich-only / phase1-only). morning-enrich and phase1+prune are independently gated by DO_MORNING_ENRICH / DO_PHASE1 derived from RUN_MODE. --data-only preserved (runs both) for manual/adhoc backward-compat. Per-mode MODE_LABEL drives the spot-side S3 log key (health/morning_enrich_log/… vs health/data_phase1_log/…) and the heartbeat dimension so a morning-enrich-only run is not mislabeled data-phase1. Shared scaffolding (log capture, S3 EXIT-trap upload, watchdog, heartbeat) works for all three modes.

2. preflight.py + weekly_collector.py — new dedicated morning_enrich DataPreflight mode whose checks are the UNION of what _run_morning_enrich actually needs: AWS_REGION env, polygon + FRED secret presence (_check_secrets), polygon + FRED reachability probes, S3 bucket + writeable sentinel, ArcticDB libraries present. Deliberately NO check_arcticdb_fresh — morning-enrich is part of what makes ArcticDB fresh, so a freshness gate at its own entry would be circular. weekly_collector.main() now maps --morning-enrich"morning_enrich" (was the dependency-blind "daily", which never probed polygon/FRED — a drifted key failed 28 min into the spot run). Mode whitelist + docstring updated.

3. infrastructure/step_function.json — new MorningEnrich quartet (CheckSkipMorningEnrich / MorningEnrich / WaitForMorningEnrich / CheckMorningEnrichStatus, plus MorningEnrichWait + ExtractMorningEnrichError) inserted before DataPhase1, mirroring the RAGIngestion/DataPhase1 quartets exactly: same Retry (States.TaskFailed, MaxAttempts 1), same States.ALL → HandleFailure Catch with ResultPath $.error, same HeartbeatSeconds/TimeoutSeconds (5400/5460), same skip-input Choice shape (skip_morning_enrich, the analogue of skip_data_phase1). MorningEnrich runs --morning-enrich-only; DataPhase1 switched --data-only--phase1-only. Chain: InitializeInput → CheckSkipMorningEnrich → MorningEnrich → WaitForMorningEnrich → CheckMorningEnrichStatus (success) → CheckSkipDataPhase1 → DataPhase1 → (existing next, unchanged). Every existing downstream state untouched.

4. Tests — +44 tests:

  • test_sf_morning_enrich_split_wiring.py — quartet presence, happy-path reachability (MorningEnrich strictly before DataPhase1), --morning-enrich-only / --phase1-only command shapes, HandleFailure Catch, pipefail + S3-log-trap invariants, ResultPath isolation.
  • test_spot_data_weekly_run_modes.py — flag→RUN_MODE parsing, independent DO_* gating, SKIP_RAG_BLOCK, per-mode MODE_LABEL + heartbeat (grep-style, mirrors test_spot_env_source_aws_region.py).
  • test_weekly_collector_preflight_mode_mapping.py — pins --morning-enrich"morning_enrich" (not "daily").
  • test_preflight.py — extended with TestMorningEnrichMode (probes polygon+FRED, no arcticdb-freshness via check_arcticdb_fresh patch assertion, fail-fast on missing secret).

Validation

  • bash -n infrastructure/spot_data_weekly.sh — OK
  • python3 -c "import json; json.load(open('infrastructure/step_function.json'))" — OK
  • Full suite (pytest tests/ -q): 1094 passed, 1 skipped, zero failures (clean-main baseline ~1050; +44 new). 5 pre-existing daily_append.py concat FutureWarnings, unrelated.

Deploy

DEPLOY IS HELD. This is review-ready only. The in-flight recovery Saturday SF run must complete green (proving the #247/#248 preflight fixes end-to-end) before any SF redeploy. The Saturday SF must NOT be redeployed while a recovery execution is live on it.

🤖 Generated with Claude Code

…reflight task split P0

Standing rule (preflight-task-split-260516.md): every preflight-bearing
action is its own SF task; a downstream failure must never re-run a
completed upstream task. Accept the extra spot-launch cost.

Origin: 2026-05-16 Saturday SF DataPhase1 ran spot_data_weekly.sh
--data-only = morning-enrich (~28 min) THEN phase1 on one spot, with
phase1's preflight buried 28 minutes behind a completed morning-enrich.
Every phase1 recovery re-paid the 28-min morning-enrich. A fast-fail
that fires 28 minutes deep is not a fast-fail.

Changes:
- spot_data_weekly.sh: add --morning-enrich-only / --phase1-only run
  modes (RUN_MODE morning-enrich-only / phase1-only). morning-enrich and
  phase1+prune are now independently gated by DO_MORNING_ENRICH /
  DO_PHASE1 derived from RUN_MODE. --data-only preserved (runs both) for
  manual/adhoc backward-compat. Per-mode MODE_LABEL feeds the spot-side
  S3 log key + heartbeat dimension so a morning-enrich-only run is not
  mislabeled data-phase1. Shared scaffolding (log capture, S3 EXIT-trap
  upload, watchdog, heartbeat) works for all three modes.
- preflight.py: dedicated "morning_enrich" mode whose checks are the
  UNION of what _run_morning_enrich needs (AWS_REGION env, polygon +
  FRED secret presence + reachability probes, S3 bucket + writeable
  sentinel, ArcticDB libraries present). Deliberately NO ArcticDB-
  freshness check -- morning-enrich is part of what makes it fresh.
  weekly_collector.main() now maps --morning-enrich -> "morning_enrich"
  (was the dependency-blind "daily" which skipped polygon/FRED probes).
- step_function.json: new MorningEnrich quartet (CheckSkipMorningEnrich
  / MorningEnrich / WaitForMorningEnrich / CheckMorningEnrichStatus +
  MorningEnrichWait + ExtractMorningEnrichError) inserted BEFORE
  DataPhase1, mirroring the RAGIngestion/DataPhase1 quartets exactly
  (same Retry/Catch/Heartbeat/Timeout/HandleFailure wiring + a
  skip_morning_enrich Choice). MorningEnrich runs
  --morning-enrich-only; DataPhase1 switched --data-only ->
  --phase1-only. Chain: InitializeInput -> CheckSkipMorningEnrich ->
  MorningEnrich -> ... -> CheckSkipDataPhase1 -> DataPhase1 ->
  (existing next, unchanged). All downstream states untouched.
- tests: +44 tests across test_sf_morning_enrich_split_wiring.py,
  test_spot_data_weekly_run_modes.py,
  test_weekly_collector_preflight_mode_mapping.py, and extended
  test_preflight.py (morning_enrich mode: probes polygon+FRED, no
  arcticdb-freshness, fail-fast on missing secret).

Full suite: 1094 passed, 1 skipped (clean-main baseline ~1050; +44 new).
bash -n + SF JSON parse validated. DEPLOY HELD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit fe9507f into main May 16, 2026
1 check passed
@cipher813 cipher813 deleted the feat/split-dataphase1-morningenrich branch May 16, 2026 14:53
cipher813 added a commit that referenced this pull request May 18, 2026
…e; rule shipped disabled) (#258)

Foundational spine of ROADMAP "Scheduled Friday-PM 'shell run'" (P1, added
2026-05-16) — the *prevention* half of Saturday-SF reliability (the
*containment* half, preflight-task-split, shipped 2026-05-16 in data
#249/#250). Surfaces a Saturday-fatal bootstrap break ~11.5h before the
unattended Sat 02:00 PT firing, inside an operator-awake Friday-evening fix
window, instead of as a Saturday-morning-after lost-week incident.

STRICT SUPERSET — shell_run absent/false ⇒ byte-identical to today's real
Saturday run. Only two existing edges change, each routed through a new
Choice whose Default is the pre-spine target:
  InitializeInput.Next: CheckSkipMorningEnrich -> CheckShellRun
    (Default -> CheckSkipMorningEnrich; unchanged for the real run)
  WaitForWeeklySubstrateHealthCheck.Next: NotifyComplete -> CheckShellRunNotify
    (Default -> NotifyComplete; the real Saturday SUCCESS email is untouched)

shell_run propagation (mirrors the existing skip_*/JsonMerge precedent
exactly — no new mechanism invented):
  CheckShellRun (Choice): {"shell_run": true} -> ApplyShellRunDefaults
  ApplyShellRunDefaults (Pass): States.JsonMerge(<all 16 skip_*=true>, $, false)
    layers every skip flag = true UNDER the execution input so an explicit
    per-flag override still wins (e.g. {"shell_run":true,"skip_research":false}
    still runs Research). Every workload state already has a Choice-gated
    skip_*, so the whole workload no-ops via the EXISTING skip mechanism.

Per-state dry-vs-skip inventory under shell_run (spine = pure-skip; per-module
--preflight-only/--dry-run "spots boot + smoke" are SCOPED FOLLOW-ONS):
  SKIPPED via existing skip_* gate (16): MorningEnrich, DataPhase1,
    RAGIngestion, RegimeSubstrate, RegimeRetrospectiveEval, Research,
    DataPhase2, EvalJudge(+RollingMean), RationaleClustering,
    ReplayConcordance, Counterfactual, PredictorTraining, DriftDetection,
    Backtester, Parity, Evaluator
  STILL RUNS (read-only, no skip gate by design — exactly the bootstrap/
    transport smoke the shell run wants Friday PM): SaturdayHealthCheck,
    WeeklySubstrateHealthCheck. Their shell_run-aware missing-Friday-bar
    tolerance is ROADMAP owed-work item 5 (scoped follow-on).
  NOTIFY: NotifyShellRunComplete (shell-run-tagged Subject, reuses the exact
    NotifyComplete SNS substrate — alpha-engine-alerts topic, same Resource).

Friday EventBridge rule (CFN, the documented infra-as-code home for
EventBridge rules — SaturdayTrigger/WeekdayTrigger live there):
  FridayShellRunTrigger, cron(30 21 ? * FRI *) = 21:30 UTC Fri =
  14:30 PT (PDT, dominant season) / 13:30 PT (PST). Chosen AFTER the Friday
  EOD SF (~1:25 PT) so it never collides with PostMarketData/EODReconcile/
  StopTradingInstance on the trading instance, and ~11.5h BEFORE the real
  Sat 09:00 UTC firing. Targets the SAME alpha-engine-saturday-pipeline SF
  (NOT a parallel SF) with {"shell_run": true}, same EventBridgeSfnRoleArn —
  the existing states:StartExecution grant is SF-ARN-scoped so NO IAM change
  is needed.
  SHIPPED State: DISABLED — zero-risk merge. Additive observability, NOT a
  backstop (the "fail loud, no backstop" design decision stands).
  Operator enable step:
    aws events enable-rule --name alpha-engine-friday-shell-run --region us-east-1

Consolidated-notify decision: shell-run SUCCESS is delivered by reusing the
existing NotifyComplete SNS pattern with a SHELL RUN-tagged Subject (zero new
infra). A shell-run FAILURE reuses the unchanged HandleFailure (its 20
inbound error edges deliberately NOT re-pointed: high churn, zero added
operator value, and would perturb the real Saturday failure path's risk
surface — the FAILED alert's Friday execution timestamp/ID is the actionable
signal). The richer per-state pass/fail report (ROADMAP design point 5) is a
scoped follow-on.

Scoped per-module follow-on PRs (repo -> state -> dry mode needed; NOT done
here — these convert "skipped" to "spots boot + smoke"):
  alpha-engine-data -> DataPhase1/MorningEnrich -> spot_data_weekly.sh
    --preflight-only (preflight + universe-freshness scan, no polygon/FMP
    writes); shell_run-aware tolerance for "Friday bar not yet present"
  alpha-engine-data -> RAGIngestion -> spot_data_weekly.sh --rag-only
    --preflight-only (corpus reachability + secrets, no SEC/embedding writes)
  alpha-engine-predictor -> PredictorTraining -> spot_train.sh --preflight-only
    (load + WF-gate-shape check, NO predictor/weights/ promotion)
  alpha-engine-backtester -> Backtester/Parity/Evaluator -> spot_backtest.sh
    --mode=smoke + simulate-dry, NO config/*.json auto-apply
    (freeze_evaluator pattern is the model)
  alpha-engine-data -> SaturdayHealthCheck/WeeklySubstrateHealthCheck ->
    shell_run-aware missing-Friday-bar tolerance (ROADMAP owed-work item 5)
  (Research/predictor-inference/executor already have --dry-run/--simulate;
   wiring those into the SF states is part of the per-state follow-ons above.)

Tests: tests/test_sf_friday_shell_run_wiring.py (23 cases — strict-superset
edges, JsonMerge user-input-wins order, every skip-gate covered by the
defaults blob, full happy-path traversal for shell_run true vs absent,
Friday rule DISABLED + same-SF + shell_run=true + cron). Updated two
pre-spine wiring tests (morning_enrich_split, substrate_check) to assert
through the new gates while pinning Default == pre-spine target. Full suite:
1242 passed, 1 skipped (pre-existing, unrelated). No new pip deps. No secrets.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant