feat(sf): Friday-PM shell_run dry-pass of the Saturday pipeline (spine; rule shipped disabled)#258
Merged
Merged
Conversation
…e; rule shipped disabled) Foundational spine of ROADMAP "Scheduled Friday-PM 'shell run'" (P1, added 2026-05-16) — the *prevention* half of Saturday-SF reliability (the *containment* half, preflight-task-split, shipped 2026-05-16 in data #249/#250). Surfaces a Saturday-fatal bootstrap break ~11.5h before the unattended Sat 02:00 PT firing, inside an operator-awake Friday-evening fix window, instead of as a Saturday-morning-after lost-week incident. STRICT SUPERSET — shell_run absent/false ⇒ byte-identical to today's real Saturday run. Only two existing edges change, each routed through a new Choice whose Default is the pre-spine target: InitializeInput.Next: CheckSkipMorningEnrich -> CheckShellRun (Default -> CheckSkipMorningEnrich; unchanged for the real run) WaitForWeeklySubstrateHealthCheck.Next: NotifyComplete -> CheckShellRunNotify (Default -> NotifyComplete; the real Saturday SUCCESS email is untouched) shell_run propagation (mirrors the existing skip_*/JsonMerge precedent exactly — no new mechanism invented): CheckShellRun (Choice): {"shell_run": true} -> ApplyShellRunDefaults ApplyShellRunDefaults (Pass): States.JsonMerge(<all 16 skip_*=true>, $, false) layers every skip flag = true UNDER the execution input so an explicit per-flag override still wins (e.g. {"shell_run":true,"skip_research":false} still runs Research). Every workload state already has a Choice-gated skip_*, so the whole workload no-ops via the EXISTING skip mechanism. Per-state dry-vs-skip inventory under shell_run (spine = pure-skip; per-module --preflight-only/--dry-run "spots boot + smoke" are SCOPED FOLLOW-ONS): SKIPPED via existing skip_* gate (16): MorningEnrich, DataPhase1, RAGIngestion, RegimeSubstrate, RegimeRetrospectiveEval, Research, DataPhase2, EvalJudge(+RollingMean), RationaleClustering, ReplayConcordance, Counterfactual, PredictorTraining, DriftDetection, Backtester, Parity, Evaluator STILL RUNS (read-only, no skip gate by design — exactly the bootstrap/ transport smoke the shell run wants Friday PM): SaturdayHealthCheck, WeeklySubstrateHealthCheck. Their shell_run-aware missing-Friday-bar tolerance is ROADMAP owed-work item 5 (scoped follow-on). NOTIFY: NotifyShellRunComplete (shell-run-tagged Subject, reuses the exact NotifyComplete SNS substrate — alpha-engine-alerts topic, same Resource). Friday EventBridge rule (CFN, the documented infra-as-code home for EventBridge rules — SaturdayTrigger/WeekdayTrigger live there): FridayShellRunTrigger, cron(30 21 ? * FRI *) = 21:30 UTC Fri = 14:30 PT (PDT, dominant season) / 13:30 PT (PST). Chosen AFTER the Friday EOD SF (~1:25 PT) so it never collides with PostMarketData/EODReconcile/ StopTradingInstance on the trading instance, and ~11.5h BEFORE the real Sat 09:00 UTC firing. Targets the SAME alpha-engine-saturday-pipeline SF (NOT a parallel SF) with {"shell_run": true}, same EventBridgeSfnRoleArn — the existing states:StartExecution grant is SF-ARN-scoped so NO IAM change is needed. SHIPPED State: DISABLED — zero-risk merge. Additive observability, NOT a backstop (the "fail loud, no backstop" design decision stands). Operator enable step: aws events enable-rule --name alpha-engine-friday-shell-run --region us-east-1 Consolidated-notify decision: shell-run SUCCESS is delivered by reusing the existing NotifyComplete SNS pattern with a SHELL RUN-tagged Subject (zero new infra). A shell-run FAILURE reuses the unchanged HandleFailure (its 20 inbound error edges deliberately NOT re-pointed: high churn, zero added operator value, and would perturb the real Saturday failure path's risk surface — the FAILED alert's Friday execution timestamp/ID is the actionable signal). The richer per-state pass/fail report (ROADMAP design point 5) is a scoped follow-on. Scoped per-module follow-on PRs (repo -> state -> dry mode needed; NOT done here — these convert "skipped" to "spots boot + smoke"): alpha-engine-data -> DataPhase1/MorningEnrich -> spot_data_weekly.sh --preflight-only (preflight + universe-freshness scan, no polygon/FMP writes); shell_run-aware tolerance for "Friday bar not yet present" alpha-engine-data -> RAGIngestion -> spot_data_weekly.sh --rag-only --preflight-only (corpus reachability + secrets, no SEC/embedding writes) alpha-engine-predictor -> PredictorTraining -> spot_train.sh --preflight-only (load + WF-gate-shape check, NO predictor/weights/ promotion) alpha-engine-backtester -> Backtester/Parity/Evaluator -> spot_backtest.sh --mode=smoke + simulate-dry, NO config/*.json auto-apply (freeze_evaluator pattern is the model) alpha-engine-data -> SaturdayHealthCheck/WeeklySubstrateHealthCheck -> shell_run-aware missing-Friday-bar tolerance (ROADMAP owed-work item 5) (Research/predictor-inference/executor already have --dry-run/--simulate; wiring those into the SF states is part of the per-state follow-ons above.) Tests: tests/test_sf_friday_shell_run_wiring.py (23 cases — strict-superset edges, JsonMerge user-input-wins order, every skip-gate covered by the defaults blob, full happy-path traversal for shell_run true vs absent, Friday rule DISABLED + same-SF + shell_run=true + cron). Updated two pre-spine wiring tests (morning_enrich_split, substrate_check) to assert through the new gates while pinning Default == pre-spine target. Full suite: 1242 passed, 1 skipped (pre-existing, unrelated). No new pip deps. No secrets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 18, 2026
…un instead of skip (#260) * feat(sf): shell-run keystone — spot --preflight-only + Lambda --dry-run instead of skip Converts #258's pure-skip shell_run into actual boot+dry execution of the Saturday SF workload. ApplyShellRunDefaults no longer force-sets all 16 skip_* true; it now sets a single preflight_args=" --preflight-only" suffix var (driving the 7 spot states' States.Format command), Lambda dry flags for the 4 verified-clean Lambda states, and hard-skips ONLY the 5 documented no-clean-dry-path exceptions. InitializeInput seeds the control vars at non-dry identity values so the shell_run-absent path is byte-identical (spots) / behaviourally identical (Lambdas) to today's real Saturday run. Invariant preserved + test-proven: shell_run absent/false ⇒ every spot command string char-for-char unchanged (TestByteIdenticalAbsentPath resolves the States.Array/States.Format intrinsics with preflight_args="" and asserts equality against origin/main). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(sf): hermetic byte-identical baseline — fix CI (origin/main not in shallow PR checkout) The keystone byte-identical proof shelled out to `git show origin/main:infrastructure/step_function.json` at test time. GitHub Actions' shallow PR checkout has no `origin/main` local ref → `subprocess.CalledProcessError ... exit status 128` → `test` check failed. Replace the live-git `orig_sf` fixture with a committed frozen baseline `tests/fixtures/sf_prekeystone_spot_commands.json` (the RESOLVED pre-keystone spot command lists captured from origin/main; handles the states already on commands.$ — Backtester/Parity/Evaluator). The proof is now hermetic and still a true regression guard against the strict-superset invariant. Docstring documents deliberate-regeneration. Suite: 1337 passed, 1 skipped (unchanged). Keystone file 43/43. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 18, 2026
…perator request) (#262) Per Brian: the Friday shell-run should fire at 1:45 PM PT. Retimes the existing alpha-engine-friday-shell-run EventBridge rule (#258) from cron(30 21 ? * FRI *) (14:30 PT PDT) to cron(45 20 ? * FRI *) (1:45 PM PT PDT / 12:45 PM PT PST). Single rule retimed, NOT a second redundant rule (per feedback_dont_add_redundant_scheduled_infra). Still after the Friday EOD SF (~1:25 PT, no trading-instance collision) and ~12h before the real Sat 09:00 UTC firing. Rule stays State: DISABLED — operator enables deliberately via `aws events enable-rule`. Comments + the wiring test's pinned cron/timing updated. Suite 1337 passed. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Foundational spine of ROADMAP "Scheduled Friday-PM 'shell run' — automated full-fidelity preflight of the Saturday SF" (P1, added 2026-05-16). The prevention half of Saturday-SF reliability; the containment half (preflight-task-split) shipped 2026-05-16 in data #249/#250.
A scheduled Friday-PM dry/no-op pass of the entire Saturday SF turns a Saturday-fatal breakage from a Saturday-morning-after lost-week incident into a Friday-evening operator fix window (~11.5h before the real Sat 02:00 PT firing). Motivated by the recent multi-week Saturday-SF cascade history.
Design
Strict superset.
shell_runabsent/false ⇒ byte-identical to today's real Saturday run. Mirrors the existingskip_*/States.JsonMergeprecedent exactly — no new mechanism invented. Only two existing edges change, each routed through a new Choice whoseDefaultis the pre-spine target:InitializeInput.NextCheckSkipMorningEnrichCheckShellRunCheckShellRun.Default→CheckSkipMorningEnrich(unchanged)WaitForWeeklySubstrateHealthCheck.NextNotifyCompleteCheckShellRunNotifyCheckShellRunNotify.Default→NotifyComplete(SUCCESS email untouched)shell_run=true→ApplyShellRunDefaults(Pass):States.JsonMerge(<all 16 skip_*=true>, $, false)layers every skip flag = true under the execution input, so an explicit per-flag override still wins ({"shell_run":true,"skip_research":false}still runs Research). Every workload state already has a Choice-gatedskip_*, so the whole workload no-ops via the existing skip mechanism.Per-state dry-vs-skip inventory under
shell_runSpine = pure-skip (per-module "spots boot + smoke" dry paths are scoped follow-ons):
skip_*Choice gate, force-true'd byApplyShellRunDefaultsNotifyCompleteSNS substrateFriday cron + justification
cron(30 21 ? * FRI *)= 21:30 UTC Fri = 14:30 PT (PDT, dominant season) / 13:30 PT (PST). Chosen after the Friday EOD SF (~1:25 PT) so the shell run never collides with PostMarketData/EODReconcile/StopTradingInstance on the trading instance, and ~11.5h before the real Sat 09:00 UTC firing — a red report Friday evening gives a full operator fix window. shell_run mode fetches nothing, so the Saturday "polygon T+1 must have settled" timing constraint is moot here.Operator enable step
The rule ships
State: DISABLED— zero-risk merge. Additive observability, not a backstop (the "fail loud, no backstop" design decision stands). Brian enables deliberately:It targets the same
alpha-engine-saturday-pipelineSF (NOT a parallel SF) with{"shell_run": true}, sameEventBridgeSfnRoleArn. The existingstates:StartExecutiongrant is SF-ARN-scoped (infrastructure/iam/alpha-engine-eventbridge-sfn-role.json) so no IAM change is needed.Consolidated-notify decision
Shell-run SUCCESS reuses the existing
NotifyCompleteSNS pattern with aSHELL RUN-tagged Subject (zero new infra). A shell-run FAILURE reuses the unchangedHandleFailure— its 20 inbound error edges were deliberately not re-pointed (high churn, zero added operator value, would perturb the real Saturday failure path's risk surface; the FAILED alert's Friday execution timestamp/ID is the actionable signal). The richer per-state pass/fail report (ROADMAP design point 5) is a scoped follow-on.Scoped per-module follow-on PRs (NOT in this spine)
Convert "skipped" → "spots boot + smoke":
spot_data_weekly.sh --preflight-only(preflight + universe-freshness scan, no polygon/FMP writes)spot_data_weekly.sh --rag-only --preflight-only(corpus reachability + secrets, no SEC/embedding writes)spot_train.sh --preflight-only(load + WF-gate-shape check, NOpredictor/weights/promotion)spot_backtest.sh --mode=smoke+ simulate-dry, NOconfig/*.jsonauto-apply (freeze_evaluator pattern)Research / predictor-inference / executor already have
--dry-run/--simulate; wiring those into the SF states is part of the per-state follow-ons above.Tests
tests/test_sf_friday_shell_run_wiring.py— 23 cases: strict-superset edges,JsonMergeuser-input-wins order, every skip-gate covered by the defaults blob (drift guard), full happy-path traversal forshell_runtrue vs absent, Friday rule DISABLED + same-SF +shell_run=true+ cron. Updated two pre-spine wiring tests (morning_enrich_split,substrate_check) to assert through the new gates while pinningDefault == pre-spine target. Full suite: 1242 passed, 1 skipped (pre-existing, unrelated). No new pip deps. No secrets.🤖 Generated with Claude Code