Skip to content

feat(sf): Friday-PM shell_run dry-pass of the Saturday pipeline (spine; rule shipped disabled)#258

Merged
cipher813 merged 2 commits into
mainfrom
feat/sf-friday-shell-run
May 18, 2026
Merged

feat(sf): Friday-PM shell_run dry-pass of the Saturday pipeline (spine; rule shipped disabled)#258
cipher813 merged 2 commits into
mainfrom
feat/sf-friday-shell-run

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Foundational spine of ROADMAP "Scheduled Friday-PM 'shell run' — automated full-fidelity preflight of the Saturday SF" (P1, added 2026-05-16). The prevention half of Saturday-SF reliability; the containment half (preflight-task-split) shipped 2026-05-16 in data #249/#250.

A scheduled Friday-PM dry/no-op pass of the entire Saturday SF turns a Saturday-fatal breakage from a Saturday-morning-after lost-week incident into a Friday-evening operator fix window (~11.5h before the real Sat 02:00 PT firing). Motivated by the recent multi-week Saturday-SF cascade history.

Design

Strict superset. shell_run absent/false ⇒ byte-identical to today's real Saturday run. Mirrors the existing skip_* / States.JsonMerge precedent exactly — no new mechanism invented. Only two existing edges change, each routed through a new Choice whose Default is the pre-spine target:

Edge Pre-spine Post-spine Real Saturday run
InitializeInput.Next CheckSkipMorningEnrich CheckShellRun CheckShellRun.DefaultCheckSkipMorningEnrich (unchanged)
WaitForWeeklySubstrateHealthCheck.Next NotifyComplete CheckShellRunNotify CheckShellRunNotify.DefaultNotifyComplete (SUCCESS email untouched)

shell_run=trueApplyShellRunDefaults (Pass): States.JsonMerge(<all 16 skip_*=true>, $, false) layers every skip flag = true under the execution input, so an explicit per-flag override still wins ({"shell_run":true,"skip_research":false} still runs Research). Every workload state already has a Choice-gated skip_*, so the whole workload no-ops via the existing skip mechanism.

Per-state dry-vs-skip inventory under shell_run

Spine = pure-skip (per-module "spots boot + smoke" dry paths are scoped follow-ons):

State(s) Under shell_run Mechanism
MorningEnrich, DataPhase1, RAGIngestion, RegimeSubstrate, RegimeRetrospectiveEval, Research, DataPhase2, EvalJudge(+RollingMean), RationaleClustering, ReplayConcordance, Counterfactual, PredictorTraining, DriftDetection, Backtester, Parity, Evaluator (16) SKIPPED existing skip_* Choice gate, force-true'd by ApplyShellRunDefaults
SaturdayHealthCheck, WeeklySubstrateHealthCheck STILL RUNS (read-only — exactly the bootstrap/transport smoke the shell run wants) no skip gate by design; shell_run-aware missing-Friday-bar tolerance = ROADMAP owed-work item 5 (follow-on)
(success notify) NotifyShellRunComplete shell-run-tagged Subject, reuses the exact NotifyComplete SNS substrate

Friday cron + justification

cron(30 21 ? * FRI *) = 21:30 UTC Fri = 14:30 PT (PDT, dominant season) / 13:30 PT (PST). Chosen after the Friday EOD SF (~1:25 PT) so the shell run never collides with PostMarketData/EODReconcile/StopTradingInstance on the trading instance, and ~11.5h before the real Sat 09:00 UTC firing — a red report Friday evening gives a full operator fix window. shell_run mode fetches nothing, so the Saturday "polygon T+1 must have settled" timing constraint is moot here.

Operator enable step

The rule ships State: DISABLED — zero-risk merge. Additive observability, not a backstop (the "fail loud, no backstop" design decision stands). Brian enables deliberately:

aws events enable-rule --name alpha-engine-friday-shell-run --region us-east-1

It targets the same alpha-engine-saturday-pipeline SF (NOT a parallel SF) with {"shell_run": true}, same EventBridgeSfnRoleArn. The existing states:StartExecution grant is SF-ARN-scoped (infrastructure/iam/alpha-engine-eventbridge-sfn-role.json) so no IAM change is needed.

Consolidated-notify decision

Shell-run SUCCESS reuses the existing NotifyComplete SNS pattern with a SHELL RUN-tagged Subject (zero new infra). A shell-run FAILURE reuses the unchanged HandleFailure — its 20 inbound error edges were deliberately not re-pointed (high churn, zero added operator value, would perturb the real Saturday failure path's risk surface; the FAILED alert's Friday execution timestamp/ID is the actionable signal). The richer per-state pass/fail report (ROADMAP design point 5) is a scoped follow-on.

Scoped per-module follow-on PRs (NOT in this spine)

Convert "skipped" → "spots boot + smoke":

Repo State Dry mode needed
alpha-engine-data DataPhase1 / MorningEnrich spot_data_weekly.sh --preflight-only (preflight + universe-freshness scan, no polygon/FMP writes)
alpha-engine-data RAGIngestion spot_data_weekly.sh --rag-only --preflight-only (corpus reachability + secrets, no SEC/embedding writes)
alpha-engine-predictor PredictorTraining spot_train.sh --preflight-only (load + WF-gate-shape check, NO predictor/weights/ promotion)
alpha-engine-backtester Backtester / Parity / Evaluator spot_backtest.sh --mode=smoke + simulate-dry, NO config/*.json auto-apply (freeze_evaluator pattern)
alpha-engine-data SaturdayHealthCheck / WeeklySubstrateHealthCheck shell_run-aware missing-Friday-bar tolerance (ROADMAP owed-work item 5)

Research / predictor-inference / executor already have --dry-run/--simulate; wiring those into the SF states is part of the per-state follow-ons above.

Tests

tests/test_sf_friday_shell_run_wiring.py — 23 cases: strict-superset edges, JsonMerge user-input-wins order, every skip-gate covered by the defaults blob (drift guard), full happy-path traversal for shell_run true vs absent, Friday rule DISABLED + same-SF + shell_run=true + cron. Updated two pre-spine wiring tests (morning_enrich_split, substrate_check) to assert through the new gates while pinning Default == pre-spine target. Full suite: 1242 passed, 1 skipped (pre-existing, unrelated). No new pip deps. No secrets.

🤖 Generated with Claude Code

cipher813 and others added 2 commits May 18, 2026 10:08
…e; rule shipped disabled)

Foundational spine of ROADMAP "Scheduled Friday-PM 'shell run'" (P1, added
2026-05-16) — the *prevention* half of Saturday-SF reliability (the
*containment* half, preflight-task-split, shipped 2026-05-16 in data
#249/#250). Surfaces a Saturday-fatal bootstrap break ~11.5h before the
unattended Sat 02:00 PT firing, inside an operator-awake Friday-evening fix
window, instead of as a Saturday-morning-after lost-week incident.

STRICT SUPERSET — shell_run absent/false ⇒ byte-identical to today's real
Saturday run. Only two existing edges change, each routed through a new
Choice whose Default is the pre-spine target:
  InitializeInput.Next: CheckSkipMorningEnrich -> CheckShellRun
    (Default -> CheckSkipMorningEnrich; unchanged for the real run)
  WaitForWeeklySubstrateHealthCheck.Next: NotifyComplete -> CheckShellRunNotify
    (Default -> NotifyComplete; the real Saturday SUCCESS email is untouched)

shell_run propagation (mirrors the existing skip_*/JsonMerge precedent
exactly — no new mechanism invented):
  CheckShellRun (Choice): {"shell_run": true} -> ApplyShellRunDefaults
  ApplyShellRunDefaults (Pass): States.JsonMerge(<all 16 skip_*=true>, $, false)
    layers every skip flag = true UNDER the execution input so an explicit
    per-flag override still wins (e.g. {"shell_run":true,"skip_research":false}
    still runs Research). Every workload state already has a Choice-gated
    skip_*, so the whole workload no-ops via the EXISTING skip mechanism.

Per-state dry-vs-skip inventory under shell_run (spine = pure-skip; per-module
--preflight-only/--dry-run "spots boot + smoke" are SCOPED FOLLOW-ONS):
  SKIPPED via existing skip_* gate (16): MorningEnrich, DataPhase1,
    RAGIngestion, RegimeSubstrate, RegimeRetrospectiveEval, Research,
    DataPhase2, EvalJudge(+RollingMean), RationaleClustering,
    ReplayConcordance, Counterfactual, PredictorTraining, DriftDetection,
    Backtester, Parity, Evaluator
  STILL RUNS (read-only, no skip gate by design — exactly the bootstrap/
    transport smoke the shell run wants Friday PM): SaturdayHealthCheck,
    WeeklySubstrateHealthCheck. Their shell_run-aware missing-Friday-bar
    tolerance is ROADMAP owed-work item 5 (scoped follow-on).
  NOTIFY: NotifyShellRunComplete (shell-run-tagged Subject, reuses the exact
    NotifyComplete SNS substrate — alpha-engine-alerts topic, same Resource).

Friday EventBridge rule (CFN, the documented infra-as-code home for
EventBridge rules — SaturdayTrigger/WeekdayTrigger live there):
  FridayShellRunTrigger, cron(30 21 ? * FRI *) = 21:30 UTC Fri =
  14:30 PT (PDT, dominant season) / 13:30 PT (PST). Chosen AFTER the Friday
  EOD SF (~1:25 PT) so it never collides with PostMarketData/EODReconcile/
  StopTradingInstance on the trading instance, and ~11.5h BEFORE the real
  Sat 09:00 UTC firing. Targets the SAME alpha-engine-saturday-pipeline SF
  (NOT a parallel SF) with {"shell_run": true}, same EventBridgeSfnRoleArn —
  the existing states:StartExecution grant is SF-ARN-scoped so NO IAM change
  is needed.
  SHIPPED State: DISABLED — zero-risk merge. Additive observability, NOT a
  backstop (the "fail loud, no backstop" design decision stands).
  Operator enable step:
    aws events enable-rule --name alpha-engine-friday-shell-run --region us-east-1

Consolidated-notify decision: shell-run SUCCESS is delivered by reusing the
existing NotifyComplete SNS pattern with a SHELL RUN-tagged Subject (zero new
infra). A shell-run FAILURE reuses the unchanged HandleFailure (its 20
inbound error edges deliberately NOT re-pointed: high churn, zero added
operator value, and would perturb the real Saturday failure path's risk
surface — the FAILED alert's Friday execution timestamp/ID is the actionable
signal). The richer per-state pass/fail report (ROADMAP design point 5) is a
scoped follow-on.

Scoped per-module follow-on PRs (repo -> state -> dry mode needed; NOT done
here — these convert "skipped" to "spots boot + smoke"):
  alpha-engine-data -> DataPhase1/MorningEnrich -> spot_data_weekly.sh
    --preflight-only (preflight + universe-freshness scan, no polygon/FMP
    writes); shell_run-aware tolerance for "Friday bar not yet present"
  alpha-engine-data -> RAGIngestion -> spot_data_weekly.sh --rag-only
    --preflight-only (corpus reachability + secrets, no SEC/embedding writes)
  alpha-engine-predictor -> PredictorTraining -> spot_train.sh --preflight-only
    (load + WF-gate-shape check, NO predictor/weights/ promotion)
  alpha-engine-backtester -> Backtester/Parity/Evaluator -> spot_backtest.sh
    --mode=smoke + simulate-dry, NO config/*.json auto-apply
    (freeze_evaluator pattern is the model)
  alpha-engine-data -> SaturdayHealthCheck/WeeklySubstrateHealthCheck ->
    shell_run-aware missing-Friday-bar tolerance (ROADMAP owed-work item 5)
  (Research/predictor-inference/executor already have --dry-run/--simulate;
   wiring those into the SF states is part of the per-state follow-ons above.)

Tests: tests/test_sf_friday_shell_run_wiring.py (23 cases — strict-superset
edges, JsonMerge user-input-wins order, every skip-gate covered by the
defaults blob, full happy-path traversal for shell_run true vs absent,
Friday rule DISABLED + same-SF + shell_run=true + cron). Updated two
pre-spine wiring tests (morning_enrich_split, substrate_check) to assert
through the new gates while pinning Default == pre-spine target. Full suite:
1242 passed, 1 skipped (pre-existing, unrelated). No new pip deps. No secrets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit a59c95f into main May 18, 2026
1 check passed
@cipher813 cipher813 deleted the feat/sf-friday-shell-run branch May 18, 2026 18:36
cipher813 added a commit that referenced this pull request May 18, 2026
…un instead of skip (#260)

* feat(sf): shell-run keystone — spot --preflight-only + Lambda --dry-run instead of skip

Converts #258's pure-skip shell_run into actual boot+dry execution of the
Saturday SF workload. ApplyShellRunDefaults no longer force-sets all 16
skip_* true; it now sets a single preflight_args=" --preflight-only"
suffix var (driving the 7 spot states' States.Format command), Lambda dry
flags for the 4 verified-clean Lambda states, and hard-skips ONLY the 5
documented no-clean-dry-path exceptions. InitializeInput seeds the control
vars at non-dry identity values so the shell_run-absent path is
byte-identical (spots) / behaviourally identical (Lambdas) to today's real
Saturday run.

Invariant preserved + test-proven: shell_run absent/false ⇒ every spot
command string char-for-char unchanged (TestByteIdenticalAbsentPath
resolves the States.Array/States.Format intrinsics with preflight_args=""
and asserts equality against origin/main).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(sf): hermetic byte-identical baseline — fix CI (origin/main not in shallow PR checkout)

The keystone byte-identical proof shelled out to
`git show origin/main:infrastructure/step_function.json` at test time.
GitHub Actions' shallow PR checkout has no `origin/main` local ref →
`subprocess.CalledProcessError ... exit status 128` → `test` check failed.

Replace the live-git `orig_sf` fixture with a committed frozen baseline
`tests/fixtures/sf_prekeystone_spot_commands.json` (the RESOLVED
pre-keystone spot command lists captured from origin/main; handles the
states already on commands.$ — Backtester/Parity/Evaluator). The proof
is now hermetic and still a true regression guard against the
strict-superset invariant. Docstring documents deliberate-regeneration.

Suite: 1337 passed, 1 skipped (unchanged). Keystone file 43/43.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 18, 2026
…perator request) (#262)

Per Brian: the Friday shell-run should fire at 1:45 PM PT. Retimes the
existing alpha-engine-friday-shell-run EventBridge rule (#258) from
cron(30 21 ? * FRI *) (14:30 PT PDT) to cron(45 20 ? * FRI *) (1:45 PM
PT PDT / 12:45 PM PT PST). Single rule retimed, NOT a second redundant
rule (per feedback_dont_add_redundant_scheduled_infra). Still after the
Friday EOD SF (~1:25 PT, no trading-instance collision) and ~12h before
the real Sat 09:00 UTC firing. Rule stays State: DISABLED — operator
enables deliberately via `aws events enable-rule`. Comments + the
wiring test's pinned cron/timing updated. Suite 1337 passed.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant