Skip to content

feat(sf): add WeeklySubstrateHealthCheck state at end of Saturday SF#175

Merged
cipher813 merged 2 commits into
mainfrom
feat/sf-substrate-weekly-check
May 6, 2026
Merged

feat(sf): add WeeklySubstrateHealthCheck state at end of Saturday SF#175
cipher813 merged 2 commits into
mainfrom
feat/sf-substrate-weekly-check

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

  • Inserts WeeklySubstrateHealthCheck + WaitForWeeklySubstrateHealthCheck between the existing WaitForSaturdayHealthCheck and NotifyComplete.
  • New states invoke python -m alpha_engine_lib.transparency --cadence weekly --alert on the dashboard EC2 (the Sat SF dispatcher), running the row-driven substrate health checker shipped in alpha-engine-lib v0.5.0 (lib PR Split Evaluator from Backtester into independent Step Function step #23).
  • 15 new wiring tests pin chain ordering, Catch semantics, command shape, and ResultPath isolation. 490 total passing.

Why

The Phase 2 → 3 gate is "≥99% of inventory rows pass for 8 consecutive weeks." Without an enforced check, that's a number computed retrospectively from data; with a per-row substrate check + per-row CloudWatch alarm, it becomes a property the system continuously enforces — a failed row pages in <24h instead of being caught at next eyeball pass.

The existing artifact-freshness SaturdayHealthCheck and behavioral DriftDetection continue to run unchanged. The substrate check is a different abstraction (row-driven inventory validation, content assertions, not just last-modified age) and runs in parallel. Two-staleness-vectors avoidance: SaturdayHealthCheck retirement is planned after ~4-6 weeks of green substrate runs once row coverage proves out.

Wiring

... → Counterfactual → SaturdayHealthCheck → WaitForSaturdayHealthCheck
    → WeeklySubstrateHealthCheck → WaitForWeeklySubstrateHealthCheck → NotifyComplete

Both new states are non-blocking (Catch routes to NotifyComplete) per the same pattern as SaturdayHealthCheck. Pipeline halts only on hard infra failure; row-level failures fire CloudWatch alarms via --alert flag publishing to the existing alpha-engine-alerts SNS topic.

Dependencies

Test plan

  • pytest tests/test_sf_substrate_check_wiring.py — 15 new tests passing
  • pytest (full suite) — 490 total passing, no regressions
  • After merge + deploy: ad-hoc Sat SF test run (skip_* flags to short-circuit upstream stages) exercises the new states end-to-end before Sat 5/9 production fire
  • Verify AlphaEngine/Substrate namespace appears in CloudWatch with per-row SubstrateRowOK metrics + 3 aggregates (SubstrateChecksOK / Failed / Pending)

Follow-ups

  • CloudWatch alarms PR (per-row alarm on SubstrateRowOK < 1 + aggregate alarm on SubstrateChecksFailed > 0 → SNS)
  • Daily-cadence equivalent state at end of EOD weekday SF
  • Deprecate SaturdayHealthCheck after 4-6 weeks of green substrate runs

🤖 Generated with Claude Code

cipher813 and others added 2 commits May 6, 2026 13:24
Inserts ``WeeklySubstrateHealthCheck`` + ``WaitForWeeklySubstrateHealthCheck``
between the existing ``WaitForSaturdayHealthCheck`` and ``NotifyComplete``.
The new states invoke ``python -m alpha_engine_lib.transparency
--cadence weekly --alert`` on the dashboard EC2 (Sat SF dispatcher),
running the row-driven substrate health checker shipped in
alpha-engine-lib v0.5.0.

The substrate check is the enforced half of the Phase 2 → 3 observation
gate. Per-row CloudWatch metrics emit to AlphaEngine/Substrate so
individual rows have their own alarms — a failed row pages immediately
and decrements the 8-week observation-gate denominator for that row
instead of letting the failure get noticed retrospectively.

The existing artifact-freshness ``SaturdayHealthCheck`` and behavioral
``DriftDetection`` continue to run unchanged. The substrate check is a
different abstraction (row-driven inventory validation, content
assertions) and runs in parallel; SaturdayHealthCheck retirement is
planned after ~4-6 weeks of green substrate runs.

Both new states are non-blocking (Catch routes to NotifyComplete) per
the same pattern as SaturdayHealthCheck — pipeline halts only on
hard infra failure, row-level failures fire CloudWatch alarms.

15 new wiring tests pin the chain ordering, Catch semantics, command
shape (``--cadence weekly``, ``--alert``, dashboard EC2, git pull
before run), and ResultPath isolation between freshness and substrate
states. 490 total passing.

Requires: alpha-engine-dashboard PR bumping lib pin to v0.5.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit c7676d9 into main May 6, 2026
1 check passed
@cipher813 cipher813 deleted the feat/sf-substrate-weekly-check branch May 6, 2026 20:30
cipher813 added a commit that referenced this pull request May 6, 2026
…substrate (#176)

Adds infrastructure/setup_substrate_alarms.sh — idempotent operator
script that creates one CloudWatch alarm per inventory row plus one
aggregate failure alarm. All point to the existing alpha-engine-alerts
SNS topic.

Per-row alarm (alpha-engine-substrate-<row_id>): fires when
SubstrateRowOK metric for that row drops below 1. The lib emits
1=ok/not_yet_effective, 0=fail, so a single fail in a 24h window
triggers SNS via Statistic=Minimum.

Aggregate alarm (alpha-engine-substrate-aggregate-failures): fires
when SubstrateChecksFailed > 0. Safety net for accidental per-row
alarm deletion — per-row alarms remain authoritative for which row
failed.

treat-missing-data=notBreaching keeps weekly-cadence rows quiet
between Sat-SF emissions; only emitted-and-failed datapoints fire.

Row enumeration sources from alpha_engine_lib.transparency.load_inventory()
so adding a row to the YAML and re-running this script automatically
adds the corresponding alarm. No hardcoded row list to drift.

Bumps alpha-engine-lib pin v0.3.0 → v0.5.0 so the test imports of
DEFAULT_NAMESPACE_OUT (added in lib #23) resolve.

15 new tests pin namespace alignment with lib, SNS target, row
enumeration source, alarm semantics (LessThanThreshold + Minimum +
notBreaching), and execution order (topic check before alarm
creation). 505 total passing.

Operator runs once after data #175 deploys:
  pip install -r requirements.txt  # gets v0.5.0
  ./infrastructure/setup_substrate_alarms.sh

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 7, 2026
#178)

Mirrors the Saturday-SF WeeklySubstrateHealthCheck (PR #175) into the
weekday EOD SF, running ``python -m alpha_engine_lib.transparency
--cadence daily --alert`` on the dashboard EC2 between EODReconcile
success and StopTradingInstance.

Closes the Phase 2 → 3 gap where rows 4/5/6 of transparency_inventory
(lineage, risk_events, residual_pct) only got checked once per week
despite emitting daily — a bad emission Mon-Thu would otherwise sit
undetected until Saturday's run.

The same per-row CloudWatch alarms from PR #176 cover daily emissions
(SubstrateRowOK metric is cadence-agnostic). No new alarms needed.

Both new states are non-blocking (Catch routes to StopTradingInstance)
so a substrate-check infra failure can never leave the trading EC2
running overnight (cost-guard requirement).

Refactors update_eod_pipeline_sf.sh to read the SF definition from
infrastructure/step_function_eod.json instead of an inline heredoc,
matching the deploy_step_function.sh pattern for the Saturday SF. The
JSON file is now the single source of truth; wiring tests pin its
contents. Eliminates the two-staleness-vectors antipattern that had
the heredoc and the JSON file diverging silently.

17 new wiring tests pin chain ordering, Catch semantics, command
shape (--cadence daily, --alert, dashboard EC2, git pull before run),
ResultPath isolation, and instance targeting (dashboard EC2 not
trading EC2). 519 total passing.

Requires: alpha-engine PR adding ec2_instance_id to the daemon's EOD
SF input (DailySubstrateHealthCheck targets \$.ec2_instance_id, which
the daemon trigger now populates with the dashboard EC2 instance id).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 18, 2026
…n dry path — closes DriftDetection skip-exception) (#261)

Adds a `--preflight-only` modifier to infrastructure/spot_drift_detection.sh,
mirroring the merged #259 (spot_data_weekly.sh) / predictor #175 /
backtester #224 pattern. Closes the DriftDetection skip-exception in
ROADMAP "Friday shell-run — per-module dry-path activation" — the one
per-module SF step still SKIPPED rather than dry-run on the Friday shell_run.

Insertion point
---------------
`PREFLIGHT_ONLY=0` modifier var initialised before the arg-parse loop
(orthogonal to RUN_MODE, `set -u` safe); `--preflight-only) PREFLIGHT_ONLY=1`
added to the case loop. The guard block is inserted AFTER the smoke-only
block and strictly BEFORE the "# ── Full drift detection ──" section (the
`run_remote bash -s <<DRIFT` heredoc) and before the trailing
`aws cloudwatch put-metric-data` heartbeat.

No-scan / no-write proof
------------------------
`monitoring.drift_detector` (in alpha-engine-predictor, on the sibling-clone
PYTHONPATH) is the SOLE code path that does any S3 get_object/put_object of
the drift report or SNS publish on alert; the launcher's CloudWatch
put-metric-data heartbeat trails it. The PREFLIGHT_ONLY guard `exit 0`s
strictly before the `<<DRIFT` heredoc, so the scan, the SNS publish, the S3
put_object, and the CloudWatch emit are all statically unreachable. The
preflight itself runs only BasePreflight.check_env_vars (env read) +
BasePreflight.check_s3_bucket (bucket HEAD) + an `importlib.import_module`
of the drift module (import-only — boto3 clients + check_drift()/main()
sit behind `if __name__ == "__main__"`, which an import does not trigger).
Zero external API data fetch, zero S3/CW/SNS/config mutation; exit 0
because a passed preflight is a healthy outcome (SSM/SF report Success).

Preflight substrate reused
--------------------------
The drift workload binary lives in alpha-engine-predictor (no
--preflight-only of its own; out of scope to modify here) and this repo's
preflight.py DataPreflight modes (daily/morning_enrich/phase1/phase2) are
data-collection scoped — none maps to drift. Per the canonical-lib
fallback the preflight composes `alpha_engine_lib.preflight.BasePreflight`
DIRECTLY (env-vars + S3 HEAD) — no bespoke preflight scaffolding duplicated.

Verbatim flag name: `--preflight-only`

Tests
-----
New tests/test_spot_drift_detection_preflight_only.py (5 static
greps/source-position assertions, mirroring
tests/test_preflight_only_dry_path.py): flag parses as a modifier;
guard precedes DRIFT + heartbeat; exit 0 before DRIFT; no scan/S3/CW/SNS
in block; canonical BasePreflight reused (no scaffolding). `bash -n`
clean. Full data suite: 1342 passed, 1 skipped (pre-existing), 5
pre-existing warnings.

Independent of #260: that PR touches spot_data_weekly.sh + the Lambda
dry-run keystone (a different file); the Saturday/Friday SF rewire to
route the DriftDetection state at this `--preflight-only` flag under the
Friday shell_run is a separate follow-on (no step_function.json change here).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant