feat(sf): add WeeklySubstrateHealthCheck state at end of Saturday SF by cipher813 · Pull Request #175 · cipher813/alpha-engine-data

cipher813 · 2026-05-06T20:25:13Z

Summary

Inserts WeeklySubstrateHealthCheck + WaitForWeeklySubstrateHealthCheck between the existing WaitForSaturdayHealthCheck and NotifyComplete.
New states invoke python -m alpha_engine_lib.transparency --cadence weekly --alert on the dashboard EC2 (the Sat SF dispatcher), running the row-driven substrate health checker shipped in alpha-engine-lib v0.5.0 (lib PR Split Evaluator from Backtester into independent Step Function step #23).
15 new wiring tests pin chain ordering, Catch semantics, command shape, and ResultPath isolation. 490 total passing.

Why

The Phase 2 → 3 gate is "≥99% of inventory rows pass for 8 consecutive weeks." Without an enforced check, that's a number computed retrospectively from data; with a per-row substrate check + per-row CloudWatch alarm, it becomes a property the system continuously enforces — a failed row pages in <24h instead of being caught at next eyeball pass.

The existing artifact-freshness SaturdayHealthCheck and behavioral DriftDetection continue to run unchanged. The substrate check is a different abstraction (row-driven inventory validation, content assertions, not just last-modified age) and runs in parallel. Two-staleness-vectors avoidance: SaturdayHealthCheck retirement is planned after ~4-6 weeks of green substrate runs once row coverage proves out.

Wiring

... → Counterfactual → SaturdayHealthCheck → WaitForSaturdayHealthCheck
    → WeeklySubstrateHealthCheck → WaitForWeeklySubstrateHealthCheck → NotifyComplete

Both new states are non-blocking (Catch routes to NotifyComplete) per the same pattern as SaturdayHealthCheck. Pipeline halts only on hard infra failure; row-level failures fire CloudWatch alarms via --alert flag publishing to the existing alpha-engine-alerts SNS topic.

Dependencies

alpha-engine-lib PR Split Evaluator from Backtester into independent Step Function step #23 ✅ merged, tagged v0.5.0
alpha-engine-dashboard PR fix(rag): register pgvector adapter on get_connection() #64 (lib pin bump v0.2.2 → v0.5.0) — must merge before this so the dashboard EC2's pip install -r requirements.txt step inside the new SF state pulls the right lib version

Test plan

pytest tests/test_sf_substrate_check_wiring.py — 15 new tests passing
pytest (full suite) — 490 total passing, no regressions
After merge + deploy: ad-hoc Sat SF test run (skip_* flags to short-circuit upstream stages) exercises the new states end-to-end before Sat 5/9 production fire
Verify AlphaEngine/Substrate namespace appears in CloudWatch with per-row SubstrateRowOK metrics + 3 aggregates (SubstrateChecksOK / Failed / Pending)

Follow-ups

CloudWatch alarms PR (per-row alarm on SubstrateRowOK < 1 + aggregate alarm on SubstrateChecksFailed > 0 → SNS)
Daily-cadence equivalent state at end of EOD weekday SF
Deprecate SaturdayHealthCheck after 4-6 weeks of green substrate runs

🤖 Generated with Claude Code

Inserts ``WeeklySubstrateHealthCheck`` + ``WaitForWeeklySubstrateHealthCheck`` between the existing ``WaitForSaturdayHealthCheck`` and ``NotifyComplete``. The new states invoke ``python -m alpha_engine_lib.transparency --cadence weekly --alert`` on the dashboard EC2 (Sat SF dispatcher), running the row-driven substrate health checker shipped in alpha-engine-lib v0.5.0. The substrate check is the enforced half of the Phase 2 → 3 observation gate. Per-row CloudWatch metrics emit to AlphaEngine/Substrate so individual rows have their own alarms — a failed row pages immediately and decrements the 8-week observation-gate denominator for that row instead of letting the failure get noticed retrospectively. The existing artifact-freshness ``SaturdayHealthCheck`` and behavioral ``DriftDetection`` continue to run unchanged. The substrate check is a different abstraction (row-driven inventory validation, content assertions) and runs in parallel; SaturdayHealthCheck retirement is planned after ~4-6 weeks of green substrate runs. Both new states are non-blocking (Catch routes to NotifyComplete) per the same pattern as SaturdayHealthCheck — pipeline halts only on hard infra failure, row-level failures fire CloudWatch alarms. 15 new wiring tests pin the chain ordering, Catch semantics, command shape (``--cadence weekly``, ``--alert``, dashboard EC2, git pull before run), and ResultPath isolation between freshness and substrate states. 490 total passing. Requires: alpha-engine-dashboard PR bumping lib pin to v0.5.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…substrate (#176) Adds infrastructure/setup_substrate_alarms.sh — idempotent operator script that creates one CloudWatch alarm per inventory row plus one aggregate failure alarm. All point to the existing alpha-engine-alerts SNS topic. Per-row alarm (alpha-engine-substrate-<row_id>): fires when SubstrateRowOK metric for that row drops below 1. The lib emits 1=ok/not_yet_effective, 0=fail, so a single fail in a 24h window triggers SNS via Statistic=Minimum. Aggregate alarm (alpha-engine-substrate-aggregate-failures): fires when SubstrateChecksFailed > 0. Safety net for accidental per-row alarm deletion — per-row alarms remain authoritative for which row failed. treat-missing-data=notBreaching keeps weekly-cadence rows quiet between Sat-SF emissions; only emitted-and-failed datapoints fire. Row enumeration sources from alpha_engine_lib.transparency.load_inventory() so adding a row to the YAML and re-running this script automatically adds the corresponding alarm. No hardcoded row list to drift. Bumps alpha-engine-lib pin v0.3.0 → v0.5.0 so the test imports of DEFAULT_NAMESPACE_OUT (added in lib #23) resolve. 15 new tests pin namespace alignment with lib, SNS target, row enumeration source, alarm semantics (LessThanThreshold + Minimum + notBreaching), and execution order (topic check before alarm creation). 505 total passing. Operator runs once after data #175 deploys: pip install -r requirements.txt # gets v0.5.0 ./infrastructure/setup_substrate_alarms.sh Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#178) Mirrors the Saturday-SF WeeklySubstrateHealthCheck (PR #175) into the weekday EOD SF, running ``python -m alpha_engine_lib.transparency --cadence daily --alert`` on the dashboard EC2 between EODReconcile success and StopTradingInstance. Closes the Phase 2 → 3 gap where rows 4/5/6 of transparency_inventory (lineage, risk_events, residual_pct) only got checked once per week despite emitting daily — a bad emission Mon-Thu would otherwise sit undetected until Saturday's run. The same per-row CloudWatch alarms from PR #176 cover daily emissions (SubstrateRowOK metric is cadence-agnostic). No new alarms needed. Both new states are non-blocking (Catch routes to StopTradingInstance) so a substrate-check infra failure can never leave the trading EC2 running overnight (cost-guard requirement). Refactors update_eod_pipeline_sf.sh to read the SF definition from infrastructure/step_function_eod.json instead of an inline heredoc, matching the deploy_step_function.sh pattern for the Saturday SF. The JSON file is now the single source of truth; wiring tests pin its contents. Eliminates the two-staleness-vectors antipattern that had the heredoc and the JSON file diverging silently. 17 new wiring tests pin chain ordering, Catch semantics, command shape (--cadence daily, --alert, dashboard EC2, git pull before run), ResultPath isolation, and instance targeting (dashboard EC2 not trading EC2). 519 total passing. Requires: alpha-engine PR adding ec2_instance_id to the daemon's EOD SF input (DailySubstrateHealthCheck targets \$.ec2_instance_id, which the daemon trigger now populates with the dashboard EC2 instance id). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n dry path — closes DriftDetection skip-exception) (#261) Adds a `--preflight-only` modifier to infrastructure/spot_drift_detection.sh, mirroring the merged #259 (spot_data_weekly.sh) / predictor #175 / backtester #224 pattern. Closes the DriftDetection skip-exception in ROADMAP "Friday shell-run — per-module dry-path activation" — the one per-module SF step still SKIPPED rather than dry-run on the Friday shell_run. Insertion point --------------- `PREFLIGHT_ONLY=0` modifier var initialised before the arg-parse loop (orthogonal to RUN_MODE, `set -u` safe); `--preflight-only) PREFLIGHT_ONLY=1` added to the case loop. The guard block is inserted AFTER the smoke-only block and strictly BEFORE the "# ── Full drift detection ──" section (the `run_remote bash -s <<DRIFT` heredoc) and before the trailing `aws cloudwatch put-metric-data` heartbeat. No-scan / no-write proof ------------------------ `monitoring.drift_detector` (in alpha-engine-predictor, on the sibling-clone PYTHONPATH) is the SOLE code path that does any S3 get_object/put_object of the drift report or SNS publish on alert; the launcher's CloudWatch put-metric-data heartbeat trails it. The PREFLIGHT_ONLY guard `exit 0`s strictly before the `<<DRIFT` heredoc, so the scan, the SNS publish, the S3 put_object, and the CloudWatch emit are all statically unreachable. The preflight itself runs only BasePreflight.check_env_vars (env read) + BasePreflight.check_s3_bucket (bucket HEAD) + an `importlib.import_module` of the drift module (import-only — boto3 clients + check_drift()/main() sit behind `if __name__ == "__main__"`, which an import does not trigger). Zero external API data fetch, zero S3/CW/SNS/config mutation; exit 0 because a passed preflight is a healthy outcome (SSM/SF report Success). Preflight substrate reused -------------------------- The drift workload binary lives in alpha-engine-predictor (no --preflight-only of its own; out of scope to modify here) and this repo's preflight.py DataPreflight modes (daily/morning_enrich/phase1/phase2) are data-collection scoped — none maps to drift. Per the canonical-lib fallback the preflight composes `alpha_engine_lib.preflight.BasePreflight` DIRECTLY (env-vars + S3 HEAD) — no bespoke preflight scaffolding duplicated. Verbatim flag name: `--preflight-only` Tests ----- New tests/test_spot_drift_detection_preflight_only.py (5 static greps/source-position assertions, mirroring tests/test_preflight_only_dry_path.py): flag parses as a modifier; guard precedes DRIFT + heartbeat; exit 0 before DRIFT; no scan/S3/CW/SNS in block; canonical BasePreflight reused (no scaffolding). `bash -n` clean. Full data suite: 1342 passed, 1 skipped (pre-existing), 5 pre-existing warnings. Independent of #260: that PR touches spot_data_weekly.sh + the Lambda dry-run keystone (a different file); the Saturday/Friday SF rewire to route the DriftDetection state at this `--preflight-only` flag under the Friday shell_run is a separate follow-on (no step_function.json change here). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 and others added 2 commits May 6, 2026 13:24

Merge branch 'main' into feat/sf-substrate-weekly-check

821cbe3

cipher813 merged commit c7676d9 into main May 6, 2026
1 check passed

cipher813 deleted the feat/sf-substrate-weekly-check branch May 6, 2026 20:30

cipher813 mentioned this pull request May 6, 2026

feat(alarms): per-row + aggregate CloudWatch alarms for transparency substrate #176

Merged

5 tasks

cipher813 mentioned this pull request May 7, 2026

feat(sf): add DailySubstrateHealthCheck state at end of weekday EOD SF #178

Merged

5 tasks

cipher813 mentioned this pull request May 18, 2026

feat(data): spot_drift_detection.sh --preflight-only (Friday shell-run dry path — closes DriftDetection skip-exception) #261

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sf): add WeeklySubstrateHealthCheck state at end of Saturday SF#175

feat(sf): add WeeklySubstrateHealthCheck state at end of Saturday SF#175
cipher813 merged 2 commits into
mainfrom
feat/sf-substrate-weekly-check

cipher813 commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 6, 2026

Summary

Why

Wiring

Dependencies

Test plan

Follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant