feat(sf): add WeeklySubstrateHealthCheck state at end of Saturday SF#175
Merged
Conversation
Inserts ``WeeklySubstrateHealthCheck`` + ``WaitForWeeklySubstrateHealthCheck`` between the existing ``WaitForSaturdayHealthCheck`` and ``NotifyComplete``. The new states invoke ``python -m alpha_engine_lib.transparency --cadence weekly --alert`` on the dashboard EC2 (Sat SF dispatcher), running the row-driven substrate health checker shipped in alpha-engine-lib v0.5.0. The substrate check is the enforced half of the Phase 2 → 3 observation gate. Per-row CloudWatch metrics emit to AlphaEngine/Substrate so individual rows have their own alarms — a failed row pages immediately and decrements the 8-week observation-gate denominator for that row instead of letting the failure get noticed retrospectively. The existing artifact-freshness ``SaturdayHealthCheck`` and behavioral ``DriftDetection`` continue to run unchanged. The substrate check is a different abstraction (row-driven inventory validation, content assertions) and runs in parallel; SaturdayHealthCheck retirement is planned after ~4-6 weeks of green substrate runs. Both new states are non-blocking (Catch routes to NotifyComplete) per the same pattern as SaturdayHealthCheck — pipeline halts only on hard infra failure, row-level failures fire CloudWatch alarms. 15 new wiring tests pin the chain ordering, Catch semantics, command shape (``--cadence weekly``, ``--alert``, dashboard EC2, git pull before run), and ResultPath isolation between freshness and substrate states. 490 total passing. Requires: alpha-engine-dashboard PR bumping lib pin to v0.5.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
cipher813
added a commit
that referenced
this pull request
May 6, 2026
…substrate (#176) Adds infrastructure/setup_substrate_alarms.sh — idempotent operator script that creates one CloudWatch alarm per inventory row plus one aggregate failure alarm. All point to the existing alpha-engine-alerts SNS topic. Per-row alarm (alpha-engine-substrate-<row_id>): fires when SubstrateRowOK metric for that row drops below 1. The lib emits 1=ok/not_yet_effective, 0=fail, so a single fail in a 24h window triggers SNS via Statistic=Minimum. Aggregate alarm (alpha-engine-substrate-aggregate-failures): fires when SubstrateChecksFailed > 0. Safety net for accidental per-row alarm deletion — per-row alarms remain authoritative for which row failed. treat-missing-data=notBreaching keeps weekly-cadence rows quiet between Sat-SF emissions; only emitted-and-failed datapoints fire. Row enumeration sources from alpha_engine_lib.transparency.load_inventory() so adding a row to the YAML and re-running this script automatically adds the corresponding alarm. No hardcoded row list to drift. Bumps alpha-engine-lib pin v0.3.0 → v0.5.0 so the test imports of DEFAULT_NAMESPACE_OUT (added in lib #23) resolve. 15 new tests pin namespace alignment with lib, SNS target, row enumeration source, alarm semantics (LessThanThreshold + Minimum + notBreaching), and execution order (topic check before alarm creation). 505 total passing. Operator runs once after data #175 deploys: pip install -r requirements.txt # gets v0.5.0 ./infrastructure/setup_substrate_alarms.sh Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
cipher813
added a commit
that referenced
this pull request
May 7, 2026
#178) Mirrors the Saturday-SF WeeklySubstrateHealthCheck (PR #175) into the weekday EOD SF, running ``python -m alpha_engine_lib.transparency --cadence daily --alert`` on the dashboard EC2 between EODReconcile success and StopTradingInstance. Closes the Phase 2 → 3 gap where rows 4/5/6 of transparency_inventory (lineage, risk_events, residual_pct) only got checked once per week despite emitting daily — a bad emission Mon-Thu would otherwise sit undetected until Saturday's run. The same per-row CloudWatch alarms from PR #176 cover daily emissions (SubstrateRowOK metric is cadence-agnostic). No new alarms needed. Both new states are non-blocking (Catch routes to StopTradingInstance) so a substrate-check infra failure can never leave the trading EC2 running overnight (cost-guard requirement). Refactors update_eod_pipeline_sf.sh to read the SF definition from infrastructure/step_function_eod.json instead of an inline heredoc, matching the deploy_step_function.sh pattern for the Saturday SF. The JSON file is now the single source of truth; wiring tests pin its contents. Eliminates the two-staleness-vectors antipattern that had the heredoc and the JSON file diverging silently. 17 new wiring tests pin chain ordering, Catch semantics, command shape (--cadence daily, --alert, dashboard EC2, git pull before run), ResultPath isolation, and instance targeting (dashboard EC2 not trading EC2). 519 total passing. Requires: alpha-engine PR adding ec2_instance_id to the daemon's EOD SF input (DailySubstrateHealthCheck targets \$.ec2_instance_id, which the daemon trigger now populates with the dashboard EC2 instance id). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 18, 2026
…n dry path — closes DriftDetection skip-exception) (#261) Adds a `--preflight-only` modifier to infrastructure/spot_drift_detection.sh, mirroring the merged #259 (spot_data_weekly.sh) / predictor #175 / backtester #224 pattern. Closes the DriftDetection skip-exception in ROADMAP "Friday shell-run — per-module dry-path activation" — the one per-module SF step still SKIPPED rather than dry-run on the Friday shell_run. Insertion point --------------- `PREFLIGHT_ONLY=0` modifier var initialised before the arg-parse loop (orthogonal to RUN_MODE, `set -u` safe); `--preflight-only) PREFLIGHT_ONLY=1` added to the case loop. The guard block is inserted AFTER the smoke-only block and strictly BEFORE the "# ── Full drift detection ──" section (the `run_remote bash -s <<DRIFT` heredoc) and before the trailing `aws cloudwatch put-metric-data` heartbeat. No-scan / no-write proof ------------------------ `monitoring.drift_detector` (in alpha-engine-predictor, on the sibling-clone PYTHONPATH) is the SOLE code path that does any S3 get_object/put_object of the drift report or SNS publish on alert; the launcher's CloudWatch put-metric-data heartbeat trails it. The PREFLIGHT_ONLY guard `exit 0`s strictly before the `<<DRIFT` heredoc, so the scan, the SNS publish, the S3 put_object, and the CloudWatch emit are all statically unreachable. The preflight itself runs only BasePreflight.check_env_vars (env read) + BasePreflight.check_s3_bucket (bucket HEAD) + an `importlib.import_module` of the drift module (import-only — boto3 clients + check_drift()/main() sit behind `if __name__ == "__main__"`, which an import does not trigger). Zero external API data fetch, zero S3/CW/SNS/config mutation; exit 0 because a passed preflight is a healthy outcome (SSM/SF report Success). Preflight substrate reused -------------------------- The drift workload binary lives in alpha-engine-predictor (no --preflight-only of its own; out of scope to modify here) and this repo's preflight.py DataPreflight modes (daily/morning_enrich/phase1/phase2) are data-collection scoped — none maps to drift. Per the canonical-lib fallback the preflight composes `alpha_engine_lib.preflight.BasePreflight` DIRECTLY (env-vars + S3 HEAD) — no bespoke preflight scaffolding duplicated. Verbatim flag name: `--preflight-only` Tests ----- New tests/test_spot_drift_detection_preflight_only.py (5 static greps/source-position assertions, mirroring tests/test_preflight_only_dry_path.py): flag parses as a modifier; guard precedes DRIFT + heartbeat; exit 0 before DRIFT; no scan/S3/CW/SNS in block; canonical BasePreflight reused (no scaffolding). `bash -n` clean. Full data suite: 1342 passed, 1 skipped (pre-existing), 5 pre-existing warnings. Independent of #260: that PR touches spot_data_weekly.sh + the Lambda dry-run keystone (a different file); the Saturday/Friday SF rewire to route the DriftDetection state at this `--preflight-only` flag under the Friday shell_run is a separate follow-on (no step_function.json change here). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
WeeklySubstrateHealthCheck+WaitForWeeklySubstrateHealthCheckbetween the existingWaitForSaturdayHealthCheckandNotifyComplete.python -m alpha_engine_lib.transparency --cadence weekly --alerton the dashboard EC2 (the Sat SF dispatcher), running the row-driven substrate health checker shipped in alpha-engine-lib v0.5.0 (lib PR Split Evaluator from Backtester into independent Step Function step #23).Why
The Phase 2 → 3 gate is "≥99% of inventory rows pass for 8 consecutive weeks." Without an enforced check, that's a number computed retrospectively from data; with a per-row substrate check + per-row CloudWatch alarm, it becomes a property the system continuously enforces — a failed row pages in <24h instead of being caught at next eyeball pass.
The existing artifact-freshness
SaturdayHealthCheckand behavioralDriftDetectioncontinue to run unchanged. The substrate check is a different abstraction (row-driven inventory validation, content assertions, not just last-modified age) and runs in parallel. Two-staleness-vectors avoidance:SaturdayHealthCheckretirement is planned after ~4-6 weeks of green substrate runs once row coverage proves out.Wiring
Both new states are non-blocking (Catch routes to
NotifyComplete) per the same pattern asSaturdayHealthCheck. Pipeline halts only on hard infra failure; row-level failures fire CloudWatch alarms via--alertflag publishing to the existingalpha-engine-alertsSNS topic.Dependencies
pip install -r requirements.txtstep inside the new SF state pulls the right lib versionTest plan
pytest tests/test_sf_substrate_check_wiring.py— 15 new tests passingpytest(full suite) — 490 total passing, no regressionsskip_*flags to short-circuit upstream stages) exercises the new states end-to-end before Sat 5/9 production fireAlphaEngine/Substratenamespace appears in CloudWatch with per-rowSubstrateRowOKmetrics + 3 aggregates (SubstrateChecksOK/Failed/Pending)Follow-ups
SubstrateRowOK < 1+ aggregate alarm onSubstrateChecksFailed > 0→ SNS)SaturdayHealthCheckafter 4-6 weeks of green substrate runs🤖 Generated with Claude Code