feat(alarms): per-row + aggregate CloudWatch alarms for transparency substrate by cipher813 · Pull Request #176 · cipher813/alpha-engine-data

cipher813 · 2026-05-06T20:35:13Z

Summary

infrastructure/setup_substrate_alarms.sh — idempotent operator script that creates one alarm per inventory row + one aggregate failure alarm, all targeting the existing alpha-engine-alerts SNS topic.
Bumps alpha-engine-lib pin v0.3.0 → v0.5.0 so the new transparency module is importable for row enumeration.
15 new tests pin namespace alignment, SNS target, row enumeration source, and alarm semantics. 505 total passing.

Why

The substrate health checker (lib v0.5.0 + alpha-engine-data #175) emits per-row CloudWatch metrics every Sat SF run. Without alarms, those metrics are observability surface only. The point of the Phase 2 → 3 gate is "≥99% sustained for 8 weeks enforced via paging on every miss" — alarms are the enforcement layer.

Alarm shape

Per-row (one per inventory row, named alpha-engine-substrate-<row_id>):

LessThanThreshold 1, Statistic=Minimum, Period=86400, EvaluationPeriods=1
The lib emits 1=ok/not_yet_effective, 0=fail. Any single fail in a 24h window fires.
treat-missing-data=notBreaching keeps weekly-cadence rows quiet between Sat-SF emissions.

Aggregate (alpha-engine-substrate-aggregate-failures):

GreaterThanThreshold 0 on SubstrateChecksFailed
Safety net for accidental per-row alarm deletion — per-row alarms remain authoritative.

Row enumeration is sourced, not hardcoded

The script pulls row IDs at runtime via:

from alpha_engine_lib.transparency import load_inventory
print(' '.join(r['id'] for r in load_inventory()['inventory']))

Adding a row to transparency_inventory.yaml and re-running this script automatically creates the corresponding alarm. Removing a row leaves the alarm in INSUFFICIENT_DATA — safer than silently deleting.

Relationship to existing health checks

SaturdayHealthCheck (artifact freshness) and DriftDetection (predictor behavior) continue to run unchanged — they cover different failure modes than the substrate check, not the same ones. SaturdayHealthCheck asks "are upstream/intermediate artifacts fresh?" (Polygon ingestion broke, slim cache refresh failed); the substrate check asks "are the measurement-substrate emissions correct?" (trade lineage null, residual_pct > 1%). Complementary, not redundant. No retirement planned.

Operator workflow

After this PR + data #175 deploy:

pip install -r requirements.txt   # picks up lib v0.5.0
./infrastructure/setup_substrate_alarms.sh

First Sat SF run (5/9) emits metrics; alarms transition from INSUFFICIENT_DATA to OK/ALARM.

Test plan

pytest tests/test_substrate_alarms_script.py — 15 passing
pytest (full data suite) — 505 total, no regressions
bash -n setup_substrate_alarms.sh — syntax OK
Operator runs script post-deploy, confirms 9 per-row + 1 aggregate alarm exist
After first Sat 5/9 SF: aws cloudwatch describe-alarms --alarm-name-prefix alpha-engine-substrate- shows all alarms transitioning out of INSUFFICIENT_DATA

Follow-ups

Daily-cadence equivalent SF state at end of EOD weekday SF (uses the same alarms — metric names are cadence-agnostic)

🤖 Generated with Claude Code

…substrate Adds infrastructure/setup_substrate_alarms.sh — idempotent operator script that creates one CloudWatch alarm per inventory row plus one aggregate failure alarm. All point to the existing alpha-engine-alerts SNS topic. Per-row alarm (alpha-engine-substrate-<row_id>): fires when SubstrateRowOK metric for that row drops below 1. The lib emits 1=ok/not_yet_effective, 0=fail, so a single fail in a 24h window triggers SNS via Statistic=Minimum. Aggregate alarm (alpha-engine-substrate-aggregate-failures): fires when SubstrateChecksFailed > 0. Safety net for accidental per-row alarm deletion — per-row alarms remain authoritative for which row failed. treat-missing-data=notBreaching keeps weekly-cadence rows quiet between Sat-SF emissions; only emitted-and-failed datapoints fire. Row enumeration sources from alpha_engine_lib.transparency.load_inventory() so adding a row to the YAML and re-running this script automatically adds the corresponding alarm. No hardcoded row list to drift. Bumps alpha-engine-lib pin v0.3.0 → v0.5.0 so the test imports of DEFAULT_NAMESPACE_OUT (added in lib #23) resolve. 15 new tests pin namespace alignment with lib, SNS target, row enumeration source, alarm semantics (LessThanThreshold + Minimum + notBreaching), and execution order (topic check before alarm creation). 505 total passing. Operator runs once after data #175 deploys: pip install -r requirements.txt # gets v0.5.0 ./infrastructure/setup_substrate_alarms.sh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#178) Mirrors the Saturday-SF WeeklySubstrateHealthCheck (PR #175) into the weekday EOD SF, running ``python -m alpha_engine_lib.transparency --cadence daily --alert`` on the dashboard EC2 between EODReconcile success and StopTradingInstance. Closes the Phase 2 → 3 gap where rows 4/5/6 of transparency_inventory (lineage, risk_events, residual_pct) only got checked once per week despite emitting daily — a bad emission Mon-Thu would otherwise sit undetected until Saturday's run. The same per-row CloudWatch alarms from PR #176 cover daily emissions (SubstrateRowOK metric is cadence-agnostic). No new alarms needed. Both new states are non-blocking (Catch routes to StopTradingInstance) so a substrate-check infra failure can never leave the trading EC2 running overnight (cost-guard requirement). Refactors update_eod_pipeline_sf.sh to read the SF definition from infrastructure/step_function_eod.json instead of an inline heredoc, matching the deploy_step_function.sh pattern for the Saturday SF. The JSON file is now the single source of truth; wiring tests pin its contents. Eliminates the two-staleness-vectors antipattern that had the heredoc and the JSON file diverging silently. 17 new wiring tests pin chain ordering, Catch semantics, command shape (--cadence daily, --alert, dashboard EC2, git pull before run), ResultPath isolation, and instance targeting (dashboard EC2 not trading EC2). 519 total passing. Requires: alpha-engine PR adding ec2_instance_id to the daemon's EOD SF input (DailySubstrateHealthCheck targets \$.ec2_instance_id, which the daemon trigger now populates with the dashboard EC2 instance id). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e seed + backfill (#206) * feat(signal_returns): write calibrator-v1 context on score_performance seed + backfill Root-cause closure for the 2026-05-09 Saturday SF evaluator P0 (weight_optimizer ERROR: "None of [Index(['quant_score','qual_score'])] are in the [columns]"; auto-rollback Sharpe -42.2% vs baseline). Producer audit revealed two parallel writers diverged silently after research migration #12 (2026-05-08): * scoring/performance_tracker.py::record_new_buy_scores writes ALL 5 canonical context columns — but has zero production callers. * collectors/signal_returns.py::_seed_score_performance is the actual production writer (runs weekly in DataPhase1) and only wrote (symbol, score_date, score, price_on_date). The 5 canonical columns (quant_score, qual_score, conviction, sector_modifier, market_regime) were never populated. Single-fact-single-writer rebuild: * _seed_score_performance now extracts the 5 context fields from the same signals.json payload that drives the BUY filter — single source-of-truth fetch per signals.json, no second round-trip. * New _backfill_score_context repairs legacy rows whose canonical columns are NULL. UPDATE-WHERE-NULL so re-runs are no-ops once every row has a source. * _ensure_score_performance_schema mirrors research migration #12 defensively in case DataPhase1 ever fires against a fresh research.db before research's cold-start migrations run. Composes with backtester #176 (PR-day consumer-side coalesce fix). With this PR the producer becomes authoritative; the next backtester PR can retire the S3 round-trip in weight_optimizer.load_with_subscores. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(signal_returns): drift gate — canonical context coverage CW gauge Locks the producer-side contract established in the previous commit: after seed + backfill complete, query score_performance for rows with score_date >= 2026-05-17 (first Sat SF after this PR merges) and emit the coverage percentage as a CloudWatch gauge: AlphaEngine/Data/score_performance_canonical_coverage_pct Coverage = fraction of post-cutover rows with ALL 5 canonical context columns populated (quant_score, qual_score, conviction, sector_modifier, market_regime). 100% is the contract; the gauge is always emitted (including 100.0) so alarm baselines stay continuous. Mirrors the chronic-gap drift detection pattern at weekly_collector.py:_check_chronic_gap_polygon_recovery — same best-effort emit, same observability-not-load-bearing posture. A follow-up alpha-engine-lib transparency_inventory entry can wire this into the substrate health alarm if desired; the metric itself is the drift signal. Tripwire test asserts _CANONICAL_CONTEXT_COLUMNS stays in lockstep with the seed INSERT — adding a 6th column to one without the other would make the drift gate blind to that field. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 merged commit b3318cc into main May 6, 2026
1 check passed

cipher813 deleted the feat/substrate-alarms branch May 6, 2026 20:37

cipher813 mentioned this pull request May 7, 2026

feat(sf): add DailySubstrateHealthCheck state at end of weekday EOD SF #178

Merged

5 tasks

cipher813 mentioned this pull request May 10, 2026

feat(signal_returns): write calibrator-v1 context on score_performance seed + backfill #206

Merged

5 tasks

cipher813 mentioned this pull request May 13, 2026

docs(config): banner on config.yaml.example + delete stale flow-doctor.yaml.example #225

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(alarms): per-row + aggregate CloudWatch alarms for transparency substrate#176

feat(alarms): per-row + aggregate CloudWatch alarms for transparency substrate#176
cipher813 merged 1 commit into
mainfrom
feat/substrate-alarms

cipher813 commented May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Alarm shape

Row enumeration is sourced, not hardcoded

Relationship to existing health checks

Operator workflow

Test plan

Follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cipher813 commented May 6, 2026 •

edited

Loading