feat(alarms): per-row + aggregate CloudWatch alarms for transparency substrate#176
Merged
Conversation
…substrate Adds infrastructure/setup_substrate_alarms.sh — idempotent operator script that creates one CloudWatch alarm per inventory row plus one aggregate failure alarm. All point to the existing alpha-engine-alerts SNS topic. Per-row alarm (alpha-engine-substrate-<row_id>): fires when SubstrateRowOK metric for that row drops below 1. The lib emits 1=ok/not_yet_effective, 0=fail, so a single fail in a 24h window triggers SNS via Statistic=Minimum. Aggregate alarm (alpha-engine-substrate-aggregate-failures): fires when SubstrateChecksFailed > 0. Safety net for accidental per-row alarm deletion — per-row alarms remain authoritative for which row failed. treat-missing-data=notBreaching keeps weekly-cadence rows quiet between Sat-SF emissions; only emitted-and-failed datapoints fire. Row enumeration sources from alpha_engine_lib.transparency.load_inventory() so adding a row to the YAML and re-running this script automatically adds the corresponding alarm. No hardcoded row list to drift. Bumps alpha-engine-lib pin v0.3.0 → v0.5.0 so the test imports of DEFAULT_NAMESPACE_OUT (added in lib #23) resolve. 15 new tests pin namespace alignment with lib, SNS target, row enumeration source, alarm semantics (LessThanThreshold + Minimum + notBreaching), and execution order (topic check before alarm creation). 505 total passing. Operator runs once after data #175 deploys: pip install -r requirements.txt # gets v0.5.0 ./infrastructure/setup_substrate_alarms.sh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
cipher813
added a commit
that referenced
this pull request
May 7, 2026
#178) Mirrors the Saturday-SF WeeklySubstrateHealthCheck (PR #175) into the weekday EOD SF, running ``python -m alpha_engine_lib.transparency --cadence daily --alert`` on the dashboard EC2 between EODReconcile success and StopTradingInstance. Closes the Phase 2 → 3 gap where rows 4/5/6 of transparency_inventory (lineage, risk_events, residual_pct) only got checked once per week despite emitting daily — a bad emission Mon-Thu would otherwise sit undetected until Saturday's run. The same per-row CloudWatch alarms from PR #176 cover daily emissions (SubstrateRowOK metric is cadence-agnostic). No new alarms needed. Both new states are non-blocking (Catch routes to StopTradingInstance) so a substrate-check infra failure can never leave the trading EC2 running overnight (cost-guard requirement). Refactors update_eod_pipeline_sf.sh to read the SF definition from infrastructure/step_function_eod.json instead of an inline heredoc, matching the deploy_step_function.sh pattern for the Saturday SF. The JSON file is now the single source of truth; wiring tests pin its contents. Eliminates the two-staleness-vectors antipattern that had the heredoc and the JSON file diverging silently. 17 new wiring tests pin chain ordering, Catch semantics, command shape (--cadence daily, --alert, dashboard EC2, git pull before run), ResultPath isolation, and instance targeting (dashboard EC2 not trading EC2). 519 total passing. Requires: alpha-engine PR adding ec2_instance_id to the daemon's EOD SF input (DailySubstrateHealthCheck targets \$.ec2_instance_id, which the daemon trigger now populates with the dashboard EC2 instance id). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
cipher813
added a commit
that referenced
this pull request
May 10, 2026
…e seed + backfill (#206) * feat(signal_returns): write calibrator-v1 context on score_performance seed + backfill Root-cause closure for the 2026-05-09 Saturday SF evaluator P0 (weight_optimizer ERROR: "None of [Index(['quant_score','qual_score'])] are in the [columns]"; auto-rollback Sharpe -42.2% vs baseline). Producer audit revealed two parallel writers diverged silently after research migration #12 (2026-05-08): * scoring/performance_tracker.py::record_new_buy_scores writes ALL 5 canonical context columns — but has zero production callers. * collectors/signal_returns.py::_seed_score_performance is the actual production writer (runs weekly in DataPhase1) and only wrote (symbol, score_date, score, price_on_date). The 5 canonical columns (quant_score, qual_score, conviction, sector_modifier, market_regime) were never populated. Single-fact-single-writer rebuild: * _seed_score_performance now extracts the 5 context fields from the same signals.json payload that drives the BUY filter — single source-of-truth fetch per signals.json, no second round-trip. * New _backfill_score_context repairs legacy rows whose canonical columns are NULL. UPDATE-WHERE-NULL so re-runs are no-ops once every row has a source. * _ensure_score_performance_schema mirrors research migration #12 defensively in case DataPhase1 ever fires against a fresh research.db before research's cold-start migrations run. Composes with backtester #176 (PR-day consumer-side coalesce fix). With this PR the producer becomes authoritative; the next backtester PR can retire the S3 round-trip in weight_optimizer.load_with_subscores. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(signal_returns): drift gate — canonical context coverage CW gauge Locks the producer-side contract established in the previous commit: after seed + backfill complete, query score_performance for rows with score_date >= 2026-05-17 (first Sat SF after this PR merges) and emit the coverage percentage as a CloudWatch gauge: AlphaEngine/Data/score_performance_canonical_coverage_pct Coverage = fraction of post-cutover rows with ALL 5 canonical context columns populated (quant_score, qual_score, conviction, sector_modifier, market_regime). 100% is the contract; the gauge is always emitted (including 100.0) so alarm baselines stay continuous. Mirrors the chronic-gap drift detection pattern at weekly_collector.py:_check_chronic_gap_polygon_recovery — same best-effort emit, same observability-not-load-bearing posture. A follow-up alpha-engine-lib transparency_inventory entry can wire this into the substrate health alarm if desired; the metric itself is the drift signal. Tripwire test asserts _CANONICAL_CONTEXT_COLUMNS stays in lockstep with the seed INSERT — adding a 6th column to one without the other would make the drift gate blind to that field. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
infrastructure/setup_substrate_alarms.sh— idempotent operator script that creates one alarm per inventory row + one aggregate failure alarm, all targeting the existingalpha-engine-alertsSNS topic.alpha-engine-libpin v0.3.0 → v0.5.0 so the new transparency module is importable for row enumeration.Why
The substrate health checker (lib v0.5.0 + alpha-engine-data #175) emits per-row CloudWatch metrics every Sat SF run. Without alarms, those metrics are observability surface only. The point of the Phase 2 → 3 gate is "≥99% sustained for 8 weeks enforced via paging on every miss" — alarms are the enforcement layer.
Alarm shape
Per-row (one per inventory row, named
alpha-engine-substrate-<row_id>):LessThanThreshold1,Statistic=Minimum,Period=86400,EvaluationPeriods=1treat-missing-data=notBreachingkeeps weekly-cadence rows quiet between Sat-SF emissions.Aggregate (
alpha-engine-substrate-aggregate-failures):GreaterThanThreshold0 onSubstrateChecksFailedRow enumeration is sourced, not hardcoded
The script pulls row IDs at runtime via:
Adding a row to
transparency_inventory.yamland re-running this script automatically creates the corresponding alarm. Removing a row leaves the alarm inINSUFFICIENT_DATA— safer than silently deleting.Relationship to existing health checks
SaturdayHealthCheck (artifact freshness) and DriftDetection (predictor behavior) continue to run unchanged — they cover different failure modes than the substrate check, not the same ones. SaturdayHealthCheck asks "are upstream/intermediate artifacts fresh?" (Polygon ingestion broke, slim cache refresh failed); the substrate check asks "are the measurement-substrate emissions correct?" (trade lineage null, residual_pct > 1%). Complementary, not redundant. No retirement planned.
Operator workflow
After this PR + data #175 deploy:
pip install -r requirements.txt # picks up lib v0.5.0 ./infrastructure/setup_substrate_alarms.shFirst Sat SF run (5/9) emits metrics; alarms transition from
INSUFFICIENT_DATAtoOK/ALARM.Test plan
pytest tests/test_substrate_alarms_script.py— 15 passingpytest(full data suite) — 505 total, no regressionsbash -n setup_substrate_alarms.sh— syntax OKaws cloudwatch describe-alarms --alarm-name-prefix alpha-engine-substrate-shows all alarms transitioning out ofINSUFFICIENT_DATAFollow-ups
🤖 Generated with Claude Code