Skip to content

feat(alarms): per-row + aggregate CloudWatch alarms for transparency substrate#176

Merged
cipher813 merged 1 commit into
mainfrom
feat/substrate-alarms
May 6, 2026
Merged

feat(alarms): per-row + aggregate CloudWatch alarms for transparency substrate#176
cipher813 merged 1 commit into
mainfrom
feat/substrate-alarms

Conversation

@cipher813
Copy link
Copy Markdown
Owner

@cipher813 cipher813 commented May 6, 2026

Summary

  • infrastructure/setup_substrate_alarms.sh — idempotent operator script that creates one alarm per inventory row + one aggregate failure alarm, all targeting the existing alpha-engine-alerts SNS topic.
  • Bumps alpha-engine-lib pin v0.3.0 → v0.5.0 so the new transparency module is importable for row enumeration.
  • 15 new tests pin namespace alignment, SNS target, row enumeration source, and alarm semantics. 505 total passing.

Why

The substrate health checker (lib v0.5.0 + alpha-engine-data #175) emits per-row CloudWatch metrics every Sat SF run. Without alarms, those metrics are observability surface only. The point of the Phase 2 → 3 gate is "≥99% sustained for 8 weeks enforced via paging on every miss" — alarms are the enforcement layer.

Alarm shape

Per-row (one per inventory row, named alpha-engine-substrate-<row_id>):

  • LessThanThreshold 1, Statistic=Minimum, Period=86400, EvaluationPeriods=1
  • The lib emits 1=ok/not_yet_effective, 0=fail. Any single fail in a 24h window fires.
  • treat-missing-data=notBreaching keeps weekly-cadence rows quiet between Sat-SF emissions.

Aggregate (alpha-engine-substrate-aggregate-failures):

  • GreaterThanThreshold 0 on SubstrateChecksFailed
  • Safety net for accidental per-row alarm deletion — per-row alarms remain authoritative.

Row enumeration is sourced, not hardcoded

The script pulls row IDs at runtime via:

from alpha_engine_lib.transparency import load_inventory
print(' '.join(r['id'] for r in load_inventory()['inventory']))

Adding a row to transparency_inventory.yaml and re-running this script automatically creates the corresponding alarm. Removing a row leaves the alarm in INSUFFICIENT_DATA — safer than silently deleting.

Relationship to existing health checks

SaturdayHealthCheck (artifact freshness) and DriftDetection (predictor behavior) continue to run unchanged — they cover different failure modes than the substrate check, not the same ones. SaturdayHealthCheck asks "are upstream/intermediate artifacts fresh?" (Polygon ingestion broke, slim cache refresh failed); the substrate check asks "are the measurement-substrate emissions correct?" (trade lineage null, residual_pct > 1%). Complementary, not redundant. No retirement planned.

Operator workflow

After this PR + data #175 deploy:

pip install -r requirements.txt   # picks up lib v0.5.0
./infrastructure/setup_substrate_alarms.sh

First Sat SF run (5/9) emits metrics; alarms transition from INSUFFICIENT_DATA to OK/ALARM.

Test plan

  • pytest tests/test_substrate_alarms_script.py — 15 passing
  • pytest (full data suite) — 505 total, no regressions
  • bash -n setup_substrate_alarms.sh — syntax OK
  • Operator runs script post-deploy, confirms 9 per-row + 1 aggregate alarm exist
  • After first Sat 5/9 SF: aws cloudwatch describe-alarms --alarm-name-prefix alpha-engine-substrate- shows all alarms transitioning out of INSUFFICIENT_DATA

Follow-ups

  • Daily-cadence equivalent SF state at end of EOD weekday SF (uses the same alarms — metric names are cadence-agnostic)

🤖 Generated with Claude Code

…substrate

Adds infrastructure/setup_substrate_alarms.sh — idempotent operator
script that creates one CloudWatch alarm per inventory row plus one
aggregate failure alarm. All point to the existing alpha-engine-alerts
SNS topic.

Per-row alarm (alpha-engine-substrate-<row_id>): fires when
SubstrateRowOK metric for that row drops below 1. The lib emits
1=ok/not_yet_effective, 0=fail, so a single fail in a 24h window
triggers SNS via Statistic=Minimum.

Aggregate alarm (alpha-engine-substrate-aggregate-failures): fires
when SubstrateChecksFailed > 0. Safety net for accidental per-row
alarm deletion — per-row alarms remain authoritative for which row
failed.

treat-missing-data=notBreaching keeps weekly-cadence rows quiet
between Sat-SF emissions; only emitted-and-failed datapoints fire.

Row enumeration sources from alpha_engine_lib.transparency.load_inventory()
so adding a row to the YAML and re-running this script automatically
adds the corresponding alarm. No hardcoded row list to drift.

Bumps alpha-engine-lib pin v0.3.0 → v0.5.0 so the test imports of
DEFAULT_NAMESPACE_OUT (added in lib #23) resolve.

15 new tests pin namespace alignment with lib, SNS target, row
enumeration source, alarm semantics (LessThanThreshold + Minimum +
notBreaching), and execution order (topic check before alarm
creation). 505 total passing.

Operator runs once after data #175 deploys:
  pip install -r requirements.txt  # gets v0.5.0
  ./infrastructure/setup_substrate_alarms.sh

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit b3318cc into main May 6, 2026
1 check passed
@cipher813 cipher813 deleted the feat/substrate-alarms branch May 6, 2026 20:37
cipher813 added a commit that referenced this pull request May 7, 2026
#178)

Mirrors the Saturday-SF WeeklySubstrateHealthCheck (PR #175) into the
weekday EOD SF, running ``python -m alpha_engine_lib.transparency
--cadence daily --alert`` on the dashboard EC2 between EODReconcile
success and StopTradingInstance.

Closes the Phase 2 → 3 gap where rows 4/5/6 of transparency_inventory
(lineage, risk_events, residual_pct) only got checked once per week
despite emitting daily — a bad emission Mon-Thu would otherwise sit
undetected until Saturday's run.

The same per-row CloudWatch alarms from PR #176 cover daily emissions
(SubstrateRowOK metric is cadence-agnostic). No new alarms needed.

Both new states are non-blocking (Catch routes to StopTradingInstance)
so a substrate-check infra failure can never leave the trading EC2
running overnight (cost-guard requirement).

Refactors update_eod_pipeline_sf.sh to read the SF definition from
infrastructure/step_function_eod.json instead of an inline heredoc,
matching the deploy_step_function.sh pattern for the Saturday SF. The
JSON file is now the single source of truth; wiring tests pin its
contents. Eliminates the two-staleness-vectors antipattern that had
the heredoc and the JSON file diverging silently.

17 new wiring tests pin chain ordering, Catch semantics, command
shape (--cadence daily, --alert, dashboard EC2, git pull before run),
ResultPath isolation, and instance targeting (dashboard EC2 not
trading EC2). 519 total passing.

Requires: alpha-engine PR adding ec2_instance_id to the daemon's EOD
SF input (DailySubstrateHealthCheck targets \$.ec2_instance_id, which
the daemon trigger now populates with the dashboard EC2 instance id).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 10, 2026
…e seed + backfill (#206)

* feat(signal_returns): write calibrator-v1 context on score_performance seed + backfill

Root-cause closure for the 2026-05-09 Saturday SF evaluator P0
(weight_optimizer ERROR: "None of [Index(['quant_score','qual_score'])]
are in the [columns]"; auto-rollback Sharpe -42.2% vs baseline).

Producer audit revealed two parallel writers diverged silently after
research migration #12 (2026-05-08):
  * scoring/performance_tracker.py::record_new_buy_scores writes ALL 5
    canonical context columns — but has zero production callers.
  * collectors/signal_returns.py::_seed_score_performance is the actual
    production writer (runs weekly in DataPhase1) and only wrote
    (symbol, score_date, score, price_on_date). The 5 canonical
    columns (quant_score, qual_score, conviction, sector_modifier,
    market_regime) were never populated.

Single-fact-single-writer rebuild:
  * _seed_score_performance now extracts the 5 context fields from the
    same signals.json payload that drives the BUY filter — single
    source-of-truth fetch per signals.json, no second round-trip.
  * New _backfill_score_context repairs legacy rows whose canonical
    columns are NULL. UPDATE-WHERE-NULL so re-runs are no-ops once
    every row has a source.
  * _ensure_score_performance_schema mirrors research migration #12
    defensively in case DataPhase1 ever fires against a fresh
    research.db before research's cold-start migrations run.

Composes with backtester #176 (PR-day consumer-side coalesce fix). With
this PR the producer becomes authoritative; the next backtester PR can
retire the S3 round-trip in weight_optimizer.load_with_subscores.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(signal_returns): drift gate — canonical context coverage CW gauge

Locks the producer-side contract established in the previous commit:
after seed + backfill complete, query score_performance for rows with
score_date >= 2026-05-17 (first Sat SF after this PR merges) and emit
the coverage percentage as a CloudWatch gauge:

  AlphaEngine/Data/score_performance_canonical_coverage_pct

Coverage = fraction of post-cutover rows with ALL 5 canonical context
columns populated (quant_score, qual_score, conviction,
sector_modifier, market_regime). 100% is the contract; the gauge is
always emitted (including 100.0) so alarm baselines stay continuous.

Mirrors the chronic-gap drift detection pattern at
weekly_collector.py:_check_chronic_gap_polygon_recovery — same
best-effort emit, same observability-not-load-bearing posture. A
follow-up alpha-engine-lib transparency_inventory entry can wire this
into the substrate health alarm if desired; the metric itself is the
drift signal.

Tripwire test asserts _CANONICAL_CONTEXT_COLUMNS stays in lockstep
with the seed INSERT — adding a 6th column to one without the other
would make the drift gate blind to that field.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant