Skip to content

feat(deploy): Telegram + SNS alert on 3 canary rollback paths (L221 — 2/5)#184

Merged
cipher813 merged 1 commit into
mainfrom
feat/canary-rollback-alerts-l221
May 21, 2026
Merged

feat(deploy): Telegram + SNS alert on 3 canary rollback paths (L221 — 2/5)#184
cipher813 merged 1 commit into
mainfrom
feat/canary-rollback-alerts-l221

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

2/5 of the L221 fleet pass. Predictor's `infrastructure/deploy.sh` has 3 canary failure-exit paths — inference (L266), regime (L375), regime-eval (L473) — each gets a `python3 -m alpha_engine_lib.alerts publish` before `exit 1`.

The regime + regime-eval alerts include the names + versions of the upstream Lambdas already promoted at that point, since an image-wide bug surfacing on the later canary requires operator triage of the already-live siblings.

Implementation

3 minimal inserts; `|| true` ensures the alert publish never overrides the deploy's `exit 1`. Lib alerts v0.21.0 already in requirements.

Fleet pass scope

repo sites status
alpha-engine-research 1 #216
alpha-engine-predictor 3 this PR
alpha-engine-data 6 total (1 main + 5 sub-lambdas) TBD
alpha-engine-backtester 3 (health / counterfactual / concordance) TBD
alpha-engine-dashboard n/a n/a

Test plan

  • `bash -n infrastructure/deploy.sh` clean
  • Pre-commit hooks pass
  • First production exercise on next canary failure

🤖 Generated with Claude Code

Independent-channel surveillance on the 3 canary failure paths
(inference / regime / regime-eval) so a silent auto-rollback is no
longer the failure mode. 2-day silent rollback chain in
alpha-engine-data #274 retrospective is the recurrence class.

The regime + regime-eval alerts include the names + versions of the
upstream Lambdas already promoted at that point — operator triage
information critical when an image-wide bug surfaces only on the
later canary. Best-effort; trailing || true never overrides exit 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit d1b39e3 into main May 21, 2026
1 check passed
@cipher813 cipher813 deleted the feat/canary-rollback-alerts-l221 branch May 21, 2026 23:43
cipher813 added a commit that referenced this pull request May 22, 2026
…b pin v0.17.0→v0.24.0 (#185)

Adds ``--dedup-key "canary-fail-${LAMBDA}-v${VERSION}"`` to all three
canary-failure alert publish calls in ``infrastructure/deploy.sh``
(inference Lambda + regime Lambda + regime-eval Lambda). An image-wide
rebuild that breaks N Lambdas' canaries within the hour now collapses
to one alert per (Lambda, version) — closes the N-emails-for-one-event
class on the 2026-05-21 L221 canary-alert surveillance shipped 5/21
late evening (PR #184 merged 23:43Z).

Originally filed as a P1 owed before the 4 L221 canary PRs merged;
those PRs merged 5/21 23:42-23:51Z so this becomes a post-merge
retrofit per the ROADMAP's downgrade rule.

Lib pin v0.17.0 → v0.24.0 in lockstep across requirements.txt +
requirements-lambda.txt (drift between them shipped a stale lib to
prod 2026-05-07 per the existing comment). v0.24.0 ships the
``--dedup-key`` CLI flag this retrofit relies on; v0.18-0.23 additive
features come transitively.

Suite: 1131 passed.

Composes with [[reference_alpha_engine_lib_alerts_v0_24_0_dedup]]
(consumer migration #4 of the v0.24.0 substrate).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant