fix(cf): recover orchestration stack from ROLLBACK_COMPLETE + harden deploy script by cipher813 · Pull Request #75 · cipher813/alpha-engine-data

cipher813 · 2026-04-20T23:52:24Z

Recovery PR for the broken CloudFormation stack surfaced during tonight's deploy-drift bootstrap (commit c88ed15 ran `deploy-infrastructure.sh`, which silently succeeded despite the stack sitting in ROLLBACK_COMPLETE since 21:04 UTC today).

What's broken

`alpha-engine-orchestration` stack state: ROLLBACK_COMPLETE
Root cause: `create-stack` at 21:04 UTC rolled back on "State machine already exists" — the state machines, EventBridge rules, Scheduler, SNS topic, and 6 of 7 alarms were all created directly via AWS CLI earlier in the system's life, never through CloudFormation.
Consequences:
- `UnscoredBuyCandidatesGap` alarm from PR feat(sf): coverage-gap self-heal between Predictor and executor #72 was never created
- `git-sha` tag from PR feat(drift): stamp git-sha at deploy + DeployDriftCheck SF gate (Phase 2+3) #74 was never applied
- Deploy-drift probe (feat(drift): stamp git-sha at deploy + DeployDriftCheck SF gate (Phase 2+3) #74) reads `_read_stack_tag → None`, degrades to `has_drift=false` — silently blind to the broken-stack case. (Hardening of that probe is tracked separately on alpha-engine-predictor.)
- Previous `deploy-infrastructure.sh` silently swallowed the update-stack error — the exact `no_silent_fails` pattern we've been fighting all session.

What's in this PR

`cloudformation/resources-to-import.json` — 15 resources to adopt via CF import change-set. Physical IDs pulled from live probes tonight.
`cloudformation/RECOVERY.md` — runbook: delete-stack → create-change-set IMPORT → execute-change-set → `deploy-infrastructure.sh`.
`deploy-infrastructure.sh` hardening:
- Terminal stack states (ROLLBACK_COMPLETE etc.) now exit non-zero with a pointer to RECOVERY.md.
- Removed `|| echo "no updates needed"` silent-swallow. Only "No updates are to be performed" is an acceptable no-op; every other update-stack failure exits non-zero.
- `stack-update-complete` wait added so exit code reflects actual stack state.

Execution plan (to run after merge)

Follow `RECOVERY.md` start to finish:
```bash
cd infrastructure/cloudformation
aws cloudformation delete-stack --stack-name alpha-engine-orchestration
aws cloudformation wait stack-delete-complete --stack-name alpha-engine-orchestration
aws cloudformation create-change-set \
--stack-name alpha-engine-orchestration \
--change-set-name bootstrap-import \
--change-set-type IMPORT \
--resources-to-import file://resources-to-import.json \
--template-body file://alpha-engine-orchestration.yaml \
--capabilities CAPABILITY_NAMED_IAM

Review change-set in console, then:

aws cloudformation execute-change-set \
--stack-name alpha-engine-orchestration --change-set-name bootstrap-import
aws cloudformation wait stack-import-complete --stack-name alpha-engine-orchestration
cd ..
bash deploy-infrastructure.sh
```

Verify:
```bash
aws cloudwatch describe-alarms \
--alarm-names alpha-engine-predictor-unscored-buy-candidates \
--query "MetricAlarms[0].AlarmName" --output text
```

Tests

No Python surface → no unit tests.
Validation is the recovery procedure itself + `deploy-infrastructure.sh` hitting the new error paths on next run.

Draft status

Holding to let you eyeball the resources-to-import.json list before execution. Import is safe (read-only adoption — no resources modified) but one-way; I'd rather not surprise you at 11:50 PM.

🤖 Generated with Claude Code

…deploy script Addresses the 2026-04-20 incident surfaced during the deploy-drift bootstrap run: the alpha-engine-orchestration CloudFormation stack has been sitting in ROLLBACK_COMPLETE since 21:04 UTC today (stack creation attempt rolled back when it hit "State machine already exists" on the SaturdayPipeline resource — every state machine, EventBridge rule, Scheduler, SNS topic, and most alarms were created directly via AWS CLI earlier in the system's life, not through CloudFormation). Consequences of the stack being in ROLLBACK_COMPLETE since today: - UnscoredBuyCandidatesGap alarm from PR #72 was never created. - git-sha tag from PR #74 was never applied. - Drift-check Lambda action (deploy_drift.check_deploy_drift) reads `_read_stack_tag → None`, which currently degrades to has_drift=false. So the freshly-shipped drift architecture is silently blind to the broken-stack case. Hardening of that probe lives in a sibling PR on alpha-engine-predictor. - Previous deploy-infrastructure.sh silently swallowed the error in the update-stack branch — exactly the no_silent_fails pattern we were fighting. Fixed here. Changes ------- 1. `cloudformation/resources-to-import.json` (new) — lists the 15 pre-existing resources the stack needs to adopt via CloudFormation import change-set: SNS topic + subscription, 2 state machines, 3 EventBridge rules, Scheduler schedule, 6 pre-existing alarms. Physical IDs pulled from live AWS probes. Two template resources are intentionally NOT in this list (ResearchAlertsPermission, UnscoredBuyCandidatesGap) — those are created fresh in step 4 of the recovery runbook. 2. `cloudformation/RECOVERY.md` (new) — step-by-step runbook for the import change-set flow: delete-stack → create-change-set (IMPORT) → execute-change-set → deploy-infrastructure.sh. Includes verify commands and a note on keeping resources-to-import.json current when new resources are added to the template. 3. `deploy-infrastructure.sh` hardening: - Detect terminal stack states (ROLLBACK_COMPLETE, ROLLBACK_FAILED, UPDATE_ROLLBACK_FAILED, CREATE_FAILED, DELETE_FAILED) up-front and exit non-zero with a pointer to RECOVERY.md. Prevents a broken stack from silently re-entering the update path on every deploy. - Replace the `|| echo "no updates needed"` silent swallow with a real error check: only "No updates are to be performed" stderr is an acceptable no-op. Every other update-stack failure (IAM denial, template validation, resource conflicts) now exits non-zero. - Wait on stack-update-complete when update-stack succeeds, so the deploy script's exit code reflects actual stack state. No tests added — infrastructure shell script with no Python surface. Validation is the recovery procedure itself + deploy-infrastructure.sh exercising the new error paths on its next run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Found-during-execution fixes from tonight's recovery run: 1. `resources-to-import.json`: - `AWS::SNS::Subscription` primary identifier is `Arn`, not `SubscriptionArn` (AWS rejected at CreateChangeSet validation). - `AWS::Events::Rule` primary identifier is `Arn`, not `Name`. - `AWS::Scheduler::Schedule` primary identifier IS `Name` (inconsistent with most AWS services that prefer ARN). - Added 4 heartbeat alarms that were in the template but missed from the initial list: BacktesterHeartbeat, ExecutorEodHeartbeat, PredictorTrainingHeartbeat, RAGIngestionHeartbeat. - Added ResearchAlertsErrors: initial probe used the wrong name (alpha-research-alerts-errors by analogy to the EventBridge rule) when the template's AlarmName is alpha-engine-research-alerts-errors. Caused one rollback cycle before I spotted it. 2. `RECOVERY.md` — new Gotchas section documenting all four AWS behaviors that aren't obvious from the CloudFormation docs: - Per-resource-type identifier naming - DeletionPolicy: Retain required on imported resources - Outputs: forbidden in import template - Probe resource names against template Properties, not from convention Post-recovery state (verified): - Stack: UPDATE_COMPLETE - git-sha tag: present, matches the PR commit - UnscoredBuyCandidatesGap alarm: exists, threshold=1.0 on AlphaEngine/Predictor/unscored_buy_candidates_count - All 22 template resources tracked by the stack Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 and others added 2 commits April 20, 2026 16:51

cipher813 marked this pull request as ready for review April 21, 2026 00:33

cipher813 merged commit 909628e into main Apr 21, 2026
1 check passed

cipher813 deleted the feat/cf-stack-import-fix branch April 21, 2026 00:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cf): recover orchestration stack from ROLLBACK_COMPLETE + harden deploy script#75

fix(cf): recover orchestration stack from ROLLBACK_COMPLETE + harden deploy script#75
cipher813 merged 2 commits into
mainfrom
feat/cf-stack-import-fix

cipher813 commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented Apr 20, 2026

What's broken

What's in this PR

Execution plan (to run after merge)

Review change-set in console, then:

Tests

Draft status

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant