Skip to content

feat(drift): stamp git-sha at deploy + DeployDriftCheck SF gate (Phase 2+3)#74

Merged
cipher813 merged 1 commit into
mainfrom
feat/sf-deploy-drift-check
Apr 20, 2026
Merged

feat(drift): stamp git-sha at deploy + DeployDriftCheck SF gate (Phase 2+3)#74
cipher813 merged 1 commit into
mainfrom
feat/sf-deploy-drift-check

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Phase 2+3 of the deploy-drift architecture. Pairs with alpha-engine-predictor #44 (Phase 1 preflight + Phase 2+3 Lambda-side probe).

Summary

  • deploy-infrastructure.sh stamps git SHA into BOTH Step Function Comment fields ([git:<sha>] prefix, re-stamp-idempotent) and CloudFormation stack tags (`Key=git-sha,Value=`).
  • New DeployDriftCheck first state in the weekday SF invokes the predictor Lambda's action=check_deploy_drift, which returns has_drift/sf_drift/cf_drift + diagnostic fields.
  • New DeployDriftGate Choice state routes to HandleFailure on drift; else falls through to prior StartAt (StartExecutorEC2).

Why

2026-04-20 evening exposed that predictor Lambda auto-deploys on merge but SF + CF need manual `deploy-infrastructure.sh`. Impossible to track by hand. Structural fix: stamp artifacts, verify at preflight.

Degraded modes (explicitly non-blocking)

  • Missing SF stamp (legacy deploy predates this PR) → no drift flag
  • Missing CF tag (legacy stack) → no drift flag
  • GitHub API outage → no drift flag

These matter because the bootstrap deploy of this PR itself won't have the stamps in the currently-running artifacts. First deploy of `deploy-infrastructure.sh` with this PR stamps forward; subsequent drift is caught.

Test plan

  • SF JSON validates: 26 states, no dangling refs, no unreachable
  • Python JSON-munging for stamp injection is idempotent (re-running deploy.sh strips prior prefix before re-prepending)
  • Bootstrap deploy: run `bash infrastructure/deploy-infrastructure.sh` once to ship the stamped SF + tagged stack
  • First live validation: next weekday SF run should log `DeployDriftCheck` with has_drift=false
  • Negative validation: before calling this fully battle-tested, manually tweak the SF Comment prefix to a non-matching SHA and confirm the SF fails at DeployDriftGate

Rollout order

  1. Merge alpha-engine-predictor Migrate DataPhase1+RAGIngestion (bundled) and DriftDetection to spot #44 first (adds `action=check_deploy_drift` to the Lambda)
  2. Wait for auto-deploy to publish new predictor Lambda version
  3. Then merge this PR + run `bash infrastructure/deploy-infrastructure.sh` once to stamp

Merging this PR before #44 leaves a window where SF calls a Lambda action that doesn't exist → DeployDriftCheck task fails → HandleFailure fires → SF pipeline blocked.

🤖 Generated with Claude Code

Pairs with alpha-engine-predictor PR #44 (preflight + check_deploy_drift
Lambda action). Together these close the deploy-drift visibility gap
that made the 2026-04-20 coverage-gap session unmanageable: two deploy
paths (auto CI vs manual deploy-infrastructure.sh) with no way to tell
which side had shipped without diffing SHAs by hand.

Stamp at deploy
---------------
1. `deploy-infrastructure.sh` now reads `$GITHUB_SHA` (CI) or
   `git rev-parse HEAD` (local), injects `[git:<sha>]` prefix into the
   top-level `Comment` field of both step_function.json +
   step_function_daily.json before upload / update-state-machine.
   Re-stamping strips any prior `[git:…]` so it's idempotent.
2. CloudFormation stack gets `--tags Key=git-sha,Value=<sha>` on both
   create-stack and update-stack paths.

SF gate
-------
- New `DeployDriftCheck` state as the first state (was StartExecutorEC2).
  Invokes predictor Lambda `action=check_deploy_drift` which returns
  `{has_drift, sf_drift, cf_drift, upstream_sha, sf_sha, stack_sha, ...}`.
- New `DeployDriftGate` Choice state: if `has_drift=true`, route to
  HandleFailure. Else fall through to StartExecutorEC2 (prior StartAt).
- Degraded modes (missing stamps on legacy artifacts, GitHub outage)
  set has_drift=false so the gate doesn't block recoverable scenarios.

State graph: 26 states total (was 24). All Next/Default/Catch refs
resolve; no unreachable states.

Bootstrap
---------
First deploy via `bash infrastructure/deploy-infrastructure.sh` is the
one time this check CAN'T catch its own absence — but that's inherent
to any self-detecting system. After that single bootstrap, every
subsequent drift surfaces at the next weekday SF run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 marked this pull request as ready for review April 20, 2026 23:29
@cipher813 cipher813 merged commit accc240 into main Apr 20, 2026
1 check passed
@cipher813 cipher813 deleted the feat/sf-deploy-drift-check branch April 20, 2026 23:29
cipher813 added a commit that referenced this pull request Apr 21, 2026
…deploy script (#75)

* fix(cf): recover orchestration stack from ROLLBACK_COMPLETE + harden deploy script

Addresses the 2026-04-20 incident surfaced during the deploy-drift
bootstrap run: the alpha-engine-orchestration CloudFormation stack has
been sitting in ROLLBACK_COMPLETE since 21:04 UTC today (stack creation
attempt rolled back when it hit "State machine already exists" on the
SaturdayPipeline resource — every state machine, EventBridge rule,
Scheduler, SNS topic, and most alarms were created directly via AWS
CLI earlier in the system's life, not through CloudFormation).

Consequences of the stack being in ROLLBACK_COMPLETE since today:
- UnscoredBuyCandidatesGap alarm from PR #72 was never created.
- git-sha tag from PR #74 was never applied.
- Drift-check Lambda action (deploy_drift.check_deploy_drift) reads
  `_read_stack_tag → None`, which currently degrades to has_drift=false.
  So the freshly-shipped drift architecture is silently blind to the
  broken-stack case. Hardening of that probe lives in a sibling PR on
  alpha-engine-predictor.
- Previous deploy-infrastructure.sh silently swallowed the error in
  the update-stack branch — exactly the no_silent_fails pattern we
  were fighting. Fixed here.

Changes
-------

1. `cloudformation/resources-to-import.json` (new) — lists the 15
   pre-existing resources the stack needs to adopt via CloudFormation
   import change-set: SNS topic + subscription, 2 state machines, 3
   EventBridge rules, Scheduler schedule, 6 pre-existing alarms.
   Physical IDs pulled from live AWS probes. Two template resources
   are intentionally NOT in this list (ResearchAlertsPermission,
   UnscoredBuyCandidatesGap) — those are created fresh in step 4 of
   the recovery runbook.

2. `cloudformation/RECOVERY.md` (new) — step-by-step runbook for the
   import change-set flow: delete-stack → create-change-set
   (IMPORT) → execute-change-set → deploy-infrastructure.sh. Includes
   verify commands and a note on keeping resources-to-import.json
   current when new resources are added to the template.

3. `deploy-infrastructure.sh` hardening:
   - Detect terminal stack states (ROLLBACK_COMPLETE, ROLLBACK_FAILED,
     UPDATE_ROLLBACK_FAILED, CREATE_FAILED, DELETE_FAILED) up-front and
     exit non-zero with a pointer to RECOVERY.md. Prevents a broken
     stack from silently re-entering the update path on every deploy.
   - Replace the `|| echo "no updates needed"` silent swallow with a
     real error check: only "No updates are to be performed" stderr is
     an acceptable no-op. Every other update-stack failure (IAM denial,
     template validation, resource conflicts) now exits non-zero.
   - Wait on stack-update-complete when update-stack succeeds, so the
     deploy script's exit code reflects actual stack state.

No tests added — infrastructure shell script with no Python surface.
Validation is the recovery procedure itself + deploy-infrastructure.sh
exercising the new error paths on its next run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cf): resources-to-import completeness + RECOVERY.md gotchas

Found-during-execution fixes from tonight's recovery run:

1. `resources-to-import.json`:
   - `AWS::SNS::Subscription` primary identifier is `Arn`, not
     `SubscriptionArn` (AWS rejected at CreateChangeSet validation).
   - `AWS::Events::Rule` primary identifier is `Arn`, not `Name`.
   - `AWS::Scheduler::Schedule` primary identifier IS `Name` (inconsistent
     with most AWS services that prefer ARN).
   - Added 4 heartbeat alarms that were in the template but missed from
     the initial list: BacktesterHeartbeat, ExecutorEodHeartbeat,
     PredictorTrainingHeartbeat, RAGIngestionHeartbeat.
   - Added ResearchAlertsErrors: initial probe used the wrong name
     (alpha-research-alerts-errors by analogy to the EventBridge rule)
     when the template's AlarmName is alpha-engine-research-alerts-errors.
     Caused one rollback cycle before I spotted it.

2. `RECOVERY.md` — new Gotchas section documenting all four AWS
   behaviors that aren't obvious from the CloudFormation docs:
   - Per-resource-type identifier naming
   - DeletionPolicy: Retain required on imported resources
   - Outputs: forbidden in import template
   - Probe resource names against template Properties, not from
     convention

Post-recovery state (verified):
- Stack: UPDATE_COMPLETE
- git-sha tag: present, matches the PR commit
- UnscoredBuyCandidatesGap alarm: exists, threshold=1.0 on
  AlphaEngine/Predictor/unscored_buy_candidates_count
- All 22 template resources tracked by the stack

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant