feat(sf): coverage-gap self-heal between Predictor and executor by cipher813 · Pull Request #72 · cipher813/alpha-engine-data

cipher813 · 2026-04-20T21:34:41Z

Phase 2 of the Research↔Predictor coverage-gap closure. Pairs with:

alpha-engine-predictor Point CheckTradingDay + HealthCheck SSM commands at alpha-engine-dashboard #42 — --tickers flag + check_coverage action
alpha-engine feat(sf): coverage-gap self-heal between Predictor and executor #72 — executor read-time guard + CloudWatch metric

Summary

Adds 5 new states between PredictorInference and PredictorHealthCheck:

CheckPredictorCoverage — Lambda invoke with action=check_coverage, returns {missing_tickers, has_gap, ...}
CoverageGapChoice — if has_gap=true, go to ReinvokePredictor; else straight to PredictorHealthCheck
ReinvokePredictor — Lambda invoke with action=predict + tickers=$.missing_tickers. Predictor PR Point CheckTradingDay + HealthCheck SSM commands at alpha-engine-dashboard #42 merges these predictions into the existing predictions/{date}.json.
RecheckCoverage — second coverage check after re-invoke
FinalCoverageGate — if gap STILL exists after one re-run, fail hard. No infinite-loop path.

Why

2026-04-20: executor bought SNDK/WDC/BIIB/XEL at market open. 4 of 5 live entries bypassed the GBM veto gate because Research had produced buy_candidates that the first PredictorInference run didn't score. The invariant "every buy_candidate has a prediction before the executor sees signals.json" was enforced nowhere. This PR enforces it in orchestration.

Validation

JSON parses
All 24 states have valid Next/Default/Catch targets (no missing, no unreachable)
Single-retry design prevents runaway Lambda invocations
Deploy + dry-run on tomorrow's weekday SF execution (blocked on predictor Point CheckTradingDay + HealthCheck SSM commands at alpha-engine-dashboard #42 + executor feat(sf): coverage-gap self-heal between Predictor and executor #72 merging first)

Deploy order

alpha-engine-predictor Point CheckTradingDay + HealthCheck SSM commands at alpha-engine-dashboard #42 (merge + deploy Lambda — check_coverage action must exist before SF calls it)
alpha-engine feat(sf): coverage-gap self-heal between Predictor and executor #72 (merge + deploy executor)
This PR (merge + deploy SF definition)

Deploying out of order leaves a window where the SF calls a Lambda action that doesn't exist → SF failure on the next weekday run.

Status

Draft until predictor #42 is merged + the Lambda live alias is bumped to include check_coverage. Promoting out of draft is the signal for the SF deploy.

🤖 Generated with Claude Code

Closes the Research↔Predictor coverage gap at the orchestration layer (Phase 2). Pairs with alpha-engine-predictor PR #42 (--tickers flag + check_coverage action) and alpha-engine PR #72 (executor read-time guard). Problem ------- 2026-04-20: executor daemon bought SNDK/WDC/BIIB/XEL at market open despite 7 buy_candidates having no prediction row. GBM veto gate was structurally unreachable for those tickers (no prediction → no veto). 4 of 5 live entries (~80% of capital) routed around a risk control. Architecture ------------ The invariant is "every buy_candidate must have a prediction before the executor sees signals.json". Previously enforced nowhere. Now enforced in two layers: - **Self-heal (this PR):** PredictorInference → CheckPredictorCoverage → (if gap) ReinvokePredictor with tickers=missing → RecheckCoverage → (if STILL gap) HandleFailure. Single retry — no infinite loop. - **Defense-in-depth (predictor #42 + executor #72):** both predictor write-time and executor read-time refuse to proceed on a coverage gap. These fire if the self-heal mechanism above ever regresses. State graph added ----------------- PredictorInference └→ CheckPredictorCoverage (new, Lambda action=check_coverage) └→ CoverageGapChoice (new) ├─ has_gap=true → ReinvokePredictor (new, Lambda action=predict │ + tickers=$.coverage_result.Payload.missing_tickers) │ └→ RecheckCoverage (new) │ └→ FinalCoverageGate (new) │ ├─ still has_gap → HandleFailure │ └─ default → PredictorHealthCheck └─ default → PredictorHealthCheck All state references validated: 24 states total, no missing Next targets, no unreachable states. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 4 (CW alarm) of the Research↔Predictor coverage-gap closure. Bundled into the same PR as Phase 2 (SF self-heal) since both are infra living in this repo. New alarm --------- - Namespace: AlphaEngine/Predictor - MetricName: unscored_buy_candidates_count - Emitted by executor's signal_reader on every run (0 on success, >0 on gap) - Threshold: Maximum ≥ 1 over any 1-hour window - Action: existing alpha-engine-alerts SNS topic - TreatMissingData: notBreaching (executor is off-hours M-F only) Semantics --------- A positive value means the SF self-heal (CheckPredictorCoverage → ReinvokePredictor) failed to close the gap before the executor read predictions.json — either orchestration regressed or a ticker is genuinely un-scorable. Long-term regression guard for the coverage invariant. cfn-lint clean (only pre-existing W2001 warnings on unused parameters). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…deploy script (#75) * fix(cf): recover orchestration stack from ROLLBACK_COMPLETE + harden deploy script Addresses the 2026-04-20 incident surfaced during the deploy-drift bootstrap run: the alpha-engine-orchestration CloudFormation stack has been sitting in ROLLBACK_COMPLETE since 21:04 UTC today (stack creation attempt rolled back when it hit "State machine already exists" on the SaturdayPipeline resource — every state machine, EventBridge rule, Scheduler, SNS topic, and most alarms were created directly via AWS CLI earlier in the system's life, not through CloudFormation). Consequences of the stack being in ROLLBACK_COMPLETE since today: - UnscoredBuyCandidatesGap alarm from PR #72 was never created. - git-sha tag from PR #74 was never applied. - Drift-check Lambda action (deploy_drift.check_deploy_drift) reads `_read_stack_tag → None`, which currently degrades to has_drift=false. So the freshly-shipped drift architecture is silently blind to the broken-stack case. Hardening of that probe lives in a sibling PR on alpha-engine-predictor. - Previous deploy-infrastructure.sh silently swallowed the error in the update-stack branch — exactly the no_silent_fails pattern we were fighting. Fixed here. Changes ------- 1. `cloudformation/resources-to-import.json` (new) — lists the 15 pre-existing resources the stack needs to adopt via CloudFormation import change-set: SNS topic + subscription, 2 state machines, 3 EventBridge rules, Scheduler schedule, 6 pre-existing alarms. Physical IDs pulled from live AWS probes. Two template resources are intentionally NOT in this list (ResearchAlertsPermission, UnscoredBuyCandidatesGap) — those are created fresh in step 4 of the recovery runbook. 2. `cloudformation/RECOVERY.md` (new) — step-by-step runbook for the import change-set flow: delete-stack → create-change-set (IMPORT) → execute-change-set → deploy-infrastructure.sh. Includes verify commands and a note on keeping resources-to-import.json current when new resources are added to the template. 3. `deploy-infrastructure.sh` hardening: - Detect terminal stack states (ROLLBACK_COMPLETE, ROLLBACK_FAILED, UPDATE_ROLLBACK_FAILED, CREATE_FAILED, DELETE_FAILED) up-front and exit non-zero with a pointer to RECOVERY.md. Prevents a broken stack from silently re-entering the update path on every deploy. - Replace the `|| echo "no updates needed"` silent swallow with a real error check: only "No updates are to be performed" stderr is an acceptable no-op. Every other update-stack failure (IAM denial, template validation, resource conflicts) now exits non-zero. - Wait on stack-update-complete when update-stack succeeds, so the deploy script's exit code reflects actual stack state. No tests added — infrastructure shell script with no Python surface. Validation is the recovery procedure itself + deploy-infrastructure.sh exercising the new error paths on its next run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cf): resources-to-import completeness + RECOVERY.md gotchas Found-during-execution fixes from tonight's recovery run: 1. `resources-to-import.json`: - `AWS::SNS::Subscription` primary identifier is `Arn`, not `SubscriptionArn` (AWS rejected at CreateChangeSet validation). - `AWS::Events::Rule` primary identifier is `Arn`, not `Name`. - `AWS::Scheduler::Schedule` primary identifier IS `Name` (inconsistent with most AWS services that prefer ARN). - Added 4 heartbeat alarms that were in the template but missed from the initial list: BacktesterHeartbeat, ExecutorEodHeartbeat, PredictorTrainingHeartbeat, RAGIngestionHeartbeat. - Added ResearchAlertsErrors: initial probe used the wrong name (alpha-research-alerts-errors by analogy to the EventBridge rule) when the template's AlarmName is alpha-engine-research-alerts-errors. Caused one rollback cycle before I spotted it. 2. `RECOVERY.md` — new Gotchas section documenting all four AWS behaviors that aren't obvious from the CloudFormation docs: - Per-resource-type identifier naming - DeletionPolicy: Retain required on imported resources - Outputs: forbidden in import template - Probe resource names against template Properties, not from convention Post-recovery state (verified): - Stack: UPDATE_COMPLETE - git-sha tag: present, matches the PR commit - UnscoredBuyCandidatesGap alarm: exists, threshold=1.0 on AlphaEngine/Predictor/unscored_buy_candidates_count - All 22 template resources tracked by the stack Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 and others added 3 commits April 20, 2026 14:28

Merge branch 'main' into feat/sf-coverage-gap-choice

33df35c

cipher813 marked this pull request as ready for review April 20, 2026 22:50

cipher813 merged commit 77e0956 into main Apr 20, 2026
1 check passed

cipher813 deleted the feat/sf-coverage-gap-choice branch April 20, 2026 22:53

cipher813 mentioned this pull request Apr 20, 2026

fix(cf): recover orchestration stack from ROLLBACK_COMPLETE + harden deploy script #75

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sf): coverage-gap self-heal between Predictor and executor#72

feat(sf): coverage-gap self-heal between Predictor and executor#72
cipher813 merged 3 commits into
mainfrom
feat/sf-coverage-gap-choice

cipher813 commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented Apr 20, 2026

Summary

Why

Validation

Deploy order

Status

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant