Migrate DataPhase1+RAGIngestion (bundled) and DriftDetection to spot#44
Merged
Conversation
DataPhase1 now runs on a self-terminating c5.large spot instance (same pattern as Backtester + PredictorTraining) instead of hammering the t3.micro. The micro becomes a dispatcher: pulls the latest launcher script, sources .env, invokes bash infrastructure/spot_data_phase1.sh. All heavy Python work (yfinance, polygon, FRED, ArcticDB append) runs on the spot. Rationale: the 2026-04-16 OOM incident showed that running data-refresh workloads on a 1 GB RAM instance is fragile-by-design. Even though Saturday DataPhase1 has historically fit in micro RAM (it uses different code paths than the daily feature compute that OOM'd today), consolidating all heavy weekly compute onto self-terminating spots aligns DataPhase1 with the existing Backtester/PredictorTraining pattern and removes the 1 GB ceiling from future data-refresh growth. Also: SaturdayHealthCheck SSM command repointed from /home/ec2-user/alpha-engine-data (health_checker.py was deleted from that repo in #43) to /home/ec2-user/alpha-engine-dashboard where it now lives. Mirrors the same fix applied to the weekday HealthCheck step in #42. Files: - new infrastructure/spot_data_phase1.sh (spot launcher, mirrors spot_backtest.sh) - edit infrastructure/step_function.json (DataPhase1 + SaturdayHealthCheck commands) Timeout bumped 1800 → 2700s to accommodate spot bootstrap overhead (~7 min for instance launch + pip install on top of ~20 min workload). Deferred (separate PR): migrate RAGIngestion + DriftDetection to spot as well. They still run on the micro and need alpha-engine-data restored on ae-dashboard for now. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813
added a commit
to cipher813/alpha-engine-dashboard
that referenced
this pull request
Apr 16, 2026
Reverts the removal from #19. The Saturday Step Function still runs RAGIngestion + DriftDetection on the micro from alpha-engine-data, and the new DataPhase1 spot launcher (cipher813/alpha-engine-data#44) lives at alpha-engine-data/infrastructure/spot_data_phase1.sh — the micro invokes it as a dispatcher and needs the repo checked out. Context: #19 assumed ae-dashboard had no runtime need for alpha-engine-data once health_checker + trading_calendar moved here. That was wrong — the Saturday SF has 4 separate steps that target the micro from alpha-engine-data (DataPhase1, RAGIngestion, DriftDetection, SaturdayHealthCheck). I missed this when recommending #19. After cipher813/alpha-engine-data#44 merges and RAG/Drift also migrate to spot (planned follow-up), alpha-engine-data may still need to be cloned on the micro for the launcher scripts but the heavy .venv becomes unnecessary. At that point this line can be removed again along with a lean-clone pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extends the DataPhase1-to-spot migration to cover all three Saturday
SF steps that were running heavy alpha-engine-data workloads on the
t3.micro:
- DataPhase1 + RAGIngestion now share a single spot instance via
spot_data_weekly.sh (renamed from spot_data_phase1.sh). Both
workloads use the same alpha-engine-data clone + pip install —
bundling saves ~7 min of bootstrap overhead and one spot request.
RAGIngestion SF state chain (RAGIngestion + WaitForRAGIngestion +
CheckRAGStatus + RAGWait + ExtractRAGError) removed; DataPhase1's
success now wires directly to Research.
- DriftDetection moves to its own spot via spot_drift_detection.sh.
Launcher clones BOTH alpha-engine-data and alpha-engine-predictor
(drift_detector lives in data/monitoring/ but imports from
predictor via PYTHONPATH). Overkill cost-wise for the ~5 min
workload (~7 min bootstrap + ~5 min work vs ~5 min on micro), but
completes the architectural goal: zero heavy venvs on the micro.
Net effect on ae-dashboard after next boot-pull:
- alpha-engine-data: cloned (for launcher scripts only, ~300 lines bash)
- alpha-engine-data/.venv: can be deleted permanently
- 0 heavy Python workloads running on the t3.micro at any point in
the Saturday pipeline
Timeout bumps:
- DataPhase1 (bundled): 2700s → 3600s (phase1 ~20min + rag ~15min + bootstrap ~7min)
- DriftDetection: 300s → 1200s (bootstrap ~7min + workload ~5min)
SF state count: 34 → 30 (-4 RAG chain states).
Followup roadmap P2: bundle DriftDetection onto PredictorTraining's
spot since drift reads predictor weights produced by that step —
would save another bootstrap cycle.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 16, 2026
cipher813
added a commit
that referenced
this pull request
Apr 20, 2026
Pairs with alpha-engine-predictor PR #44 (preflight + check_deploy_drift Lambda action). Together these close the deploy-drift visibility gap that made the 2026-04-20 coverage-gap session unmanageable: two deploy paths (auto CI vs manual deploy-infrastructure.sh) with no way to tell which side had shipped without diffing SHAs by hand. Stamp at deploy --------------- 1. `deploy-infrastructure.sh` now reads `$GITHUB_SHA` (CI) or `git rev-parse HEAD` (local), injects `[git:<sha>]` prefix into the top-level `Comment` field of both step_function.json + step_function_daily.json before upload / update-state-machine. Re-stamping strips any prior `[git:…]` so it's idempotent. 2. CloudFormation stack gets `--tags Key=git-sha,Value=<sha>` on both create-stack and update-stack paths. SF gate ------- - New `DeployDriftCheck` state as the first state (was StartExecutorEC2). Invokes predictor Lambda `action=check_deploy_drift` which returns `{has_drift, sf_drift, cf_drift, upstream_sha, sf_sha, stack_sha, ...}`. - New `DeployDriftGate` Choice state: if `has_drift=true`, route to HandleFailure. Else fall through to StartExecutorEC2 (prior StartAt). - Degraded modes (missing stamps on legacy artifacts, GitHub outage) set has_drift=false so the gate doesn't block recoverable scenarios. State graph: 26 states total (was 24). All Next/Default/Catch refs resolve; no unreachable states. Bootstrap --------- First deploy via `bash infrastructure/deploy-infrastructure.sh` is the one time this check CAN'T catch its own absence — but that's inherent to any self-detecting system. After that single bootstrap, every subsequent drift surfaces at the next weekday SF run. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Moves all three heavy Saturday-pipeline SSM steps off ae-dashboard (t3.micro, 1 GB RAM) onto self-terminating spot EC2 instances. Completes the producer/observability separation started in the DailyData move earlier today — zero heavy Python workloads run on the micro at any point in the Saturday pipeline.
Changes
Workload migrations (3 steps → 2 spots)
Bundling rationale: DataPhase1 + RAGIngestion run back-to-back on the same repo (alpha-engine-data) with the same pip install. Sharing one spot saves ~7 min of bootstrap overhead vs two separate spots. Trade-off: any failure fails the bundle — acceptable since partial Saturday failures typically require a full-pipeline rerun anyway.
New launcher scripts (~280-300 lines each, mirror spot_backtest.sh)
Step Function edits
Net effect on ae-dashboard
Pre-merge / deploy checklist
```
ae-dashboard "cd /home/ec2-user/alpha-engine-data && export HOME=/home/ec2-user && bash infrastructure/spot_data_weekly.sh --smoke-only"
ae-dashboard "cd /home/ec2-user/alpha-engine-data && export HOME=/home/ec2-user && bash infrastructure/spot_drift_detection.sh --smoke-only"
```
Each launches a spot, validates imports + `--dry-run`, terminates. Catches IAM/AMI/subnet/PAT regressions before the real run.
Followup (roadmap P2)
Bundle DriftDetection onto PredictorTraining's spot — drift reads predictor weights written by training, so they have natural data dependency. Would save another bootstrap cycle. Added to ROADMAP.md for a later session.
🤖 Generated with Claude Code