Migrate DataPhase1+RAGIngestion (bundled) and DriftDetection to spot by cipher813 · Pull Request #44 · cipher813/alpha-engine-data

cipher813 · 2026-04-16T17:12:20Z

Summary

Moves all three heavy Saturday-pipeline SSM steps off ae-dashboard (t3.micro, 1 GB RAM) onto self-terminating spot EC2 instances. Completes the producer/observability separation started in the DailyData move earlier today — zero heavy Python workloads run on the micro at any point in the Saturday pipeline.

Changes

Workload migrations (3 steps → 2 spots)

Step	Before	After
DataPhase1	micro, weekly_collector --phase 1	spot (bundled)
RAGIngestion	micro, run_weekly_ingestion.sh	bundled onto same spot as DataPhase1
DriftDetection	micro, monitoring.drift_detector	separate spot

Bundling rationale: DataPhase1 + RAGIngestion run back-to-back on the same repo (alpha-engine-data) with the same pip install. Sharing one spot saves ~7 min of bootstrap overhead vs two separate spots. Trade-off: any failure fails the bundle — acceptable since partial Saturday failures typically require a full-pipeline rerun anyway.

New launcher scripts (~280-300 lines each, mirror spot_backtest.sh)

`infrastructure/spot_data_weekly.sh` — clones alpha-engine-data, runs `weekly_collector.py --phase 1` then `run_weekly_ingestion.sh` sequentially, emits two heartbeats (data-phase1, rag-ingestion), terminates
`infrastructure/spot_drift_detection.sh` — clones alpha-engine-data + alpha-engine-predictor (drift_detector needs both), runs `python -m monitoring.drift_detector --alert`, emits heartbeat, terminates

Step Function edits

DataPhase1: command now invokes `spot_data_weekly.sh`. Timeout bumped 1800→3600s (phase1 ~20min + rag ~15min + spot bootstrap ~7min).
RAGIngestion chain removed: RAGIngestion + WaitForRAGIngestion + CheckRAGStatus + RAGWait + ExtractRAGError states all deleted. DataPhase1 success now wires directly to Research. State count: 34 → 30.
DriftDetection: command now invokes `spot_drift_detection.sh`. Timeout bumped 300→1200s.
SaturdayHealthCheck: path repointed from `/home/ec2-user/alpha-engine-data` to `/home/ec2-user/alpha-engine-dashboard` (health_checker lives there now per Delete trading_calendar + health_checker (moved to alpha-engine-dashboard) #43).

Net effect on ae-dashboard

`alpha-engine-data` still cloned (needed for launcher scripts)
`alpha-engine-data/.venv` can be deleted permanently post-deploy (no runtime Python workload references it anymore)
Micro stays lean — serves as a bash+AWS-CLI dispatcher for three workloads

Pre-merge / deploy checklist

Merge cipher813/alpha-engine-dashboard#20 (restores alpha-engine-data in boot-pull REPOS)
On ae-dashboard: `git clone` alpha-engine-data (not already present after the de-bloat `rm -rf` earlier)
Merge this PR
`bash infrastructure/deploy_step_function_saturday.sh` from your laptop (on updated main)
Strongly recommended: smoke-test both launchers from ae-dashboard before Saturday:
```
ae-dashboard "cd /home/ec2-user/alpha-engine-data && export HOME=/home/ec2-user && bash infrastructure/spot_data_weekly.sh --smoke-only"
ae-dashboard "cd /home/ec2-user/alpha-engine-data && export HOME=/home/ec2-user && bash infrastructure/spot_drift_detection.sh --smoke-only"
```
Each launches a spot, validates imports + `--dry-run`, terminates. Catches IAM/AMI/subnet/PAT regressions before the real run.
After Saturday's run completes green: `ae-dashboard "rm -rf /home/ec2-user/alpha-engine-data/.venv"` to reclaim ~300 MB and validate the zero-heavy-venv end state.

Followup (roadmap P2)

Bundle DriftDetection onto PredictorTraining's spot — drift reads predictor weights written by training, so they have natural data dependency. Would save another bootstrap cycle. Added to ROADMAP.md for a later session.

🤖 Generated with Claude Code

DataPhase1 now runs on a self-terminating c5.large spot instance (same pattern as Backtester + PredictorTraining) instead of hammering the t3.micro. The micro becomes a dispatcher: pulls the latest launcher script, sources .env, invokes bash infrastructure/spot_data_phase1.sh. All heavy Python work (yfinance, polygon, FRED, ArcticDB append) runs on the spot. Rationale: the 2026-04-16 OOM incident showed that running data-refresh workloads on a 1 GB RAM instance is fragile-by-design. Even though Saturday DataPhase1 has historically fit in micro RAM (it uses different code paths than the daily feature compute that OOM'd today), consolidating all heavy weekly compute onto self-terminating spots aligns DataPhase1 with the existing Backtester/PredictorTraining pattern and removes the 1 GB ceiling from future data-refresh growth. Also: SaturdayHealthCheck SSM command repointed from /home/ec2-user/alpha-engine-data (health_checker.py was deleted from that repo in #43) to /home/ec2-user/alpha-engine-dashboard where it now lives. Mirrors the same fix applied to the weekday HealthCheck step in #42. Files: - new infrastructure/spot_data_phase1.sh (spot launcher, mirrors spot_backtest.sh) - edit infrastructure/step_function.json (DataPhase1 + SaturdayHealthCheck commands) Timeout bumped 1800 → 2700s to accommodate spot bootstrap overhead (~7 min for instance launch + pip install on top of ~20 min workload). Deferred (separate PR): migrate RAGIngestion + DriftDetection to spot as well. They still run on the micro and need alpha-engine-data restored on ae-dashboard for now. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reverts the removal from #19. The Saturday Step Function still runs RAGIngestion + DriftDetection on the micro from alpha-engine-data, and the new DataPhase1 spot launcher (cipher813/alpha-engine-data#44) lives at alpha-engine-data/infrastructure/spot_data_phase1.sh — the micro invokes it as a dispatcher and needs the repo checked out. Context: #19 assumed ae-dashboard had no runtime need for alpha-engine-data once health_checker + trading_calendar moved here. That was wrong — the Saturday SF has 4 separate steps that target the micro from alpha-engine-data (DataPhase1, RAGIngestion, DriftDetection, SaturdayHealthCheck). I missed this when recommending #19. After cipher813/alpha-engine-data#44 merges and RAG/Drift also migrate to spot (planned follow-up), alpha-engine-data may still need to be cloned on the micro for the launcher scripts but the heavy .venv becomes unnecessary. At that point this line can be removed again along with a lean-clone pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Extends the DataPhase1-to-spot migration to cover all three Saturday SF steps that were running heavy alpha-engine-data workloads on the t3.micro: - DataPhase1 + RAGIngestion now share a single spot instance via spot_data_weekly.sh (renamed from spot_data_phase1.sh). Both workloads use the same alpha-engine-data clone + pip install — bundling saves ~7 min of bootstrap overhead and one spot request. RAGIngestion SF state chain (RAGIngestion + WaitForRAGIngestion + CheckRAGStatus + RAGWait + ExtractRAGError) removed; DataPhase1's success now wires directly to Research. - DriftDetection moves to its own spot via spot_drift_detection.sh. Launcher clones BOTH alpha-engine-data and alpha-engine-predictor (drift_detector lives in data/monitoring/ but imports from predictor via PYTHONPATH). Overkill cost-wise for the ~5 min workload (~7 min bootstrap + ~5 min work vs ~5 min on micro), but completes the architectural goal: zero heavy venvs on the micro. Net effect on ae-dashboard after next boot-pull: - alpha-engine-data: cloned (for launcher scripts only, ~300 lines bash) - alpha-engine-data/.venv: can be deleted permanently - 0 heavy Python workloads running on the t3.micro at any point in the Saturday pipeline Timeout bumps: - DataPhase1 (bundled): 2700s → 3600s (phase1 ~20min + rag ~15min + bootstrap ~7min) - DriftDetection: 300s → 1200s (bootstrap ~7min + workload ~5min) SF state count: 34 → 30 (-4 RAG chain states). Followup roadmap P2: bundle DriftDetection onto PredictorTraining's spot since drift reads predictor weights produced by that step — would save another bootstrap cycle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pairs with alpha-engine-predictor PR #44 (preflight + check_deploy_drift Lambda action). Together these close the deploy-drift visibility gap that made the 2026-04-20 coverage-gap session unmanageable: two deploy paths (auto CI vs manual deploy-infrastructure.sh) with no way to tell which side had shipped without diffing SHAs by hand. Stamp at deploy --------------- 1. `deploy-infrastructure.sh` now reads `$GITHUB_SHA` (CI) or `git rev-parse HEAD` (local), injects `[git:<sha>]` prefix into the top-level `Comment` field of both step_function.json + step_function_daily.json before upload / update-state-machine. Re-stamping strips any prior `[git:…]` so it's idempotent. 2. CloudFormation stack gets `--tags Key=git-sha,Value=<sha>` on both create-stack and update-stack paths. SF gate ------- - New `DeployDriftCheck` state as the first state (was StartExecutorEC2). Invokes predictor Lambda `action=check_deploy_drift` which returns `{has_drift, sf_drift, cf_drift, upstream_sha, sf_sha, stack_sha, ...}`. - New `DeployDriftGate` Choice state: if `has_drift=true`, route to HandleFailure. Else fall through to StartExecutorEC2 (prior StartAt). - Degraded modes (missing stamps on legacy artifacts, GitHub outage) set has_drift=false so the gate doesn't block recoverable scenarios. State graph: 26 states total (was 24). All Next/Default/Catch refs resolve; no unreachable states. Bootstrap --------- First deploy via `bash infrastructure/deploy-infrastructure.sh` is the one time this check CAN'T catch its own absence — but that's inherent to any self-detecting system. After that single bootstrap, every subsequent drift surfaces at the next weekday SF run. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 mentioned this pull request Apr 16, 2026

Restore alpha-engine-data in boot-pull REPOS (pair with spot migration) cipher813/alpha-engine-dashboard#20

Merged

cipher813 changed the title ~~Migrate DataPhase1 to spot + fix SaturdayHealthCheck path~~ Migrate DataPhase1+RAGIngestion (bundled) and DriftDetection to spot Apr 16, 2026

cipher813 merged commit 2286773 into main Apr 16, 2026
1 check passed

cipher813 deleted the feat/spot-migration-data-phase1 branch April 16, 2026 17:34

This was referenced Apr 16, 2026

Fetch alpha-engine-lib PAT from SSM on the spot (not via .env on dispatcher) #45

Merged

Upload alpha-engine-config/data/config.yaml to spot #46

Merged

feat(drift): stamp git-sha at deploy + DeployDriftCheck SF gate (Phase 2+3) #74

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate DataPhase1+RAGIngestion (bundled) and DriftDetection to spot#44

Migrate DataPhase1+RAGIngestion (bundled) and DriftDetection to spot#44
cipher813 merged 2 commits into
mainfrom
feat/spot-migration-data-phase1

cipher813 commented Apr 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Workload migrations (3 steps → 2 spots)

New launcher scripts (~280-300 lines each, mirror spot_backtest.sh)

Step Function edits

Net effect on ae-dashboard

Pre-merge / deploy checklist

Followup (roadmap P2)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cipher813 commented Apr 16, 2026 •

edited

Loading