Skip to content

Migrate DataPhase1+RAGIngestion (bundled) and DriftDetection to spot#44

Merged
cipher813 merged 2 commits into
mainfrom
feat/spot-migration-data-phase1
Apr 16, 2026
Merged

Migrate DataPhase1+RAGIngestion (bundled) and DriftDetection to spot#44
cipher813 merged 2 commits into
mainfrom
feat/spot-migration-data-phase1

Conversation

@cipher813
Copy link
Copy Markdown
Owner

@cipher813 cipher813 commented Apr 16, 2026

Summary

Moves all three heavy Saturday-pipeline SSM steps off ae-dashboard (t3.micro, 1 GB RAM) onto self-terminating spot EC2 instances. Completes the producer/observability separation started in the DailyData move earlier today — zero heavy Python workloads run on the micro at any point in the Saturday pipeline.

Changes

Workload migrations (3 steps → 2 spots)

Step Before After
DataPhase1 micro, weekly_collector --phase 1 spot (bundled)
RAGIngestion micro, run_weekly_ingestion.sh bundled onto same spot as DataPhase1
DriftDetection micro, monitoring.drift_detector separate spot

Bundling rationale: DataPhase1 + RAGIngestion run back-to-back on the same repo (alpha-engine-data) with the same pip install. Sharing one spot saves ~7 min of bootstrap overhead vs two separate spots. Trade-off: any failure fails the bundle — acceptable since partial Saturday failures typically require a full-pipeline rerun anyway.

New launcher scripts (~280-300 lines each, mirror spot_backtest.sh)

  • `infrastructure/spot_data_weekly.sh` — clones alpha-engine-data, runs `weekly_collector.py --phase 1` then `run_weekly_ingestion.sh` sequentially, emits two heartbeats (data-phase1, rag-ingestion), terminates
  • `infrastructure/spot_drift_detection.sh` — clones alpha-engine-data + alpha-engine-predictor (drift_detector needs both), runs `python -m monitoring.drift_detector --alert`, emits heartbeat, terminates

Step Function edits

  • DataPhase1: command now invokes `spot_data_weekly.sh`. Timeout bumped 1800→3600s (phase1 ~20min + rag ~15min + spot bootstrap ~7min).
  • RAGIngestion chain removed: RAGIngestion + WaitForRAGIngestion + CheckRAGStatus + RAGWait + ExtractRAGError states all deleted. DataPhase1 success now wires directly to Research. State count: 34 → 30.
  • DriftDetection: command now invokes `spot_drift_detection.sh`. Timeout bumped 300→1200s.
  • SaturdayHealthCheck: path repointed from `/home/ec2-user/alpha-engine-data` to `/home/ec2-user/alpha-engine-dashboard` (health_checker lives there now per Delete trading_calendar + health_checker (moved to alpha-engine-dashboard) #43).

Net effect on ae-dashboard

  • `alpha-engine-data` still cloned (needed for launcher scripts)
  • `alpha-engine-data/.venv` can be deleted permanently post-deploy (no runtime Python workload references it anymore)
  • Micro stays lean — serves as a bash+AWS-CLI dispatcher for three workloads

Pre-merge / deploy checklist

  • Merge cipher813/alpha-engine-dashboard#20 (restores alpha-engine-data in boot-pull REPOS)
  • On ae-dashboard: `git clone` alpha-engine-data (not already present after the de-bloat `rm -rf` earlier)
  • Merge this PR
  • `bash infrastructure/deploy_step_function_saturday.sh` from your laptop (on updated main)
  • Strongly recommended: smoke-test both launchers from ae-dashboard before Saturday:
    ```
    ae-dashboard "cd /home/ec2-user/alpha-engine-data && export HOME=/home/ec2-user && bash infrastructure/spot_data_weekly.sh --smoke-only"
    ae-dashboard "cd /home/ec2-user/alpha-engine-data && export HOME=/home/ec2-user && bash infrastructure/spot_drift_detection.sh --smoke-only"
    ```
    Each launches a spot, validates imports + `--dry-run`, terminates. Catches IAM/AMI/subnet/PAT regressions before the real run.
  • After Saturday's run completes green: `ae-dashboard "rm -rf /home/ec2-user/alpha-engine-data/.venv"` to reclaim ~300 MB and validate the zero-heavy-venv end state.

Followup (roadmap P2)

Bundle DriftDetection onto PredictorTraining's spot — drift reads predictor weights written by training, so they have natural data dependency. Would save another bootstrap cycle. Added to ROADMAP.md for a later session.

🤖 Generated with Claude Code

DataPhase1 now runs on a self-terminating c5.large spot instance
(same pattern as Backtester + PredictorTraining) instead of
hammering the t3.micro. The micro becomes a dispatcher: pulls the
latest launcher script, sources .env, invokes bash
infrastructure/spot_data_phase1.sh. All heavy Python work
(yfinance, polygon, FRED, ArcticDB append) runs on the spot.

Rationale: the 2026-04-16 OOM incident showed that running
data-refresh workloads on a 1 GB RAM instance is fragile-by-design.
Even though Saturday DataPhase1 has historically fit in micro RAM
(it uses different code paths than the daily feature compute that
OOM'd today), consolidating all heavy weekly compute onto
self-terminating spots aligns DataPhase1 with the existing
Backtester/PredictorTraining pattern and removes the 1 GB ceiling
from future data-refresh growth.

Also: SaturdayHealthCheck SSM command repointed from
/home/ec2-user/alpha-engine-data (health_checker.py was deleted
from that repo in #43) to /home/ec2-user/alpha-engine-dashboard
where it now lives. Mirrors the same fix applied to the weekday
HealthCheck step in #42.

Files:
  - new  infrastructure/spot_data_phase1.sh      (spot launcher, mirrors spot_backtest.sh)
  - edit infrastructure/step_function.json        (DataPhase1 + SaturdayHealthCheck commands)

Timeout bumped 1800 → 2700s to accommodate spot bootstrap overhead
(~7 min for instance launch + pip install on top of ~20 min workload).

Deferred (separate PR): migrate RAGIngestion + DriftDetection to
spot as well. They still run on the micro and need alpha-engine-data
restored on ae-dashboard for now.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit to cipher813/alpha-engine-dashboard that referenced this pull request Apr 16, 2026
Reverts the removal from #19. The Saturday Step Function still runs
RAGIngestion + DriftDetection on the micro from alpha-engine-data,
and the new DataPhase1 spot launcher (cipher813/alpha-engine-data#44)
lives at alpha-engine-data/infrastructure/spot_data_phase1.sh — the
micro invokes it as a dispatcher and needs the repo checked out.

Context: #19 assumed ae-dashboard had no runtime need for
alpha-engine-data once health_checker + trading_calendar moved here.
That was wrong — the Saturday SF has 4 separate steps that target
the micro from alpha-engine-data (DataPhase1, RAGIngestion,
DriftDetection, SaturdayHealthCheck). I missed this when
recommending #19.

After cipher813/alpha-engine-data#44 merges and RAG/Drift also
migrate to spot (planned follow-up), alpha-engine-data may still
need to be cloned on the micro for the launcher scripts but the
heavy .venv becomes unnecessary. At that point this line can be
removed again along with a lean-clone pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extends the DataPhase1-to-spot migration to cover all three Saturday
SF steps that were running heavy alpha-engine-data workloads on the
t3.micro:

- DataPhase1 + RAGIngestion now share a single spot instance via
  spot_data_weekly.sh (renamed from spot_data_phase1.sh). Both
  workloads use the same alpha-engine-data clone + pip install —
  bundling saves ~7 min of bootstrap overhead and one spot request.
  RAGIngestion SF state chain (RAGIngestion + WaitForRAGIngestion +
  CheckRAGStatus + RAGWait + ExtractRAGError) removed; DataPhase1's
  success now wires directly to Research.

- DriftDetection moves to its own spot via spot_drift_detection.sh.
  Launcher clones BOTH alpha-engine-data and alpha-engine-predictor
  (drift_detector lives in data/monitoring/ but imports from
  predictor via PYTHONPATH). Overkill cost-wise for the ~5 min
  workload (~7 min bootstrap + ~5 min work vs ~5 min on micro), but
  completes the architectural goal: zero heavy venvs on the micro.

Net effect on ae-dashboard after next boot-pull:
  - alpha-engine-data: cloned (for launcher scripts only, ~300 lines bash)
  - alpha-engine-data/.venv: can be deleted permanently
  - 0 heavy Python workloads running on the t3.micro at any point in
    the Saturday pipeline

Timeout bumps:
  - DataPhase1 (bundled): 2700s → 3600s (phase1 ~20min + rag ~15min + bootstrap ~7min)
  - DriftDetection: 300s → 1200s (bootstrap ~7min + workload ~5min)

SF state count: 34 → 30 (-4 RAG chain states).

Followup roadmap P2: bundle DriftDetection onto PredictorTraining's
spot since drift reads predictor weights produced by that step —
would save another bootstrap cycle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 changed the title Migrate DataPhase1 to spot + fix SaturdayHealthCheck path Migrate DataPhase1+RAGIngestion (bundled) and DriftDetection to spot Apr 16, 2026
@cipher813 cipher813 merged commit 2286773 into main Apr 16, 2026
1 check passed
@cipher813 cipher813 deleted the feat/spot-migration-data-phase1 branch April 16, 2026 17:34
cipher813 added a commit that referenced this pull request Apr 20, 2026
Pairs with alpha-engine-predictor PR #44 (preflight + check_deploy_drift
Lambda action). Together these close the deploy-drift visibility gap
that made the 2026-04-20 coverage-gap session unmanageable: two deploy
paths (auto CI vs manual deploy-infrastructure.sh) with no way to tell
which side had shipped without diffing SHAs by hand.

Stamp at deploy
---------------
1. `deploy-infrastructure.sh` now reads `$GITHUB_SHA` (CI) or
   `git rev-parse HEAD` (local), injects `[git:<sha>]` prefix into the
   top-level `Comment` field of both step_function.json +
   step_function_daily.json before upload / update-state-machine.
   Re-stamping strips any prior `[git:…]` so it's idempotent.
2. CloudFormation stack gets `--tags Key=git-sha,Value=<sha>` on both
   create-stack and update-stack paths.

SF gate
-------
- New `DeployDriftCheck` state as the first state (was StartExecutorEC2).
  Invokes predictor Lambda `action=check_deploy_drift` which returns
  `{has_drift, sf_drift, cf_drift, upstream_sha, sf_sha, stack_sha, ...}`.
- New `DeployDriftGate` Choice state: if `has_drift=true`, route to
  HandleFailure. Else fall through to StartExecutorEC2 (prior StartAt).
- Degraded modes (missing stamps on legacy artifacts, GitHub outage)
  set has_drift=false so the gate doesn't block recoverable scenarios.

State graph: 26 states total (was 24). All Next/Default/Catch refs
resolve; no unreachable states.

Bootstrap
---------
First deploy via `bash infrastructure/deploy-infrastructure.sh` is the
one time this check CAN'T catch its own absence — but that's inherent
to any self-detecting system. After that single bootstrap, every
subsequent drift surfaces at the next weekday SF run.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant