Point CheckTradingDay + HealthCheck SSM commands at alpha-engine-dashboard by cipher813 · Pull Request #42 · cipher813/alpha-engine-data

cipher813 · 2026-04-16T16:29:34Z

Summary

Part 2/3 of the ae-dashboard de-bloat split. Updates the Step Function's two SSM commands that currently invoke scripts from `/home/ec2-user/alpha-engine-data` to run from `/home/ec2-user/alpha-engine-dashboard` instead.

Both scripts (`trading_calendar.py`, `health_checker.py`) were copied verbatim into the dashboard repo in cipher813/alpha-engine-dashboard#18. CLI contract unchanged — `"TRADING DAY"` / `"MARKET_CLOSED"` stdout markers and `--alert` flag behavior are identical.

Change

`infrastructure/step_function_daily.json`, two hunks:

```diff

"cd /home/ec2-user/alpha-engine-data",

"cd /home/ec2-user/alpha-engine-dashboard",
"source .venv/bin/activate",
"python trading_calendar.py"
```

```diff

"cd /home/ec2-user/alpha-engine-data",

"cd /home/ec2-user/alpha-engine-dashboard",
"source .venv/bin/activate",
"python health_checker.py --alert 2>&1 | tee /var/log/health-check.log"
```

No other states touched. No CloudFormation changes needed.

Pre-merge requirements

cipher813/alpha-engine-dashboard#18 merged
ae-dashboard has pulled the new dashboard files (either via installed `boot-pull.timer` or manual `git -C /home/ec2-user/alpha-engine-dashboard pull`)

Deploy

Operator runs the existing deploy script post-merge:

```bash
bash infrastructure/deploy_step_function_daily.sh
```

Test plan

`python3 -c "import json; json.load(open('infrastructure/step_function_daily.json'))"` — JSON valid
Post-deploy: next weekday run — `CheckTradingDay` logs `/var/log/amazon/ssm/...` show `cd /home/ec2-user/alpha-engine-dashboard`, `HealthCheck` writes `/var/log/health-check.log` with identical report shape

Part 3/3

After a clean weekday run validates new paths: separate PR to delete `trading_calendar.py` + `health_checker.py` + their tests from alpha-engine-data, plus cross-repo update to remove `alpha-engine-data` from alpha-engine-dashboard's `boot-pull.sh` REPOS list and delete the clone on ae-dashboard.

🤖 Generated with Claude Code

…board Part 2/3 of ae-dashboard de-bloat. Once this deploys, the Step Function invokes trading_calendar.py and health_checker.py from /home/ec2-user/alpha-engine-dashboard instead of /home/ec2-user/ alpha-engine-data. Both scripts were copied to the dashboard repo in cipher813/alpha-engine-dashboard#18 — identical content, same CLI contract (TRADING DAY / MARKET_CLOSED stdout markers, --alert flag). Pre-merge requirements: 1. cipher813/alpha-engine-dashboard#18 must be merged 2. ae-dashboard must have pulled the new dashboard repo files (daily boot-pull.timer or manual pull) No other SF states touched. No CloudFormation or deploy-script changes needed — operator runs the existing infrastructure/deploy_step_function_daily.sh after merge to apply the updated definition. Part 3/3 (file deletion + alpha-engine-data removal from ae-dashboard's boot-pull REPOS list + clone removal) follows after next weekday run verifies the new paths work. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…oard) Part 3/3 of ae-dashboard de-bloat. Both scripts were copied verbatim to alpha-engine-dashboard in cipher813/alpha-engine-dashboard#18, and the Step Function SSM commands were repointed in #42 (merged 2026-04-16) to run from /home/ec2-user/alpha-engine-dashboard. This PR removes the original copies from alpha-engine-data now that nothing in the weekday Step Function or Saturday pipeline references them anymore. The data repo is now scoped purely to data-production code (collectors, builders, features, weekly_collector) — matches the producer-vs-observability seam documented in the earlier commits. Pre-merge requirements (MERGE ORDER IS IMPORTANT): 1. Friday 2026-04-17 weekday Step Function run must complete successfully using the new dashboard paths (verify CheckTradingDay logs `cd /home/ec2-user/alpha-engine-dashboard`, HealthCheck writes /var/log/health-check.log normally) 2. Update ae-dashboard crontab — there's a `0 */6 * * *` entry still running `cd /home/ec2-user/alpha-engine-data && .venv/bin/python health_checker.py --alert`. Operator must crontab -e on ae-dashboard and swap the path to /home/ec2-user/alpha-engine-dashboard before this PR merges, else the cron breaks until that edit happens Pairs with cipher813/alpha-engine-dashboard#19 (removes alpha-engine-data from the dashboard's boot-pull.sh REPOS list). Tests: full suite 49 passed (was 71 before moves — the 22 delta is 26 tests deleted with the moved files + 5 new test_module_health tests added earlier + others). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…oard) (#43) Part 3/3 of ae-dashboard de-bloat. Both scripts were copied verbatim to alpha-engine-dashboard in cipher813/alpha-engine-dashboard#18, and the Step Function SSM commands were repointed in #42 (merged 2026-04-16) to run from /home/ec2-user/alpha-engine-dashboard. This PR removes the original copies from alpha-engine-data now that nothing in the weekday Step Function or Saturday pipeline references them anymore. The data repo is now scoped purely to data-production code (collectors, builders, features, weekly_collector) — matches the producer-vs-observability seam documented in the earlier commits. Pre-merge requirements (MERGE ORDER IS IMPORTANT): 1. Friday 2026-04-17 weekday Step Function run must complete successfully using the new dashboard paths (verify CheckTradingDay logs `cd /home/ec2-user/alpha-engine-dashboard`, HealthCheck writes /var/log/health-check.log normally) 2. Update ae-dashboard crontab — there's a `0 */6 * * *` entry still running `cd /home/ec2-user/alpha-engine-data && .venv/bin/python health_checker.py --alert`. Operator must crontab -e on ae-dashboard and swap the path to /home/ec2-user/alpha-engine-dashboard before this PR merges, else the cron breaks until that edit happens Pairs with cipher813/alpha-engine-dashboard#19 (removes alpha-engine-data from the dashboard's boot-pull.sh REPOS list). Tests: full suite 49 passed (was 71 before moves — the 22 delta is 26 tests deleted with the moved files + 5 new test_module_health tests added earlier + others). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

DataPhase1 now runs on a self-terminating c5.large spot instance (same pattern as Backtester + PredictorTraining) instead of hammering the t3.micro. The micro becomes a dispatcher: pulls the latest launcher script, sources .env, invokes bash infrastructure/spot_data_phase1.sh. All heavy Python work (yfinance, polygon, FRED, ArcticDB append) runs on the spot. Rationale: the 2026-04-16 OOM incident showed that running data-refresh workloads on a 1 GB RAM instance is fragile-by-design. Even though Saturday DataPhase1 has historically fit in micro RAM (it uses different code paths than the daily feature compute that OOM'd today), consolidating all heavy weekly compute onto self-terminating spots aligns DataPhase1 with the existing Backtester/PredictorTraining pattern and removes the 1 GB ceiling from future data-refresh growth. Also: SaturdayHealthCheck SSM command repointed from /home/ec2-user/alpha-engine-data (health_checker.py was deleted from that repo in #43) to /home/ec2-user/alpha-engine-dashboard where it now lives. Mirrors the same fix applied to the weekday HealthCheck step in #42. Files: - new infrastructure/spot_data_phase1.sh (spot launcher, mirrors spot_backtest.sh) - edit infrastructure/step_function.json (DataPhase1 + SaturdayHealthCheck commands) Timeout bumped 1800 → 2700s to accommodate spot bootstrap overhead (~7 min for instance launch + pip install on top of ~20 min workload). Deferred (separate PR): migrate RAGIngestion + DriftDetection to spot as well. They still run on the micro and need alpha-engine-data restored on ae-dashboard for now. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…44) * Migrate DataPhase1 to spot + fix SaturdayHealthCheck path DataPhase1 now runs on a self-terminating c5.large spot instance (same pattern as Backtester + PredictorTraining) instead of hammering the t3.micro. The micro becomes a dispatcher: pulls the latest launcher script, sources .env, invokes bash infrastructure/spot_data_phase1.sh. All heavy Python work (yfinance, polygon, FRED, ArcticDB append) runs on the spot. Rationale: the 2026-04-16 OOM incident showed that running data-refresh workloads on a 1 GB RAM instance is fragile-by-design. Even though Saturday DataPhase1 has historically fit in micro RAM (it uses different code paths than the daily feature compute that OOM'd today), consolidating all heavy weekly compute onto self-terminating spots aligns DataPhase1 with the existing Backtester/PredictorTraining pattern and removes the 1 GB ceiling from future data-refresh growth. Also: SaturdayHealthCheck SSM command repointed from /home/ec2-user/alpha-engine-data (health_checker.py was deleted from that repo in #43) to /home/ec2-user/alpha-engine-dashboard where it now lives. Mirrors the same fix applied to the weekday HealthCheck step in #42. Files: - new infrastructure/spot_data_phase1.sh (spot launcher, mirrors spot_backtest.sh) - edit infrastructure/step_function.json (DataPhase1 + SaturdayHealthCheck commands) Timeout bumped 1800 → 2700s to accommodate spot bootstrap overhead (~7 min for instance launch + pip install on top of ~20 min workload). Deferred (separate PR): migrate RAGIngestion + DriftDetection to spot as well. They still run on the micro and need alpha-engine-data restored on ae-dashboard for now. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Bundle DataPhase1+RAGIngestion on one spot; migrate DriftDetection Extends the DataPhase1-to-spot migration to cover all three Saturday SF steps that were running heavy alpha-engine-data workloads on the t3.micro: - DataPhase1 + RAGIngestion now share a single spot instance via spot_data_weekly.sh (renamed from spot_data_phase1.sh). Both workloads use the same alpha-engine-data clone + pip install — bundling saves ~7 min of bootstrap overhead and one spot request. RAGIngestion SF state chain (RAGIngestion + WaitForRAGIngestion + CheckRAGStatus + RAGWait + ExtractRAGError) removed; DataPhase1's success now wires directly to Research. - DriftDetection moves to its own spot via spot_drift_detection.sh. Launcher clones BOTH alpha-engine-data and alpha-engine-predictor (drift_detector lives in data/monitoring/ but imports from predictor via PYTHONPATH). Overkill cost-wise for the ~5 min workload (~7 min bootstrap + ~5 min work vs ~5 min on micro), but completes the architectural goal: zero heavy venvs on the micro. Net effect on ae-dashboard after next boot-pull: - alpha-engine-data: cloned (for launcher scripts only, ~300 lines bash) - alpha-engine-data/.venv: can be deleted permanently - 0 heavy Python workloads running on the t3.micro at any point in the Saturday pipeline Timeout bumps: - DataPhase1 (bundled): 2700s → 3600s (phase1 ~20min + rag ~15min + bootstrap ~7min) - DriftDetection: 300s → 1200s (bootstrap ~7min + workload ~5min) SF state count: 34 → 30 (-4 RAG chain states). Followup roadmap P2: bundle DriftDetection onto PredictorTraining's spot since drift reads predictor weights produced by that step — would save another bootstrap cycle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(sf): coverage-gap self-heal between Predictor and executor Closes the Research↔Predictor coverage gap at the orchestration layer (Phase 2). Pairs with alpha-engine-predictor PR #42 (--tickers flag + check_coverage action) and alpha-engine PR #72 (executor read-time guard). Problem ------- 2026-04-20: executor daemon bought SNDK/WDC/BIIB/XEL at market open despite 7 buy_candidates having no prediction row. GBM veto gate was structurally unreachable for those tickers (no prediction → no veto). 4 of 5 live entries (~80% of capital) routed around a risk control. Architecture ------------ The invariant is "every buy_candidate must have a prediction before the executor sees signals.json". Previously enforced nowhere. Now enforced in two layers: - **Self-heal (this PR):** PredictorInference → CheckPredictorCoverage → (if gap) ReinvokePredictor with tickers=missing → RecheckCoverage → (if STILL gap) HandleFailure. Single retry — no infinite loop. - **Defense-in-depth (predictor #42 + executor #72):** both predictor write-time and executor read-time refuse to proceed on a coverage gap. These fire if the self-heal mechanism above ever regresses. State graph added ----------------- PredictorInference └→ CheckPredictorCoverage (new, Lambda action=check_coverage) └→ CoverageGapChoice (new) ├─ has_gap=true → ReinvokePredictor (new, Lambda action=predict │ + tickers=$.coverage_result.Payload.missing_tickers) │ └→ RecheckCoverage (new) │ └→ FinalCoverageGate (new) │ ├─ still has_gap → HandleFailure │ └─ default → PredictorHealthCheck └─ default → PredictorHealthCheck All state references validated: 24 states total, no missing Next targets, no unreachable states. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cf): CloudWatch alarm on unscored_buy_candidates_count Phase 4 (CW alarm) of the Research↔Predictor coverage-gap closure. Bundled into the same PR as Phase 2 (SF self-heal) since both are infra living in this repo. New alarm --------- - Namespace: AlphaEngine/Predictor - MetricName: unscored_buy_candidates_count - Emitted by executor's signal_reader on every run (0 on success, >0 on gap) - Threshold: Maximum ≥ 1 over any 1-hour window - Action: existing alpha-engine-alerts SNS topic - TreatMissingData: notBreaching (executor is off-hours M-F only) Semantics --------- A positive value means the SF self-heal (CheckPredictorCoverage → ReinvokePredictor) failed to close the gap before the executor read predictions.json — either orchestration regressed or a ticker is genuinely un-scorable. Long-term regression guard for the coverage invariant. cfn-lint clean (only pre-existing W2001 warnings on unused parameters). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 merged commit c76aa58 into main Apr 16, 2026
1 check passed

cipher813 deleted the chore/step-function-point-to-dashboard-paths branch April 16, 2026 16:41

cipher813 mentioned this pull request Apr 16, 2026

Delete trading_calendar + health_checker (moved to alpha-engine-dashboard) #43

Merged

5 tasks

cipher813 mentioned this pull request Apr 16, 2026

Migrate DataPhase1+RAGIngestion (bundled) and DriftDetection to spot #44

Merged

6 tasks

cipher813 mentioned this pull request Apr 20, 2026

feat(sf): coverage-gap self-heal between Predictor and executor #72

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Point CheckTradingDay + HealthCheck SSM commands at alpha-engine-dashboard#42

Point CheckTradingDay + HealthCheck SSM commands at alpha-engine-dashboard#42
cipher813 merged 1 commit into
mainfrom
chore/step-function-point-to-dashboard-paths

cipher813 commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented Apr 16, 2026

Summary

Change

Pre-merge requirements

Deploy

Test plan

Part 3/3

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant