Point CheckTradingDay + HealthCheck SSM commands at alpha-engine-dashboard#42
Merged
Merged
Conversation
…board Part 2/3 of ae-dashboard de-bloat. Once this deploys, the Step Function invokes trading_calendar.py and health_checker.py from /home/ec2-user/alpha-engine-dashboard instead of /home/ec2-user/ alpha-engine-data. Both scripts were copied to the dashboard repo in cipher813/alpha-engine-dashboard#18 — identical content, same CLI contract (TRADING DAY / MARKET_CLOSED stdout markers, --alert flag). Pre-merge requirements: 1. cipher813/alpha-engine-dashboard#18 must be merged 2. ae-dashboard must have pulled the new dashboard repo files (daily boot-pull.timer or manual pull) No other SF states touched. No CloudFormation or deploy-script changes needed — operator runs the existing infrastructure/deploy_step_function_daily.sh after merge to apply the updated definition. Part 3/3 (file deletion + alpha-engine-data removal from ae-dashboard's boot-pull REPOS list + clone removal) follows after next weekday run verifies the new paths work. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
Apr 16, 2026
…oard) Part 3/3 of ae-dashboard de-bloat. Both scripts were copied verbatim to alpha-engine-dashboard in cipher813/alpha-engine-dashboard#18, and the Step Function SSM commands were repointed in #42 (merged 2026-04-16) to run from /home/ec2-user/alpha-engine-dashboard. This PR removes the original copies from alpha-engine-data now that nothing in the weekday Step Function or Saturday pipeline references them anymore. The data repo is now scoped purely to data-production code (collectors, builders, features, weekly_collector) — matches the producer-vs-observability seam documented in the earlier commits. Pre-merge requirements (MERGE ORDER IS IMPORTANT): 1. Friday 2026-04-17 weekday Step Function run must complete successfully using the new dashboard paths (verify CheckTradingDay logs `cd /home/ec2-user/alpha-engine-dashboard`, HealthCheck writes /var/log/health-check.log normally) 2. Update ae-dashboard crontab — there's a `0 */6 * * *` entry still running `cd /home/ec2-user/alpha-engine-data && .venv/bin/python health_checker.py --alert`. Operator must crontab -e on ae-dashboard and swap the path to /home/ec2-user/alpha-engine-dashboard before this PR merges, else the cron breaks until that edit happens Pairs with cipher813/alpha-engine-dashboard#19 (removes alpha-engine-data from the dashboard's boot-pull.sh REPOS list). Tests: full suite 49 passed (was 71 before moves — the 22 delta is 26 tests deleted with the moved files + 5 new test_module_health tests added earlier + others). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
cipher813
added a commit
that referenced
this pull request
Apr 16, 2026
…oard) (#43) Part 3/3 of ae-dashboard de-bloat. Both scripts were copied verbatim to alpha-engine-dashboard in cipher813/alpha-engine-dashboard#18, and the Step Function SSM commands were repointed in #42 (merged 2026-04-16) to run from /home/ec2-user/alpha-engine-dashboard. This PR removes the original copies from alpha-engine-data now that nothing in the weekday Step Function or Saturday pipeline references them anymore. The data repo is now scoped purely to data-production code (collectors, builders, features, weekly_collector) — matches the producer-vs-observability seam documented in the earlier commits. Pre-merge requirements (MERGE ORDER IS IMPORTANT): 1. Friday 2026-04-17 weekday Step Function run must complete successfully using the new dashboard paths (verify CheckTradingDay logs `cd /home/ec2-user/alpha-engine-dashboard`, HealthCheck writes /var/log/health-check.log normally) 2. Update ae-dashboard crontab — there's a `0 */6 * * *` entry still running `cd /home/ec2-user/alpha-engine-data && .venv/bin/python health_checker.py --alert`. Operator must crontab -e on ae-dashboard and swap the path to /home/ec2-user/alpha-engine-dashboard before this PR merges, else the cron breaks until that edit happens Pairs with cipher813/alpha-engine-dashboard#19 (removes alpha-engine-data from the dashboard's boot-pull.sh REPOS list). Tests: full suite 49 passed (was 71 before moves — the 22 delta is 26 tests deleted with the moved files + 5 new test_module_health tests added earlier + others). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
Apr 16, 2026
DataPhase1 now runs on a self-terminating c5.large spot instance (same pattern as Backtester + PredictorTraining) instead of hammering the t3.micro. The micro becomes a dispatcher: pulls the latest launcher script, sources .env, invokes bash infrastructure/spot_data_phase1.sh. All heavy Python work (yfinance, polygon, FRED, ArcticDB append) runs on the spot. Rationale: the 2026-04-16 OOM incident showed that running data-refresh workloads on a 1 GB RAM instance is fragile-by-design. Even though Saturday DataPhase1 has historically fit in micro RAM (it uses different code paths than the daily feature compute that OOM'd today), consolidating all heavy weekly compute onto self-terminating spots aligns DataPhase1 with the existing Backtester/PredictorTraining pattern and removes the 1 GB ceiling from future data-refresh growth. Also: SaturdayHealthCheck SSM command repointed from /home/ec2-user/alpha-engine-data (health_checker.py was deleted from that repo in #43) to /home/ec2-user/alpha-engine-dashboard where it now lives. Mirrors the same fix applied to the weekday HealthCheck step in #42. Files: - new infrastructure/spot_data_phase1.sh (spot launcher, mirrors spot_backtest.sh) - edit infrastructure/step_function.json (DataPhase1 + SaturdayHealthCheck commands) Timeout bumped 1800 → 2700s to accommodate spot bootstrap overhead (~7 min for instance launch + pip install on top of ~20 min workload). Deferred (separate PR): migrate RAGIngestion + DriftDetection to spot as well. They still run on the micro and need alpha-engine-data restored on ae-dashboard for now. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 tasks
cipher813
added a commit
that referenced
this pull request
Apr 16, 2026
…44) * Migrate DataPhase1 to spot + fix SaturdayHealthCheck path DataPhase1 now runs on a self-terminating c5.large spot instance (same pattern as Backtester + PredictorTraining) instead of hammering the t3.micro. The micro becomes a dispatcher: pulls the latest launcher script, sources .env, invokes bash infrastructure/spot_data_phase1.sh. All heavy Python work (yfinance, polygon, FRED, ArcticDB append) runs on the spot. Rationale: the 2026-04-16 OOM incident showed that running data-refresh workloads on a 1 GB RAM instance is fragile-by-design. Even though Saturday DataPhase1 has historically fit in micro RAM (it uses different code paths than the daily feature compute that OOM'd today), consolidating all heavy weekly compute onto self-terminating spots aligns DataPhase1 with the existing Backtester/PredictorTraining pattern and removes the 1 GB ceiling from future data-refresh growth. Also: SaturdayHealthCheck SSM command repointed from /home/ec2-user/alpha-engine-data (health_checker.py was deleted from that repo in #43) to /home/ec2-user/alpha-engine-dashboard where it now lives. Mirrors the same fix applied to the weekday HealthCheck step in #42. Files: - new infrastructure/spot_data_phase1.sh (spot launcher, mirrors spot_backtest.sh) - edit infrastructure/step_function.json (DataPhase1 + SaturdayHealthCheck commands) Timeout bumped 1800 → 2700s to accommodate spot bootstrap overhead (~7 min for instance launch + pip install on top of ~20 min workload). Deferred (separate PR): migrate RAGIngestion + DriftDetection to spot as well. They still run on the micro and need alpha-engine-data restored on ae-dashboard for now. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Bundle DataPhase1+RAGIngestion on one spot; migrate DriftDetection Extends the DataPhase1-to-spot migration to cover all three Saturday SF steps that were running heavy alpha-engine-data workloads on the t3.micro: - DataPhase1 + RAGIngestion now share a single spot instance via spot_data_weekly.sh (renamed from spot_data_phase1.sh). Both workloads use the same alpha-engine-data clone + pip install — bundling saves ~7 min of bootstrap overhead and one spot request. RAGIngestion SF state chain (RAGIngestion + WaitForRAGIngestion + CheckRAGStatus + RAGWait + ExtractRAGError) removed; DataPhase1's success now wires directly to Research. - DriftDetection moves to its own spot via spot_drift_detection.sh. Launcher clones BOTH alpha-engine-data and alpha-engine-predictor (drift_detector lives in data/monitoring/ but imports from predictor via PYTHONPATH). Overkill cost-wise for the ~5 min workload (~7 min bootstrap + ~5 min work vs ~5 min on micro), but completes the architectural goal: zero heavy venvs on the micro. Net effect on ae-dashboard after next boot-pull: - alpha-engine-data: cloned (for launcher scripts only, ~300 lines bash) - alpha-engine-data/.venv: can be deleted permanently - 0 heavy Python workloads running on the t3.micro at any point in the Saturday pipeline Timeout bumps: - DataPhase1 (bundled): 2700s → 3600s (phase1 ~20min + rag ~15min + bootstrap ~7min) - DriftDetection: 300s → 1200s (bootstrap ~7min + workload ~5min) SF state count: 34 → 30 (-4 RAG chain states). Followup roadmap P2: bundle DriftDetection onto PredictorTraining's spot since drift reads predictor weights produced by that step — would save another bootstrap cycle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
cipher813
added a commit
that referenced
this pull request
Apr 20, 2026
* feat(sf): coverage-gap self-heal between Predictor and executor Closes the Research↔Predictor coverage gap at the orchestration layer (Phase 2). Pairs with alpha-engine-predictor PR #42 (--tickers flag + check_coverage action) and alpha-engine PR #72 (executor read-time guard). Problem ------- 2026-04-20: executor daemon bought SNDK/WDC/BIIB/XEL at market open despite 7 buy_candidates having no prediction row. GBM veto gate was structurally unreachable for those tickers (no prediction → no veto). 4 of 5 live entries (~80% of capital) routed around a risk control. Architecture ------------ The invariant is "every buy_candidate must have a prediction before the executor sees signals.json". Previously enforced nowhere. Now enforced in two layers: - **Self-heal (this PR):** PredictorInference → CheckPredictorCoverage → (if gap) ReinvokePredictor with tickers=missing → RecheckCoverage → (if STILL gap) HandleFailure. Single retry — no infinite loop. - **Defense-in-depth (predictor #42 + executor #72):** both predictor write-time and executor read-time refuse to proceed on a coverage gap. These fire if the self-heal mechanism above ever regresses. State graph added ----------------- PredictorInference └→ CheckPredictorCoverage (new, Lambda action=check_coverage) └→ CoverageGapChoice (new) ├─ has_gap=true → ReinvokePredictor (new, Lambda action=predict │ + tickers=$.coverage_result.Payload.missing_tickers) │ └→ RecheckCoverage (new) │ └→ FinalCoverageGate (new) │ ├─ still has_gap → HandleFailure │ └─ default → PredictorHealthCheck └─ default → PredictorHealthCheck All state references validated: 24 states total, no missing Next targets, no unreachable states. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cf): CloudWatch alarm on unscored_buy_candidates_count Phase 4 (CW alarm) of the Research↔Predictor coverage-gap closure. Bundled into the same PR as Phase 2 (SF self-heal) since both are infra living in this repo. New alarm --------- - Namespace: AlphaEngine/Predictor - MetricName: unscored_buy_candidates_count - Emitted by executor's signal_reader on every run (0 on success, >0 on gap) - Threshold: Maximum ≥ 1 over any 1-hour window - Action: existing alpha-engine-alerts SNS topic - TreatMissingData: notBreaching (executor is off-hours M-F only) Semantics --------- A positive value means the SF self-heal (CheckPredictorCoverage → ReinvokePredictor) failed to close the gap before the executor read predictions.json — either orchestration regressed or a ticker is genuinely un-scorable. Long-term regression guard for the coverage invariant. cfn-lint clean (only pre-existing W2001 warnings on unused parameters). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Part 2/3 of the ae-dashboard de-bloat split. Updates the Step Function's two SSM commands that currently invoke scripts from `/home/ec2-user/alpha-engine-data` to run from `/home/ec2-user/alpha-engine-dashboard` instead.
Both scripts (`trading_calendar.py`, `health_checker.py`) were copied verbatim into the dashboard repo in cipher813/alpha-engine-dashboard#18. CLI contract unchanged — `"TRADING DAY"` / `"MARKET_CLOSED"` stdout markers and `--alert` flag behavior are identical.
Change
`infrastructure/step_function_daily.json`, two hunks:
```diff
"source .venv/bin/activate",
"python trading_calendar.py"
```
```diff
"source .venv/bin/activate",
"python health_checker.py --alert 2>&1 | tee /var/log/health-check.log"
```
No other states touched. No CloudFormation changes needed.
Pre-merge requirements
Deploy
Operator runs the existing deploy script post-merge:
```bash
bash infrastructure/deploy_step_function_daily.sh
```
Test plan
Part 3/3
After a clean weekday run validates new paths: separate PR to delete `trading_calendar.py` + `health_checker.py` + their tests from alpha-engine-data, plus cross-repo update to remove `alpha-engine-data` from alpha-engine-dashboard's `boot-pull.sh` REPOS list and delete the clone on ae-dashboard.
🤖 Generated with Claude Code