Skip to content

Point CheckTradingDay + HealthCheck SSM commands at alpha-engine-dashboard#42

Merged
cipher813 merged 1 commit into
mainfrom
chore/step-function-point-to-dashboard-paths
Apr 16, 2026
Merged

Point CheckTradingDay + HealthCheck SSM commands at alpha-engine-dashboard#42
cipher813 merged 1 commit into
mainfrom
chore/step-function-point-to-dashboard-paths

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Part 2/3 of the ae-dashboard de-bloat split. Updates the Step Function's two SSM commands that currently invoke scripts from `/home/ec2-user/alpha-engine-data` to run from `/home/ec2-user/alpha-engine-dashboard` instead.

Both scripts (`trading_calendar.py`, `health_checker.py`) were copied verbatim into the dashboard repo in cipher813/alpha-engine-dashboard#18. CLI contract unchanged — `"TRADING DAY"` / `"MARKET_CLOSED"` stdout markers and `--alert` flag behavior are identical.

Change

`infrastructure/step_function_daily.json`, two hunks:

```diff

  • "cd /home/ec2-user/alpha-engine-data",
  • "cd /home/ec2-user/alpha-engine-dashboard",
    "source .venv/bin/activate",
    "python trading_calendar.py"
    ```

```diff

  • "cd /home/ec2-user/alpha-engine-data",
  • "cd /home/ec2-user/alpha-engine-dashboard",
    "source .venv/bin/activate",
    "python health_checker.py --alert 2>&1 | tee /var/log/health-check.log"
    ```

No other states touched. No CloudFormation changes needed.

Pre-merge requirements

  • cipher813/alpha-engine-dashboard#18 merged
  • ae-dashboard has pulled the new dashboard files (either via installed `boot-pull.timer` or manual `git -C /home/ec2-user/alpha-engine-dashboard pull`)

Deploy

Operator runs the existing deploy script post-merge:

```bash
bash infrastructure/deploy_step_function_daily.sh
```

Test plan

  • `python3 -c "import json; json.load(open('infrastructure/step_function_daily.json'))"` — JSON valid
  • Post-deploy: next weekday run — `CheckTradingDay` logs `/var/log/amazon/ssm/...` show `cd /home/ec2-user/alpha-engine-dashboard`, `HealthCheck` writes `/var/log/health-check.log` with identical report shape

Part 3/3

After a clean weekday run validates new paths: separate PR to delete `trading_calendar.py` + `health_checker.py` + their tests from alpha-engine-data, plus cross-repo update to remove `alpha-engine-data` from alpha-engine-dashboard's `boot-pull.sh` REPOS list and delete the clone on ae-dashboard.

🤖 Generated with Claude Code

…board

Part 2/3 of ae-dashboard de-bloat. Once this deploys, the Step Function
invokes trading_calendar.py and health_checker.py from
/home/ec2-user/alpha-engine-dashboard instead of /home/ec2-user/
alpha-engine-data. Both scripts were copied to the dashboard repo in
cipher813/alpha-engine-dashboard#18 — identical content, same CLI
contract (TRADING DAY / MARKET_CLOSED stdout markers, --alert flag).

Pre-merge requirements:
  1. cipher813/alpha-engine-dashboard#18 must be merged
  2. ae-dashboard must have pulled the new dashboard repo files
     (daily boot-pull.timer or manual pull)

No other SF states touched. No CloudFormation or deploy-script
changes needed — operator runs the existing
infrastructure/deploy_step_function_daily.sh after merge to apply
the updated definition.

Part 3/3 (file deletion + alpha-engine-data removal from ae-dashboard's
boot-pull REPOS list + clone removal) follows after next weekday run
verifies the new paths work.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit c76aa58 into main Apr 16, 2026
1 check passed
@cipher813 cipher813 deleted the chore/step-function-point-to-dashboard-paths branch April 16, 2026 16:41
cipher813 added a commit that referenced this pull request Apr 16, 2026
…oard)

Part 3/3 of ae-dashboard de-bloat. Both scripts were copied verbatim
to alpha-engine-dashboard in cipher813/alpha-engine-dashboard#18, and
the Step Function SSM commands were repointed in
#42 (merged 2026-04-16) to run from
/home/ec2-user/alpha-engine-dashboard.

This PR removes the original copies from alpha-engine-data now that
nothing in the weekday Step Function or Saturday pipeline references
them anymore. The data repo is now scoped purely to data-production
code (collectors, builders, features, weekly_collector) — matches the
producer-vs-observability seam documented in the earlier commits.

Pre-merge requirements (MERGE ORDER IS IMPORTANT):
  1. Friday 2026-04-17 weekday Step Function run must complete
     successfully using the new dashboard paths (verify CheckTradingDay
     logs `cd /home/ec2-user/alpha-engine-dashboard`, HealthCheck writes
     /var/log/health-check.log normally)
  2. Update ae-dashboard crontab — there's a
     `0 */6 * * *` entry still running `cd /home/ec2-user/alpha-engine-data
     && .venv/bin/python health_checker.py --alert`. Operator must
     crontab -e on ae-dashboard and swap the path to
     /home/ec2-user/alpha-engine-dashboard before this PR merges, else
     the cron breaks until that edit happens

Pairs with cipher813/alpha-engine-dashboard#19 (removes
alpha-engine-data from the dashboard's boot-pull.sh REPOS list).

Tests: full suite 49 passed (was 71 before moves — the 22 delta is
26 tests deleted with the moved files + 5 new test_module_health
tests added earlier + others).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request Apr 16, 2026
…oard) (#43)

Part 3/3 of ae-dashboard de-bloat. Both scripts were copied verbatim
to alpha-engine-dashboard in cipher813/alpha-engine-dashboard#18, and
the Step Function SSM commands were repointed in
#42 (merged 2026-04-16) to run from
/home/ec2-user/alpha-engine-dashboard.

This PR removes the original copies from alpha-engine-data now that
nothing in the weekday Step Function or Saturday pipeline references
them anymore. The data repo is now scoped purely to data-production
code (collectors, builders, features, weekly_collector) — matches the
producer-vs-observability seam documented in the earlier commits.

Pre-merge requirements (MERGE ORDER IS IMPORTANT):
  1. Friday 2026-04-17 weekday Step Function run must complete
     successfully using the new dashboard paths (verify CheckTradingDay
     logs `cd /home/ec2-user/alpha-engine-dashboard`, HealthCheck writes
     /var/log/health-check.log normally)
  2. Update ae-dashboard crontab — there's a
     `0 */6 * * *` entry still running `cd /home/ec2-user/alpha-engine-data
     && .venv/bin/python health_checker.py --alert`. Operator must
     crontab -e on ae-dashboard and swap the path to
     /home/ec2-user/alpha-engine-dashboard before this PR merges, else
     the cron breaks until that edit happens

Pairs with cipher813/alpha-engine-dashboard#19 (removes
alpha-engine-data from the dashboard's boot-pull.sh REPOS list).

Tests: full suite 49 passed (was 71 before moves — the 22 delta is
26 tests deleted with the moved files + 5 new test_module_health
tests added earlier + others).

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request Apr 16, 2026
DataPhase1 now runs on a self-terminating c5.large spot instance
(same pattern as Backtester + PredictorTraining) instead of
hammering the t3.micro. The micro becomes a dispatcher: pulls the
latest launcher script, sources .env, invokes bash
infrastructure/spot_data_phase1.sh. All heavy Python work
(yfinance, polygon, FRED, ArcticDB append) runs on the spot.

Rationale: the 2026-04-16 OOM incident showed that running
data-refresh workloads on a 1 GB RAM instance is fragile-by-design.
Even though Saturday DataPhase1 has historically fit in micro RAM
(it uses different code paths than the daily feature compute that
OOM'd today), consolidating all heavy weekly compute onto
self-terminating spots aligns DataPhase1 with the existing
Backtester/PredictorTraining pattern and removes the 1 GB ceiling
from future data-refresh growth.

Also: SaturdayHealthCheck SSM command repointed from
/home/ec2-user/alpha-engine-data (health_checker.py was deleted
from that repo in #43) to /home/ec2-user/alpha-engine-dashboard
where it now lives. Mirrors the same fix applied to the weekday
HealthCheck step in #42.

Files:
  - new  infrastructure/spot_data_phase1.sh      (spot launcher, mirrors spot_backtest.sh)
  - edit infrastructure/step_function.json        (DataPhase1 + SaturdayHealthCheck commands)

Timeout bumped 1800 → 2700s to accommodate spot bootstrap overhead
(~7 min for instance launch + pip install on top of ~20 min workload).

Deferred (separate PR): migrate RAGIngestion + DriftDetection to
spot as well. They still run on the micro and need alpha-engine-data
restored on ae-dashboard for now.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request Apr 16, 2026
…44)

* Migrate DataPhase1 to spot + fix SaturdayHealthCheck path

DataPhase1 now runs on a self-terminating c5.large spot instance
(same pattern as Backtester + PredictorTraining) instead of
hammering the t3.micro. The micro becomes a dispatcher: pulls the
latest launcher script, sources .env, invokes bash
infrastructure/spot_data_phase1.sh. All heavy Python work
(yfinance, polygon, FRED, ArcticDB append) runs on the spot.

Rationale: the 2026-04-16 OOM incident showed that running
data-refresh workloads on a 1 GB RAM instance is fragile-by-design.
Even though Saturday DataPhase1 has historically fit in micro RAM
(it uses different code paths than the daily feature compute that
OOM'd today), consolidating all heavy weekly compute onto
self-terminating spots aligns DataPhase1 with the existing
Backtester/PredictorTraining pattern and removes the 1 GB ceiling
from future data-refresh growth.

Also: SaturdayHealthCheck SSM command repointed from
/home/ec2-user/alpha-engine-data (health_checker.py was deleted
from that repo in #43) to /home/ec2-user/alpha-engine-dashboard
where it now lives. Mirrors the same fix applied to the weekday
HealthCheck step in #42.

Files:
  - new  infrastructure/spot_data_phase1.sh      (spot launcher, mirrors spot_backtest.sh)
  - edit infrastructure/step_function.json        (DataPhase1 + SaturdayHealthCheck commands)

Timeout bumped 1800 → 2700s to accommodate spot bootstrap overhead
(~7 min for instance launch + pip install on top of ~20 min workload).

Deferred (separate PR): migrate RAGIngestion + DriftDetection to
spot as well. They still run on the micro and need alpha-engine-data
restored on ae-dashboard for now.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Bundle DataPhase1+RAGIngestion on one spot; migrate DriftDetection

Extends the DataPhase1-to-spot migration to cover all three Saturday
SF steps that were running heavy alpha-engine-data workloads on the
t3.micro:

- DataPhase1 + RAGIngestion now share a single spot instance via
  spot_data_weekly.sh (renamed from spot_data_phase1.sh). Both
  workloads use the same alpha-engine-data clone + pip install —
  bundling saves ~7 min of bootstrap overhead and one spot request.
  RAGIngestion SF state chain (RAGIngestion + WaitForRAGIngestion +
  CheckRAGStatus + RAGWait + ExtractRAGError) removed; DataPhase1's
  success now wires directly to Research.

- DriftDetection moves to its own spot via spot_drift_detection.sh.
  Launcher clones BOTH alpha-engine-data and alpha-engine-predictor
  (drift_detector lives in data/monitoring/ but imports from
  predictor via PYTHONPATH). Overkill cost-wise for the ~5 min
  workload (~7 min bootstrap + ~5 min work vs ~5 min on micro), but
  completes the architectural goal: zero heavy venvs on the micro.

Net effect on ae-dashboard after next boot-pull:
  - alpha-engine-data: cloned (for launcher scripts only, ~300 lines bash)
  - alpha-engine-data/.venv: can be deleted permanently
  - 0 heavy Python workloads running on the t3.micro at any point in
    the Saturday pipeline

Timeout bumps:
  - DataPhase1 (bundled): 2700s → 3600s (phase1 ~20min + rag ~15min + bootstrap ~7min)
  - DriftDetection: 300s → 1200s (bootstrap ~7min + workload ~5min)

SF state count: 34 → 30 (-4 RAG chain states).

Followup roadmap P2: bundle DriftDetection onto PredictorTraining's
spot since drift reads predictor weights produced by that step —
would save another bootstrap cycle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request Apr 20, 2026
* feat(sf): coverage-gap self-heal between Predictor and executor

Closes the Research↔Predictor coverage gap at the orchestration layer
(Phase 2). Pairs with alpha-engine-predictor PR #42 (--tickers flag +
check_coverage action) and alpha-engine PR #72 (executor read-time guard).

Problem
-------
2026-04-20: executor daemon bought SNDK/WDC/BIIB/XEL at market open
despite 7 buy_candidates having no prediction row. GBM veto gate was
structurally unreachable for those tickers (no prediction → no veto).
4 of 5 live entries (~80% of capital) routed around a risk control.

Architecture
------------
The invariant is "every buy_candidate must have a prediction before the
executor sees signals.json". Previously enforced nowhere. Now enforced
in two layers:

- **Self-heal (this PR):** PredictorInference → CheckPredictorCoverage
  → (if gap) ReinvokePredictor with tickers=missing → RecheckCoverage →
  (if STILL gap) HandleFailure. Single retry — no infinite loop.
- **Defense-in-depth (predictor #42 + executor #72):** both predictor
  write-time and executor read-time refuse to proceed on a coverage gap.
  These fire if the self-heal mechanism above ever regresses.

State graph added
-----------------
PredictorInference
  └→ CheckPredictorCoverage (new, Lambda action=check_coverage)
       └→ CoverageGapChoice (new)
            ├─ has_gap=true  → ReinvokePredictor (new, Lambda action=predict
            │                   + tickers=$.coverage_result.Payload.missing_tickers)
            │                   └→ RecheckCoverage (new)
            │                        └→ FinalCoverageGate (new)
            │                             ├─ still has_gap → HandleFailure
            │                             └─ default       → PredictorHealthCheck
            └─ default        → PredictorHealthCheck

All state references validated: 24 states total, no missing Next
targets, no unreachable states.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cf): CloudWatch alarm on unscored_buy_candidates_count

Phase 4 (CW alarm) of the Research↔Predictor coverage-gap closure.
Bundled into the same PR as Phase 2 (SF self-heal) since both are infra
living in this repo.

New alarm
---------
- Namespace:  AlphaEngine/Predictor
- MetricName: unscored_buy_candidates_count
- Emitted by executor's signal_reader on every run (0 on success, >0 on gap)
- Threshold: Maximum ≥ 1 over any 1-hour window
- Action: existing alpha-engine-alerts SNS topic
- TreatMissingData: notBreaching (executor is off-hours M-F only)

Semantics
---------
A positive value means the SF self-heal (CheckPredictorCoverage →
ReinvokePredictor) failed to close the gap before the executor read
predictions.json — either orchestration regressed or a ticker is
genuinely un-scorable. Long-term regression guard for the coverage
invariant.

cfn-lint clean (only pre-existing W2001 warnings on unused parameters).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant