Split Evaluator from Backtester into independent Step Function step by cipher813 · Pull Request #23 · cipher813/alpha-engine-data

cipher813 · 2026-04-12T14:29:51Z

Summary

Add a dedicated Evaluator step to the Saturday Step Function, running evaluate.py on the always-on EC2 (not a spot instance). Reads simulation artifacts from S3 — data-independent from backtest.py.

Why now

The system is underperforming the market and the evaluator output is the primary tool for diagnosing why — signal quality, attribution, grading, regression detection, optimizer recommendations. Having eval coupled to the backtester meant every evaluate.py bug required a full 10-minute spot relaunch to retest. With the split, eval failures can be retried in seconds.

Pipeline flow

Before: `... → CheckBacktesterStatus → SaturdayHealthCheck`
After: `... → CheckBacktesterStatus → Evaluator → WaitForEvaluator → CheckEvaluatorStatus → SaturdayHealthCheck`

5 new states

Evaluator: SSM sendCommand running `evaluate.py --mode all --upload` on always-on EC2
WaitForEvaluator: SSM getCommandInvocation poll
CheckEvaluatorStatus: Choice (Success/InProgress/Pending/Default→Error)
EvaluatorWait: 15s poll interval (fast — eval takes ~10s)
ExtractEvaluatorError: Pass → HandleFailure

Prerequisites

Backtester venv installed on always-on EC2 (`python3.11 -m venv .venv` + `pip install -r requirements.txt` + `pip install -e flow-doctor`). Done via SSM 2026-04-12.
`alpha-engine-backtester` repo cloned at `/home/ec2-user/alpha-engine-backtester` on the always-on instance. Done earlier in 2026-04-11 session.

Live deployment

Applied directly to the live state machine. This PR is the repo-side record.

Follow-up

Remove evaluate.py call from `spot_backtest.sh` (currently runs twice — harmless but redundant)
Add Evaluator to weekday SF for daily eval cadence

🤖 Generated with Claude Code

The Backtester and Evaluator were coupled in a single SSM command (spot_backtest.sh runs both sequentially). If evaluate.py crashed, the entire Backtester step failed and had to rerun from scratch including the 10-min spot launch + simulation. Split the Evaluator into its own Step Function step that runs on the always-on EC2 (not a spot instance). evaluate.py reads simulation artifacts from S3, so it's data-independent from backtest.py. Runs in ~10 seconds and doesn't need the spot instance's full environment. ## Pipeline flow change Before: ... → CheckBacktesterStatus → SaturdayHealthCheck After: ... → CheckBacktesterStatus → Evaluator → WaitForEvaluator → CheckEvaluatorStatus → SaturdayHealthCheck ## New states (5) - Evaluator: SSM sendCommand (evaluate.py --mode all --upload) - WaitForEvaluator: SSM getCommandInvocation poll - CheckEvaluatorStatus: Choice (Success/InProgress/Pending/Error) - EvaluatorWait: 15s poll interval (vs 60s for backtester) - ExtractEvaluatorError: Pass → HandleFailure ## Design decisions - Runs on always-on EC2 with backtester venv (installed 2026-04-12). No spot instance overhead. - 5-min timeout (300s) — evaluate.py takes ~10s. 30x headroom. - 15s poll interval — evaluate.py is fast so we don't need the 60s backtester poll cadence. - Independent retry: 1 retry with 30s backoff. - Independent Catch → HandleFailure for evaluator-specific errors. ## Enables - Rerun eval without rerunning backtester (tail-start from Evaluator step) - Run eval at a different cadence than backtester (e.g., add to weekday SF or standalone EventBridge trigger) — P1 on roadmap - Faster feedback loop for eval bugs (no 10-min spot bootstrap) ## Follow-up (not in this PR) - Remove the evaluate.py call from spot_backtest.sh so it only runs backtest.py. Currently eval runs twice (on spot + on EC2) — harmless but redundant. - Wire Evaluator into the weekday Step Function for daily eval cadence. SF definition: 35 states (was 30). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SSM RunCommand does not set HOME. The Evaluator step was missing `export HOME=/home/ec2-user`, unlike PredictorTraining which had it. Without HOME, Python's Path.home() resolves to /root (the SSM runner user), which could cause config search path misses. executionTimeout was 300 s (5 min) — dangerously short for `evaluate.py --mode all --upload` which pulls research.db from S3, runs all analysis modules, uploads results, and sends email. Increased to 1800 s (30 min). State-level TimeoutSeconds updated to match. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…substrate (#176) Adds infrastructure/setup_substrate_alarms.sh — idempotent operator script that creates one CloudWatch alarm per inventory row plus one aggregate failure alarm. All point to the existing alpha-engine-alerts SNS topic. Per-row alarm (alpha-engine-substrate-<row_id>): fires when SubstrateRowOK metric for that row drops below 1. The lib emits 1=ok/not_yet_effective, 0=fail, so a single fail in a 24h window triggers SNS via Statistic=Minimum. Aggregate alarm (alpha-engine-substrate-aggregate-failures): fires when SubstrateChecksFailed > 0. Safety net for accidental per-row alarm deletion — per-row alarms remain authoritative for which row failed. treat-missing-data=notBreaching keeps weekly-cadence rows quiet between Sat-SF emissions; only emitted-and-failed datapoints fire. Row enumeration sources from alpha_engine_lib.transparency.load_inventory() so adding a row to the YAML and re-running this script automatically adds the corresponding alarm. No hardcoded row list to drift. Bumps alpha-engine-lib pin v0.3.0 → v0.5.0 so the test imports of DEFAULT_NAMESPACE_OUT (added in lib #23) resolve. 15 new tests pin namespace alignment with lib, SNS target, row enumeration source, alarm semantics (LessThanThreshold + Minimum + notBreaching), and execution order (topic check before alarm creation). 505 total passing. Operator runs once after data #175 deploys: pip install -r requirements.txt # gets v0.5.0 ./infrastructure/setup_substrate_alarms.sh Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 and others added 2 commits April 12, 2026 07:29

cipher813 merged commit f5bc09a into main Apr 14, 2026
1 check passed

cipher813 deleted the feat/split-evaluator-step-function branch April 14, 2026 14:49

cipher813 mentioned this pull request May 6, 2026

feat(sf): add WeeklySubstrateHealthCheck state at end of Saturday SF #175

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split Evaluator from Backtester into independent Step Function step#23

Split Evaluator from Backtester into independent Step Function step#23
cipher813 merged 2 commits into
mainfrom
feat/split-evaluator-step-function

cipher813 commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented Apr 12, 2026

Summary

Why now

Pipeline flow

5 new states

Prerequisites

Live deployment

Follow-up

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant