Split Evaluator from Backtester into independent Step Function step#23
Merged
Conversation
The Backtester and Evaluator were coupled in a single SSM command
(spot_backtest.sh runs both sequentially). If evaluate.py crashed,
the entire Backtester step failed and had to rerun from scratch
including the 10-min spot launch + simulation.
Split the Evaluator into its own Step Function step that runs on
the always-on EC2 (not a spot instance). evaluate.py reads
simulation artifacts from S3, so it's data-independent from
backtest.py. Runs in ~10 seconds and doesn't need the spot
instance's full environment.
## Pipeline flow change
Before: ... → CheckBacktesterStatus → SaturdayHealthCheck
After: ... → CheckBacktesterStatus → Evaluator →
WaitForEvaluator → CheckEvaluatorStatus →
SaturdayHealthCheck
## New states (5)
- Evaluator: SSM sendCommand (evaluate.py --mode all --upload)
- WaitForEvaluator: SSM getCommandInvocation poll
- CheckEvaluatorStatus: Choice (Success/InProgress/Pending/Error)
- EvaluatorWait: 15s poll interval (vs 60s for backtester)
- ExtractEvaluatorError: Pass → HandleFailure
## Design decisions
- Runs on always-on EC2 with backtester venv (installed
2026-04-12). No spot instance overhead.
- 5-min timeout (300s) — evaluate.py takes ~10s. 30x headroom.
- 15s poll interval — evaluate.py is fast so we don't need the
60s backtester poll cadence.
- Independent retry: 1 retry with 30s backoff.
- Independent Catch → HandleFailure for evaluator-specific errors.
## Enables
- Rerun eval without rerunning backtester (tail-start from
Evaluator step)
- Run eval at a different cadence than backtester (e.g., add to
weekday SF or standalone EventBridge trigger) — P1 on roadmap
- Faster feedback loop for eval bugs (no 10-min spot bootstrap)
## Follow-up (not in this PR)
- Remove the evaluate.py call from spot_backtest.sh so it only
runs backtest.py. Currently eval runs twice (on spot + on EC2) —
harmless but redundant.
- Wire Evaluator into the weekday Step Function for daily eval
cadence.
SF definition: 35 states (was 30).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SSM RunCommand does not set HOME. The Evaluator step was missing `export HOME=/home/ec2-user`, unlike PredictorTraining which had it. Without HOME, Python's Path.home() resolves to /root (the SSM runner user), which could cause config search path misses. executionTimeout was 300 s (5 min) — dangerously short for `evaluate.py --mode all --upload` which pulls research.db from S3, runs all analysis modules, uploads results, and sends email. Increased to 1800 s (30 min). State-level TimeoutSeconds updated to match. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4 tasks
cipher813
added a commit
that referenced
this pull request
May 6, 2026
…substrate (#176) Adds infrastructure/setup_substrate_alarms.sh — idempotent operator script that creates one CloudWatch alarm per inventory row plus one aggregate failure alarm. All point to the existing alpha-engine-alerts SNS topic. Per-row alarm (alpha-engine-substrate-<row_id>): fires when SubstrateRowOK metric for that row drops below 1. The lib emits 1=ok/not_yet_effective, 0=fail, so a single fail in a 24h window triggers SNS via Statistic=Minimum. Aggregate alarm (alpha-engine-substrate-aggregate-failures): fires when SubstrateChecksFailed > 0. Safety net for accidental per-row alarm deletion — per-row alarms remain authoritative for which row failed. treat-missing-data=notBreaching keeps weekly-cadence rows quiet between Sat-SF emissions; only emitted-and-failed datapoints fire. Row enumeration sources from alpha_engine_lib.transparency.load_inventory() so adding a row to the YAML and re-running this script automatically adds the corresponding alarm. No hardcoded row list to drift. Bumps alpha-engine-lib pin v0.3.0 → v0.5.0 so the test imports of DEFAULT_NAMESPACE_OUT (added in lib #23) resolve. 15 new tests pin namespace alignment with lib, SNS target, row enumeration source, alarm semantics (LessThanThreshold + Minimum + notBreaching), and execution order (topic check before alarm creation). 505 total passing. Operator runs once after data #175 deploys: pip install -r requirements.txt # gets v0.5.0 ./infrastructure/setup_substrate_alarms.sh Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a dedicated Evaluator step to the Saturday Step Function, running evaluate.py on the always-on EC2 (not a spot instance). Reads simulation artifacts from S3 — data-independent from backtest.py.
Why now
The system is underperforming the market and the evaluator output is the primary tool for diagnosing why — signal quality, attribution, grading, regression detection, optimizer recommendations. Having eval coupled to the backtester meant every evaluate.py bug required a full 10-minute spot relaunch to retest. With the split, eval failures can be retried in seconds.
Pipeline flow
Before: `... → CheckBacktesterStatus → SaturdayHealthCheck`
After: `... → CheckBacktesterStatus → Evaluator → WaitForEvaluator → CheckEvaluatorStatus → SaturdayHealthCheck`
5 new states
Prerequisites
Live deployment
Applied directly to the live state machine. This PR is the repo-side record.
Follow-up
🤖 Generated with Claude Code