Skip to content

Split Evaluator from Backtester into independent Step Function step#23

Merged
cipher813 merged 2 commits into
mainfrom
feat/split-evaluator-step-function
Apr 14, 2026
Merged

Split Evaluator from Backtester into independent Step Function step#23
cipher813 merged 2 commits into
mainfrom
feat/split-evaluator-step-function

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Add a dedicated Evaluator step to the Saturday Step Function, running evaluate.py on the always-on EC2 (not a spot instance). Reads simulation artifacts from S3 — data-independent from backtest.py.

Why now

The system is underperforming the market and the evaluator output is the primary tool for diagnosing why — signal quality, attribution, grading, regression detection, optimizer recommendations. Having eval coupled to the backtester meant every evaluate.py bug required a full 10-minute spot relaunch to retest. With the split, eval failures can be retried in seconds.

Pipeline flow

Before: `... → CheckBacktesterStatus → SaturdayHealthCheck`
After: `... → CheckBacktesterStatus → Evaluator → WaitForEvaluator → CheckEvaluatorStatus → SaturdayHealthCheck`

5 new states

  • Evaluator: SSM sendCommand running `evaluate.py --mode all --upload` on always-on EC2
  • WaitForEvaluator: SSM getCommandInvocation poll
  • CheckEvaluatorStatus: Choice (Success/InProgress/Pending/Default→Error)
  • EvaluatorWait: 15s poll interval (fast — eval takes ~10s)
  • ExtractEvaluatorError: Pass → HandleFailure

Prerequisites

  • Backtester venv installed on always-on EC2 (`python3.11 -m venv .venv` + `pip install -r requirements.txt` + `pip install -e flow-doctor`). Done via SSM 2026-04-12.
  • `alpha-engine-backtester` repo cloned at `/home/ec2-user/alpha-engine-backtester` on the always-on instance. Done earlier in 2026-04-11 session.

Live deployment

Applied directly to the live state machine. This PR is the repo-side record.

Follow-up

  • Remove evaluate.py call from `spot_backtest.sh` (currently runs twice — harmless but redundant)
  • Add Evaluator to weekday SF for daily eval cadence

🤖 Generated with Claude Code

cipher813 and others added 2 commits April 12, 2026 07:29
The Backtester and Evaluator were coupled in a single SSM command
(spot_backtest.sh runs both sequentially). If evaluate.py crashed,
the entire Backtester step failed and had to rerun from scratch
including the 10-min spot launch + simulation.

Split the Evaluator into its own Step Function step that runs on
the always-on EC2 (not a spot instance). evaluate.py reads
simulation artifacts from S3, so it's data-independent from
backtest.py. Runs in ~10 seconds and doesn't need the spot
instance's full environment.

## Pipeline flow change

Before: ... → CheckBacktesterStatus → SaturdayHealthCheck
After:  ... → CheckBacktesterStatus → Evaluator →
              WaitForEvaluator → CheckEvaluatorStatus →
              SaturdayHealthCheck

## New states (5)

- Evaluator: SSM sendCommand (evaluate.py --mode all --upload)
- WaitForEvaluator: SSM getCommandInvocation poll
- CheckEvaluatorStatus: Choice (Success/InProgress/Pending/Error)
- EvaluatorWait: 15s poll interval (vs 60s for backtester)
- ExtractEvaluatorError: Pass → HandleFailure

## Design decisions

- Runs on always-on EC2 with backtester venv (installed
  2026-04-12). No spot instance overhead.
- 5-min timeout (300s) — evaluate.py takes ~10s. 30x headroom.
- 15s poll interval — evaluate.py is fast so we don't need the
  60s backtester poll cadence.
- Independent retry: 1 retry with 30s backoff.
- Independent Catch → HandleFailure for evaluator-specific errors.

## Enables

- Rerun eval without rerunning backtester (tail-start from
  Evaluator step)
- Run eval at a different cadence than backtester (e.g., add to
  weekday SF or standalone EventBridge trigger) — P1 on roadmap
- Faster feedback loop for eval bugs (no 10-min spot bootstrap)

## Follow-up (not in this PR)

- Remove the evaluate.py call from spot_backtest.sh so it only
  runs backtest.py. Currently eval runs twice (on spot + on EC2) —
  harmless but redundant.
- Wire Evaluator into the weekday Step Function for daily eval
  cadence.

SF definition: 35 states (was 30).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SSM RunCommand does not set HOME. The Evaluator step was missing
`export HOME=/home/ec2-user`, unlike PredictorTraining which had it.
Without HOME, Python's Path.home() resolves to /root (the SSM runner
user), which could cause config search path misses.

executionTimeout was 300 s (5 min) — dangerously short for
`evaluate.py --mode all --upload` which pulls research.db from S3,
runs all analysis modules, uploads results, and sends email. Increased
to 1800 s (30 min). State-level TimeoutSeconds updated to match.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@cipher813 cipher813 merged commit f5bc09a into main Apr 14, 2026
1 check passed
@cipher813 cipher813 deleted the feat/split-evaluator-step-function branch April 14, 2026 14:49
cipher813 added a commit that referenced this pull request May 6, 2026
…substrate (#176)

Adds infrastructure/setup_substrate_alarms.sh — idempotent operator
script that creates one CloudWatch alarm per inventory row plus one
aggregate failure alarm. All point to the existing alpha-engine-alerts
SNS topic.

Per-row alarm (alpha-engine-substrate-<row_id>): fires when
SubstrateRowOK metric for that row drops below 1. The lib emits
1=ok/not_yet_effective, 0=fail, so a single fail in a 24h window
triggers SNS via Statistic=Minimum.

Aggregate alarm (alpha-engine-substrate-aggregate-failures): fires
when SubstrateChecksFailed > 0. Safety net for accidental per-row
alarm deletion — per-row alarms remain authoritative for which row
failed.

treat-missing-data=notBreaching keeps weekly-cadence rows quiet
between Sat-SF emissions; only emitted-and-failed datapoints fire.

Row enumeration sources from alpha_engine_lib.transparency.load_inventory()
so adding a row to the YAML and re-running this script automatically
adds the corresponding alarm. No hardcoded row list to drift.

Bumps alpha-engine-lib pin v0.3.0 → v0.5.0 so the test imports of
DEFAULT_NAMESPACE_OUT (added in lib #23) resolve.

15 new tests pin namespace alignment with lib, SNS target, row
enumeration source, alarm semantics (LessThanThreshold + Minimum +
notBreaching), and execution order (topic check before alarm
creation). 505 total passing.

Operator runs once after data #175 deploys:
  pip install -r requirements.txt  # gets v0.5.0
  ./infrastructure/setup_substrate_alarms.sh

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant