feat(sf): retire dedicated Evaluator states, consolidate into Backtester by cipher813 · Pull Request #93 · cipher813/alpha-engine-data

cipher813 · 2026-04-24T21:20:10Z

Summary

Consolidates the 8-state dedicated Evaluator subflow into the Backtester step. Spot becomes the authoritative runner for backtest + parity + evaluator as a single atomic SSM command.

Deleted states (180 lines)

CheckSkipEvaluator
CheckEvaluatorFreeze
EvaluatorFrozen
Evaluator
WaitForEvaluator
CheckEvaluatorStatus
EvaluatorWait
ExtractEvaluatorError

Rewired transitions

CheckSkipBacktester skip path: CheckSkipEvaluator → SaturdayHealthCheck
CheckBacktesterStatus success path: CheckSkipEvaluator → SaturdayHealthCheck

Motivation

The Evaluator was originally split from Backtester on 2026-04-12 "so eval can run at a different cadence." In production it has only ever run as the Sat SF tail — the cadence flexibility was never exercised. The split cost 2× IAM/SSM/CloudWatch wiring, 2× venv drift surface, and ~3–5 min of extra wall-clock per run for a capability that wasn't used.

Cadence flexibility is a trigger-side question (add an EventBridge rule → SSM or → spot_backtest.sh --skip-stages=backtest,parity, ~15 min of CF work) rather than a pipeline-shape question. Consolidating doesn't remove that flexibility — it just decouples it from the weekly shape.

Breaking input-contract changes

skip_evaluator: true — no longer honored (use skip_backtester: true to skip all three stages)
freeze_evaluator: true — no longer honored (use --freeze-evaluator on spot_backtest.sh manually)
skip_backtester: true — now skips evaluator too (previously ran evaluator on old artifacts via CheckSkipEvaluator side-path)

Coordination

Coordinated with alpha-engine-backtester PR #74 — spot_backtest.sh gains --skip-stages + --freeze-evaluator + evaluator as 3rd default-on stage.

Deploy sequencing (order matters):

Merge alpha-engine-backtester PR feat(drift): stamp git-sha at deploy + DeployDriftCheck SF gate (Phase 2+3) #74
Merge this PR
Run bash infrastructure/deploy-infrastructure.sh from always-on EC2 or laptop

If reversed: next Sat SF hits old spot_backtest.sh (no evaluator stage) + new SF (no dedicated step) → no evaluator run that week.

Validation

python3 -m json.tool — valid JSON
0 dangling references to deleted states (verified via grep)
Line count: 824 → 644 (-180)
Deploy-infrastructure.sh dry-run
First Sat SF run with consolidated shape

🤖 Generated with Claude Code

The Saturday Step Function's dedicated Evaluator subflow (8 states) was originally split from Backtester on 2026-04-12 "so eval can run at a different cadence." That cadence was never exercised — in production the evaluator only ever runs as the Sat SF tail after Backtester. The split added orchestration complexity (2x IAM/SSM/CloudWatch wiring, 2x venv drift surface, ~3-5 min extra wall-clock) for a capability that was never used. Consolidation moves evaluator into the spot_backtest.sh Backtester step (coordinated alpha-engine-backtester PR adds it as 3rd default-on stage). Spot is the authoritative runner for backtest + parity + evaluator as a single atomic SSM command; SF orchestrates one step instead of seven. Deleted states: - CheckSkipEvaluator - CheckEvaluatorFreeze - EvaluatorFrozen - Evaluator - WaitForEvaluator - CheckEvaluatorStatus - EvaluatorWait - ExtractEvaluatorError Rewired transitions: - CheckSkipBacktester skip path: CheckSkipEvaluator → SaturdayHealthCheck - CheckBacktesterStatus success path: CheckSkipEvaluator → SaturdayHealthCheck Input-contract changes (legacy inputs no longer honored): - skip_evaluator: true → use skip_backtester (skips all three stages) - freeze_evaluator: true → use --freeze-evaluator on spot CLI manually - skip_backtester: true → now skips evaluator too If off-cycle evaluator runs are ever needed, add an EventBridge rule that triggers spot_backtest.sh --skip-stages=backtest,parity (~15 min CF work), or an EventBridge → SSM path that runs evaluate.py on always-on EC2. Cadence flexibility is trigger-side, not pipeline-shape-side. Stats: 824 → 644 lines (-180). python3 -m json.tool validates. 0 dangling state references. Coordination: merge after alpha-engine-backtester PR lands on main so spots clone the new spot_backtest.sh (with evaluator stage + flags) before this SF update goes live. Then run deploy-infrastructure.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wires the alpha-engine-research-eval-rolling-mean Lambda (research PR #93) into the Saturday SF after the eval-judge converge point, and installs a single CloudWatch alarm that fires when ANY agent's rolling-4-week-mean drops below 3.0. SF flow update — eval-judge branches now converge to EvalRollingMean instead of SaturdayHealthCheck: CheckBacktesterStatus (Success) → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence → EvalJudgeFirstSaturday ─┐ ├→ EvalRollingMean → SaturdayHealthCheck → NotifyComplete → EvalJudgeWeekly ────────┘ EvalRollingMean is non-blocking: Catch States.ALL → SaturdayHealthCheck (never HandleFailure). Even on weeks where eval-judge had infra failures, the rolling-mean Lambda still runs against whatever 4 weeks of prior data ARE in CloudWatch — the trailing window is unaffected by the current week's hiccup. Why we converge to rolling-mean rather than spawn a separate EventBridge rule (per session discussion): "Don't add redundant paths around load-bearing scheduled infra" — the Saturday SF is the system's single authoritative weekly path. EventBridge would have been an implicit-timing-dependency parallel schedule ("fire 4 hours after the SF and hope eval-judge finished"). SF wiring makes the dependency explicit, runs only after the current week's raw metric was actually emitted, and gives a single SF execution trace covering the whole eval pipeline. infrastructure/setup_eval_quality_alarm.sh - One-shot idempotent script. Creates a single CloudWatch alarm "alpha-engine-eval-quality-regression" using a SEARCH metric expression to discover every (judged_agent_id, criterion, judge_model) combo at evaluation time, MIN-reduces them, and fires when the min drops below 3.0. SEARCH means new agents + criteria added later are auto-monitored without re-running the script. - Reuses the existing alpha-engine-alerts SNS topic — eval regressions land in the same operator inbox as pipeline failures. - treat-missing-data=ignore so the alarm sits at INSUFFICIENT_DATA until 4 weeks of metric history accrue (no false pages on bootstrap). IAM update: alpha-engine-research-eval-rolling-mean* added to the SF-role LambdaInvoke list in BOTH deploy_step_function.sh and deploy_step_function_daily.sh (shared-policy convention requires sync). Tests 427 → 433 (+6 EvalRollingMean assertions on top of the existing 21 LLM-judge wiring tests that were updated to expect the new converge target — Lambda alias + start-time payload + 300s timeout + non-blocking Catch + retry posture). Deploy order: 1. From alpha-engine-research: ./infrastructure/deploy.sh eval_rolling_mean (creates the Lambda alias) 2. From alpha-engine-data: ./infrastructure/deploy_step_function.sh (updates SF JSON + IAM) 3. From alpha-engine-data: ./infrastructure/setup_eval_quality_alarm.sh (installs the alarm; idempotent so safe to re-run) First eligible alarm firing: 4 weeks after eval-judge starts emitting raw scores. Until then alarm stays in INSUFFICIENT_DATA. Out of scope (PR 4d): Streamlit quality-trend dashboard page in alpha-engine-dashboard. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 merged commit 74efb1d into main Apr 24, 2026
1 check passed

cipher813 deleted the feat/consolidate-evaluator-into-spot branch April 24, 2026 21:24

cipher813 mentioned this pull request May 3, 2026

feat(sf): PR 4c rolling-mean SF wiring + CloudWatch alarm #140

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sf): retire dedicated Evaluator states, consolidate into Backtester#93

feat(sf): retire dedicated Evaluator states, consolidate into Backtester#93
cipher813 merged 1 commit into
mainfrom
feat/consolidate-evaluator-into-spot

cipher813 commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented Apr 24, 2026

Summary

Deleted states (180 lines)

Rewired transitions

Motivation

Breaking input-contract changes

Coordination

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant