feat(sf): retire dedicated Evaluator states, consolidate into Backtester#93
Merged
Merged
Conversation
The Saturday Step Function's dedicated Evaluator subflow (8 states) was originally split from Backtester on 2026-04-12 "so eval can run at a different cadence." That cadence was never exercised — in production the evaluator only ever runs as the Sat SF tail after Backtester. The split added orchestration complexity (2x IAM/SSM/CloudWatch wiring, 2x venv drift surface, ~3-5 min extra wall-clock) for a capability that was never used. Consolidation moves evaluator into the spot_backtest.sh Backtester step (coordinated alpha-engine-backtester PR adds it as 3rd default-on stage). Spot is the authoritative runner for backtest + parity + evaluator as a single atomic SSM command; SF orchestrates one step instead of seven. Deleted states: - CheckSkipEvaluator - CheckEvaluatorFreeze - EvaluatorFrozen - Evaluator - WaitForEvaluator - CheckEvaluatorStatus - EvaluatorWait - ExtractEvaluatorError Rewired transitions: - CheckSkipBacktester skip path: CheckSkipEvaluator → SaturdayHealthCheck - CheckBacktesterStatus success path: CheckSkipEvaluator → SaturdayHealthCheck Input-contract changes (legacy inputs no longer honored): - skip_evaluator: true → use skip_backtester (skips all three stages) - freeze_evaluator: true → use --freeze-evaluator on spot CLI manually - skip_backtester: true → now skips evaluator too If off-cycle evaluator runs are ever needed, add an EventBridge rule that triggers spot_backtest.sh --skip-stages=backtest,parity (~15 min CF work), or an EventBridge → SSM path that runs evaluate.py on always-on EC2. Cadence flexibility is trigger-side, not pipeline-shape-side. Stats: 824 → 644 lines (-180). python3 -m json.tool validates. 0 dangling state references. Coordination: merge after alpha-engine-backtester PR lands on main so spots clone the new spot_backtest.sh (with evaluator stage + flags) before this SF update goes live. Then run deploy-infrastructure.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
cipher813
added a commit
that referenced
this pull request
May 3, 2026
Wires the alpha-engine-research-eval-rolling-mean Lambda (research PR #93) into the Saturday SF after the eval-judge converge point, and installs a single CloudWatch alarm that fires when ANY agent's rolling-4-week-mean drops below 3.0. SF flow update — eval-judge branches now converge to EvalRollingMean instead of SaturdayHealthCheck: CheckBacktesterStatus (Success) → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence → EvalJudgeFirstSaturday ─┐ ├→ EvalRollingMean → SaturdayHealthCheck → NotifyComplete → EvalJudgeWeekly ────────┘ EvalRollingMean is non-blocking: Catch States.ALL → SaturdayHealthCheck (never HandleFailure). Even on weeks where eval-judge had infra failures, the rolling-mean Lambda still runs against whatever 4 weeks of prior data ARE in CloudWatch — the trailing window is unaffected by the current week's hiccup. Why we converge to rolling-mean rather than spawn a separate EventBridge rule (per session discussion): "Don't add redundant paths around load-bearing scheduled infra" — the Saturday SF is the system's single authoritative weekly path. EventBridge would have been an implicit-timing-dependency parallel schedule ("fire 4 hours after the SF and hope eval-judge finished"). SF wiring makes the dependency explicit, runs only after the current week's raw metric was actually emitted, and gives a single SF execution trace covering the whole eval pipeline. infrastructure/setup_eval_quality_alarm.sh - One-shot idempotent script. Creates a single CloudWatch alarm "alpha-engine-eval-quality-regression" using a SEARCH metric expression to discover every (judged_agent_id, criterion, judge_model) combo at evaluation time, MIN-reduces them, and fires when the min drops below 3.0. SEARCH means new agents + criteria added later are auto-monitored without re-running the script. - Reuses the existing alpha-engine-alerts SNS topic — eval regressions land in the same operator inbox as pipeline failures. - treat-missing-data=ignore so the alarm sits at INSUFFICIENT_DATA until 4 weeks of metric history accrue (no false pages on bootstrap). IAM update: alpha-engine-research-eval-rolling-mean* added to the SF-role LambdaInvoke list in BOTH deploy_step_function.sh and deploy_step_function_daily.sh (shared-policy convention requires sync). Tests 427 → 433 (+6 EvalRollingMean assertions on top of the existing 21 LLM-judge wiring tests that were updated to expect the new converge target — Lambda alias + start-time payload + 300s timeout + non-blocking Catch + retry posture). Deploy order: 1. From alpha-engine-research: ./infrastructure/deploy.sh eval_rolling_mean (creates the Lambda alias) 2. From alpha-engine-data: ./infrastructure/deploy_step_function.sh (updates SF JSON + IAM) 3. From alpha-engine-data: ./infrastructure/setup_eval_quality_alarm.sh (installs the alarm; idempotent so safe to re-run) First eligible alarm firing: 4 weeks after eval-judge starts emitting raw scores. Until then alarm stays in INSUFFICIENT_DATA. Out of scope (PR 4d): Streamlit quality-trend dashboard page in alpha-engine-dashboard. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Consolidates the 8-state dedicated Evaluator subflow into the Backtester step. Spot becomes the authoritative runner for backtest + parity + evaluator as a single atomic SSM command.
Deleted states (180 lines)
CheckSkipEvaluatorCheckEvaluatorFreezeEvaluatorFrozenEvaluatorWaitForEvaluatorCheckEvaluatorStatusEvaluatorWaitExtractEvaluatorErrorRewired transitions
CheckSkipBacktesterskip path:CheckSkipEvaluator→SaturdayHealthCheckCheckBacktesterStatussuccess path:CheckSkipEvaluator→SaturdayHealthCheckMotivation
The Evaluator was originally split from Backtester on 2026-04-12 "so eval can run at a different cadence." In production it has only ever run as the Sat SF tail — the cadence flexibility was never exercised. The split cost 2× IAM/SSM/CloudWatch wiring, 2× venv drift surface, and ~3–5 min of extra wall-clock per run for a capability that wasn't used.
Cadence flexibility is a trigger-side question (add an EventBridge rule → SSM or →
spot_backtest.sh --skip-stages=backtest,parity, ~15 min of CF work) rather than a pipeline-shape question. Consolidating doesn't remove that flexibility — it just decouples it from the weekly shape.Breaking input-contract changes
skip_evaluator: true— no longer honored (useskip_backtester: trueto skip all three stages)freeze_evaluator: true— no longer honored (use--freeze-evaluatoronspot_backtest.shmanually)skip_backtester: true— now skips evaluator too (previously ran evaluator on old artifacts viaCheckSkipEvaluatorside-path)Coordination
Coordinated with alpha-engine-backtester PR #74 —
spot_backtest.shgains--skip-stages+--freeze-evaluator+ evaluator as 3rd default-on stage.Deploy sequencing (order matters):
bash infrastructure/deploy-infrastructure.shfrom always-on EC2 or laptopIf reversed: next Sat SF hits old
spot_backtest.sh(no evaluator stage) + new SF (no dedicated step) → no evaluator run that week.Validation
python3 -m json.tool— valid JSON🤖 Generated with Claude Code