Skip to content

feat(sf): retire dedicated Evaluator states, consolidate into Backtester#93

Merged
cipher813 merged 1 commit into
mainfrom
feat/consolidate-evaluator-into-spot
Apr 24, 2026
Merged

feat(sf): retire dedicated Evaluator states, consolidate into Backtester#93
cipher813 merged 1 commit into
mainfrom
feat/consolidate-evaluator-into-spot

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Consolidates the 8-state dedicated Evaluator subflow into the Backtester step. Spot becomes the authoritative runner for backtest + parity + evaluator as a single atomic SSM command.

Deleted states (180 lines)

  • CheckSkipEvaluator
  • CheckEvaluatorFreeze
  • EvaluatorFrozen
  • Evaluator
  • WaitForEvaluator
  • CheckEvaluatorStatus
  • EvaluatorWait
  • ExtractEvaluatorError

Rewired transitions

  • CheckSkipBacktester skip path: CheckSkipEvaluatorSaturdayHealthCheck
  • CheckBacktesterStatus success path: CheckSkipEvaluatorSaturdayHealthCheck

Motivation

The Evaluator was originally split from Backtester on 2026-04-12 "so eval can run at a different cadence." In production it has only ever run as the Sat SF tail — the cadence flexibility was never exercised. The split cost 2× IAM/SSM/CloudWatch wiring, 2× venv drift surface, and ~3–5 min of extra wall-clock per run for a capability that wasn't used.

Cadence flexibility is a trigger-side question (add an EventBridge rule → SSM or → spot_backtest.sh --skip-stages=backtest,parity, ~15 min of CF work) rather than a pipeline-shape question. Consolidating doesn't remove that flexibility — it just decouples it from the weekly shape.

Breaking input-contract changes

  • skip_evaluator: true — no longer honored (use skip_backtester: true to skip all three stages)
  • freeze_evaluator: true — no longer honored (use --freeze-evaluator on spot_backtest.sh manually)
  • skip_backtester: true — now skips evaluator too (previously ran evaluator on old artifacts via CheckSkipEvaluator side-path)

Coordination

Coordinated with alpha-engine-backtester PR #74spot_backtest.sh gains --skip-stages + --freeze-evaluator + evaluator as 3rd default-on stage.

Deploy sequencing (order matters):

  1. Merge alpha-engine-backtester PR feat(drift): stamp git-sha at deploy + DeployDriftCheck SF gate (Phase 2+3) #74
  2. Merge this PR
  3. Run bash infrastructure/deploy-infrastructure.sh from always-on EC2 or laptop

If reversed: next Sat SF hits old spot_backtest.sh (no evaluator stage) + new SF (no dedicated step) → no evaluator run that week.

Validation

  • python3 -m json.tool — valid JSON
  • 0 dangling references to deleted states (verified via grep)
  • Line count: 824 → 644 (-180)
  • Deploy-infrastructure.sh dry-run
  • First Sat SF run with consolidated shape

🤖 Generated with Claude Code

The Saturday Step Function's dedicated Evaluator subflow (8 states) was
originally split from Backtester on 2026-04-12 "so eval can run at a
different cadence." That cadence was never exercised — in production the
evaluator only ever runs as the Sat SF tail after Backtester. The split
added orchestration complexity (2x IAM/SSM/CloudWatch wiring, 2x venv
drift surface, ~3-5 min extra wall-clock) for a capability that was
never used.

Consolidation moves evaluator into the spot_backtest.sh Backtester step
(coordinated alpha-engine-backtester PR adds it as 3rd default-on stage).
Spot is the authoritative runner for backtest + parity + evaluator as a
single atomic SSM command; SF orchestrates one step instead of seven.

Deleted states:
- CheckSkipEvaluator
- CheckEvaluatorFreeze
- EvaluatorFrozen
- Evaluator
- WaitForEvaluator
- CheckEvaluatorStatus
- EvaluatorWait
- ExtractEvaluatorError

Rewired transitions:
- CheckSkipBacktester skip path: CheckSkipEvaluator → SaturdayHealthCheck
- CheckBacktesterStatus success path: CheckSkipEvaluator → SaturdayHealthCheck

Input-contract changes (legacy inputs no longer honored):
- skip_evaluator: true → use skip_backtester (skips all three stages)
- freeze_evaluator: true → use --freeze-evaluator on spot CLI manually
- skip_backtester: true → now skips evaluator too

If off-cycle evaluator runs are ever needed, add an EventBridge rule that
triggers spot_backtest.sh --skip-stages=backtest,parity (~15 min CF work),
or an EventBridge → SSM path that runs evaluate.py on always-on EC2.
Cadence flexibility is trigger-side, not pipeline-shape-side.

Stats: 824 → 644 lines (-180). python3 -m json.tool validates. 0 dangling
state references.

Coordination: merge after alpha-engine-backtester PR lands on main so
spots clone the new spot_backtest.sh (with evaluator stage + flags)
before this SF update goes live. Then run deploy-infrastructure.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 74efb1d into main Apr 24, 2026
1 check passed
@cipher813 cipher813 deleted the feat/consolidate-evaluator-into-spot branch April 24, 2026 21:24
cipher813 added a commit that referenced this pull request May 3, 2026
Wires the alpha-engine-research-eval-rolling-mean Lambda (research
PR #93) into the Saturday SF after the eval-judge converge point,
and installs a single CloudWatch alarm that fires when ANY agent's
rolling-4-week-mean drops below 3.0.

SF flow update — eval-judge branches now converge to EvalRollingMean
instead of SaturdayHealthCheck:

  CheckBacktesterStatus (Success)
    → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence
        → EvalJudgeFirstSaturday ─┐
                                  ├→ EvalRollingMean → SaturdayHealthCheck → NotifyComplete
        → EvalJudgeWeekly ────────┘

EvalRollingMean is non-blocking: Catch States.ALL → SaturdayHealthCheck
(never HandleFailure). Even on weeks where eval-judge had infra
failures, the rolling-mean Lambda still runs against whatever 4
weeks of prior data ARE in CloudWatch — the trailing window is
unaffected by the current week's hiccup.

Why we converge to rolling-mean rather than spawn a separate
EventBridge rule (per session discussion): "Don't add redundant paths
around load-bearing scheduled infra" — the Saturday SF is the system's
single authoritative weekly path. EventBridge would have been an
implicit-timing-dependency parallel schedule ("fire 4 hours after the
SF and hope eval-judge finished"). SF wiring makes the dependency
explicit, runs only after the current week's raw metric was actually
emitted, and gives a single SF execution trace covering the whole
eval pipeline.

infrastructure/setup_eval_quality_alarm.sh
  - One-shot idempotent script. Creates a single CloudWatch alarm
    "alpha-engine-eval-quality-regression" using a SEARCH metric
    expression to discover every (judged_agent_id, criterion,
    judge_model) combo at evaluation time, MIN-reduces them, and
    fires when the min drops below 3.0. SEARCH means new agents +
    criteria added later are auto-monitored without re-running the
    script.
  - Reuses the existing alpha-engine-alerts SNS topic — eval
    regressions land in the same operator inbox as pipeline failures.
  - treat-missing-data=ignore so the alarm sits at INSUFFICIENT_DATA
    until 4 weeks of metric history accrue (no false pages on
    bootstrap).

IAM update: alpha-engine-research-eval-rolling-mean* added to the
SF-role LambdaInvoke list in BOTH deploy_step_function.sh and
deploy_step_function_daily.sh (shared-policy convention requires
sync).

Tests 427 → 433 (+6 EvalRollingMean assertions on top of the existing
21 LLM-judge wiring tests that were updated to expect the new
converge target — Lambda alias + start-time payload + 300s timeout +
non-blocking Catch + retry posture).

Deploy order:
  1. From alpha-engine-research: ./infrastructure/deploy.sh eval_rolling_mean
     (creates the Lambda alias)
  2. From alpha-engine-data: ./infrastructure/deploy_step_function.sh
     (updates SF JSON + IAM)
  3. From alpha-engine-data: ./infrastructure/setup_eval_quality_alarm.sh
     (installs the alarm; idempotent so safe to re-run)

First eligible alarm firing: 4 weeks after eval-judge starts emitting
raw scores. Until then alarm stays in INSUFFICIENT_DATA.

Out of scope (PR 4d): Streamlit quality-trend dashboard page in
alpha-engine-dashboard.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant