feat(sf): PR 4c rolling-mean SF wiring + CloudWatch alarm by cipher813 · Pull Request #140 · cipher813/alpha-engine-data

cipher813 · 2026-05-03T14:32:07Z

Summary

Wires the rolling-mean Lambda (research PR #93) into the Saturday SF after the eval-judge converge point, and installs a single CloudWatch alarm that fires when ANY agent's rolling-4-week-mean drops below 3.0.

SF flow update

Both eval-judge branches now converge to EvalRollingMean (instead of SaturdayHealthCheck directly):

CheckBacktesterStatus (Success)
  → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence
      → EvalJudgeFirstSaturday ─┐
                                ├→ EvalRollingMean → SaturdayHealthCheck → NotifyComplete
      → EvalJudgeWeekly ────────┘

EvalRollingMean is non-blocking: Catch States.ALL → SaturdayHealthCheck. Even when eval-judge has infra failures, rolling-mean still runs — the trailing 4-week window is unaffected by the current week's hiccup.

Why SF wiring instead of EventBridge

Per session discussion + the codebase convention (Don't add redundant paths around load-bearing scheduled infra): the Saturday SF is the single authoritative weekly path. EventBridge would have been an implicit-timing-dependency parallel schedule. SF wiring makes the dependency explicit, runs only after the current week's raw metric was actually emitted, and gives one SF execution trace covering the full eval pipeline.

Alarm design

infrastructure/setup_eval_quality_alarm.sh is idempotent. Creates one alarm alpha-engine-eval-quality-regression using a SEARCH metric expression that discovers every (judged_agent_id, criterion, judge_model) combo at evaluation time and MIN-reduces them — so new agents/criteria added later are auto-monitored without re-running the script. Reuses the existing alpha-engine-alerts SNS topic. treat-missing-data=ignore keeps the alarm in INSUFFICIENT_DATA until 4 weeks of metric history accrue (no false pages on bootstrap).

Test plan

python -m pytest tests/ -q → 433 passed (was 427).
6 new EvalRollingMean assertions: alias + start-time payload + 300s timeout + non-blocking Catch + retry posture.
Existing 21 LLM-judge wiring tests updated to expect the new converge target.
bash -n setup_eval_quality_alarm.sh syntax-clean.

Deploy order

From alpha-engine-research: ./infrastructure/deploy.sh eval_rolling_mean (creates the Lambda alias)
From alpha-engine-data: ./infrastructure/deploy_step_function.sh (updates SF JSON + IAM)
From alpha-engine-data: ./infrastructure/setup_eval_quality_alarm.sh (installs the alarm)

Out of scope (PR 4d)

Streamlit quality-trend dashboard page in alpha-engine-dashboard (per-agent line charts × dimensions; prompt-version → quality-score correlation chart per ROADMAP §1633).

🤖 Generated with Claude Code

Wires the alpha-engine-research-eval-rolling-mean Lambda (research PR #93) into the Saturday SF after the eval-judge converge point, and installs a single CloudWatch alarm that fires when ANY agent's rolling-4-week-mean drops below 3.0. SF flow update — eval-judge branches now converge to EvalRollingMean instead of SaturdayHealthCheck: CheckBacktesterStatus (Success) → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence → EvalJudgeFirstSaturday ─┐ ├→ EvalRollingMean → SaturdayHealthCheck → NotifyComplete → EvalJudgeWeekly ────────┘ EvalRollingMean is non-blocking: Catch States.ALL → SaturdayHealthCheck (never HandleFailure). Even on weeks where eval-judge had infra failures, the rolling-mean Lambda still runs against whatever 4 weeks of prior data ARE in CloudWatch — the trailing window is unaffected by the current week's hiccup. Why we converge to rolling-mean rather than spawn a separate EventBridge rule (per session discussion): "Don't add redundant paths around load-bearing scheduled infra" — the Saturday SF is the system's single authoritative weekly path. EventBridge would have been an implicit-timing-dependency parallel schedule ("fire 4 hours after the SF and hope eval-judge finished"). SF wiring makes the dependency explicit, runs only after the current week's raw metric was actually emitted, and gives a single SF execution trace covering the whole eval pipeline. infrastructure/setup_eval_quality_alarm.sh - One-shot idempotent script. Creates a single CloudWatch alarm "alpha-engine-eval-quality-regression" using a SEARCH metric expression to discover every (judged_agent_id, criterion, judge_model) combo at evaluation time, MIN-reduces them, and fires when the min drops below 3.0. SEARCH means new agents + criteria added later are auto-monitored without re-running the script. - Reuses the existing alpha-engine-alerts SNS topic — eval regressions land in the same operator inbox as pipeline failures. - treat-missing-data=ignore so the alarm sits at INSUFFICIENT_DATA until 4 weeks of metric history accrue (no false pages on bootstrap). IAM update: alpha-engine-research-eval-rolling-mean* added to the SF-role LambdaInvoke list in BOTH deploy_step_function.sh and deploy_step_function_daily.sh (shared-policy convention requires sync). Tests 427 → 433 (+6 EvalRollingMean assertions on top of the existing 21 LLM-judge wiring tests that were updated to expect the new converge target — Lambda alias + start-time payload + 300s timeout + non-blocking Catch + retry posture). Deploy order: 1. From alpha-engine-research: ./infrastructure/deploy.sh eval_rolling_mean (creates the Lambda alias) 2. From alpha-engine-data: ./infrastructure/deploy_step_function.sh (updates SF JSON + IAM) 3. From alpha-engine-data: ./infrastructure/setup_eval_quality_alarm.sh (installs the alarm; idempotent so safe to re-run) First eligible alarm firing: 4 weeks after eval-judge starts emitting raw scores. Until then alarm stays in INSUFFICIENT_DATA. Out of scope (PR 4d): Streamlit quality-trend dashboard page in alpha-engine-dashboard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Caught 2026-05-03 in SF eval-pipeline-validation-5: Research succeeded and wrote new-format captures to S3, but the eval-judge state silently never fired because the operator had passed skip_backtester=true to skip the long-running backtester for validation purposes. PR 4c (#140) wired the eval-pipeline states between Backtester success and SaturdayHealthCheck: CheckBacktesterStatus.Success → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence → EvalJudgeFirstSaturday or EvalJudgeWeekly → EvalRollingMean → SaturdayHealthCheck But CheckSkipBacktester.skip routed directly to SaturdayHealthCheck, bypassing the eval-pipeline entirely. Production Sat 5/9 won't hit this (skip_backtester defaults false; Backtester runs and routes through eval-judge correctly), but operator manual skips for any non-eval validation purpose silently dropped the eval state. Fix: route skip_backtester=true → CheckSkipEvalJudge instead of SaturdayHealthCheck. Eval pipeline now fires on every SF execution where the operator hasn't explicitly skip_eval_judge'd it. tests/test_sf_eval_judge_wiring.py — TestSkipBacktesterPreservesEvalJudge: pins the routing so a future "simplification" can't re-introduce the silent bypass. Tests 433 → 434 (+1 wiring assertion). Pairs with alpha-engine-research PR #104 (RubricEvalLLMOutput defense + judge max_tokens to strategic tier — closes the 5/32 remaining failure class observed in this same SF run). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Caught 2026-05-03 in SF eval-pipeline-validation-5: Research succeeded and wrote new-format captures to S3, but the eval-judge state silently never fired because the operator had passed skip_backtester=true to skip the long-running backtester for validation purposes. PR 4c (#140) wired the eval-pipeline states between Backtester success and SaturdayHealthCheck: CheckBacktesterStatus.Success → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence → EvalJudgeFirstSaturday or EvalJudgeWeekly → EvalRollingMean → SaturdayHealthCheck But CheckSkipBacktester.skip routed directly to SaturdayHealthCheck, bypassing the eval-pipeline entirely. Production Sat 5/9 won't hit this (skip_backtester defaults false; Backtester runs and routes through eval-judge correctly), but operator manual skips for any non-eval validation purpose silently dropped the eval state. Fix: route skip_backtester=true → CheckSkipEvalJudge instead of SaturdayHealthCheck. Eval pipeline now fires on every SF execution where the operator hasn't explicitly skip_eval_judge'd it. tests/test_sf_eval_judge_wiring.py — TestSkipBacktesterPreservesEvalJudge: pins the routing so a future "simplification" can't re-introduce the silent bypass. Tests 433 → 434 (+1 wiring assertion). Pairs with alpha-engine-research PR #104 (RubricEvalLLMOutput defense + judge max_tokens to strategic tier — closes the 5/32 remaining failure class observed in this same SF run). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sf): skip_backtester preserves eval-judge skip-gate path Caught 2026-05-03 in SF eval-pipeline-validation-5: Research succeeded and wrote new-format captures to S3, but the eval-judge state silently never fired because the operator had passed skip_backtester=true to skip the long-running backtester for validation purposes. PR 4c (#140) wired the eval-pipeline states between Backtester success and SaturdayHealthCheck: CheckBacktesterStatus.Success → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence → EvalJudgeFirstSaturday or EvalJudgeWeekly → EvalRollingMean → SaturdayHealthCheck But CheckSkipBacktester.skip routed directly to SaturdayHealthCheck, bypassing the eval-pipeline entirely. Production Sat 5/9 won't hit this (skip_backtester defaults false; Backtester runs and routes through eval-judge correctly), but operator manual skips for any non-eval validation purpose silently dropped the eval state. Fix: route skip_backtester=true → CheckSkipEvalJudge instead of SaturdayHealthCheck. Eval pipeline now fires on every SF execution where the operator hasn't explicitly skip_eval_judge'd it. tests/test_sf_eval_judge_wiring.py — TestSkipBacktesterPreservesEvalJudge: pins the routing so a future "simplification" can't re-introduce the silent bypass. Tests 433 → 434 (+1 wiring assertion). Pairs with alpha-engine-research PR #104 (RubricEvalLLMOutput defense + judge max_tokens to strategic tier — closes the 5/32 remaining failure class observed in this same SF run). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: drop dead ALPHA_ENGINE_LIB_TOKEN PAT plumbing alpha-engine-lib was flipped public 2026-05-03; PAT auth machinery that existed to install from a private repo is now dead weight. Removed across 6 files (net −87 lines). CI: - .github/workflows/ci.yml — drop "Configure git auth" step - .github/workflows/deploy.yml — drop the secondary actions/checkout for cipher813/alpha-engine-lib + the LIB_REPO_DIR env on the deploy step Docker / deploy: - Dockerfile — replace `COPY vendor/alpha-engine-lib` + local pip install with `pip install "alpha-engine-lib[flow_doctor] @ git+https://github.com/cipher813/alpha-engine-lib@v0.3.0"`. The [flow_doctor]-only install for Lambda is preserved (Lambda doesn't need [arcticdb] or [rag]); requirements.txt's [arcticdb,flow_doctor,rag] extras still apply for the EC2 install path. - infrastructure/deploy.sh — drop the vendor/alpha-engine-lib staging block + cleanup_lib_staging trap. Replace with one-line comment explaining lib comes from public git+https now. EC2 spot scripts: - infrastructure/spot_data_weekly.sh — drop SSM PAT fetch + insteadOf rewrite from the DEPS step. Update inline comments referencing the old mechanism (3 spots). - infrastructure/spot_drift_detection.sh — same removal. Companion follow-ups (not in this PR): - Delete ALPHA_ENGINE_LIB_TOKEN GitHub Actions secret on this repo - Delete /alpha-engine/lib-token SSM SecureString (us-east-1) - vendor/alpha-engine-lib local checkout can be removed (gitignored, not in any commit) Per ROADMAP follow-up "P3 Drop ALPHA_ENGINE_LIB_TOKEN PAT plumbing" added 2026-05-03. Second of 6 consumer-repo PRs in this cleanup arc; prototype landed in alpha-engine PR #128. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Companion to alpha-engine-backtester #140 (counterfactual Lambda implementation). Closes the agent-justification triple end-to-end — all three signals now fire from the Saturday SF on the same trailing- 8-week corpus. SF chain after this PR: ... → EvalJudge{Weekly,FirstSaturday} → EvalRollingMean → CheckSkipRationaleClustering → RationaleClustering → CheckSkipReplayConcordance → ReplayConcordance → CheckSkipCounterfactual → Counterfactual → SaturdayHealthCheck → ... Three independent skip-gates, each landing on the next signal's gate rather than the health check (so skipping one signal doesn't bundle- skip the others): - {"skip_rationale_clustering": true} → CheckSkipReplayConcordance - {"skip_replay_concordance": true} → CheckSkipCounterfactual - {"skip_counterfactual": true} → SaturdayHealthCheck Default Counterfactual payload pins production cadence: - end_time_iso = SF execution start time - window_days = 56 (8 weeks) - max_depth = 3 ("3-deep rule") No target_models payload — counterfactual doesn't replay against a target model; sklearn fits a tree on actual (input → decision) pairs. IAM updates: - github-actions-lambda-deploy.json: alpha-engine-replay-counterfactual added to LambdaUpdate + LambdaInvokeCanary lists. Asymmetric-IAM- grant antipattern compliance — 5th Lambda this shape; durable CreateFunction grant from data #165 already covers create + update. - deploy_step_function.sh: SF role inline LambdaInvoke list updated with the new function ARN so SF can invoke it. Tests: - TestStatesPresent: CheckSkipCounterfactual + Counterfactual added to required-states pin. - TestSkipReplayConcordance: rerouted assertion (now lands at CheckSkipCounterfactual, not SaturdayHealthCheck). - TestReplayConcordance: success + Catch reroutes (same). - TestSkipCounterfactual: skip_counterfactual flag → SaturdayHealthCheck. - TestCounterfactual: live alias, payload required fields (end_time_iso, window_days=56, max_depth=3), 600s timeout matches Lambda cap, success + Catch routes, retry posture. Suite 459 → 467. Composes with the agent-justification dashboard surface — all three metrics emit under the AlphaEngine/Eval namespace: - agent_quality_score (eval-judge) - agent_quality_score_4w_mean (eval-rolling-mean) - agent_rationale_template_concentration (clustering) - agent_cheap_model_concordance (concordance) - agent_counterfactual_rule_fit (this one) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 merged commit e1171b5 into main May 3, 2026
1 check passed

cipher813 deleted the feat/llm-judge-rolling-mean-sf-pr4c branch May 3, 2026 14:33

cipher813 mentioned this pull request May 6, 2026

feat(sf): wire counterfactual rule fit Lambda into Saturday SF #168

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sf): PR 4c rolling-mean SF wiring + CloudWatch alarm#140

feat(sf): PR 4c rolling-mean SF wiring + CloudWatch alarm#140
cipher813 merged 1 commit into
mainfrom
feat/llm-judge-rolling-mean-sf-pr4c

cipher813 commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 3, 2026

Summary

SF flow update

Why SF wiring instead of EventBridge

Alarm design

Test plan

Deploy order

Out of scope (PR 4d)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant