feat(sf): PR 4c rolling-mean SF wiring + CloudWatch alarm#140
Merged
Conversation
Wires the alpha-engine-research-eval-rolling-mean Lambda (research PR #93) into the Saturday SF after the eval-judge converge point, and installs a single CloudWatch alarm that fires when ANY agent's rolling-4-week-mean drops below 3.0. SF flow update — eval-judge branches now converge to EvalRollingMean instead of SaturdayHealthCheck: CheckBacktesterStatus (Success) → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence → EvalJudgeFirstSaturday ─┐ ├→ EvalRollingMean → SaturdayHealthCheck → NotifyComplete → EvalJudgeWeekly ────────┘ EvalRollingMean is non-blocking: Catch States.ALL → SaturdayHealthCheck (never HandleFailure). Even on weeks where eval-judge had infra failures, the rolling-mean Lambda still runs against whatever 4 weeks of prior data ARE in CloudWatch — the trailing window is unaffected by the current week's hiccup. Why we converge to rolling-mean rather than spawn a separate EventBridge rule (per session discussion): "Don't add redundant paths around load-bearing scheduled infra" — the Saturday SF is the system's single authoritative weekly path. EventBridge would have been an implicit-timing-dependency parallel schedule ("fire 4 hours after the SF and hope eval-judge finished"). SF wiring makes the dependency explicit, runs only after the current week's raw metric was actually emitted, and gives a single SF execution trace covering the whole eval pipeline. infrastructure/setup_eval_quality_alarm.sh - One-shot idempotent script. Creates a single CloudWatch alarm "alpha-engine-eval-quality-regression" using a SEARCH metric expression to discover every (judged_agent_id, criterion, judge_model) combo at evaluation time, MIN-reduces them, and fires when the min drops below 3.0. SEARCH means new agents + criteria added later are auto-monitored without re-running the script. - Reuses the existing alpha-engine-alerts SNS topic — eval regressions land in the same operator inbox as pipeline failures. - treat-missing-data=ignore so the alarm sits at INSUFFICIENT_DATA until 4 weeks of metric history accrue (no false pages on bootstrap). IAM update: alpha-engine-research-eval-rolling-mean* added to the SF-role LambdaInvoke list in BOTH deploy_step_function.sh and deploy_step_function_daily.sh (shared-policy convention requires sync). Tests 427 → 433 (+6 EvalRollingMean assertions on top of the existing 21 LLM-judge wiring tests that were updated to expect the new converge target — Lambda alias + start-time payload + 300s timeout + non-blocking Catch + retry posture). Deploy order: 1. From alpha-engine-research: ./infrastructure/deploy.sh eval_rolling_mean (creates the Lambda alias) 2. From alpha-engine-data: ./infrastructure/deploy_step_function.sh (updates SF JSON + IAM) 3. From alpha-engine-data: ./infrastructure/setup_eval_quality_alarm.sh (installs the alarm; idempotent so safe to re-run) First eligible alarm firing: 4 weeks after eval-judge starts emitting raw scores. Until then alarm stays in INSUFFICIENT_DATA. Out of scope (PR 4d): Streamlit quality-trend dashboard page in alpha-engine-dashboard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 3, 2026
Caught 2026-05-03 in SF eval-pipeline-validation-5: Research succeeded and wrote new-format captures to S3, but the eval-judge state silently never fired because the operator had passed skip_backtester=true to skip the long-running backtester for validation purposes. PR 4c (#140) wired the eval-pipeline states between Backtester success and SaturdayHealthCheck: CheckBacktesterStatus.Success → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence → EvalJudgeFirstSaturday or EvalJudgeWeekly → EvalRollingMean → SaturdayHealthCheck But CheckSkipBacktester.skip routed directly to SaturdayHealthCheck, bypassing the eval-pipeline entirely. Production Sat 5/9 won't hit this (skip_backtester defaults false; Backtester runs and routes through eval-judge correctly), but operator manual skips for any non-eval validation purpose silently dropped the eval state. Fix: route skip_backtester=true → CheckSkipEvalJudge instead of SaturdayHealthCheck. Eval pipeline now fires on every SF execution where the operator hasn't explicitly skip_eval_judge'd it. tests/test_sf_eval_judge_wiring.py — TestSkipBacktesterPreservesEvalJudge: pins the routing so a future "simplification" can't re-introduce the silent bypass. Tests 433 → 434 (+1 wiring assertion). Pairs with alpha-engine-research PR #104 (RubricEvalLLMOutput defense + judge max_tokens to strategic tier — closes the 5/32 remaining failure class observed in this same SF run). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 3, 2026
Caught 2026-05-03 in SF eval-pipeline-validation-5: Research succeeded and wrote new-format captures to S3, but the eval-judge state silently never fired because the operator had passed skip_backtester=true to skip the long-running backtester for validation purposes. PR 4c (#140) wired the eval-pipeline states between Backtester success and SaturdayHealthCheck: CheckBacktesterStatus.Success → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence → EvalJudgeFirstSaturday or EvalJudgeWeekly → EvalRollingMean → SaturdayHealthCheck But CheckSkipBacktester.skip routed directly to SaturdayHealthCheck, bypassing the eval-pipeline entirely. Production Sat 5/9 won't hit this (skip_backtester defaults false; Backtester runs and routes through eval-judge correctly), but operator manual skips for any non-eval validation purpose silently dropped the eval state. Fix: route skip_backtester=true → CheckSkipEvalJudge instead of SaturdayHealthCheck. Eval pipeline now fires on every SF execution where the operator hasn't explicitly skip_eval_judge'd it. tests/test_sf_eval_judge_wiring.py — TestSkipBacktesterPreservesEvalJudge: pins the routing so a future "simplification" can't re-introduce the silent bypass. Tests 433 → 434 (+1 wiring assertion). Pairs with alpha-engine-research PR #104 (RubricEvalLLMOutput defense + judge max_tokens to strategic tier — closes the 5/32 remaining failure class observed in this same SF run). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 3, 2026
* fix(sf): skip_backtester preserves eval-judge skip-gate path Caught 2026-05-03 in SF eval-pipeline-validation-5: Research succeeded and wrote new-format captures to S3, but the eval-judge state silently never fired because the operator had passed skip_backtester=true to skip the long-running backtester for validation purposes. PR 4c (#140) wired the eval-pipeline states between Backtester success and SaturdayHealthCheck: CheckBacktesterStatus.Success → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence → EvalJudgeFirstSaturday or EvalJudgeWeekly → EvalRollingMean → SaturdayHealthCheck But CheckSkipBacktester.skip routed directly to SaturdayHealthCheck, bypassing the eval-pipeline entirely. Production Sat 5/9 won't hit this (skip_backtester defaults false; Backtester runs and routes through eval-judge correctly), but operator manual skips for any non-eval validation purpose silently dropped the eval state. Fix: route skip_backtester=true → CheckSkipEvalJudge instead of SaturdayHealthCheck. Eval pipeline now fires on every SF execution where the operator hasn't explicitly skip_eval_judge'd it. tests/test_sf_eval_judge_wiring.py — TestSkipBacktesterPreservesEvalJudge: pins the routing so a future "simplification" can't re-introduce the silent bypass. Tests 433 → 434 (+1 wiring assertion). Pairs with alpha-engine-research PR #104 (RubricEvalLLMOutput defense + judge max_tokens to strategic tier — closes the 5/32 remaining failure class observed in this same SF run). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: drop dead ALPHA_ENGINE_LIB_TOKEN PAT plumbing alpha-engine-lib was flipped public 2026-05-03; PAT auth machinery that existed to install from a private repo is now dead weight. Removed across 6 files (net −87 lines). CI: - .github/workflows/ci.yml — drop "Configure git auth" step - .github/workflows/deploy.yml — drop the secondary actions/checkout for cipher813/alpha-engine-lib + the LIB_REPO_DIR env on the deploy step Docker / deploy: - Dockerfile — replace `COPY vendor/alpha-engine-lib` + local pip install with `pip install "alpha-engine-lib[flow_doctor] @ git+https://github.com/cipher813/alpha-engine-lib@v0.3.0"`. The [flow_doctor]-only install for Lambda is preserved (Lambda doesn't need [arcticdb] or [rag]); requirements.txt's [arcticdb,flow_doctor,rag] extras still apply for the EC2 install path. - infrastructure/deploy.sh — drop the vendor/alpha-engine-lib staging block + cleanup_lib_staging trap. Replace with one-line comment explaining lib comes from public git+https now. EC2 spot scripts: - infrastructure/spot_data_weekly.sh — drop SSM PAT fetch + insteadOf rewrite from the DEPS step. Update inline comments referencing the old mechanism (3 spots). - infrastructure/spot_drift_detection.sh — same removal. Companion follow-ups (not in this PR): - Delete ALPHA_ENGINE_LIB_TOKEN GitHub Actions secret on this repo - Delete /alpha-engine/lib-token SSM SecureString (us-east-1) - vendor/alpha-engine-lib local checkout can be removed (gitignored, not in any commit) Per ROADMAP follow-up "P3 Drop ALPHA_ENGINE_LIB_TOKEN PAT plumbing" added 2026-05-03. Second of 6 consumer-repo PRs in this cleanup arc; prototype landed in alpha-engine PR #128. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
cipher813
added a commit
that referenced
this pull request
May 6, 2026
Companion to alpha-engine-backtester #140 (counterfactual Lambda implementation). Closes the agent-justification triple end-to-end — all three signals now fire from the Saturday SF on the same trailing- 8-week corpus. SF chain after this PR: ... → EvalJudge{Weekly,FirstSaturday} → EvalRollingMean → CheckSkipRationaleClustering → RationaleClustering → CheckSkipReplayConcordance → ReplayConcordance → CheckSkipCounterfactual → Counterfactual → SaturdayHealthCheck → ... Three independent skip-gates, each landing on the next signal's gate rather than the health check (so skipping one signal doesn't bundle- skip the others): - {"skip_rationale_clustering": true} → CheckSkipReplayConcordance - {"skip_replay_concordance": true} → CheckSkipCounterfactual - {"skip_counterfactual": true} → SaturdayHealthCheck Default Counterfactual payload pins production cadence: - end_time_iso = SF execution start time - window_days = 56 (8 weeks) - max_depth = 3 ("3-deep rule") No target_models payload — counterfactual doesn't replay against a target model; sklearn fits a tree on actual (input → decision) pairs. IAM updates: - github-actions-lambda-deploy.json: alpha-engine-replay-counterfactual added to LambdaUpdate + LambdaInvokeCanary lists. Asymmetric-IAM- grant antipattern compliance — 5th Lambda this shape; durable CreateFunction grant from data #165 already covers create + update. - deploy_step_function.sh: SF role inline LambdaInvoke list updated with the new function ARN so SF can invoke it. Tests: - TestStatesPresent: CheckSkipCounterfactual + Counterfactual added to required-states pin. - TestSkipReplayConcordance: rerouted assertion (now lands at CheckSkipCounterfactual, not SaturdayHealthCheck). - TestReplayConcordance: success + Catch reroutes (same). - TestSkipCounterfactual: skip_counterfactual flag → SaturdayHealthCheck. - TestCounterfactual: live alias, payload required fields (end_time_iso, window_days=56, max_depth=3), 600s timeout matches Lambda cap, success + Catch routes, retry posture. Suite 459 → 467. Composes with the agent-justification dashboard surface — all three metrics emit under the AlphaEngine/Eval namespace: - agent_quality_score (eval-judge) - agent_quality_score_4w_mean (eval-rolling-mean) - agent_rationale_template_concentration (clustering) - agent_cheap_model_concordance (concordance) - agent_counterfactual_rule_fit (this one) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires the rolling-mean Lambda (research PR #93) into the Saturday SF after the eval-judge converge point, and installs a single CloudWatch alarm that fires when ANY agent's rolling-4-week-mean drops below 3.0.
SF flow update
Both eval-judge branches now converge to
EvalRollingMean(instead ofSaturdayHealthCheckdirectly):EvalRollingMeanis non-blocking:Catch States.ALL → SaturdayHealthCheck. Even when eval-judge has infra failures, rolling-mean still runs — the trailing 4-week window is unaffected by the current week's hiccup.Why SF wiring instead of EventBridge
Per session discussion + the codebase convention (
Don't add redundant paths around load-bearing scheduled infra): the Saturday SF is the single authoritative weekly path. EventBridge would have been an implicit-timing-dependency parallel schedule. SF wiring makes the dependency explicit, runs only after the current week's raw metric was actually emitted, and gives one SF execution trace covering the full eval pipeline.Alarm design
infrastructure/setup_eval_quality_alarm.shis idempotent. Creates one alarmalpha-engine-eval-quality-regressionusing aSEARCHmetric expression that discovers every(judged_agent_id, criterion, judge_model)combo at evaluation time and MIN-reduces them — so new agents/criteria added later are auto-monitored without re-running the script. Reuses the existingalpha-engine-alertsSNS topic.treat-missing-data=ignorekeeps the alarm inINSUFFICIENT_DATAuntil 4 weeks of metric history accrue (no false pages on bootstrap).Test plan
python -m pytest tests/ -q→ 433 passed (was 427).EvalRollingMeanassertions: alias + start-time payload + 300s timeout + non-blocking Catch + retry posture.bash -n setup_eval_quality_alarm.shsyntax-clean.Deploy order
alpha-engine-research:./infrastructure/deploy.sh eval_rolling_mean(creates the Lambda alias)alpha-engine-data:./infrastructure/deploy_step_function.sh(updates SF JSON + IAM)alpha-engine-data:./infrastructure/setup_eval_quality_alarm.sh(installs the alarm)Out of scope (PR 4d)
alpha-engine-dashboard(per-agent line charts × dimensions; prompt-version → quality-score correlation chart per ROADMAP §1633).🤖 Generated with Claude Code