Skip to content

feat(sf): PR 4c rolling-mean SF wiring + CloudWatch alarm#140

Merged
cipher813 merged 1 commit into
mainfrom
feat/llm-judge-rolling-mean-sf-pr4c
May 3, 2026
Merged

feat(sf): PR 4c rolling-mean SF wiring + CloudWatch alarm#140
cipher813 merged 1 commit into
mainfrom
feat/llm-judge-rolling-mean-sf-pr4c

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Wires the rolling-mean Lambda (research PR #93) into the Saturday SF after the eval-judge converge point, and installs a single CloudWatch alarm that fires when ANY agent's rolling-4-week-mean drops below 3.0.

SF flow update

Both eval-judge branches now converge to EvalRollingMean (instead of SaturdayHealthCheck directly):

CheckBacktesterStatus (Success)
  → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence
      → EvalJudgeFirstSaturday ─┐
                                ├→ EvalRollingMean → SaturdayHealthCheck → NotifyComplete
      → EvalJudgeWeekly ────────┘

EvalRollingMean is non-blocking: Catch States.ALL → SaturdayHealthCheck. Even when eval-judge has infra failures, rolling-mean still runs — the trailing 4-week window is unaffected by the current week's hiccup.

Why SF wiring instead of EventBridge

Per session discussion + the codebase convention (Don't add redundant paths around load-bearing scheduled infra): the Saturday SF is the single authoritative weekly path. EventBridge would have been an implicit-timing-dependency parallel schedule. SF wiring makes the dependency explicit, runs only after the current week's raw metric was actually emitted, and gives one SF execution trace covering the full eval pipeline.

Alarm design

infrastructure/setup_eval_quality_alarm.sh is idempotent. Creates one alarm alpha-engine-eval-quality-regression using a SEARCH metric expression that discovers every (judged_agent_id, criterion, judge_model) combo at evaluation time and MIN-reduces them — so new agents/criteria added later are auto-monitored without re-running the script. Reuses the existing alpha-engine-alerts SNS topic. treat-missing-data=ignore keeps the alarm in INSUFFICIENT_DATA until 4 weeks of metric history accrue (no false pages on bootstrap).

Test plan

  • python -m pytest tests/ -q → 433 passed (was 427).
  • 6 new EvalRollingMean assertions: alias + start-time payload + 300s timeout + non-blocking Catch + retry posture.
  • Existing 21 LLM-judge wiring tests updated to expect the new converge target.
  • bash -n setup_eval_quality_alarm.sh syntax-clean.

Deploy order

  1. From alpha-engine-research: ./infrastructure/deploy.sh eval_rolling_mean (creates the Lambda alias)
  2. From alpha-engine-data: ./infrastructure/deploy_step_function.sh (updates SF JSON + IAM)
  3. From alpha-engine-data: ./infrastructure/setup_eval_quality_alarm.sh (installs the alarm)

Out of scope (PR 4d)

  • Streamlit quality-trend dashboard page in alpha-engine-dashboard (per-agent line charts × dimensions; prompt-version → quality-score correlation chart per ROADMAP §1633).

🤖 Generated with Claude Code

Wires the alpha-engine-research-eval-rolling-mean Lambda (research
PR #93) into the Saturday SF after the eval-judge converge point,
and installs a single CloudWatch alarm that fires when ANY agent's
rolling-4-week-mean drops below 3.0.

SF flow update — eval-judge branches now converge to EvalRollingMean
instead of SaturdayHealthCheck:

  CheckBacktesterStatus (Success)
    → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence
        → EvalJudgeFirstSaturday ─┐
                                  ├→ EvalRollingMean → SaturdayHealthCheck → NotifyComplete
        → EvalJudgeWeekly ────────┘

EvalRollingMean is non-blocking: Catch States.ALL → SaturdayHealthCheck
(never HandleFailure). Even on weeks where eval-judge had infra
failures, the rolling-mean Lambda still runs against whatever 4
weeks of prior data ARE in CloudWatch — the trailing window is
unaffected by the current week's hiccup.

Why we converge to rolling-mean rather than spawn a separate
EventBridge rule (per session discussion): "Don't add redundant paths
around load-bearing scheduled infra" — the Saturday SF is the system's
single authoritative weekly path. EventBridge would have been an
implicit-timing-dependency parallel schedule ("fire 4 hours after the
SF and hope eval-judge finished"). SF wiring makes the dependency
explicit, runs only after the current week's raw metric was actually
emitted, and gives a single SF execution trace covering the whole
eval pipeline.

infrastructure/setup_eval_quality_alarm.sh
  - One-shot idempotent script. Creates a single CloudWatch alarm
    "alpha-engine-eval-quality-regression" using a SEARCH metric
    expression to discover every (judged_agent_id, criterion,
    judge_model) combo at evaluation time, MIN-reduces them, and
    fires when the min drops below 3.0. SEARCH means new agents +
    criteria added later are auto-monitored without re-running the
    script.
  - Reuses the existing alpha-engine-alerts SNS topic — eval
    regressions land in the same operator inbox as pipeline failures.
  - treat-missing-data=ignore so the alarm sits at INSUFFICIENT_DATA
    until 4 weeks of metric history accrue (no false pages on
    bootstrap).

IAM update: alpha-engine-research-eval-rolling-mean* added to the
SF-role LambdaInvoke list in BOTH deploy_step_function.sh and
deploy_step_function_daily.sh (shared-policy convention requires
sync).

Tests 427 → 433 (+6 EvalRollingMean assertions on top of the existing
21 LLM-judge wiring tests that were updated to expect the new
converge target — Lambda alias + start-time payload + 300s timeout +
non-blocking Catch + retry posture).

Deploy order:
  1. From alpha-engine-research: ./infrastructure/deploy.sh eval_rolling_mean
     (creates the Lambda alias)
  2. From alpha-engine-data: ./infrastructure/deploy_step_function.sh
     (updates SF JSON + IAM)
  3. From alpha-engine-data: ./infrastructure/setup_eval_quality_alarm.sh
     (installs the alarm; idempotent so safe to re-run)

First eligible alarm firing: 4 weeks after eval-judge starts emitting
raw scores. Until then alarm stays in INSUFFICIENT_DATA.

Out of scope (PR 4d): Streamlit quality-trend dashboard page in
alpha-engine-dashboard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit e1171b5 into main May 3, 2026
1 check passed
@cipher813 cipher813 deleted the feat/llm-judge-rolling-mean-sf-pr4c branch May 3, 2026 14:33
cipher813 added a commit that referenced this pull request May 3, 2026
Caught 2026-05-03 in SF eval-pipeline-validation-5: Research succeeded
and wrote new-format captures to S3, but the eval-judge state silently
never fired because the operator had passed skip_backtester=true to
skip the long-running backtester for validation purposes.

PR 4c (#140) wired the eval-pipeline states between Backtester success
and SaturdayHealthCheck:

  CheckBacktesterStatus.Success
    → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence
        → EvalJudgeFirstSaturday or EvalJudgeWeekly → EvalRollingMean
    → SaturdayHealthCheck

But CheckSkipBacktester.skip routed directly to SaturdayHealthCheck,
bypassing the eval-pipeline entirely. Production Sat 5/9 won't hit
this (skip_backtester defaults false; Backtester runs and routes
through eval-judge correctly), but operator manual skips for any
non-eval validation purpose silently dropped the eval state.

Fix: route skip_backtester=true → CheckSkipEvalJudge instead of
SaturdayHealthCheck. Eval pipeline now fires on every SF execution
where the operator hasn't explicitly skip_eval_judge'd it.

tests/test_sf_eval_judge_wiring.py — TestSkipBacktesterPreservesEvalJudge:
  pins the routing so a future "simplification" can't re-introduce
  the silent bypass.

Tests 433 → 434 (+1 wiring assertion).

Pairs with alpha-engine-research PR #104 (RubricEvalLLMOutput
defense + judge max_tokens to strategic tier — closes the 5/32
remaining failure class observed in this same SF run).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 3, 2026
Caught 2026-05-03 in SF eval-pipeline-validation-5: Research succeeded
and wrote new-format captures to S3, but the eval-judge state silently
never fired because the operator had passed skip_backtester=true to
skip the long-running backtester for validation purposes.

PR 4c (#140) wired the eval-pipeline states between Backtester success
and SaturdayHealthCheck:

  CheckBacktesterStatus.Success
    → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence
        → EvalJudgeFirstSaturday or EvalJudgeWeekly → EvalRollingMean
    → SaturdayHealthCheck

But CheckSkipBacktester.skip routed directly to SaturdayHealthCheck,
bypassing the eval-pipeline entirely. Production Sat 5/9 won't hit
this (skip_backtester defaults false; Backtester runs and routes
through eval-judge correctly), but operator manual skips for any
non-eval validation purpose silently dropped the eval state.

Fix: route skip_backtester=true → CheckSkipEvalJudge instead of
SaturdayHealthCheck. Eval pipeline now fires on every SF execution
where the operator hasn't explicitly skip_eval_judge'd it.

tests/test_sf_eval_judge_wiring.py — TestSkipBacktesterPreservesEvalJudge:
  pins the routing so a future "simplification" can't re-introduce
  the silent bypass.

Tests 433 → 434 (+1 wiring assertion).

Pairs with alpha-engine-research PR #104 (RubricEvalLLMOutput
defense + judge max_tokens to strategic tier — closes the 5/32
remaining failure class observed in this same SF run).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 3, 2026
* fix(sf): skip_backtester preserves eval-judge skip-gate path

Caught 2026-05-03 in SF eval-pipeline-validation-5: Research succeeded
and wrote new-format captures to S3, but the eval-judge state silently
never fired because the operator had passed skip_backtester=true to
skip the long-running backtester for validation purposes.

PR 4c (#140) wired the eval-pipeline states between Backtester success
and SaturdayHealthCheck:

  CheckBacktesterStatus.Success
    → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence
        → EvalJudgeFirstSaturday or EvalJudgeWeekly → EvalRollingMean
    → SaturdayHealthCheck

But CheckSkipBacktester.skip routed directly to SaturdayHealthCheck,
bypassing the eval-pipeline entirely. Production Sat 5/9 won't hit
this (skip_backtester defaults false; Backtester runs and routes
through eval-judge correctly), but operator manual skips for any
non-eval validation purpose silently dropped the eval state.

Fix: route skip_backtester=true → CheckSkipEvalJudge instead of
SaturdayHealthCheck. Eval pipeline now fires on every SF execution
where the operator hasn't explicitly skip_eval_judge'd it.

tests/test_sf_eval_judge_wiring.py — TestSkipBacktesterPreservesEvalJudge:
  pins the routing so a future "simplification" can't re-introduce
  the silent bypass.

Tests 433 → 434 (+1 wiring assertion).

Pairs with alpha-engine-research PR #104 (RubricEvalLLMOutput
defense + judge max_tokens to strategic tier — closes the 5/32
remaining failure class observed in this same SF run).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: drop dead ALPHA_ENGINE_LIB_TOKEN PAT plumbing

alpha-engine-lib was flipped public 2026-05-03; PAT auth machinery
that existed to install from a private repo is now dead weight.
Removed across 6 files (net −87 lines).

CI:
- .github/workflows/ci.yml — drop "Configure git auth" step
- .github/workflows/deploy.yml — drop the secondary
  actions/checkout for cipher813/alpha-engine-lib + the LIB_REPO_DIR
  env on the deploy step

Docker / deploy:
- Dockerfile — replace `COPY vendor/alpha-engine-lib` + local pip
  install with `pip install "alpha-engine-lib[flow_doctor] @
  git+https://github.com/cipher813/alpha-engine-lib@v0.3.0"`. The
  [flow_doctor]-only install for Lambda is preserved (Lambda doesn't
  need [arcticdb] or [rag]); requirements.txt's
  [arcticdb,flow_doctor,rag] extras still apply for the EC2 install
  path.
- infrastructure/deploy.sh — drop the vendor/alpha-engine-lib
  staging block + cleanup_lib_staging trap. Replace with one-line
  comment explaining lib comes from public git+https now.

EC2 spot scripts:
- infrastructure/spot_data_weekly.sh — drop SSM PAT fetch + insteadOf
  rewrite from the DEPS step. Update inline comments referencing the
  old mechanism (3 spots).
- infrastructure/spot_drift_detection.sh — same removal.

Companion follow-ups (not in this PR):
- Delete ALPHA_ENGINE_LIB_TOKEN GitHub Actions secret on this repo
- Delete /alpha-engine/lib-token SSM SecureString (us-east-1)
- vendor/alpha-engine-lib local checkout can be removed (gitignored,
  not in any commit)

Per ROADMAP follow-up "P3 Drop ALPHA_ENGINE_LIB_TOKEN PAT plumbing"
added 2026-05-03. Second of 6 consumer-repo PRs in this cleanup arc;
prototype landed in alpha-engine PR #128.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 6, 2026
Companion to alpha-engine-backtester #140 (counterfactual Lambda
implementation). Closes the agent-justification triple end-to-end —
all three signals now fire from the Saturday SF on the same trailing-
8-week corpus.

SF chain after this PR:

  ... → EvalJudge{Weekly,FirstSaturday} → EvalRollingMean
      → CheckSkipRationaleClustering → RationaleClustering
      → CheckSkipReplayConcordance → ReplayConcordance
      → CheckSkipCounterfactual → Counterfactual
      → SaturdayHealthCheck → ...

Three independent skip-gates, each landing on the next signal's gate
rather than the health check (so skipping one signal doesn't bundle-
skip the others):

- {"skip_rationale_clustering": true} → CheckSkipReplayConcordance
- {"skip_replay_concordance": true}    → CheckSkipCounterfactual
- {"skip_counterfactual": true}        → SaturdayHealthCheck

Default Counterfactual payload pins production cadence:

  - end_time_iso = SF execution start time
  - window_days = 56 (8 weeks)
  - max_depth = 3 ("3-deep rule")

No target_models payload — counterfactual doesn't replay against a
target model; sklearn fits a tree on actual (input → decision) pairs.

IAM updates:

- github-actions-lambda-deploy.json: alpha-engine-replay-counterfactual
  added to LambdaUpdate + LambdaInvokeCanary lists. Asymmetric-IAM-
  grant antipattern compliance — 5th Lambda this shape; durable
  CreateFunction grant from data #165 already covers create + update.
- deploy_step_function.sh: SF role inline LambdaInvoke list updated
  with the new function ARN so SF can invoke it.

Tests:

- TestStatesPresent: CheckSkipCounterfactual + Counterfactual added
  to required-states pin.
- TestSkipReplayConcordance: rerouted assertion (now lands at
  CheckSkipCounterfactual, not SaturdayHealthCheck).
- TestReplayConcordance: success + Catch reroutes (same).
- TestSkipCounterfactual: skip_counterfactual flag → SaturdayHealthCheck.
- TestCounterfactual: live alias, payload required fields
  (end_time_iso, window_days=56, max_depth=3), 600s timeout matches
  Lambda cap, success + Catch routes, retry posture.

Suite 459 → 467.

Composes with the agent-justification dashboard surface — all three
metrics emit under the AlphaEngine/Eval namespace:

  - agent_quality_score (eval-judge)
  - agent_quality_score_4w_mean (eval-rolling-mean)
  - agent_rationale_template_concentration (clustering)
  - agent_cheap_model_concordance (concordance)
  - agent_counterfactual_rule_fit (this one)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant