fix(iam): grant SF role InvokeFunction on 4 missing Lambdas + preflight by cipher813 · Pull Request #188 · cipher813/alpha-engine-data

cipher813 · 2026-05-07T22:18:27Z

Summary

Investigation triggered by today's eval email Agent Justification Stack section showing "Concordance — no_recent_sf_run" while the SF clearly traversed ReplayConcordance. Pulled today's SF state for execution validation-eval-final-20260507T201416:

rationale_clustering_error: AccessDenied — User: ... is not authorized to perform:
  lambda:InvokeFunction on resource: alpha-engine-research-rationale-clustering:live ...
replay_concordance_error: same shape, replay-concordance:live

The codified IAM policy listed 5 Lambda ARN patterns; 3 of the agent-justification-arc Lambdas (rationale-clustering, replay-concordance, replay-counterfactual) were missing — they were added to the SF defn 2026-05-06 in the 17-PR all-nighter without the matching IAM grant. The SF state-level Catch[States.ALL] swallowed AccessDenied silently → ~3 weeks of missing data on 3 of the 4 quadfecta sources, never visible to the operator.

The new preflight test surfaced a 4th gap (predictor-health-check, weekday SF) before this PR even merged — same class, different SF.

Per the asymmetric-IAM-grant antipattern memory

This is the 5th incident of this class. The codified-policy + check-drift.py loop catches policy/AWS divergence; this preflight closes the OTHER half of the symmetric grant — policy vs SF-defn divergence. Adding a new SF Lambda invocation is now a one-line policy update or the test fails at PR time.

Changes

infrastructure/iam/alpha-engine-step-functions-role.json — add 4 ARN patterns:
- alpha-engine-research-rationale-clustering*
- alpha-engine-replay-concordance*
- alpha-engine-replay-counterfactual*
- alpha-engine-predictor-health-check*
Live policy applied via apply.sh (verified OK before opening PR — concordance Lambda will start writing to S3 on next SF firing).
tests/test_sf_iam_lambda_grants.py (new) — parametrized static check across the 3 SF defns; asserts every Lambda ARN any Task state would invoke is grantable under the policy. Verified to fail when grants are missing.

Test plan

pytest tests/test_sf_iam_lambda_grants.py — 2/2 pass + 1 skip (eod SF has no Lambda invocations)
pytest (full suite) — 548/548 pass (+2 new for the SF-grant invariant)
Live policy applied via apply.sh (confirmed OK)
Trigger SF after merge — concordance Lambda should now write to decision_artifacts/_replay_summary/ and the email's Agent Justification Stack section should show real concordance numbers

Investigation triggered by today's eval email Agent Justification Stack section showing "Concordance — no_recent_sf_run" while the SF state clearly traversed ReplayConcordance. SF execution state for today's validation-eval-final-20260507T201416 carried: rationale_clustering_error: AccessDenied — User: ... is not authorized to perform: lambda:InvokeFunction on resource: alpha-engine-research -rationale-clustering:live ... replay_concordance_error: same shape, replay-concordance:live The codified IAM policy (infrastructure/iam/alpha-engine-step-functions-role.json) lists 5 Lambda ARN patterns; 3 of the agent-justification-arc Lambdas (rationale-clustering, replay-concordance, replay-counterfactual) are missing — they were added to the SF defn 2026-05-06 in the 17-PR all-nighter without the matching IAM grant. The SF state-level Catch[States.ALL] swallowed the AccessDenied silently → ~3 weeks of missing data on the 3-of-4 quadfecta sources, never visible to the operator. Fix: - alpha-engine-step-functions-role.json: add 4 ARN patterns (rationale-clustering, replay-concordance, replay-counterfactual, predictor-health-check). The 4th was discovered by the new preflight test — same class, different SF defn (weekday). - apply.sh applied live (verified OK). Preflight (closes the symmetric-grant gap that allowed this): - tests/test_sf_iam_lambda_grants.py — static parser walks every Task state in step_function.json + step_function_daily.json + step_function_eod.json, extracts every invoked Lambda ARN, asserts each one is grantable under the codified policy. Catches this exact class at PR time. Verified to fail when grants are missing (correctly flagged predictor-health-check before the policy was updated to include it). Per the asymmetric-IAM-grant antipattern memory: this is the 5th incident of this class. The codified-policy + check-drift loop catches policy/AWS divergence; this preflight closes the OTHER half — policy vs SF-defn divergence. Adding a new SF Lambda invocation is now a one-line policy update or the test fails at PR time. 548 tests pass (+2 new for the SF-grant invariant), 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…m_log_capture CLI (#290) Closes the 2026-05-22 Friday-PM dry-pass failure (Alpha Engine Saturday Pipeline FAILED at MorningEnrich, exit 127, "trap: s3: invalid signal specification"). Lifts the inline `trap 'aws s3 cp /var/log/X.log "s3://..." || true' EXIT` + `tee /var/log/X.log` pattern out of JSON and into the alpha_engine_lib.ssm_log_capture Python CLI (lib v0.25.0). **Root cause** of the dry-pass break: PR #253 (merged 2026-05-17) flipped 8 Saturday-SF spot states from plain `commands` JSON arrays to `commands.$ States.Array(...)` so they could splice `$.run_date` / `$.preflight_args` via States.Format. Inside States.Array arg strings, ASL's documented escape for an inner single quote is `\'` — but the AWS ASL evaluator does NOT unescape `\'` to `'` in practice; it passes the backslash through literally. The trap line `'trap \'cmd\' EXIT'` therefore rendered into the SSM `_script.sh` as `trap \'cmd\' EXIT`. Bash interpreted the `\'` outside quotes as a literal apostrophe stripped of its quoting power, then word-split the line and passed every token after `aws` to `trap` as a signal name. Exit 127, line 7. Same defect lurked in ALL 8 Saturday-SF spot states (MorningEnrich first only because it's the first to run; everything downstream would have failed the same way). The 2026-05-22 Friday-PM shell-run dry-pass — designed exactly to catch break classes like this ~12h before the Saturday 09:00 UTC firing — exposed the latent regression on its first execution under the broken pattern (no Saturday SF had run between #253 merge and the dry-pass). The Friday-PM dry-pass safety net worked as designed. **Fix shape (institutional, per CLAUDE.md SOTA sub-sub-rule):** Each affected state now invokes a single States.Format-rendered token list with no bash trap and no inner single quotes: /home/ec2-user/alpha-engine-dashboard/.venv/bin/python \ -m alpha_engine_lib.ssm_log_capture run \ --slug X --log /var/log/X.log -- bash <launcher> ... The lib CLI internalizes the tee + S3 ship-on-exit + exit-code propagation. No ASL escape surface. Tested independently in lib v0.25.0 (20 unit tests; see alpha-engine-lib PR #57). States converted (slug → launcher): - morning-enrich → spot_data_weekly.sh --morning-enrich-only - data-weekly → spot_data_weekly.sh --phase1-only - rag-ingestion → spot_data_weekly.sh --rag-only - predictor-training → spot_train.sh --full-only (alpha-engine-predictor) - drift-detection → spot_drift_detection.sh - backtester → spot_backtest.sh --skip-stages=parity,evaluator (alpha-engine-backtester) - parity → spot_backtest.sh --skip-stages=backtest,evaluator (alpha-engine-backtester) - evaluator → spot_backtest.sh --skip-stages=backtest,parity (alpha-engine-backtester) The `States.Format('export RUN_DATE=\\'{}\\'', $.run_date)` line in Backtester/Parity/Evaluator stays as-is — `States.Format` template unescaping IS reliable (different ASL code path than States.Array arg unescaping). **Test gaps closed in the same diff:** 1. `test_sf_ssm_pipefail_wiring._iter_ssm_command_blocks` previously only iterated plain `commands` arrays — silently skipped every `commands.$ States.Array(...)` state. That was why the broken trap form went undetected from #253 (2026-05-17) until the dry-pass actually fired in production (2026-05-22). The helper now uses the shared `tests.sf_command_utils.extract_commands` to handle both forms uniformly. 2. `test_long_ssm_steps_ship_log_to_s3` now accepts EITHER an inline bash trap (plain `commands` form, used by weekday + EOD SFs) OR the alpha_engine_lib.ssm_log_capture CLI (commands.$ States.Array form, used by Saturday SF). Either shape satisfies the S3-log-capture invariant; the test enforces presence + correctness without prescribing one form. 3. `tests/fixtures/sf_prekeystone_spot_commands.json` regenerated from the new SF (the fixture's docstring explicitly permits this for "deliberate, reviewed change to a spot state's absent-path command" — this is exactly that case). Pre-keystone form had the broken `\'`-escape trap baked in; new baseline reflects the lib-CLI shape. The Friday-PM keystone byte-identicality proof still holds (absent path = lib-CLI invocation with preflight_args=""). 4. `test_morning_enrich_has_s3_log_trap_before_work` and the analogous parity test renamed/rewritten to assert the lib-CLI presence instead of the inline-trap presence. Both also assert that NO inline trap coexists with the lib CLI in the same state (would race on /var/log). 5. `test_spot_command_carries_preflight_only_under_shell_run` updated to assert the new shell_run command shape (lib CLI + `--` separator + launcher + --preflight-only). **Dependency chain (must merge in order):** 1. alpha-engine-lib #57 — adds ssm_log_capture, auto-tags v0.25.0 2. alpha-engine-data #289 — bumps lib pin to v0.25.0 (spot side) 3. alpha-engine-predictor #188 — bumps lib pin (PredictorTraining spot) 4. alpha-engine-backtester #243 — bumps lib pin (Backtester/Parity/Evaluator spots) 5. alpha-engine-dashboard #118 — bumps lib pin (ae-dashboard = SSM target, hosts the .venv where the lib CLI runs) 6. **THIS PR** — converts SF JSON to invoke the lib CLI 7. Run alpha-engine-data/infrastructure/deploy_step_function.sh 8. Redrive the Friday-PM dry-pass; confirm 8/8 spot states green 9. Saturday 09:00 UTC SF runs cleanly Suite: 1430 passed, 1 skipped, 7 warnings (FutureWarnings unrelated to this change). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 merged commit 5aad809 into main May 7, 2026
2 checks passed

cipher813 deleted the fix/sf-iam-quadfecta-lambdas branch May 7, 2026 22:24

cipher813 mentioned this pull request May 22, 2026

fix(sf): keystone ApplyShellRunDefaults — swap JsonMerge args so shellDefaults wins #291

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(iam): grant SF role InvokeFunction on 4 missing Lambdas + preflight#188

fix(iam): grant SF role InvokeFunction on 4 missing Lambdas + preflight#188
cipher813 merged 1 commit into
mainfrom
fix/sf-iam-quadfecta-lambdas

cipher813 commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 7, 2026

Summary

Per the asymmetric-IAM-grant antipattern memory

Changes

Test plan

Related

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant