fix(sf): convert 8 broken Saturday spot states to alpha_engine_lib.ssm_log_capture CLI#290
Merged
Merged
Conversation
…m_log_capture CLI Closes the 2026-05-22 Friday-PM dry-pass failure (Alpha Engine Saturday Pipeline FAILED at MorningEnrich, exit 127, "trap: s3: invalid signal specification"). Lifts the inline `trap 'aws s3 cp /var/log/X.log "s3://..." || true' EXIT` + `tee /var/log/X.log` pattern out of JSON and into the alpha_engine_lib.ssm_log_capture Python CLI (lib v0.25.0). **Root cause** of the dry-pass break: PR #253 (merged 2026-05-17) flipped 8 Saturday-SF spot states from plain `commands` JSON arrays to `commands.$ States.Array(...)` so they could splice `$.run_date` / `$.preflight_args` via States.Format. Inside States.Array arg strings, ASL's documented escape for an inner single quote is `\'` — but the AWS ASL evaluator does NOT unescape `\'` to `'` in practice; it passes the backslash through literally. The trap line `'trap \'cmd\' EXIT'` therefore rendered into the SSM `_script.sh` as `trap \'cmd\' EXIT`. Bash interpreted the `\'` outside quotes as a literal apostrophe stripped of its quoting power, then word-split the line and passed every token after `aws` to `trap` as a signal name. Exit 127, line 7. Same defect lurked in ALL 8 Saturday-SF spot states (MorningEnrich first only because it's the first to run; everything downstream would have failed the same way). The 2026-05-22 Friday-PM shell-run dry-pass — designed exactly to catch break classes like this ~12h before the Saturday 09:00 UTC firing — exposed the latent regression on its first execution under the broken pattern (no Saturday SF had run between #253 merge and the dry-pass). The Friday-PM dry-pass safety net worked as designed. **Fix shape (institutional, per CLAUDE.md SOTA sub-sub-rule):** Each affected state now invokes a single States.Format-rendered token list with no bash trap and no inner single quotes: /home/ec2-user/alpha-engine-dashboard/.venv/bin/python \ -m alpha_engine_lib.ssm_log_capture run \ --slug X --log /var/log/X.log -- bash <launcher> ... The lib CLI internalizes the tee + S3 ship-on-exit + exit-code propagation. No ASL escape surface. Tested independently in lib v0.25.0 (20 unit tests; see alpha-engine-lib PR #57). States converted (slug → launcher): - morning-enrich → spot_data_weekly.sh --morning-enrich-only - data-weekly → spot_data_weekly.sh --phase1-only - rag-ingestion → spot_data_weekly.sh --rag-only - predictor-training → spot_train.sh --full-only (alpha-engine-predictor) - drift-detection → spot_drift_detection.sh - backtester → spot_backtest.sh --skip-stages=parity,evaluator (alpha-engine-backtester) - parity → spot_backtest.sh --skip-stages=backtest,evaluator (alpha-engine-backtester) - evaluator → spot_backtest.sh --skip-stages=backtest,parity (alpha-engine-backtester) The `States.Format('export RUN_DATE=\\'{}\\'', $.run_date)` line in Backtester/Parity/Evaluator stays as-is — `States.Format` template unescaping IS reliable (different ASL code path than States.Array arg unescaping). **Test gaps closed in the same diff:** 1. `test_sf_ssm_pipefail_wiring._iter_ssm_command_blocks` previously only iterated plain `commands` arrays — silently skipped every `commands.$ States.Array(...)` state. That was why the broken trap form went undetected from #253 (2026-05-17) until the dry-pass actually fired in production (2026-05-22). The helper now uses the shared `tests.sf_command_utils.extract_commands` to handle both forms uniformly. 2. `test_long_ssm_steps_ship_log_to_s3` now accepts EITHER an inline bash trap (plain `commands` form, used by weekday + EOD SFs) OR the alpha_engine_lib.ssm_log_capture CLI (commands.$ States.Array form, used by Saturday SF). Either shape satisfies the S3-log-capture invariant; the test enforces presence + correctness without prescribing one form. 3. `tests/fixtures/sf_prekeystone_spot_commands.json` regenerated from the new SF (the fixture's docstring explicitly permits this for "deliberate, reviewed change to a spot state's absent-path command" — this is exactly that case). Pre-keystone form had the broken `\'`-escape trap baked in; new baseline reflects the lib-CLI shape. The Friday-PM keystone byte-identicality proof still holds (absent path = lib-CLI invocation with preflight_args=""). 4. `test_morning_enrich_has_s3_log_trap_before_work` and the analogous parity test renamed/rewritten to assert the lib-CLI presence instead of the inline-trap presence. Both also assert that NO inline trap coexists with the lib CLI in the same state (would race on /var/log). 5. `test_spot_command_carries_preflight_only_under_shell_run` updated to assert the new shell_run command shape (lib CLI + `--` separator + launcher + --preflight-only). **Dependency chain (must merge in order):** 1. alpha-engine-lib #57 — adds ssm_log_capture, auto-tags v0.25.0 2. alpha-engine-data #289 — bumps lib pin to v0.25.0 (spot side) 3. alpha-engine-predictor #188 — bumps lib pin (PredictorTraining spot) 4. alpha-engine-backtester #243 — bumps lib pin (Backtester/Parity/Evaluator spots) 5. alpha-engine-dashboard #118 — bumps lib pin (ae-dashboard = SSM target, hosts the .venv where the lib CLI runs) 6. **THIS PR** — converts SF JSON to invoke the lib CLI 7. Run alpha-engine-data/infrastructure/deploy_step_function.sh 8. Redrive the Friday-PM dry-pass; confirm 8/8 spot states green 9. Saturday 09:00 UTC SF runs cleanly Suite: 1430 passed, 1 skipped, 7 warnings (FutureWarnings unrelated to this change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the 2026-05-22 Friday-PM dry-pass failure ("Alpha Engine Saturday Pipeline FAILED" at MorningEnrich, exit 127,
trap: s3: invalid signal specification). Lifts the inlinetrap 'aws s3 cp /var/log/X.log "s3://..." || true' EXIT+tee /var/log/X.logpattern out of JSON and into thealpha_engine_lib.ssm_log_capturePython CLI (lib v0.25.0).The Friday-PM shell-run dry-pass safety net (#258 + #260 + #263) caught a latent regression introduced by #253 ~12h before the Saturday 09:00 UTC firing. Exactly the failure class it was built to catch.
Root cause
PR #253 (merged 2026-05-17) flipped 8 Saturday-SF spot states from plain
commandsJSON arrays tocommands.$ States.Array(...)so they could splice$.run_date/$.preflight_argsviaStates.Format. InsideStates.Arrayarg strings, ASL's documented escape for an inner single quote is\'— but the AWS ASL evaluator does NOT unescape\'to'in practice; it passes the backslash through literally.The trap line
'trap \'cmd\' EXIT'therefore rendered into the SSM_script.shastrap \'cmd\' EXIT. Bash interpreted\'outside quotes as a literal apostrophe stripped of its quoting power, then word-split the line and passed every token afterawstotrapas a signal name. Exit 127, line 7. Same defect lurked in ALL 8 Saturday-SF spot states.Fix
Each affected state now invokes a single
States.Format-rendered token list — no bash trap, no inner single quotes:The lib CLI internalizes tee + S3 ship-on-exit + exit-code propagation. No ASL escape surface by construction.
spot_data_weekly.sh --morning-enrich-onlyspot_data_weekly.sh --phase1-onlyspot_data_weekly.sh --rag-onlyspot_train.sh --full-onlyspot_drift_detection.shspot_backtest.sh --skip-stages=parity,evaluatorspot_backtest.sh --skip-stages=backtest,evaluatorspot_backtest.sh --skip-stages=backtest,parityStates.Format('export RUN_DATE=\'{}\'', $.run_date)in Backtester/Parity/Evaluator stays as-is —States.Formattemplate unescaping IS reliable (different ASL code path).Test gaps closed in the same PR
test_sf_ssm_pipefail_wiring._iter_ssm_command_blockspreviously only iterated plaincommandsarrays — silently skipped everycommands.$ States.Array(...)state. That's why the broken trap went undetected from 2026-05-17 → 2026-05-22 dry-pass fire. Now iterates both forms via the sharedextract_commandshelper.test_long_ssm_steps_ship_log_to_s3accepts EITHER the inline bash trap (plain commands form) OR the lib CLI (commands.$ form) as satisfying the invariant.sf_prekeystone_spot_commands.jsonfixture regenerated per its own docstring permission ("Regenerate ONLY on a deliberate, reviewed change to a spot state's absent-path command"). Pre-keystone form had the broken trap baked in.test_sf_morning_enrich_split_wiring.pyandtest_sf_backtest_parity_split_wiring.pyrewritten to assert lib-CLI presence + no-inline-trap-coexistence.test_spot_command_carries_preflight_only_under_shell_runupdated for the new shell_run command shape.Dependency chain (merge in order)
ssm_log_capture, auto-tags v0.25.0.venvwhere the lib CLI runs)alpha-engine-data/infrastructure/deploy_step_function.shTest plan
/home/ec2-user/alpha-engine-dashboard/.venv/bin/python -m alpha_engine_lib.ssm_log_capture --helpon ae-dashboard returns the CLI usagebash infrastructure/deploy_step_function.shto push updated SF to AWSalpha-engine-saturday-pipelineexecution (Saturday SF), or re-fire the Friday-PMalpha-engine-friday-shell-runrule--preflight-onlyclean-exit) without trap-related failures_ssm_logs/<slug>/<date>/<host>-<time>.logkeys land under the canonical layout🤖 Generated with Claude Code