Skip to content

fix(sf): convert 8 broken Saturday spot states to alpha_engine_lib.ssm_log_capture CLI#290

Merged
cipher813 merged 2 commits into
mainfrom
fix/sf-states-to-ssm-log-capture-cli
May 22, 2026
Merged

fix(sf): convert 8 broken Saturday spot states to alpha_engine_lib.ssm_log_capture CLI#290
cipher813 merged 2 commits into
mainfrom
fix/sf-states-to-ssm-log-capture-cli

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Closes the 2026-05-22 Friday-PM dry-pass failure ("Alpha Engine Saturday Pipeline FAILED" at MorningEnrich, exit 127, trap: s3: invalid signal specification). Lifts the inline trap 'aws s3 cp /var/log/X.log "s3://..." || true' EXIT + tee /var/log/X.log pattern out of JSON and into the alpha_engine_lib.ssm_log_capture Python CLI (lib v0.25.0).

The Friday-PM shell-run dry-pass safety net (#258 + #260 + #263) caught a latent regression introduced by #253 ~12h before the Saturday 09:00 UTC firing. Exactly the failure class it was built to catch.

Root cause

PR #253 (merged 2026-05-17) flipped 8 Saturday-SF spot states from plain commands JSON arrays to commands.$ States.Array(...) so they could splice $.run_date / $.preflight_args via States.Format. Inside States.Array arg strings, ASL's documented escape for an inner single quote is \' — but the AWS ASL evaluator does NOT unescape \' to ' in practice; it passes the backslash through literally.

The trap line 'trap \'cmd\' EXIT' therefore rendered into the SSM _script.sh as trap \'cmd\' EXIT. Bash interpreted \' outside quotes as a literal apostrophe stripped of its quoting power, then word-split the line and passed every token after aws to trap as a signal name. Exit 127, line 7. Same defect lurked in ALL 8 Saturday-SF spot states.

Fix

Each affected state now invokes a single States.Format-rendered token list — no bash trap, no inner single quotes:

/home/ec2-user/alpha-engine-dashboard/.venv/bin/python \
  -m alpha_engine_lib.ssm_log_capture run \
  --slug X --log /var/log/X.log -- bash <launcher> ...

The lib CLI internalizes tee + S3 ship-on-exit + exit-code propagation. No ASL escape surface by construction.

State Slug Launcher
MorningEnrich morning-enrich spot_data_weekly.sh --morning-enrich-only
DataPhase1 data-weekly spot_data_weekly.sh --phase1-only
RAGIngestion rag-ingestion spot_data_weekly.sh --rag-only
PredictorTraining predictor-training spot_train.sh --full-only
DriftDetection drift-detection spot_drift_detection.sh
Backtester backtester spot_backtest.sh --skip-stages=parity,evaluator
Parity parity spot_backtest.sh --skip-stages=backtest,evaluator
Evaluator evaluator spot_backtest.sh --skip-stages=backtest,parity

States.Format('export RUN_DATE=\'{}\'', $.run_date) in Backtester/Parity/Evaluator stays as-is — States.Format template unescaping IS reliable (different ASL code path).

Test gaps closed in the same PR

  1. test_sf_ssm_pipefail_wiring._iter_ssm_command_blocks previously only iterated plain commands arrays — silently skipped every commands.$ States.Array(...) state. That's why the broken trap went undetected from 2026-05-17 → 2026-05-22 dry-pass fire. Now iterates both forms via the shared extract_commands helper.
  2. test_long_ssm_steps_ship_log_to_s3 accepts EITHER the inline bash trap (plain commands form) OR the lib CLI (commands.$ form) as satisfying the invariant.
  3. sf_prekeystone_spot_commands.json fixture regenerated per its own docstring permission ("Regenerate ONLY on a deliberate, reviewed change to a spot state's absent-path command"). Pre-keystone form had the broken trap baked in.
  4. Per-state tests in test_sf_morning_enrich_split_wiring.py and test_sf_backtest_parity_split_wiring.py rewritten to assert lib-CLI presence + no-inline-trap-coexistence.
  5. test_spot_command_carries_preflight_only_under_shell_run updated for the new shell_run command shape.

Dependency chain (merge in order)

  1. alpha-engine-lib #57 — adds ssm_log_capture, auto-tags v0.25.0
  2. alpha-engine-data #289 — bumps lib pin v0.24.0 → v0.25.0 (spot side)
  3. alpha-engine-predictor #188 — bumps lib pin (PredictorTraining spot)
  4. alpha-engine-backtester #243 — bumps lib pin (Backtester/Parity/Evaluator spots)
  5. alpha-engine-dashboard #118 — bumps lib pin v0.16.0 → v0.25.0 (ae-dashboard = SSM target, hosts the .venv where the lib CLI runs)
  6. THIS PR — converts SF JSON to invoke the lib CLI
  7. Run alpha-engine-data/infrastructure/deploy_step_function.sh
  8. Redrive the Friday-PM dry-pass; confirm 8/8 spot states green
  9. Saturday 09:00 UTC firing runs cleanly

Test plan

  • Suite green locally: 1430 passed, 1 skipped (FutureWarnings unrelated)
  • Deps 1–5 merged in order; v0.25.0 tag exists
  • Merge this PR
  • Verify /home/ec2-user/alpha-engine-dashboard/.venv/bin/python -m alpha_engine_lib.ssm_log_capture --help on ae-dashboard returns the CLI usage
  • Run bash infrastructure/deploy_step_function.sh to push updated SF to AWS
  • Redrive the failed alpha-engine-saturday-pipeline execution (Saturday SF), or re-fire the Friday-PM alpha-engine-friday-shell-run rule
  • Confirm 8/8 spot states reach Success (or --preflight-only clean-exit) without trap-related failures
  • Spot-check S3 _ssm_logs/<slug>/<date>/<host>-<time>.log keys land under the canonical layout
  • Saturday 09:00 UTC: real Saturday SF firing runs cleanly

🤖 Generated with Claude Code

…m_log_capture CLI

Closes the 2026-05-22 Friday-PM dry-pass failure (Alpha Engine Saturday
Pipeline FAILED at MorningEnrich, exit 127, "trap: s3: invalid signal
specification"). Lifts the inline `trap 'aws s3 cp /var/log/X.log
"s3://..." || true' EXIT` + `tee /var/log/X.log` pattern out of JSON
and into the alpha_engine_lib.ssm_log_capture Python CLI (lib v0.25.0).

**Root cause** of the dry-pass break:
PR #253 (merged 2026-05-17) flipped 8 Saturday-SF spot states from
plain `commands` JSON arrays to `commands.$ States.Array(...)` so they
could splice `$.run_date` / `$.preflight_args` via States.Format.
Inside States.Array arg strings, ASL's documented escape for an inner
single quote is `\'` — but the AWS ASL evaluator does NOT unescape
`\'` to `'` in practice; it passes the backslash through literally.

The trap line `'trap \'cmd\' EXIT'` therefore rendered into the SSM
`_script.sh` as `trap \'cmd\' EXIT`. Bash interpreted the `\'`
outside quotes as a literal apostrophe stripped of its quoting power,
then word-split the line and passed every token after `aws` to `trap`
as a signal name. Exit 127, line 7. Same defect lurked in ALL 8
Saturday-SF spot states (MorningEnrich first only because it's the
first to run; everything downstream would have failed the same way).

The 2026-05-22 Friday-PM shell-run dry-pass — designed exactly to
catch break classes like this ~12h before the Saturday 09:00 UTC
firing — exposed the latent regression on its first execution under
the broken pattern (no Saturday SF had run between #253 merge and the
dry-pass). The Friday-PM dry-pass safety net worked as designed.

**Fix shape (institutional, per CLAUDE.md SOTA sub-sub-rule):**
Each affected state now invokes a single States.Format-rendered token
list with no bash trap and no inner single quotes:

  /home/ec2-user/alpha-engine-dashboard/.venv/bin/python \
    -m alpha_engine_lib.ssm_log_capture run \
    --slug X --log /var/log/X.log -- bash <launcher> ...

The lib CLI internalizes the tee + S3 ship-on-exit + exit-code
propagation. No ASL escape surface. Tested independently in lib v0.25.0
(20 unit tests; see alpha-engine-lib PR #57).

States converted (slug → launcher):
- morning-enrich → spot_data_weekly.sh --morning-enrich-only
- data-weekly → spot_data_weekly.sh --phase1-only
- rag-ingestion → spot_data_weekly.sh --rag-only
- predictor-training → spot_train.sh --full-only (alpha-engine-predictor)
- drift-detection → spot_drift_detection.sh
- backtester → spot_backtest.sh --skip-stages=parity,evaluator (alpha-engine-backtester)
- parity → spot_backtest.sh --skip-stages=backtest,evaluator (alpha-engine-backtester)
- evaluator → spot_backtest.sh --skip-stages=backtest,parity (alpha-engine-backtester)

The `States.Format('export RUN_DATE=\\'{}\\'', $.run_date)` line in
Backtester/Parity/Evaluator stays as-is — `States.Format` template
unescaping IS reliable (different ASL code path than States.Array arg
unescaping).

**Test gaps closed in the same diff:**

1. `test_sf_ssm_pipefail_wiring._iter_ssm_command_blocks` previously
   only iterated plain `commands` arrays — silently skipped every
   `commands.$ States.Array(...)` state. That was why the broken trap
   form went undetected from #253 (2026-05-17) until the dry-pass
   actually fired in production (2026-05-22). The helper now uses the
   shared `tests.sf_command_utils.extract_commands` to handle both
   forms uniformly.

2. `test_long_ssm_steps_ship_log_to_s3` now accepts EITHER an inline
   bash trap (plain `commands` form, used by weekday + EOD SFs) OR the
   alpha_engine_lib.ssm_log_capture CLI (commands.$ States.Array form,
   used by Saturday SF). Either shape satisfies the S3-log-capture
   invariant; the test enforces presence + correctness without
   prescribing one form.

3. `tests/fixtures/sf_prekeystone_spot_commands.json` regenerated from
   the new SF (the fixture's docstring explicitly permits this for
   "deliberate, reviewed change to a spot state's absent-path
   command" — this is exactly that case). Pre-keystone form had the
   broken `\'`-escape trap baked in; new baseline reflects the lib-CLI
   shape. The Friday-PM keystone byte-identicality proof still holds
   (absent path = lib-CLI invocation with preflight_args="").

4. `test_morning_enrich_has_s3_log_trap_before_work` and the analogous
   parity test renamed/rewritten to assert the lib-CLI presence
   instead of the inline-trap presence. Both also assert that NO
   inline trap coexists with the lib CLI in the same state (would
   race on /var/log).

5. `test_spot_command_carries_preflight_only_under_shell_run` updated
   to assert the new shell_run command shape (lib CLI + `--` separator
   + launcher + --preflight-only).

**Dependency chain (must merge in order):**
1. alpha-engine-lib #57 — adds ssm_log_capture, auto-tags v0.25.0
2. alpha-engine-data #289 — bumps lib pin to v0.25.0 (spot side)
3. alpha-engine-predictor #188 — bumps lib pin (PredictorTraining spot)
4. alpha-engine-backtester #243 — bumps lib pin (Backtester/Parity/Evaluator spots)
5. alpha-engine-dashboard #118 — bumps lib pin (ae-dashboard = SSM target, hosts the .venv where the lib CLI runs)
6. **THIS PR** — converts SF JSON to invoke the lib CLI
7. Run alpha-engine-data/infrastructure/deploy_step_function.sh
8. Redrive the Friday-PM dry-pass; confirm 8/8 spot states green
9. Saturday 09:00 UTC SF runs cleanly

Suite: 1430 passed, 1 skipped, 7 warnings (FutureWarnings unrelated to this change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 marked this pull request as ready for review May 22, 2026 22:33
@cipher813 cipher813 merged commit 63a15ff into main May 22, 2026
1 check passed
@cipher813 cipher813 deleted the fix/sf-states-to-ssm-log-capture-cli branch May 22, 2026 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant