Skip to content

Normalize soft-failure error shape in Saturday Step Function#15

Merged
cipher813 merged 1 commit into
mainfrom
fix/step-function-soft-failure-error-normalization
Apr 11, 2026
Merged

Normalize soft-failure error shape in Saturday Step Function#15
cipher813 merged 1 commit into
mainfrom
fix/step-function-soft-failure-error-normalization

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Tonight's Saturday Step Function failure surfaced two bugs stacked. The first — alpha-engine-data writing "breadth": null into macro.json → research macro_agent crashing with AttributeError — was fixed in #13.

This PR fixes the second: when any Check* Choice state routed through its Default edge to HandleFailure, HandleFailure's Message template crashed the execution with:

States.Runtime: The JsonPath argument for the field '$.error'
could not be found in the input

Because $.error is only populated by the Catch blocks on Task failures — the Choice.Default path (soft failure, where a Lambda returned {"status": "ERROR"} or an SSM command ended non-Success) never sets it. Net effect:

  1. The SNS alert never fired — the paged-on email relied on HandleFailure running successfully, and that state crashed before it could publish.
  2. The execution died with a confusing States.Runtime meta-error instead of the real cause.
  3. The real cause (the AttributeError in macro_agent) was buried under a JSONPath lookup failure, making tonight's triage take longer than it should have.

Fix

Insert a dedicated Pass extractor for each of the 5 Check* states. Each extractor copies the relevant poll object (or the Research Lambda Payload) into $.error under a shape that identifies which phase failed, then falls through to the existing HandleFailure. The Catch-based hard-failure path is unchanged — it still writes directly to $.error via ResultPath, so both paths converge on the same shape.

State machine grew from 25 → 30 states. The 5 new states:

Source Choice state New extractor Reads from
CheckDataPhase1Status.Default ExtractDataPhase1Error $.data_phase1_poll
CheckRAGStatus.Default ExtractRAGError $.rag_poll
CheckResearchStatus.Default ExtractResearchError $.research_result.Payload
CheckPredictorStatus.Default ExtractPredictorError $.predictor_poll
CheckBacktesterStatus.Default ExtractBacktesterError $.backtester_poll

Already applied to the live state machine

I ran `aws stepfunctions update-state-machine --definition file://infrastructure/step_function.json` before opening this PR so tonight's restart picks up the fix:

updateDate: 2026-04-10T18:00:23-07:00
revisionId: d836ff1d-1a8c-4cf7-b8dd-69a1262fad86

Verified via `describe-state-machine` that the live definition has 30 states, all 5 Extract* states present, and `CheckResearchStatus.Default` now points to `ExtractResearchError`. This commit keeps the repo source of truth aligned with the live definition.

Test plan

  • JSON validates (`python3 -c "import json; json.load(open(...))"`)
  • All 5 Check*.Default targets verified against expected Extract* names
  • Every Extract* state has `Type=Pass`, `Next=HandleFailure`, `ResultPath=$.error`
  • `HandleFailure` Message template unchanged
  • Live SM update succeeded
  • End-to-end test: restart tonight's failed Saturday execution and confirm (a) Research runs green now that Fix macro breadth: compute from slim cache instead of writing null #13 is deployed, or (b) if something else soft-fails, HandleFailure fires SNS instead of `States.Runtime`

Out of scope

🤖 Generated with Claude Code

Tonight's pipeline failure surfaced two bugs stacked. The first (macro
breadth=null) was fixed in #13. This PR fixes the second: when any
Check* Choice state routed through its Default edge to HandleFailure,
HandleFailure's Message template crashed with:

    States.Runtime: The JsonPath argument for the field '$.error'
    could not be found in the input

because `$.error` is only populated by the Catch blocks on Task
failures — the Choice.Default path (soft failure, where a Lambda
returned {"status": "ERROR"} or an SSM command ended non-Success)
never sets it. Net effect: the SNS alert never fired, the execution
died with a confusing meta-error, and the real cause (the
AttributeError in macro_agent) was buried under a JSONPath lookup
failure.

Fix: insert a dedicated Pass state for each of the 5 Check* states.
Each extractor copies the relevant poll object (or the Research
Lambda Payload) into `$.error` under a shape that identifies which
phase failed, then falls through to the existing HandleFailure state.
The Catch-based hard-failure path is unchanged — it still writes
directly to `$.error` via ResultPath.

State machine now has 30 states (was 25). The 5 new states are:

- ExtractDataPhase1Error   (CheckDataPhase1Status.Default)
- ExtractRAGError          (CheckRAGStatus.Default)
- ExtractResearchError     (CheckResearchStatus.Default)
- ExtractPredictorError    (CheckPredictorStatus.Default)
- ExtractBacktesterError   (CheckBacktesterStatus.Default)

The live state machine has already been updated via
`aws stepfunctions update-state-machine` so tonight's restart picks
this up. This commit keeps the source of truth in the repo aligned
with the live definition.

Verified: every Extract state Pass has Type=Pass, Next=HandleFailure,
ResultPath=$.error. HandleFailure's SNS Message template is unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 4f42a05 into main Apr 11, 2026
1 check passed
@cipher813 cipher813 deleted the fix/step-function-soft-failure-error-normalization branch April 11, 2026 01:02
cipher813 added a commit that referenced this pull request Apr 11, 2026
Tonight's pipeline rerun passed DataPhase1 + RAG + Research cleanly
(validating the #13 and #15 fixes) but DataPhase2 failed with:

    Runtime.ImportModuleError: No module named 'ssm_secrets'

Root cause: lambda/handler.py:22 does `from ssm_secrets import
load_secrets` at module top-level, but the Dockerfile never copied
ssm_secrets.py into the image. The old pre-#14 image was apparently
built manually at a time when this import didn't exist, so it ran
fine. #14's GitHub Actions auto-deploy rebuilt the image from the
current Dockerfile, published version 2, and flipped the `live` alias
before the canary step failed on an unrelated IAM permission (see
follow-up note below) — so the alias is now stuck on the broken v2.

Fix: add `COPY ssm_secrets.py ${LAMBDA_TASK_ROOT}/` to the Dockerfile.
Audited lambda/handler.py + collectors/alternative.py import graphs
via AST walk: the only first-party modules pulled in at handler
import time are `collectors` (already copied) and `ssm_secrets` (now
copied). ssm_secrets.py itself only imports stdlib (logging, os), so
it doesn't transitively pull anything else.

Also added `ssm_secrets.py` and `lambda/**` to the deploy workflow's
paths filter — both were missing, which means changes to handler.py
or the secrets loader would silently not retrigger a Lambda rebuild.

Known follow-up (NOT in scope for this PR): the GitHub Actions OIDC
role `github-actions-lambda-deploy` lacks `lambda:InvokeFunction` +
`lambda:UpdateAlias` on `alpha-engine-data-collector:live`. That's
why infrastructure/deploy.sh's canary step fails after every
successful build, and why the post-canary rollback also fails
silently — leaving the alias stranded on whatever version just got
published. Two options: (a) add those perms to the OIDC role, or
(b) short-circuit the canary step in CI. Separate infrastructure PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant