Normalize soft-failure error shape in Saturday Step Function#15
Merged
cipher813 merged 1 commit intoApr 11, 2026
Merged
Conversation
Tonight's pipeline failure surfaced two bugs stacked. The first (macro breadth=null) was fixed in #13. This PR fixes the second: when any Check* Choice state routed through its Default edge to HandleFailure, HandleFailure's Message template crashed with: States.Runtime: The JsonPath argument for the field '$.error' could not be found in the input because `$.error` is only populated by the Catch blocks on Task failures — the Choice.Default path (soft failure, where a Lambda returned {"status": "ERROR"} or an SSM command ended non-Success) never sets it. Net effect: the SNS alert never fired, the execution died with a confusing meta-error, and the real cause (the AttributeError in macro_agent) was buried under a JSONPath lookup failure. Fix: insert a dedicated Pass state for each of the 5 Check* states. Each extractor copies the relevant poll object (or the Research Lambda Payload) into `$.error` under a shape that identifies which phase failed, then falls through to the existing HandleFailure state. The Catch-based hard-failure path is unchanged — it still writes directly to `$.error` via ResultPath. State machine now has 30 states (was 25). The 5 new states are: - ExtractDataPhase1Error (CheckDataPhase1Status.Default) - ExtractRAGError (CheckRAGStatus.Default) - ExtractResearchError (CheckResearchStatus.Default) - ExtractPredictorError (CheckPredictorStatus.Default) - ExtractBacktesterError (CheckBacktesterStatus.Default) The live state machine has already been updated via `aws stepfunctions update-state-machine` so tonight's restart picks this up. This commit keeps the source of truth in the repo aligned with the live definition. Verified: every Extract state Pass has Type=Pass, Next=HandleFailure, ResultPath=$.error. HandleFailure's SNS Message template is unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
Apr 11, 2026
Tonight's pipeline rerun passed DataPhase1 + RAG + Research cleanly (validating the #13 and #15 fixes) but DataPhase2 failed with: Runtime.ImportModuleError: No module named 'ssm_secrets' Root cause: lambda/handler.py:22 does `from ssm_secrets import load_secrets` at module top-level, but the Dockerfile never copied ssm_secrets.py into the image. The old pre-#14 image was apparently built manually at a time when this import didn't exist, so it ran fine. #14's GitHub Actions auto-deploy rebuilt the image from the current Dockerfile, published version 2, and flipped the `live` alias before the canary step failed on an unrelated IAM permission (see follow-up note below) — so the alias is now stuck on the broken v2. Fix: add `COPY ssm_secrets.py ${LAMBDA_TASK_ROOT}/` to the Dockerfile. Audited lambda/handler.py + collectors/alternative.py import graphs via AST walk: the only first-party modules pulled in at handler import time are `collectors` (already copied) and `ssm_secrets` (now copied). ssm_secrets.py itself only imports stdlib (logging, os), so it doesn't transitively pull anything else. Also added `ssm_secrets.py` and `lambda/**` to the deploy workflow's paths filter — both were missing, which means changes to handler.py or the secrets loader would silently not retrigger a Lambda rebuild. Known follow-up (NOT in scope for this PR): the GitHub Actions OIDC role `github-actions-lambda-deploy` lacks `lambda:InvokeFunction` + `lambda:UpdateAlias` on `alpha-engine-data-collector:live`. That's why infrastructure/deploy.sh's canary step fails after every successful build, and why the post-canary rollback also fails silently — leaving the alias stranded on whatever version just got published. Two options: (a) add those perms to the OIDC role, or (b) short-circuit the canary step in CI. Separate infrastructure PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tonight's Saturday Step Function failure surfaced two bugs stacked. The first —
alpha-engine-datawriting"breadth": nullintomacro.json→ researchmacro_agentcrashing withAttributeError— was fixed in #13.This PR fixes the second: when any
Check*Choice state routed through itsDefaultedge toHandleFailure,HandleFailure's Message template crashed the execution with:Because
$.erroris only populated by theCatchblocks on Task failures — theChoice.Defaultpath (soft failure, where a Lambda returned{"status": "ERROR"}or an SSM command ended non-Success) never sets it. Net effect:HandleFailurerunning successfully, and that state crashed before it could publish.States.Runtimemeta-error instead of the real cause.AttributeErrorinmacro_agent) was buried under a JSONPath lookup failure, making tonight's triage take longer than it should have.Fix
Insert a dedicated
Passextractor for each of the 5Check*states. Each extractor copies the relevant poll object (or the Research Lambda Payload) into$.errorunder a shape that identifies which phase failed, then falls through to the existingHandleFailure. TheCatch-based hard-failure path is unchanged — it still writes directly to$.errorviaResultPath, so both paths converge on the same shape.State machine grew from 25 → 30 states. The 5 new states:
CheckDataPhase1Status.DefaultExtractDataPhase1Error$.data_phase1_pollCheckRAGStatus.DefaultExtractRAGError$.rag_pollCheckResearchStatus.DefaultExtractResearchError$.research_result.PayloadCheckPredictorStatus.DefaultExtractPredictorError$.predictor_pollCheckBacktesterStatus.DefaultExtractBacktesterError$.backtester_pollAlready applied to the live state machine
I ran `aws stepfunctions update-state-machine --definition file://infrastructure/step_function.json` before opening this PR so tonight's restart picks up the fix:
Verified via `describe-state-machine` that the live definition has 30 states, all 5 Extract* states present, and `CheckResearchStatus.Default` now points to `ExtractResearchError`. This commit keeps the repo source of truth aligned with the live definition.
Test plan
Out of scope
🤖 Generated with Claude Code