fix(eventbridge): CFN-canonical cron rules + target-uniqueness chokepoint (closes 2026-05-26 dup-target incident)#322
Merged
Conversation
…add target-uniqueness chokepoint test (closes 2026-05-26 duplicate-target incident) Root cause of the 2026-05-26 weekday-pipeline failure: PR #317 (33c3753, 2026-05-25 evening) added pipeline_role tagging to both the CFN template AND the deploy scripts, with different target IDs (Id=1 from scripts, Id=<rule>-pipeline from CFN). EventBridge couldn't dedupe — different IDs meant different targets — so the alpha-engine-weekday rule shipped with TWO targets pointing at the SAME state machine. Every weekday cron firing fanned to two parallel SF executions; both ran MorningEnrich on the trading instance; both opened ArcticDB and reached daily_append's update_batch/write_batch; the C++ engine emitted 321 unique-symbol E_NON_INCREASING_INDEX_VERSION (code 5090) races; the 5%-threshold gate hard-failed both runs at 35.6% (n_err=322 of 905). Saturday rule had the identical defect (first exposure Sat 5/30). This PR makes EventBridge rules + targets CFN-canonical (deploy scripts no longer call put-rule or put-targets), migrates the enable_standalone_scanner=true flag onto the CFN saturday target (was script-only — re-deploying CFN would have silently reverted L1995 Phase 3), and adds two substrate gates: - TestDeployScriptsHaveNoEventBridgeWrites: both deploy scripts must not contain executable `aws events put-rule` or `aws events put-targets` lines. Closes the dual-write path that enabled the bug. - TestCFNTargetUniqueness: each cron-triggered AWS::Events::Rule in the orchestration CFN has exactly 1 entry under Targets:. PR #317's existing tests validated input contents but not target count; this is the gap that let the defect through CI. Operational state at PR-open time: - Duplicate targets removed manually via `aws events remove-targets` (Id=weekday-pipeline + Id=saturday-pipeline) so Wed 5/27 + Sat 5/30 fire correctly without waiting for this PR to merge. - Today's failed weekday SF redriven via operator-launched execution (pipeline_role=recovery). - EB IAM role bootstrap (trust policy + create-role) kept in the saturday deploy script so a fresh region/account can still bootstrap via that script alone. Inline policy remains codified in alpha-engine repo's infrastructure/iam/. Follow-up (filed at wind-down): - P1: SF-side MutualExclusionGuard (DynamoDB conditional PUT) so any future duplicate trigger (operator double-paste, EB bug, cross-region replay) hard-fails before any SSM SendCommand. - P1: alpha-engine-lib producer-side universe-writer lock (S3 conditional PUT) — covers manual `python -m builders.daily_append` runs that don't go through SF. Test: `pytest tests/test_deploy_step_function_eventbridge_input.py` (11 passed) + full suite (1539 passed, 1 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 27, 2026
…oint module (#328) Closes the remaining 6 findings from the 2026-05-27 wider L302 audit (P0 retrospective on PR #317's content-vs-uniqueness CI gap that caused the 2026-05-26 dup-EB-target trading-day miss). PR #322 closed the EventBridge target instance; this closes the meta-pattern across the rest of this repo's CI: tests pin WHAT was put, not HOW MANY were put or whether anything ELSE was put. New module tests/test_sf_payload_uniqueness.py with 29 tests across 6 classes — one per audit finding: | Class | L302 Finding | What it closes | |---|---|---| | TestSaturdaySFPayloadFieldSetsClosed | F2 eval_judge_wiring + F4 aggregate_costs + cross-cutting | Lambda Payload field-set drift across all 14 Saturday SF Lambda calls | | TestWeekdaySFPayloadFieldSetsClosed | (extension) | Lambda Payload field-set drift across 6 weekday SF Lambda calls | | TestSFRoleInvokeFunctionStatementCount | F3 iam_lambda_grants | exactly 1 lambda:InvokeFunction Statement in alpha-engine-step-functions-role.json (catches stale overlapping ARN statements from pre-2026 refactors) | | TestWeekdaySSMFlowDoctorOrdering | F5 ssm_pipefail_wiring | FLOW_DOCTOR_ENABLED=1 in first 3 commands (closes 2026-05-11 ordering-incident recurrence path) | | TestEODSFTopLevelFieldsClosed | F6 eod_substrate_check_wiring | top-level $.field namespace closed across input + intermediate ResultPath fields (catches silent collisions) | | TestSaturdaySFSpotStateCount | F7 friday_shell_run_wiring | exactly 8 spot-launching states (catches orphaned legacy state from incomplete refactor) | Shape per surface: pin a closed registry of expected keys/states, fail loud when actual diverges. Mirrors PR #322's TestCFNTargetUniqueness pattern; same chokepoint shape applied to 6 more surfaces. Suite: 1567 → 1596 passed (+29 net). Composes with #322, [[reference-eventbridge-target-uniqueness-invariant]], [[feedback-mocked-tests-dont-validate-external-api-contract]], [[feedback-audit-findings-become-roadmap-followups]] and the L302 P0-retrospective entry in ROADMAP. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
cipher813
added a commit
that referenced
this pull request
May 27, 2026
…332) * feat(sf): L274 SF MutualExclusionGuard via DynamoDB conditional PUT Closes the architectural hole the L286a/b producer-side universe_writer_lock covers for manual daily_append invocations. That lock guards the data writer; this mutex guards the SF entry point. Together they cover both halves of single-writer-per- cadence-cell. Trigger of record: 2026-05-26 duplicate-EventBridge-target incident (PR #322 closed the SPECIFIC CFN-vs-deploy-script shape via CI; this mutex closes the broader class — operator double-paste, EventBridge internal retry-on-throttle, cross- region replay coincidence, any future shape the CI chokepoint misses). Design: - CFN adds ExecutionMutexTable (PAY_PER_REQUEST, mutex_key PK only, intentionally no TTL — minute-bucket key encodes its own staleness window). - IAM adds dynamodb:PutItem to alpha-engine-step-functions-role scoped to the mutex table ARN (no DeleteItem — design has no release path). - Each SF (Saturday, weekday, EOD) inserts a 3-state mutex chain before its existing first state: CheckMutexRole (Choice over $.pipeline_role allowlist {daily, weekly, eod, shell-run}) → AcquireMutex (DynamoDB.PutItem with ConditionExpression attribute_not_exists) → MutexConflict (Fail). The ConditionalCheckFailedException Catch routes to MutexConflict; the States.ALL Catch fails OPEN to the former first state per [[feedback_no_silent_fails]] secondary-observability carve-out (DDB outage must not block trading). - mutex_key = '{state-machine-name}#{pipeline_role}#{YYYY-MM-DDTHH:MM}' (UTC minute bucket from $$.Execution.StartTime). Deploy ordering (CRITICAL): 1. CFN apply (creates DynamoDB table) — REQUIRED FIRST. 2. infrastructure/iam/apply.sh alpha-engine-step-functions-role (grants dynamodb:PutItem on the table). 3. infrastructure/deploy_step_function.sh + deploy_step_function_daily.sh + update_eod_pipeline_sf.sh (uploads new SF JSONs that reference the table + IAM grant). Out of order = next cron-fired cadence SF fails AccessDenied or ResourceNotFoundException at AcquireMutex. (Fail-open Catch catches the States.ALL case so cadence runs still proceed, but the CW alarm noise is avoidable.) Tests: 48 new in test_sf_mutex_wiring.py covering states present + wiring + allowlist + Catch behavior + CFN table schema + IAM grant scope. Three pre-existing tests updated to account for the InitializeInput.Next reroute (test_sf_friday_shell_run_wiring, test_sf_morning_enrich_split_ wiring, test_sf_payload_uniqueness EOD field registry). Suite: 1626 → 1674 passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: retrigger drift-check after operator-applied IAM grant --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pipeline_roletagging in BOTH the CFN template AND the deploy scripts with different target IDs. EventBridge couldn't dedupe →alpha-engine-weekdayshipped with TWO targets pointing to the same SF → every weekday cron fanned to 2 parallel SF executions → both spawned MorningEnrich on the same trading instance → 321 unique-symbolE_NON_INCREASING_INDEX_VERSION(5090) races indaily_append'supdate_batch/write_batch→ 5%-threshold gate hard-failed both at 35.6% (n_err=322/905).alpha-engine-saturdayhad the identical defect (first exposure Sat 5/30).put-rule/put-targets).enable_standalone_scanner=truemigrated from script to CFN saturday target (was script-only — re-deploying CFN would have silently reverted L1995 Phase 3).TestDeployScriptsHaveNoEventBridgeWrites: both deploy scripts must not contain executableaws events put-rule/put-targetslinesTestCFNTargetUniqueness: each cron-triggeredAWS::Events::Rulehas exactly 1 entry underTargets:Operational state at PR-open time
aws events remove-targets(Id=weekday-pipeline+ Id=saturday-pipeline) — Wed 5/27 + Sat 5/30 will fire correctly without waiting for this PR to merge.pipeline_role=recovery).create-role) kept in the saturday deploy script so a fresh region/account can still bootstrap via that script alone. Inline policy remains codified inalpha-enginerepo'sinfrastructure/iam/.Follow-up (filed at session wind-down)
MutualExclusionGuard(DynamoDB conditional PUT) so any future duplicate trigger (operator double-paste, EB bug, cross-region replay) hard-fails before any SSMSendCommand.alpha-engine-libproducer-side universe-writer lock (S3 conditional PUT) — covers manualpython -m builders.daily_appendruns that don't go through SF.TestEventBridgeInputchokepoint validated target input contents but not target count uniqueness. Pattern to watch: when tests pin SoT-A content, also pin SoT-A uniqueness.Test plan
pytest tests/test_deploy_step_function_eventbridge_input.py(11 passed)alpha-engine-weekdayandalpha-engine-saturdayaws cloudformation deploy --template-file infrastructure/cloudformation/alpha-engine-orchestration.yaml ...to re-converge the CFN-managed state with the live state (idempotent — current live Id="1" gets renamed to "weekday-pipeline"/"saturday-pipeline" per the CFN template; behavior unchanged)🤖 Generated with Claude Code