Skip to content

fix(eventbridge): CFN-canonical cron rules + target-uniqueness chokepoint (closes 2026-05-26 dup-target incident)#322

Merged
cipher813 merged 1 commit into
mainfrom
fix/eb-target-single-sot-uniqueness-chokepoint
May 26, 2026
Merged

fix(eventbridge): CFN-canonical cron rules + target-uniqueness chokepoint (closes 2026-05-26 dup-target incident)#322
cipher813 merged 1 commit into
mainfrom
fix/eb-target-single-sot-uniqueness-chokepoint

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

  • Root cause of the 2026-05-26 weekday-pipeline failure: PR feat(eventbridge): Option-D pipeline_role tags on all 3 SF cron rules #317 added pipeline_role tagging in BOTH the CFN template AND the deploy scripts with different target IDs. EventBridge couldn't dedupe → alpha-engine-weekday shipped with TWO targets pointing to the same SF → every weekday cron fanned to 2 parallel SF executions → both spawned MorningEnrich on the same trading instance → 321 unique-symbol E_NON_INCREASING_INDEX_VERSION (5090) races in daily_append's update_batch/write_batch → 5%-threshold gate hard-failed both at 35.6% (n_err=322/905). alpha-engine-saturday had the identical defect (first exposure Sat 5/30).
  • Fix: EB rules + targets are now CFN-canonical (deploy scripts no longer call put-rule / put-targets). enable_standalone_scanner=true migrated from script to CFN saturday target (was script-only — re-deploying CFN would have silently reverted L1995 Phase 3).
  • Substrate gates (the chokepoint PR feat(eventbridge): Option-D pipeline_role tags on all 3 SF cron rules #317 lacked):
    • TestDeployScriptsHaveNoEventBridgeWrites: both deploy scripts must not contain executable aws events put-rule / put-targets lines
    • TestCFNTargetUniqueness: each cron-triggered AWS::Events::Rule has exactly 1 entry under Targets:

Operational state at PR-open time

  • Duplicate live targets removed manually via aws events remove-targets (Id=weekday-pipeline + Id=saturday-pipeline) — Wed 5/27 + Sat 5/30 will fire correctly without waiting for this PR to merge.
  • Today's failed weekday SF redriven via operator-launched execution (pipeline_role=recovery).
  • EB IAM role bootstrap (trust policy + create-role) kept in the saturday deploy script so a fresh region/account can still bootstrap via that script alone. Inline policy remains codified in alpha-engine repo's infrastructure/iam/.

Follow-up (filed at session wind-down)

  • P1: SF-side MutualExclusionGuard (DynamoDB conditional PUT) so any future duplicate trigger (operator double-paste, EB bug, cross-region replay) hard-fails before any SSM SendCommand.
  • P1: alpha-engine-lib producer-side universe-writer lock (S3 conditional PUT) — covers manual python -m builders.daily_append runs that don't go through SF.
  • P0 retrospective: PR feat(eventbridge): Option-D pipeline_role tags on all 3 SF cron rules #317 audit — the existing TestEventBridgeInput chokepoint validated target input contents but not target count uniqueness. Pattern to watch: when tests pin SoT-A content, also pin SoT-A uniqueness.

Test plan

  • pytest tests/test_deploy_step_function_eventbridge_input.py (11 passed)
  • Full suite: 1539 passed, 1 skipped
  • Live EB state confirmed single-target on both alpha-engine-weekday and alpha-engine-saturday
  • Today's recovery SF in flight (started 12:29 UTC, monitoring) — confirms duplicate-target removal unblocks the path
  • After merge, operator runs aws cloudformation deploy --template-file infrastructure/cloudformation/alpha-engine-orchestration.yaml ... to re-converge the CFN-managed state with the live state (idempotent — current live Id="1" gets renamed to "weekday-pipeline"/"saturday-pipeline" per the CFN template; behavior unchanged)

🤖 Generated with Claude Code

…add target-uniqueness chokepoint test (closes 2026-05-26 duplicate-target incident)

Root cause of the 2026-05-26 weekday-pipeline failure:

PR #317 (33c3753, 2026-05-25 evening) added pipeline_role tagging to
both the CFN template AND the deploy scripts, with different target
IDs (Id=1 from scripts, Id=<rule>-pipeline from CFN). EventBridge
couldn't dedupe — different IDs meant different targets — so the
alpha-engine-weekday rule shipped with TWO targets pointing at the
SAME state machine. Every weekday cron firing fanned to two parallel
SF executions; both ran MorningEnrich on the trading instance; both
opened ArcticDB and reached daily_append's update_batch/write_batch;
the C++ engine emitted 321 unique-symbol E_NON_INCREASING_INDEX_VERSION
(code 5090) races; the 5%-threshold gate hard-failed both runs at
35.6% (n_err=322 of 905). Saturday rule had the identical defect
(first exposure Sat 5/30).

This PR makes EventBridge rules + targets CFN-canonical (deploy
scripts no longer call put-rule or put-targets), migrates the
enable_standalone_scanner=true flag onto the CFN saturday target
(was script-only — re-deploying CFN would have silently reverted
L1995 Phase 3), and adds two substrate gates:

  - TestDeployScriptsHaveNoEventBridgeWrites: both deploy scripts
    must not contain executable `aws events put-rule` or
    `aws events put-targets` lines. Closes the dual-write path that
    enabled the bug.
  - TestCFNTargetUniqueness: each cron-triggered AWS::Events::Rule
    in the orchestration CFN has exactly 1 entry under Targets:.
    PR #317's existing tests validated input contents but not target
    count; this is the gap that let the defect through CI.

Operational state at PR-open time:
  - Duplicate targets removed manually via `aws events remove-targets`
    (Id=weekday-pipeline + Id=saturday-pipeline) so Wed 5/27 + Sat 5/30
    fire correctly without waiting for this PR to merge.
  - Today's failed weekday SF redriven via operator-launched execution
    (pipeline_role=recovery).
  - EB IAM role bootstrap (trust policy + create-role) kept in the
    saturday deploy script so a fresh region/account can still
    bootstrap via that script alone. Inline policy remains codified
    in alpha-engine repo's infrastructure/iam/.

Follow-up (filed at wind-down):
  - P1: SF-side MutualExclusionGuard (DynamoDB conditional PUT) so
    any future duplicate trigger (operator double-paste, EB bug,
    cross-region replay) hard-fails before any SSM SendCommand.
  - P1: alpha-engine-lib producer-side universe-writer lock
    (S3 conditional PUT) — covers manual `python -m builders.daily_append`
    runs that don't go through SF.

Test: `pytest tests/test_deploy_step_function_eventbridge_input.py`
(11 passed) + full suite (1539 passed, 1 skipped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 0024738 into main May 26, 2026
1 check passed
@cipher813 cipher813 deleted the fix/eb-target-single-sot-uniqueness-chokepoint branch May 26, 2026 13:57
cipher813 added a commit that referenced this pull request May 27, 2026
…oint module (#328)

Closes the remaining 6 findings from the 2026-05-27 wider L302 audit
(P0 retrospective on PR #317's content-vs-uniqueness CI gap that
caused the 2026-05-26 dup-EB-target trading-day miss). PR #322 closed
the EventBridge target instance; this closes the meta-pattern across
the rest of this repo's CI: tests pin WHAT was put, not HOW MANY were
put or whether anything ELSE was put.

New module tests/test_sf_payload_uniqueness.py with 29 tests across
6 classes — one per audit finding:

| Class | L302 Finding | What it closes |
|---|---|---|
| TestSaturdaySFPayloadFieldSetsClosed | F2 eval_judge_wiring + F4 aggregate_costs + cross-cutting | Lambda Payload field-set drift across all 14 Saturday SF Lambda calls |
| TestWeekdaySFPayloadFieldSetsClosed | (extension) | Lambda Payload field-set drift across 6 weekday SF Lambda calls |
| TestSFRoleInvokeFunctionStatementCount | F3 iam_lambda_grants | exactly 1 lambda:InvokeFunction Statement in alpha-engine-step-functions-role.json (catches stale overlapping ARN statements from pre-2026 refactors) |
| TestWeekdaySSMFlowDoctorOrdering | F5 ssm_pipefail_wiring | FLOW_DOCTOR_ENABLED=1 in first 3 commands (closes 2026-05-11 ordering-incident recurrence path) |
| TestEODSFTopLevelFieldsClosed | F6 eod_substrate_check_wiring | top-level $.field namespace closed across input + intermediate ResultPath fields (catches silent collisions) |
| TestSaturdaySFSpotStateCount | F7 friday_shell_run_wiring | exactly 8 spot-launching states (catches orphaned legacy state from incomplete refactor) |

Shape per surface: pin a closed registry of expected keys/states,
fail loud when actual diverges. Mirrors PR #322's
TestCFNTargetUniqueness pattern; same chokepoint shape applied to
6 more surfaces.

Suite: 1567 → 1596 passed (+29 net).

Composes with #322, [[reference-eventbridge-target-uniqueness-invariant]],
[[feedback-mocked-tests-dont-validate-external-api-contract]],
[[feedback-audit-findings-become-roadmap-followups]] and the L302
P0-retrospective entry in ROADMAP.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 27, 2026
…332)

* feat(sf): L274 SF MutualExclusionGuard via DynamoDB conditional PUT

Closes the architectural hole the L286a/b producer-side
universe_writer_lock covers for manual daily_append invocations.
That lock guards the data writer; this mutex guards the SF entry
point. Together they cover both halves of single-writer-per-
cadence-cell.

Trigger of record: 2026-05-26 duplicate-EventBridge-target
incident (PR #322 closed the SPECIFIC CFN-vs-deploy-script
shape via CI; this mutex closes the broader class — operator
double-paste, EventBridge internal retry-on-throttle, cross-
region replay coincidence, any future shape the CI chokepoint
misses).

Design:
- CFN adds ExecutionMutexTable (PAY_PER_REQUEST, mutex_key PK
  only, intentionally no TTL — minute-bucket key encodes its
  own staleness window).
- IAM adds dynamodb:PutItem to alpha-engine-step-functions-role
  scoped to the mutex table ARN (no DeleteItem — design has no
  release path).
- Each SF (Saturday, weekday, EOD) inserts a 3-state mutex chain
  before its existing first state: CheckMutexRole (Choice over
  $.pipeline_role allowlist {daily, weekly, eod, shell-run}) →
  AcquireMutex (DynamoDB.PutItem with ConditionExpression
  attribute_not_exists) → MutexConflict (Fail). The
  ConditionalCheckFailedException Catch routes to MutexConflict;
  the States.ALL Catch fails OPEN to the former first state
  per [[feedback_no_silent_fails]] secondary-observability
  carve-out (DDB outage must not block trading).
- mutex_key = '{state-machine-name}#{pipeline_role}#{YYYY-MM-DDTHH:MM}'
  (UTC minute bucket from $$.Execution.StartTime).

Deploy ordering (CRITICAL):
1. CFN apply (creates DynamoDB table) — REQUIRED FIRST.
2. infrastructure/iam/apply.sh alpha-engine-step-functions-role
   (grants dynamodb:PutItem on the table).
3. infrastructure/deploy_step_function.sh +
   deploy_step_function_daily.sh + update_eod_pipeline_sf.sh
   (uploads new SF JSONs that reference the table + IAM grant).

Out of order = next cron-fired cadence SF fails AccessDenied or
ResourceNotFoundException at AcquireMutex. (Fail-open Catch
catches the States.ALL case so cadence runs still proceed, but
the CW alarm noise is avoidable.)

Tests: 48 new in test_sf_mutex_wiring.py covering states
present + wiring + allowlist + Catch behavior + CFN table
schema + IAM grant scope. Three pre-existing tests updated to
account for the InitializeInput.Next reroute
(test_sf_friday_shell_run_wiring, test_sf_morning_enrich_split_
wiring, test_sf_payload_uniqueness EOD field registry).

Suite: 1626 → 1674 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: retrigger drift-check after operator-applied IAM grant

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant