Skip to content

feat(eventbridge): Option-D pipeline_role tags on all 3 SF cron rules#317

Merged
cipher813 merged 2 commits into
mainfrom
feat/pipeline-role-eventbridge-cron
May 25, 2026
Merged

feat(eventbridge): Option-D pipeline_role tags on all 3 SF cron rules#317
cipher813 merged 2 commits into
mainfrom
feat/pipeline-role-eventbridge-cron

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Tags every cron-triggered SF execution with a pipeline_role input field per the Option-D execution-picker plan (Brian-approved 2026-05-25 evening; lib substrate in alpha-engine-lib#75; daemon companion in alpha-engine#214). Page 25 / Slack / CLI consumers filter by role so smoke / recovery / operator-replay executions never displace the canonical cadence run as "most recent."

  • SaturdayTrigger (cron 09:00 UTC SAT) → pipeline_role="weekly"
  • FridayShellRunTrigger (cron 20:45 UTC FRI, shipped DISABLED) → pipeline_role="shell-run"
  • WeekdayTrigger (cron 12:45 UTC MON-FRI) → pipeline_role="daily"

Dual source-of-truth coverage

Edits land in BOTH the CFN orchestration template AND the live-EventBridge-applied deploy scripts so neither path goes stale:

Source Saturday Weekday Friday shell-run
infrastructure/cloudformation/alpha-engine-orchestration.yaml
infrastructure/deploy_step_function.sh
infrastructure/deploy_step_function_daily.sh

Naming convention

pipeline_role Triggered by Page 25 default
\"weekly\" SaturdayTrigger EventBridge cron shown
\"daily\" WeekdayTrigger EventBridge cron shown
\"eod\" daemon._trigger_eod_pipeline shown
\"shell-run\" FridayShellRunTrigger (DISABLED) hidden by default
\"smoke\" operator smoke / debug hidden by default
\"recovery\" operator fix-and-rerun hidden by default
\"backfill\" operator historical backfill hidden by default
\"operator-replay\" operator ad-hoc replay (catch-all) hidden by default

Test plan

  • 5 new chokepoint tests in test_deploy_step_function_eventbridge_input.py — assert the right pipeline_role on each trigger across all three sources of truth. Drift in either the CFN OR a deploy script fails CI loudly with a named remediation path.
  • Full data suite: 1532 passed, 1 skipped, 7 warnings in 8.15s
  • Verified SF JSON InitializeInput uses States.JsonMerge — the new pipeline_role field is preserved through to all downstream states without breaking any JSONPath reference.
  • After all three merge: run infrastructure/deploy_step_function.sh + infrastructure/deploy_step_function_daily.sh to update the live EventBridge rules. Verify via aws events list-targets-by-rule --rule alpha-engine-saturday --region us-east-1.

🤖 Generated with Claude Code

cipher813 and others added 2 commits May 25, 2026 15:17
Tags every cron-triggered SF execution with a ``pipeline_role`` input
field per the Option-D execution-picker plan (Brian-approved 2026-05-25
evening, lib substrate in alpha-engine-lib#75). Page 25 / Slack / CLI
consumers filter by role so smoke / recovery / operator-replay
executions never displace the canonical cadence run as "most recent."

- SaturdayTrigger (cron 09:00 UTC SAT)     → pipeline_role="weekly"
- FridayShellRunTrigger (cron 20:45 UTC FRI, shipped DISABLED)
                                            → pipeline_role="shell-run"
- WeekdayTrigger (cron 12:45 UTC MON-FRI)   → pipeline_role="daily"

Edits land in BOTH source-of-truth paths:

1. CFN orchestration template (``alpha-engine-orchestration.yaml``) —
   the fresh-region/account bootstrap path; re-apply re-stamps the
   live rule from this YAML.
2. ``deploy_step_function.sh`` + ``deploy_step_function_daily.sh`` —
   the operator-applied path; running these scripts updates the live
   rule directly. Both paths now carry pipeline_role to prevent
   drift between CFN and live state.

Naming convention for ad-hoc operator launches (documented in PR body
+ pipeline-reporting-revamp plan doc follow-up):

  pipeline_role  Triggered by                          Page 25 default
  -------------  ------------------------------------  ---------------
  "weekly"       SaturdayTrigger EventBridge cron      shown
  "daily"        WeekdayTrigger EventBridge cron       shown
  "eod"          daemon._trigger_eod_pipeline          shown
  "shell-run"    FridayShellRunTrigger (DISABLED)      hidden by default
  "smoke"        operator smoke / debug                hidden by default
  "recovery"     operator fix-and-rerun                hidden by default
  "backfill"     operator historical backfill          hidden by default
  "operator-     operator ad-hoc replay (catch-all)    hidden by default
   replay"

Chokepoint tests added (TestEventBridgeInput / TestWeekdayEventBridgeInput
/ TestOrchestrationCFNPipelineRoles): assert pipeline_role on the right
value for each trigger across all three sources of truth. Drift in
either the CFN OR a deploy script fails CI loudly with a named
remediation path.

Companion edit in alpha-engine repo:
``executor/daemon.py::_trigger_eod_pipeline`` adds
``"pipeline_role": "eod"`` to its start_execution input dict — ships
in a separate PR for that repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 33c3753 into main May 25, 2026
1 check passed
@cipher813 cipher813 deleted the feat/pipeline-role-eventbridge-cron branch May 25, 2026 22:23
cipher813 added a commit that referenced this pull request May 26, 2026
…add target-uniqueness chokepoint test (closes 2026-05-26 duplicate-target incident) (#322)

Root cause of the 2026-05-26 weekday-pipeline failure:

PR #317 (33c3753, 2026-05-25 evening) added pipeline_role tagging to
both the CFN template AND the deploy scripts, with different target
IDs (Id=1 from scripts, Id=<rule>-pipeline from CFN). EventBridge
couldn't dedupe — different IDs meant different targets — so the
alpha-engine-weekday rule shipped with TWO targets pointing at the
SAME state machine. Every weekday cron firing fanned to two parallel
SF executions; both ran MorningEnrich on the trading instance; both
opened ArcticDB and reached daily_append's update_batch/write_batch;
the C++ engine emitted 321 unique-symbol E_NON_INCREASING_INDEX_VERSION
(code 5090) races; the 5%-threshold gate hard-failed both runs at
35.6% (n_err=322 of 905). Saturday rule had the identical defect
(first exposure Sat 5/30).

This PR makes EventBridge rules + targets CFN-canonical (deploy
scripts no longer call put-rule or put-targets), migrates the
enable_standalone_scanner=true flag onto the CFN saturday target
(was script-only — re-deploying CFN would have silently reverted
L1995 Phase 3), and adds two substrate gates:

  - TestDeployScriptsHaveNoEventBridgeWrites: both deploy scripts
    must not contain executable `aws events put-rule` or
    `aws events put-targets` lines. Closes the dual-write path that
    enabled the bug.
  - TestCFNTargetUniqueness: each cron-triggered AWS::Events::Rule
    in the orchestration CFN has exactly 1 entry under Targets:.
    PR #317's existing tests validated input contents but not target
    count; this is the gap that let the defect through CI.

Operational state at PR-open time:
  - Duplicate targets removed manually via `aws events remove-targets`
    (Id=weekday-pipeline + Id=saturday-pipeline) so Wed 5/27 + Sat 5/30
    fire correctly without waiting for this PR to merge.
  - Today's failed weekday SF redriven via operator-launched execution
    (pipeline_role=recovery).
  - EB IAM role bootstrap (trust policy + create-role) kept in the
    saturday deploy script so a fresh region/account can still
    bootstrap via that script alone. Inline policy remains codified
    in alpha-engine repo's infrastructure/iam/.

Follow-up (filed at wind-down):
  - P1: SF-side MutualExclusionGuard (DynamoDB conditional PUT) so
    any future duplicate trigger (operator double-paste, EB bug,
    cross-region replay) hard-fails before any SSM SendCommand.
  - P1: alpha-engine-lib producer-side universe-writer lock
    (S3 conditional PUT) — covers manual `python -m builders.daily_append`
    runs that don't go through SF.

Test: `pytest tests/test_deploy_step_function_eventbridge_input.py`
(11 passed) + full suite (1539 passed, 1 skipped).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 27, 2026
…oint module (#328)

Closes the remaining 6 findings from the 2026-05-27 wider L302 audit
(P0 retrospective on PR #317's content-vs-uniqueness CI gap that
caused the 2026-05-26 dup-EB-target trading-day miss). PR #322 closed
the EventBridge target instance; this closes the meta-pattern across
the rest of this repo's CI: tests pin WHAT was put, not HOW MANY were
put or whether anything ELSE was put.

New module tests/test_sf_payload_uniqueness.py with 29 tests across
6 classes — one per audit finding:

| Class | L302 Finding | What it closes |
|---|---|---|
| TestSaturdaySFPayloadFieldSetsClosed | F2 eval_judge_wiring + F4 aggregate_costs + cross-cutting | Lambda Payload field-set drift across all 14 Saturday SF Lambda calls |
| TestWeekdaySFPayloadFieldSetsClosed | (extension) | Lambda Payload field-set drift across 6 weekday SF Lambda calls |
| TestSFRoleInvokeFunctionStatementCount | F3 iam_lambda_grants | exactly 1 lambda:InvokeFunction Statement in alpha-engine-step-functions-role.json (catches stale overlapping ARN statements from pre-2026 refactors) |
| TestWeekdaySSMFlowDoctorOrdering | F5 ssm_pipefail_wiring | FLOW_DOCTOR_ENABLED=1 in first 3 commands (closes 2026-05-11 ordering-incident recurrence path) |
| TestEODSFTopLevelFieldsClosed | F6 eod_substrate_check_wiring | top-level $.field namespace closed across input + intermediate ResultPath fields (catches silent collisions) |
| TestSaturdaySFSpotStateCount | F7 friday_shell_run_wiring | exactly 8 spot-launching states (catches orphaned legacy state from incomplete refactor) |

Shape per surface: pin a closed registry of expected keys/states,
fail loud when actual diverges. Mirrors PR #322's
TestCFNTargetUniqueness pattern; same chokepoint shape applied to
6 more surfaces.

Suite: 1567 → 1596 passed (+29 net).

Composes with #322, [[reference-eventbridge-target-uniqueness-invariant]],
[[feedback-mocked-tests-dont-validate-external-api-contract]],
[[feedback-audit-findings-become-roadmap-followups]] and the L302
P0-retrospective entry in ROADMAP.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant