Skip to content

fix: suppress SFN for dry-run pipelines and deduplicate JOB_COMPLETED#67

Merged
dwsmith1983 merged 4 commits intomainfrom
fix/dryrun-suppress-sfn
Mar 13, 2026
Merged

fix: suppress SFN for dry-run pipelines and deduplicate JOB_COMPLETED#67
dwsmith1983 merged 4 commits intomainfrom
fix/dryrun-suppress-sfn

Conversation

@dwsmith1983
Copy link
Copy Markdown
Owner

Summary

  • Dry-run SFN suppression: handleRerunRequest and handleJobFailure lacked cfg.DryRun checks, allowing rerun requests and job failure retries to start real SFN executions for dry-run pipelines. Added guards in both handlers, defense-in-depth in startSFNWithName, and watchdog reconciliation skip to prevent orphaned trigger locks.
  • Duplicate JOB_COMPLETED: handleCheckJob in the orchestrator published JOB_COMPLETED when polling detected success, but the stream-router also published it when the JOB# record arrived via DynamoDB stream — causing duplicate Slack alerts for polled jobs (Glue/EMR). Removed the orchestrator emission; stream-router is now the single canonical source.

…ilure paths

handleRerunRequest and handleJobFailure did not check cfg.DryRun before
calling startSFNWithName, allowing rerun requests and job failure retries
to start real Step Function executions for observation-only pipelines.

Added dry-run guards in both handlers, defense-in-depth in
startSFNWithName, and watchdog reconciliation skip to prevent orphaned
trigger locks.
handleCheckJob published JOB_COMPLETED directly when polling detected
success, but the stream-router's handleJobSuccess also published it when
the JOB# record arrived via DynamoDB stream. This caused duplicate Slack
alerts for polled jobs (Glue/EMR) while sync jobs only got one.

The stream-router is now the single canonical source for JOB_COMPLETED
across all job types (sync and polled).
@github-actions github-actions bot added tests Test changes lambda Lambda handlers docs Documentation labels Mar 12, 2026
Add DRY_RUN_COMPLETED terminal event after WOULD_TRIGGER + SLA_PROJECTION
to close the observation loop for each evaluation period. Carries SLA
verdict (met/breach/n/a) so operators see each period resolve.

Add cfg.DryRun guards to all seven watchdog functions: scheduleSLAAlerts,
detectMissedSchedules, detectMissedInclusionSchedules, checkTriggerDeadlines,
detectMissingPostRunSensors, detectRelativeSLABreaches, and
detectStaleTriggers. Without these, dry-run pipelines received real
SLA_WARNING/SLA_BREACH alerts via EventBridge Scheduler.

Harden triggeredAt parse in late-data path to warn and return on bad data
instead of silently producing garbage durations.
@dwsmith1983 dwsmith1983 self-assigned this Mar 13, 2026
@github-actions github-actions bot added the types Public types (pkg/types) label Mar 13, 2026
Add v0.8.0 events (SENSOR_DEADLINE_EXPIRED, IRREGULAR_SCHEDULE_MISSED,
RELATIVE_SLA_WARNING, RELATIVE_SLA_BREACH) and all DRY_RUN_* events to
the alert rule so they route to SQS and reach Slack via alert-dispatcher.
@github-actions github-actions bot added the deploy Deployment and ASL label Mar 13, 2026
@dwsmith1983 dwsmith1983 merged commit be5b24d into main Mar 13, 2026
6 checks passed
@dwsmith1983 dwsmith1983 deleted the fix/dryrun-suppress-sfn branch March 13, 2026 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deploy Deployment and ASL docs Documentation lambda Lambda handlers tests Test changes types Public types (pkg/types)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant