Skip to content

refactor: v0.9.2–0.9.3 restructuring #68

Merged
dwsmith1983 merged 17 commits intomainfrom
refactor/restructuring
Mar 14, 2026
Merged

refactor: v0.9.2–0.9.3 restructuring #68
dwsmith1983 merged 17 commits intomainfrom
refactor/restructuring

Conversation

@dwsmith1983
Copy link
Copy Markdown
Owner

@dwsmith1983 dwsmith1983 commented Mar 14, 2026

Summary

  • Split large watchdog module (1079 lines) into 4 focused files (~200 lines each)
  • Split stream router into focused domain files for orchestration, post-run, rerun, drift detection
  • Fix batch item failure reporting, baseline namespacing, rerun ordering, epoch normalization
  • Fix trigger deadline timezone resolution
  • Fix SLA-met suppression when pipeline never ran
  • Fix validation mode case-insensitive matching
  • Harden IAM policies and EventBridge bus access
  • Replace shell execution with direct exec in command trigger
  • Add SSRF protection to trigger HTTP clients via dial-time IP validation
  • Detect EventBridge PutEvents partial failures instead of discarding output
  • Add dry-run rerun/retry observability with full evaluation logging
  • Extract shared drift detection, HTTP client construction, SLA schedule loop

EvaluateRules now uses strings.ToUpper(mode) so "any", "Any",
and "ANY" all route to the ANY branch. Previously lowercase
variants fell through to the default ALL case.
- lambda_trigger_arns default [] with precondition (SEC-1)
- Slack plaintext token deprecation warning via check block (SEC-2)
- New variables for trigger IAM scoping: glue_job_arns, emr_cluster_arns,
  emr_serverless_app_arns, sfn_trigger_arns — all default [] (SEC-4)
- EventBridge bus policy restricts PutEvents to Lambda roles (SEC-5)
…s (BUG-5, CQ-5)

BUG-5: handleSLACancel now checks for trigger existence before
publishing SLA verdict. Pipelines that were never triggered no
longer emit false SLA_MET events.

CQ-5: Replaced _ = publishEvent(...) with error-logged calls
in SLA reconcile path.
closeSensorTriggerWindow now reads timezone from cfg.Schedule.Timezone
(the schedule's own timezone) instead of cfg.SLA.Timezone. Falls back
to SLA timezone if schedule timezone is not set. Prevents incorrect
deadline calculation when schedule and SLA use different timezones.
…RY-1)

New ExtractFloatOk distinguishes absent keys from actual zero values.
New DetectDrift consolidates 3 identical drift comparison sites into
one shared function. Transitions like 5000→0 now correctly trigger
drift detection instead of being silently skipped.
…poch normalization (BUG-2,3,4,9,10)

BUG-4: HandleStreamEvent returns DynamoDBEventResponse with failed
record EventIDs for Lambda partial batch retry.

BUG-10: Namespace postrun baseline by rule key to prevent field
collision between rules with the same field name. Clean break —
existing flat baselines self-heal on next pipeline completion.

BUG-2: RemapPerPeriodSensors collects additions in staging map
to avoid nondeterministic map mutation during range iteration.

BUG-3: Reorder handleRerunRequest to lock-before-write, preventing
orphaned rerun records when lock reset fails.

BUG-9: Normalize updatedAt epoch timestamps < 1e12 to milliseconds
for consistent rerun freshness comparison.
resolveHTTPClient(timeoutSec) replaces identical 7-line blocks
in ExecuteHTTP and ExecuteAirflow.
createSLASchedules() replaces duplicated warning/breach schedule
creation loops in scheduleSLAAlerts (watchdog) and handleSLASchedule
(sla-monitor). onConflictSkip parameter handles the differing error
behavior between the two callers.
Eliminates shell interpretation entirely. No pipes, redirects, or
variable expansion. strings.Fields splits the command into argv.
Prevents command injection via crafted pipeline configs.
Pure refactor, no logic changes. Functions grouped by domain:
- watchdog.go: HandleWatchdog entry point only (34 lines)
- watchdog_stale.go: stale trigger detection + reconciliation
- watchdog_missed.go: missed schedule detection (cron + inclusion)
- watchdog_sla.go: SLA alerting + trigger deadlines
- watchdog_postrun.go: post-run sensor monitoring + relative SLA
Add 4 new EventDetailType constants for dry-run rerun/retry
observability: DRY_RUN_WOULD_RERUN, DRY_RUN_RERUN_REJECTED,
DRY_RUN_WOULD_RETRY, DRY_RUN_RETRY_EXHAUSTED.
Replace the 5-line early returns in handleRerunRequest and
handleJobFailure with self-contained evaluation blocks that run all
checks (calendar exclusion, rerun limits, circuit breaker, retry
budget) and publish observation events instead of executing side
effects.

New functions: handleDryRunRerunRequest, handleDryRunJobFailure.
Tests: 2 updated + 6 new covering all decision branches (would-rerun,
calendar-rejected, limit-exceeded, circuit-breaker-reject, no-job-
history, would-retry, retry-exhausted, calendar-excluded).
Add DRY_RUN_WOULD_RERUN, DRY_RUN_RERUN_REJECTED, DRY_RUN_WOULD_RETRY,
and DRY_RUN_RETRY_EXHAUSTED to the EventBridge alert-events pattern.
publishEvent now checks FailedEntryCount on PutEventsOutput. AWS
EventBridge can return FailedEntryCount > 0 with error == nil for
partial failures — these were previously silently discarded.
Custom http.Transport with dial-time IP validation rejects connections
to private, loopback, link-local, and multicast addresses. Catches all
bypass vectors including DNS rebinding and HTTP redirects. Protects
HTTP, Airflow, and Databricks triggers against targeting internal
endpoints (AWS IMDS, ECS metadata, VPC services).
@github-actions github-actions bot added tests Test changes lambda Lambda handlers deploy Deployment and ASL docs Documentation triggers Trigger types types Public types (pkg/types) labels Mar 14, 2026
@dwsmith1983 dwsmith1983 changed the title refactor: v0.9.2–0.9.3 restructuring and repo analysis remediation refactor: v0.9.2–0.9.3 restructuring Mar 14, 2026
@dwsmith1983 dwsmith1983 self-assigned this Mar 14, 2026
@dwsmith1983 dwsmith1983 merged commit f127a26 into main Mar 14, 2026
6 checks passed
@dwsmith1983 dwsmith1983 deleted the refactor/restructuring branch March 14, 2026 03:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deploy Deployment and ASL docs Documentation lambda Lambda handlers tests Test changes triggers Trigger types types Public types (pkg/types)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant