Skip to content

ci: drop fragile event-warnings gate; rely on Flux reconcile status#1678

Merged
devantler merged 1 commit into
mainfrom
ci/drop-event-warnings-gate
Jun 2, 2026
Merged

ci: drop fragile event-warnings gate; rely on Flux reconcile status#1678
devantler merged 1 commit into
mainfrom
ci/drop-event-warnings-gate

Conversation

@devantler
Copy link
Copy Markdown
Contributor

Why

The check-event-warnings composite action sampled Warning events after a settle and failed the deploy if any fired since a marker. Two problems:

Fragile (false positives). It records the marker before the settle and flags any Warning since then — so a one-shot warning emitted during the settle that self-heals is still flagged, contradicting its own docstring. Concretely, the #1661 spread-pods policy (and any controller config change) restarts the Flux controllers; their /readyz returns connection refused for a few seconds during the restart, and that tripped the gate even though the controllers were healthy seconds later.

Redundant. ksail workload reconcile already "triggers reconciliation and waits for completion, tracking the OCIRepository and each Kustomization individually" with a timeout. And every Flux Kustomization in this repo —

variables → infrastructure-controllers → infrastructure → apps   (all wait: true)

— only reports Ready once Flux's own health checks pass on everything it applied: Deployments, StatefulSets, and HelmReleases. That's why a stalled HelmRelease (e.g. the recent OpenBao RetriesExceeded) surfaces as infrastructure-controllers HealthCheckFailed and fails the reconcile. Flux's Ready/Stalled conditions are the authoritative, heuristic-free signal — and they also catch pre-existing unhealth the event marker never could.

What

  • Delete .github/actions/check-event-warnings/ and its three call sites: ci.yaml (system-test + merge_group prod deploy) and cd.yaml (tag deploy).
  • The ksail workload reconcile step is now the deploy gate. The existing 🩺 Diagnose Flux on failure steps are unchanged and still fire on its failure (steps.reconcile.outcome/conclusion == 'failure').

No new gate code is added — the trustworthy signal (Flux reconcile status, via ksail's per-Kustomization wait) already existed; this just stops masking it with a fragile event heuristic. If we ever want a stronger post-reconcile assertion, the right home is ksail (reconcile/wait), not bespoke CI bash.

Validation

  • No remaining references to the action; both workflows parse as YAML.
  • git status: 1 deletion + 2 workflow edits; diffs keep the diagnose steps intact and correctly wired to the reconcile step.

Supersedes #1677 (which fixed the gate's marker timing) — removing the gate makes that fix moot.

The check-event-warnings composite action sampled Warning events after a
settle and failed the deploy if any fired since a marker. It was a heuristic
layered on top of reconciliation and produced false positives: a one-shot
warning emitted DURING the settle window — e.g. a Flux controller's /readyz
blipping "connection refused" while it restarts to pick up the #1661
spread-policy mutation — was flagged even though it self-healed in seconds.

It was also redundant. `ksail workload reconcile` already triggers and WAITS
for completion, tracking the OCIRepository and each Kustomization individually
with a timeout. Every Flux Kustomization here (variables ->
infrastructure-controllers -> infrastructure -> apps) is `wait: true`, so a
Kustomization only reports Ready once Flux's own health checks pass on
everything it applied — Deployments, StatefulSets and HelmReleases alike.
That is why a stalled HelmRelease (e.g. OpenBao RetriesExceeded) surfaces as
`infrastructure-controllers HealthCheckFailed` and fails the reconcile. Flux's
Ready/Stalled conditions are the authoritative, heuristic-free signal, and
they catch pre-existing unhealth the event marker never could.

Remove the action and its three call sites (ci.yaml system-test + merge_group,
cd.yaml). The reconcile step is now the deploy gate; the existing "Diagnose
Flux on failure" steps still fire on its failure.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes the check-event-warnings deploy gate from CI/CD and relies on Flux reconciliation status (via ksail workload reconcile / ksail-cluster’s reconcile) as the authoritative deployment health signal.

Changes:

  • Removed the check-event-warnings composite action implementation.
  • Dropped all workflow call sites in CI (system-test, merge_group prod deploy) and CD (tag-based prod deploy).
  • Kept existing “Diagnose Flux on failure” steps intact and still tied to reconcile failure conditions.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
.github/workflows/ci.yaml Removes the post-reconcile event-warning gate from system-test and prod deploy flows.
.github/workflows/cd.yaml Removes the post-reconcile event-warning gate from tag-based prod deploy flow.
.github/actions/check-event-warnings/action.yaml Deletes the composite action that previously sampled Warning events and failed the deploy.

@devantler devantler added this pull request to the merge queue Jun 2, 2026
Merged via the queue into main with commit f0832b5 Jun 2, 2026
10 checks passed
@devantler devantler deleted the ci/drop-event-warnings-gate branch June 2, 2026 06:15
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board Jun 2, 2026
@botantler
Copy link
Copy Markdown
Contributor

botantler Bot commented Jun 2, 2026

🎉 This PR is included in version 1.15.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@botantler botantler Bot added the released label Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

2 participants