ci: drop fragile event-warnings gate; rely on Flux reconcile status#1678
Merged
Conversation
The check-event-warnings composite action sampled Warning events after a settle and failed the deploy if any fired since a marker. It was a heuristic layered on top of reconciliation and produced false positives: a one-shot warning emitted DURING the settle window — e.g. a Flux controller's /readyz blipping "connection refused" while it restarts to pick up the #1661 spread-policy mutation — was flagged even though it self-healed in seconds. It was also redundant. `ksail workload reconcile` already triggers and WAITS for completion, tracking the OCIRepository and each Kustomization individually with a timeout. Every Flux Kustomization here (variables -> infrastructure-controllers -> infrastructure -> apps) is `wait: true`, so a Kustomization only reports Ready once Flux's own health checks pass on everything it applied — Deployments, StatefulSets and HelmReleases alike. That is why a stalled HelmRelease (e.g. OpenBao RetriesExceeded) surfaces as `infrastructure-controllers HealthCheckFailed` and fails the reconcile. Flux's Ready/Stalled conditions are the authoritative, heuristic-free signal, and they catch pre-existing unhealth the event marker never could. Remove the action and its three call sites (ci.yaml system-test + merge_group, cd.yaml). The reconcile step is now the deploy gate; the existing "Diagnose Flux on failure" steps still fire on its failure. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR removes the check-event-warnings deploy gate from CI/CD and relies on Flux reconciliation status (via ksail workload reconcile / ksail-cluster’s reconcile) as the authoritative deployment health signal.
Changes:
- Removed the
check-event-warningscomposite action implementation. - Dropped all workflow call sites in CI (
system-test,merge_groupprod deploy) and CD (tag-based prod deploy). - Kept existing “Diagnose Flux on failure” steps intact and still tied to reconcile failure conditions.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| .github/workflows/ci.yaml | Removes the post-reconcile event-warning gate from system-test and prod deploy flows. |
| .github/workflows/cd.yaml | Removes the post-reconcile event-warning gate from tag-based prod deploy flow. |
| .github/actions/check-event-warnings/action.yaml | Deletes the composite action that previously sampled Warning events and failed the deploy. |
Contributor
|
🎉 This PR is included in version 1.15.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The
check-event-warningscomposite action sampled Warning events after a settle and failed the deploy if any fired since a marker. Two problems:Fragile (false positives). It records the marker before the settle and flags any Warning since then — so a one-shot warning emitted during the settle that self-heals is still flagged, contradicting its own docstring. Concretely, the #1661
spread-podspolicy (and any controller config change) restarts the Flux controllers; their/readyzreturnsconnection refusedfor a few seconds during the restart, and that tripped the gate even though the controllers were healthy seconds later.Redundant.
ksail workload reconcilealready "triggers reconciliation and waits for completion, tracking the OCIRepository and each Kustomization individually" with a timeout. And every Flux Kustomization in this repo —— only reports
Readyonce Flux's own health checks pass on everything it applied: Deployments, StatefulSets, and HelmReleases. That's why a stalled HelmRelease (e.g. the recent OpenBaoRetriesExceeded) surfaces asinfrastructure-controllers HealthCheckFailedand fails the reconcile. Flux'sReady/Stalledconditions are the authoritative, heuristic-free signal — and they also catch pre-existing unhealth the event marker never could.What
.github/actions/check-event-warnings/and its three call sites:ci.yaml(system-test + merge_group prod deploy) andcd.yaml(tag deploy).ksail workload reconcilestep is now the deploy gate. The existing🩺 Diagnose Flux on failuresteps are unchanged and still fire on its failure (steps.reconcile.outcome/conclusion == 'failure').No new gate code is added — the trustworthy signal (Flux reconcile status, via ksail's per-Kustomization wait) already existed; this just stops masking it with a fragile event heuristic. If we ever want a stronger post-reconcile assertion, the right home is
ksail(reconcile/wait), not bespoke CI bash.Validation
git status: 1 deletion + 2 workflow edits; diffs keep the diagnose steps intact and correctly wired to thereconcilestep.Supersedes #1677 (which fixed the gate's marker timing) — removing the gate makes that fix moot.