feat(k8s): surface pod Warning events on waitForPodPhases timeout#371
Open
remidebette wants to merge 1 commit into
Open
feat(k8s): surface pod Warning events on waitForPodPhases timeout#371remidebette wants to merge 1 commit into
remidebette wants to merge 1 commit into
Conversation
When a workflow pod never reaches Running for a reason that lives on the K8s Event resource (FailedScheduling "Too many pods" / "Insufficient cpu" / untolerated taints, FailedMount, ...) rather than on the pod object, the hook previously surfaced only a generic timeout. The ephemeral pod is usually pruned before an operator can `kubectl describe` it, so the diagnostic was lost. waitForPodPhases now makes a best-effort fetch of the pod's most recent Warning events in its catch path and appends up to the 3 newest to the thrown error, e.g.: Pod foo is unhealthy with phase status Pending: backoff timeout; events: [FailedScheduling] 0/3 nodes are available: 3 Too many pods. The fetch never throws: it must not shadow the original failure, and reading events needs `events: list` which is intentionally NOT added to requiredPermissions (doing so would hard-fail prepareJob for existing least-privilege deployments). A 403 is swallowed and simply yields no extra detail. Third piece of the k8s error-surfacing work after actions#341 and actions#364; complements actions#336 (container waiting reasons, a different resource). Implements actions#366 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Author
|
@nikola-jokic What would you think of this direction, which builds on #336 ? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a workflow pod never reaches
Runningfor a reason that lives on the K8s Event resource (FailedScheduling: Too many pods/Insufficient cpu, untolerated taints,FailedMount, …) rather than on the pod object,waitForPodPhasessurfaced only a generic timeout. The ephemeral pod is usually pruned before an operator cankubectl describeit, so the diagnostic was lost.It now makes a best-effort fetch of the pod's most recent
Warningevents in its catch path and appends up to the 3 newest to the thrown error:The lookup never throws — it must not shadow the original failure.
Reading events needs
events: list. I intentionally did not add it torequiredPermissions:isAuthPermissionsOK()hard-requires every entry, so adding it would makeprepareJobfail for every existing least-privilege deployment whose Role lacks events. A 403 here is therefore swallowed and simply yields no extra detail (non-breaking).Trade-off: to actually benefit, the runner's Role needs
events: get,list— a companion change in the actions-runner-controller chart (separate repo). Happy to instead add it torequiredPermissionsif you'd prefer the contract be explicit and coordinate the chart update.Scope
Third piece of the k8s error-surfacing work after #341 and #364 (both merged). Complements #336, which reads
containerStatuses[].state.waiting.{reason,message}— a different resource (the image-pull case). No overlap with #370.Tests
New
pod-events-test.ts(5 cases): events appended; correct namespace +fieldSelector; top-3 newest-first (incl. theeventTimefallback); no-events → no suffix; and a failed event lookup never shadows the original error. All unit suites pass;lint/format-check/tsc + nccclean. (The 5 cluster-dependent integration suites are unchanged and fail only for lack of a live cluster — same asmain.)Closes #366
🤖 Generated with Claude Code