Skip to content

feat(k8s): surface pod Warning events on waitForPodPhases timeout#371

Open
remidebette wants to merge 1 commit into
actions:mainfrom
instadeepai:surface-pod-events
Open

feat(k8s): surface pod Warning events on waitForPodPhases timeout#371
remidebette wants to merge 1 commit into
actions:mainfrom
instadeepai:surface-pod-events

Conversation

@remidebette
Copy link
Copy Markdown
Contributor

Summary

When a workflow pod never reaches Running for a reason that lives on the K8s Event resource (FailedScheduling: Too many pods / Insufficient cpu, untolerated taints, FailedMount, …) rather than on the pod object, waitForPodPhases surfaced only a generic timeout. The ephemeral pod is usually pruned before an operator can kubectl describe it, so the diagnostic was lost.

It now makes a best-effort fetch of the pod's most recent Warning events in its catch path and appends up to the 3 newest to the thrown error:

Pod foo is unhealthy with phase status Pending: backoff timeout; events: [FailedScheduling] 0/3 nodes are available: 3 Too many pods.

The lookup never throws — it must not shadow the original failure.

⚠️ Permissions — needs a maintainer call

Reading events needs events: list. I intentionally did not add it to requiredPermissions: isAuthPermissionsOK() hard-requires every entry, so adding it would make prepareJob fail for every existing least-privilege deployment whose Role lacks events. A 403 here is therefore swallowed and simply yields no extra detail (non-breaking).

Trade-off: to actually benefit, the runner's Role needs events: get,list — a companion change in the actions-runner-controller chart (separate repo). Happy to instead add it to requiredPermissions if you'd prefer the contract be explicit and coordinate the chart update.

Scope

Third piece of the k8s error-surfacing work after #341 and #364 (both merged). Complements #336, which reads containerStatuses[].state.waiting.{reason,message} — a different resource (the image-pull case). No overlap with #370.

Tests

New pod-events-test.ts (5 cases): events appended; correct namespace + fieldSelector; top-3 newest-first (incl. the eventTime fallback); no-events → no suffix; and a failed event lookup never shadows the original error. All unit suites pass; lint / format-check / tsc + ncc clean. (The 5 cluster-dependent integration suites are unchanged and fail only for lack of a live cluster — same as main.)

Closes #366

🤖 Generated with Claude Code

When a workflow pod never reaches Running for a reason that lives on the
K8s Event resource (FailedScheduling "Too many pods" / "Insufficient cpu"
/ untolerated taints, FailedMount, ...) rather than on the pod object,
the hook previously surfaced only a generic timeout. The ephemeral pod is
usually pruned before an operator can `kubectl describe` it, so the
diagnostic was lost.

waitForPodPhases now makes a best-effort fetch of the pod's most recent
Warning events in its catch path and appends up to the 3 newest to the
thrown error, e.g.:

  Pod foo is unhealthy with phase status Pending: backoff timeout; events: [FailedScheduling] 0/3 nodes are available: 3 Too many pods.

The fetch never throws: it must not shadow the original failure, and
reading events needs `events: list` which is intentionally NOT added to
requiredPermissions (doing so would hard-fail prepareJob for existing
least-privilege deployments). A 403 is swallowed and simply yields no
extra detail.

Third piece of the k8s error-surfacing work after actions#341 and actions#364;
complements actions#336 (container waiting reasons, a different resource).

Implements actions#366

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@remidebette remidebette requested a review from nikola-jokic as a code owner June 5, 2026 14:32
@remidebette
Copy link
Copy Markdown
Contributor Author

@nikola-jokic What would you think of this direction, which builds on #336 ?
Specifically, this would require more kubernetes permissions for arc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Surface pod Events (FailedScheduling, etc.) on waitForPodPhases timeout

1 participant