feat(k8s): surface pod Warning events on waitForPodPhases timeout by remidebette · Pull Request #371 · actions/runner-container-hooks

remidebette · 2026-06-05T14:32:41Z

Summary

When a workflow pod never reaches Running for a reason that lives on the K8s Event resource (FailedScheduling: Too many pods / Insufficient cpu, untolerated taints, FailedMount, …) rather than on the pod object, waitForPodPhases surfaced only a generic timeout. The ephemeral pod is usually pruned before an operator can kubectl describe it, so the diagnostic was lost.

It now makes a best-effort fetch of the pod's most recent Warning events in its catch path and appends up to the 3 newest to the thrown error:

Pod foo is unhealthy with phase status Pending: backoff timeout; events: [FailedScheduling] 0/3 nodes are available: 3 Too many pods.

The lookup never throws — it must not shadow the original failure.

⚠️ Permissions — needs a maintainer call

Reading events needs events: list. I intentionally did not add it to requiredPermissions: isAuthPermissionsOK() hard-requires every entry, so adding it would make prepareJob fail for every existing least-privilege deployment whose Role lacks events. A 403 here is therefore swallowed and simply yields no extra detail (non-breaking).

Trade-off: to actually benefit, the runner's Role needs events: get,list — a companion change in the actions-runner-controller chart (separate repo). Happy to instead add it to requiredPermissions if you'd prefer the contract be explicit and coordinate the chart update.

Scope

Third piece of the k8s error-surfacing work after #341 and #364 (both merged). Complements #336, which reads containerStatuses[].state.waiting.{reason,message} — a different resource (the image-pull case). No overlap with #370.

Tests

New pod-events-test.ts (5 cases): events appended; correct namespace + fieldSelector; top-3 newest-first (incl. the eventTime fallback); no-events → no suffix; and a failed event lookup never shadows the original error. All unit suites pass; lint / format-check / tsc + ncc clean. (The 5 cluster-dependent integration suites are unchanged and fail only for lack of a live cluster — same as main.)

Closes #366

🤖 Generated with Claude Code

When a workflow pod never reaches Running for a reason that lives on the K8s Event resource (FailedScheduling "Too many pods" / "Insufficient cpu" / untolerated taints, FailedMount, ...) rather than on the pod object, the hook previously surfaced only a generic timeout. The ephemeral pod is usually pruned before an operator can `kubectl describe` it, so the diagnostic was lost. waitForPodPhases now makes a best-effort fetch of the pod's most recent Warning events in its catch path and appends up to the 3 newest to the thrown error, e.g.: Pod foo is unhealthy with phase status Pending: backoff timeout; events: [FailedScheduling] 0/3 nodes are available: 3 Too many pods. The fetch never throws: it must not shadow the original failure, and reading events needs `events: list` which is intentionally NOT added to requiredPermissions (doing so would hard-fail prepareJob for existing least-privilege deployments). A 403 is swallowed and simply yields no extra detail. Third piece of the k8s error-surfacing work after actions#341 and actions#364; complements actions#336 (container waiting reasons, a different resource). Implements actions#366 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

remidebette · 2026-06-05T14:34:23Z

@nikola-jokic What would you think of this direction, which builds on #336 ?
Specifically, this would require more kubernetes permissions for arc

remidebette requested a review from nikola-jokic as a code owner June 5, 2026 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(k8s): surface pod Warning events on waitForPodPhases timeout#371

feat(k8s): surface pod Warning events on waitForPodPhases timeout#371
remidebette wants to merge 1 commit into
actions:mainfrom
instadeepai:surface-pod-events

remidebette commented Jun 5, 2026

Uh oh!

remidebette commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

remidebette commented Jun 5, 2026

Summary

⚠️ Permissions — needs a maintainer call

Scope

Tests

Uh oh!

remidebette commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant