Skip to content

KubernetesExecutor: surface pod log tail for generic Failed-state pods #66795

@1fanwang

Description

@1fanwang

Apache Airflow version

main (3.x)

Description

When a KubernetesExecutor-managed worker pod terminates with exit code 1, reason: "Error" and no message set on the container status (the common case for Python import errors at task-runner startup), AirflowKubernetesScheduler._get_pod_failure_reason() returns a string like:

Pod base reason: Error

This is the entire payload that makes it into the scheduler log and the task event_buffer info field. Operators then have to chase the pod logs out-of-band (kubectl, audit logs, log aggregation pipeline) to find the actual traceback.

Use case / motivation

For "generic" failures (no container.status.message), optionally append the last N lines of the pod's logs to the failure reason string. Two new opt-in config keys on [kubernetes_executor]:

  • failure_pod_log_lines (int, default 0 = disabled, recommended 50-100)
  • failure_log_read_timeout (int seconds, default 5)

When failure_pod_log_lines > 0 and the failure is "generic", call CoreV1Api.read_namespaced_pod_log(..., tail_lines=N, _request_timeout=T) and append the result. Wrap in try/except so a read-log failure never propagates out of the failure handler.

Related issues

I have not found a tracking issue for this; happy to be pointed at one if it exists.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions