Apache Airflow version
main (3.x)
Description
When a KubernetesExecutor-managed worker pod terminates with exit code 1, reason: "Error" and no message set on the container status (the common case for Python import errors at task-runner startup), AirflowKubernetesScheduler._get_pod_failure_reason() returns a string like:
This is the entire payload that makes it into the scheduler log and the task event_buffer info field. Operators then have to chase the pod logs out-of-band (kubectl, audit logs, log aggregation pipeline) to find the actual traceback.
Use case / motivation
For "generic" failures (no container.status.message), optionally append the last N lines of the pod's logs to the failure reason string. Two new opt-in config keys on [kubernetes_executor]:
failure_pod_log_lines (int, default 0 = disabled, recommended 50-100)
failure_log_read_timeout (int seconds, default 5)
When failure_pod_log_lines > 0 and the failure is "generic", call CoreV1Api.read_namespaced_pod_log(..., tail_lines=N, _request_timeout=T) and append the result. Wrap in try/except so a read-log failure never propagates out of the failure handler.
Related issues
I have not found a tracking issue for this; happy to be pointed at one if it exists.
Are you willing to submit a PR?
Code of Conduct
Apache Airflow version
main (3.x)
Description
When a KubernetesExecutor-managed worker pod terminates with
exit code 1, reason: "Error"and nomessageset on the container status (the common case for Python import errors at task-runner startup),AirflowKubernetesScheduler._get_pod_failure_reason()returns a string like:This is the entire payload that makes it into the scheduler log and the task event_buffer info field. Operators then have to chase the pod logs out-of-band (kubectl, audit logs, log aggregation pipeline) to find the actual traceback.
Use case / motivation
For "generic" failures (no
container.status.message), optionally append the last N lines of the pod's logs to the failure reason string. Two new opt-in config keys on[kubernetes_executor]:failure_pod_log_lines(int, default 0 = disabled, recommended 50-100)failure_log_read_timeout(int seconds, default 5)When
failure_pod_log_lines > 0and the failure is "generic", callCoreV1Api.read_namespaced_pod_log(..., tail_lines=N, _request_timeout=T)and append the result. Wrap in try/except so a read-log failure never propagates out of the failure handler.Related issues
I have not found a tracking issue for this; happy to be pointed at one if it exists.
Are you willing to submit a PR?
Code of Conduct