-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Description
Apache Airflow Provider(s)
cncf-kubernetes
Versions of Apache Airflow Providers
apache-airflow-providers-cncf-kubernetes>=8.3.2
Apache Airflow version
3.x (also affects 2.x)
Operating System
Linux (any Kubernetes distribution)
Deployment
Astronomer
Deployment details
Kubernetes cluster with Karpenter or similar node autoscaling. Airflow running on Kubernetes with KubernetesPodOperator tasks. Nodes are dynamically created and daemonsets (calico-node, aws-node, falco, etc.) are scheduled on new nodes.
What happened
Kubernetes Pod Operator tasks are failing with "Pod Not Found" (404) errors when pods are preempted by higher-priority daemonset pods on newly created nodes.
Error logs:
[2025-12-02, 11:00:52 UTC] {standard_task_runner.py:110} ERROR - Failed to execute job 267126 for task fresh_squeeze ((404)
Reason: Not Found
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"fresh-squeeze-vqfekgln\" not found","reason":"NotFound","details":{"name":"fresh-squeeze-vqfekgln","kind":"pods"},"code":404}
Root cause:
- New Kubernetes node is created (e.g., via Karpenter autoscaling)
- Kubernetes Pod Operator task pod gets scheduled on the new node immediately
- Daemonset pods (calico-node, aws-node, falco, etc.) begin scheduling on the same node
- Daemonsets have higher priority than task pods
- Task pod gets preempted by daemonsets
- When the operator tries to read the pod status, it receives a 404 error because the pod no longer exists
- Task fails even though this is a transient infrastructure issue
Kubernetes events show:
k get events --field-selector reason=Preempted -A
NAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGE
frigid-comet-7713 4m33s Normal Preempted pod/reltio-y1f5wtrb Preempted by pod 38606146-2e79-4ee3-824c-abb1ea96740b on node ip-10-105-68-155.us-west-2.compute.internal
What you think should happen instead
The Kubernetes Pod Operator should automatically retry when encountering 404 errors that occur due to pod preemption. This should be handled internally by the operator, independent of user-configured task retries, since:
- Pod preemption is a transient infrastructure issue, not a task logic failure
- Users may have
retries=0configured in their task definitions - The pod will typically be rescheduled successfully on a subsequent attempt once daemonsets have stabilized on the node
The operator should:
- Detect 404 errors when reading pod status
- Automatically retry the pod read operation (e.g., 3 attempts with exponential backoff)
- Only fail the task if all internal retries are exhausted
- Provide clear error messages indicating pod preemption as a possible cause
How to reproduce
Prerequisites:
- Kubernetes cluster with node autoscaling (Karpenter, Cluster Autoscaler, etc.)
- Multiple daemonsets configured (calico-node, aws-node, falco, etc.)
- Airflow deployment using KubernetesPodOperator
Steps to reproduce:
- Create a DAG with a KubernetesPodOperator task:
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
with DAG(
dag_id="test_pod_preemption",
start_date=datetime(2024, 1, 1),
schedule=None,
) as dag:
task = KubernetesPodOperator(
task_id="test_task",
name="test-pod",
image="busybox",
cmds=["echo", "hello"],
retries=0, # Note: even with retries=0, internal retries should handle 404s
)-
Trigger the DAG during a period of node scaling (or force node creation by scaling up workload)
-
Observe that task pods get scheduled on newly created nodes
-
Check Kubernetes events for pod preemption:
kubectl get events --field-selector reason=Preempted -A- The task will fail with a 404 error when the operator tries to read the preempted pod
Note: This is more likely to occur when:
- Node autoscaling creates new nodes frequently
- There are many daemonsets that must run on every node
- Task pods have lower priority than daemonsets
- Node resources are tight (c6a.xlarge, c6a.2xlarge, etc.)
Anything else
Frequency: This occurs intermittently when new nodes are created and task pods are scheduled before daemonsets stabilize. The frequency depends on:
- Node creation rate (higher with aggressive autoscaling)
- Number of daemonsets
- Node size (smaller nodes more likely to have resource contention)
Impact: Tasks fail unnecessarily due to transient infrastructure issues, requiring manual retries or higher task-level retry configurations.
Proposed Solution:
Add internal retry logic in PodManager.read_pod() and AsyncKubernetesHook.get_pod() methods to automatically retry on 404 errors:
- Retry up to 3 times with exponential backoff (2s, 4s, 8s)
- Only retry on 404 ApiException errors
- Raise
PodNotFoundExceptionafter all retries exhausted with clear error message - Works independently of user-configured task retries
Related Context:
This issue was identified in production environments where Karpenter creates nodes dynamically. The pattern is:
- Node comes online
- Task pods schedule immediately
- Daemonsets begin scheduling shortly after
- Task pods get preempted due to higher daemonset priority
- Operator fails with 404 when trying to read preempted pod
Workaround:
Currently, users must configure high retry counts in their task definitions, but this doesn't help if retries=0 is set, and it's not ideal to retry entire task execution when only the pod read operation needs retrying.
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct