Skip to content

Kubernetes Pod Operator fails with 404 errors when pods are preempted by daemonsets #59626

@hkc-8010

Description

@hkc-8010

Apache Airflow Provider(s)

cncf-kubernetes

Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes>=8.3.2

Apache Airflow version

3.x (also affects 2.x)

Operating System

Linux (any Kubernetes distribution)

Deployment

Astronomer

Deployment details

Kubernetes cluster with Karpenter or similar node autoscaling. Airflow running on Kubernetes with KubernetesPodOperator tasks. Nodes are dynamically created and daemonsets (calico-node, aws-node, falco, etc.) are scheduled on new nodes.

What happened

Kubernetes Pod Operator tasks are failing with "Pod Not Found" (404) errors when pods are preempted by higher-priority daemonset pods on newly created nodes.

Error logs:

[2025-12-02, 11:00:52 UTC] {standard_task_runner.py:110} ERROR - Failed to execute job 267126 for task fresh_squeeze ((404)
Reason: Not Found
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"fresh-squeeze-vqfekgln\" not found","reason":"NotFound","details":{"name":"fresh-squeeze-vqfekgln","kind":"pods"},"code":404}

Root cause:

  1. New Kubernetes node is created (e.g., via Karpenter autoscaling)
  2. Kubernetes Pod Operator task pod gets scheduled on the new node immediately
  3. Daemonset pods (calico-node, aws-node, falco, etc.) begin scheduling on the same node
  4. Daemonsets have higher priority than task pods
  5. Task pod gets preempted by daemonsets
  6. When the operator tries to read the pod status, it receives a 404 error because the pod no longer exists
  7. Task fails even though this is a transient infrastructure issue

Kubernetes events show:

k get events --field-selector reason=Preempted -A
NAMESPACE           LAST SEEN   TYPE     REASON      OBJECT                MESSAGE
frigid-comet-7713   4m33s       Normal   Preempted   pod/reltio-y1f5wtrb   Preempted by pod 38606146-2e79-4ee3-824c-abb1ea96740b on node ip-10-105-68-155.us-west-2.compute.internal

What you think should happen instead

The Kubernetes Pod Operator should automatically retry when encountering 404 errors that occur due to pod preemption. This should be handled internally by the operator, independent of user-configured task retries, since:

  1. Pod preemption is a transient infrastructure issue, not a task logic failure
  2. Users may have retries=0 configured in their task definitions
  3. The pod will typically be rescheduled successfully on a subsequent attempt once daemonsets have stabilized on the node

The operator should:

  • Detect 404 errors when reading pod status
  • Automatically retry the pod read operation (e.g., 3 attempts with exponential backoff)
  • Only fail the task if all internal retries are exhausted
  • Provide clear error messages indicating pod preemption as a possible cause

How to reproduce

Prerequisites:

  • Kubernetes cluster with node autoscaling (Karpenter, Cluster Autoscaler, etc.)
  • Multiple daemonsets configured (calico-node, aws-node, falco, etc.)
  • Airflow deployment using KubernetesPodOperator

Steps to reproduce:

  1. Create a DAG with a KubernetesPodOperator task:
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator

with DAG(
    dag_id="test_pod_preemption",
    start_date=datetime(2024, 1, 1),
    schedule=None,
) as dag:
    task = KubernetesPodOperator(
        task_id="test_task",
        name="test-pod",
        image="busybox",
        cmds=["echo", "hello"],
        retries=0,  # Note: even with retries=0, internal retries should handle 404s
    )
  1. Trigger the DAG during a period of node scaling (or force node creation by scaling up workload)

  2. Observe that task pods get scheduled on newly created nodes

  3. Check Kubernetes events for pod preemption:

kubectl get events --field-selector reason=Preempted -A
  1. The task will fail with a 404 error when the operator tries to read the preempted pod

Note: This is more likely to occur when:

  • Node autoscaling creates new nodes frequently
  • There are many daemonsets that must run on every node
  • Task pods have lower priority than daemonsets
  • Node resources are tight (c6a.xlarge, c6a.2xlarge, etc.)

Anything else

Frequency: This occurs intermittently when new nodes are created and task pods are scheduled before daemonsets stabilize. The frequency depends on:

  • Node creation rate (higher with aggressive autoscaling)
  • Number of daemonsets
  • Node size (smaller nodes more likely to have resource contention)

Impact: Tasks fail unnecessarily due to transient infrastructure issues, requiring manual retries or higher task-level retry configurations.

Proposed Solution:
Add internal retry logic in PodManager.read_pod() and AsyncKubernetesHook.get_pod() methods to automatically retry on 404 errors:

  • Retry up to 3 times with exponential backoff (2s, 4s, 8s)
  • Only retry on 404 ApiException errors
  • Raise PodNotFoundException after all retries exhausted with clear error message
  • Works independently of user-configured task retries

Related Context:
This issue was identified in production environments where Karpenter creates nodes dynamically. The pattern is:

  1. Node comes online
  2. Task pods schedule immediately
  3. Daemonsets begin scheduling shortly after
  4. Task pods get preempted due to higher daemonset priority
  5. Operator fails with 404 when trying to read preempted pod

Workaround:
Currently, users must configure high retry counts in their task definitions, but this doesn't help if retries=0 is set, and it's not ideal to retry entire task execution when only the pod read operation needs retrying.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions