Kubernetes Pod Operator fails with 404 errors when pods are preempted by daemonsets

### Apache Airflow Provider(s)

cncf-kubernetes

### Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes>=8.3.2

### Apache Airflow version

3.x (also affects 2.x)

### Operating System

Linux (any Kubernetes distribution)

### Deployment

Astronomer

### Deployment details

Kubernetes cluster with Karpenter or similar node autoscaling. Airflow running on Kubernetes with KubernetesPodOperator tasks. Nodes are dynamically created and daemonsets (calico-node, aws-node, falco, etc.) are scheduled on new nodes.

### What happened

Kubernetes Pod Operator tasks are failing with "Pod Not Found" (404) errors when pods are preempted by higher-priority daemonset pods on newly created nodes.

**Error logs:**
```
[2025-12-02, 11:00:52 UTC] {standard_task_runner.py:110} ERROR - Failed to execute job 267126 for task fresh_squeeze ((404)
Reason: Not Found
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"fresh-squeeze-vqfekgln\" not found","reason":"NotFound","details":{"name":"fresh-squeeze-vqfekgln","kind":"pods"},"code":404}
```

**Root cause:**
1. New Kubernetes node is created (e.g., via Karpenter autoscaling)
2. Kubernetes Pod Operator task pod gets scheduled on the new node immediately
3. Daemonset pods (calico-node, aws-node, falco, etc.) begin scheduling on the same node
4. Daemonsets have higher priority than task pods
5. Task pod gets preempted by daemonsets
6. When the operator tries to read the pod status, it receives a 404 error because the pod no longer exists
7. Task fails even though this is a transient infrastructure issue

**Kubernetes events show:**
```
k get events --field-selector reason=Preempted -A
NAMESPACE           LAST SEEN   TYPE     REASON      OBJECT                MESSAGE
frigid-comet-7713   4m33s       Normal   Preempted   pod/reltio-y1f5wtrb   Preempted by pod 38606146-2e79-4ee3-824c-abb1ea96740b on node ip-10-105-68-155.us-west-2.compute.internal
```

### What you think should happen instead

The Kubernetes Pod Operator should automatically retry when encountering 404 errors that occur due to pod preemption. This should be handled internally by the operator, independent of user-configured task retries, since:

1. Pod preemption is a transient infrastructure issue, not a task logic failure
2. Users may have `retries=0` configured in their task definitions
3. The pod will typically be rescheduled successfully on a subsequent attempt once daemonsets have stabilized on the node

The operator should:
- Detect 404 errors when reading pod status
- Automatically retry the pod read operation (e.g., 3 attempts with exponential backoff)
- Only fail the task if all internal retries are exhausted
- Provide clear error messages indicating pod preemption as a possible cause

### How to reproduce

**Prerequisites:**
- Kubernetes cluster with node autoscaling (Karpenter, Cluster Autoscaler, etc.)
- Multiple daemonsets configured (calico-node, aws-node, falco, etc.)
- Airflow deployment using KubernetesPodOperator

**Steps to reproduce:**

1. Create a DAG with a KubernetesPodOperator task:
```python
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator

with DAG(
    dag_id="test_pod_preemption",
    start_date=datetime(2024, 1, 1),
    schedule=None,
) as dag:
    task = KubernetesPodOperator(
        task_id="test_task",
        name="test-pod",
        image="busybox",
        cmds=["echo", "hello"],
        retries=0,  # Note: even with retries=0, internal retries should handle 404s
    )
```

2. Trigger the DAG during a period of node scaling (or force node creation by scaling up workload)

3. Observe that task pods get scheduled on newly created nodes

4. Check Kubernetes events for pod preemption:
```bash
kubectl get events --field-selector reason=Preempted -A
```

5. The task will fail with a 404 error when the operator tries to read the preempted pod

**Note:** This is more likely to occur when:
- Node autoscaling creates new nodes frequently
- There are many daemonsets that must run on every node
- Task pods have lower priority than daemonsets
- Node resources are tight (c6a.xlarge, c6a.2xlarge, etc.)

### Anything else

**Frequency:** This occurs intermittently when new nodes are created and task pods are scheduled before daemonsets stabilize. The frequency depends on:
- Node creation rate (higher with aggressive autoscaling)
- Number of daemonsets
- Node size (smaller nodes more likely to have resource contention)

**Impact:** Tasks fail unnecessarily due to transient infrastructure issues, requiring manual retries or higher task-level retry configurations.

**Proposed Solution:**
Add internal retry logic in `PodManager.read_pod()` and `AsyncKubernetesHook.get_pod()` methods to automatically retry on 404 errors:
- Retry up to 3 times with exponential backoff (2s, 4s, 8s)
- Only retry on 404 ApiException errors
- Raise `PodNotFoundException` after all retries exhausted with clear error message
- Works independently of user-configured task retries

**Related Context:**
This issue was identified in production environments where Karpenter creates nodes dynamically. The pattern is:
1. Node comes online
2. Task pods schedule immediately
3. Daemonsets begin scheduling shortly after
4. Task pods get preempted due to higher daemonset priority
5. Operator fails with 404 when trying to read preempted pod

**Workaround:**
Currently, users must configure high retry counts in their task definitions, but this doesn't help if `retries=0` is set, and it's not ideal to retry entire task execution when only the pod read operation needs retrying.

### Are you willing to submit PR?

- [ ] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kubernetes Pod Operator fails with 404 errors when pods are preempted by daemonsets #59626

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kubernetes Pod Operator fails with 404 errors when pods are preempted by daemonsets #59626

Description

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions