Skip to content

Fix KubernetesJobOperator failing when pods are deleted after job completion#63569

Merged
jscheffl merged 2 commits intoapache:mainfrom
ShubhamGondane:fix-k8s-job-operator-pod-not-found
Mar 14, 2026
Merged

Fix KubernetesJobOperator failing when pods are deleted after job completion#63569
jscheffl merged 2 commits intoapache:mainfrom
ShubhamGondane:fix-k8s-job-operator-pod-not-found

Conversation

@ShubhamGondane
Copy link
Contributor

When a Kubernetes Job completes but the pod is deleted before Airflow fetches logs (e.g., by cluster autoscaler), execute_complete fails with an unhandled ApiException(404). Task retries also fail repeatedly since they re-enter execute_complete and hit the same error.

This fix catches the 404 and skips log retrieval for deleted pods, resolving both the initial failure and the retry loop.

Note: whether retries of deferred tasks should re-run execute instead of execute_complete is a broader SDK-level concern and out of scope here.

closes: #56693

Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code

Generated-by: Claude Code following the guidelines

…pletion

Handle 404 ApiException in execute_complete to skip log retrieval gracefully instead of failing the task when pods are cleaned up before Airflow fetches logs.
Copy link
Member

@XD-DENG XD-DENG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit on test coverage: Could you add a test case with multiple pods (e.g., pod_names=["pod-1", "pod-2"]) where get_pod returns a 404 for one pod but succeeds for the other? This would exercise the continue behavior in the loop and confirm that logs are still written for surviving pods while the deleted one is skipped. That's the more realistic scenario for jobs with parallelism > 1.

Copy link
Member

@XD-DENG XD-DENG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall a nice fix.

@ShubhamGondane
Copy link
Contributor Author

Added a multi-pod test with two pods where one returns 404 and the other succeeds. It verifies that the loop continues past the deleted pod and still writes logs for the surviving one.

Copy link
Contributor

@eladkal eladkal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eladkal eladkal requested a review from XD-DENG March 14, 2026 08:48
@jscheffl jscheffl merged commit 9c8114f into apache:main Mar 14, 2026
109 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KubernetesJobOperator does not recover when pods are deleted on completion

4 participants