Fix monitoring-pod leak in KubernetesJobOperator#67333
Open
jykae wants to merge 1 commit into
Open
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
|
* fix(providers/cncf/kubernetes): clean up monitoring pods in KubernetesJobOperator KubernetesJobOperator inherited from KubernetesPodOperator but overrode execute() without calling post_complete_action(), so the monitoring / log-streaming pods discovered via get_pods() were never deleted. These pods have no ownerReferences to the V1Job, so ttl_seconds_after_finished and the Foreground cascade in on_kill don't reap them either. - execute() and execute_complete() now wrap their work in try/finally and call post_complete_action() for each pod in self.pods. on_finish_action (delete_pod / delete_succeeded_pod / keep_pod) is now honoured. - on_kill() additionally calls pod_manager.delete_pod() for each monitoring pod (the Job's foreground cascade doesn't reach them). - Per-pod cleanup errors are logged but never mask the in-flight exception, so Job-level failures keep propagating. - execute_complete() resolves monitoring pods once and shares the lookup between the log-retrieval path and the cleanup path. - Added unit tests, a bugfix newsfragment, and an operators.rst section documenting the cleanup contract. * Address code review feedback: remove dead PodNotFoundException check, drop unused import, relax pod-deletion ordering in test, fix trailing comma * Potential fix for pull request finding In _cleanup_monitoring_pods, remote_pod is resolved via find_pod(), which is designed to locate a single matching pod by task-instance labels and can invoke duplicate-pod resolution logic (process_duplicate_label_pods). For KubernetesJobOperator with parallelism > 1, this lookup can return the wrong pod (or trigger duplicate-handling side effects), so post_complete_action() may receive a mismatched remote_pod. Consider using the already-discovered pod’s name/namespace to refresh state (e.g. via hook.get_pod) or just pass remote_pod=pod when you already have the V1Pod object from get_pods(). Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Use isinstance(exc, TaskDeferred) instead of brittle string comparison * Potential fix for pull request finding The new unit tests add several mock.MagicMock() instances (pods, jobs, TI, etc.) without spec/autospec, and some patch() usages also create non-spec'd mocks by default. Using autospec=True on patches and create_autospec(...)/MagicMock(spec=...) for key Kubernetes objects helps catch typos/attribute mismatches in these tests and aligns with Airflow’s test hardening guidance. Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Address PR review comments: fix trigger pod_names, on_kill logging, and test assertions - triggers/job.py: Always include pod_names/pod_namespace in trigger event regardless of get_logs setting, so execute_complete() can reliably clean up monitoring pods even when get_logs=False - operators/job.py: Log unexpected ApiException in on_kill() instead of suppressing all ApiExceptions; remove unused `suppress` import - tests/test_job.py: Rewrite test_execute_respects_keep_pod and test_execute_deletes_pod_default to keep process_pod_deletion real and assert on pod_manager.delete_pod; stub hook.get_pod for remote_pod resolution - tests/test_job.py: Add regression test for get_logs=False deferrable path * Fix orphaned test_on_kill_deletes_monitoring_pods method body after accidental deletion of method signature * Make pod resolution best-effort in execute_complete * Address remaining KubernetesJobOperator review comments * Finalize review-comment fixes for KubernetesJobOperator * Fix remaining KubernetesJobOperator review comments * Update KubernetesJobOperator docs for action semantics * Improve KubernetesJobOperator newsfragment readability --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Ville Jyrkkä <vjyrkka@gmail.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix monitoring-pod leak in
KubernetesJobOperator.closes: #67332
KubernetesJobOperatorinherits fromKubernetesPodOperatorbutoverrode
execute()without ever invoking the parent's pod-cleanup path,so the "monitoring" pods discovered via
get_pods()(used to streamlogs and XCom while the Job runs) were never deleted. These pods are
created by Airflow, not by the
V1Jobcontroller, so they have noownerReferences— neitherttl_seconds_after_finishednor theforeground cascade on
on_kill()reaped them. Every task run leakedone pod per Job.
This PR makes pod cleanup symmetric with
KubernetesPodOperatorandhonours
on_finish_action/on_kill_actionfor the discovered pods.Changes
operators/job.pyexecute()andexecute_complete()now wrap their work intry/finallyand callpost_complete_action()for every podreturned by
get_pods(). The inheritedon_finish_action(
delete_pod/delete_succeeded_pod/delete_active_pod/keep_pod) is now respected, matchingKubernetesPodOperatorsemantics.
on_kill()additionally callspod_manager.delete_pod()for eachmonitoring pod, gated by
on_kill_action. The Job's foregroundcascade does not reach these pods because they have no
ownerReferences. UnexpectedApiExceptions are logged insteadof silently suppressed.
execute_complete()resolves monitoring pods once and shares thelookup between the log-retrieval and cleanup paths. Resolution is
best-effort — failures in the deferrable resume path no longer break
cleanup.
exception, so Job-level failures continue to propagate unchanged.
triggers/job.pypod_names/pod_namespace, regardless ofget_logs. This guaranteesexecute_complete()can reliably clean up monitoring pods evenwhen log streaming is disabled.
docs/operators.rstNew section documenting the cleanup contract: which pods are affected,
the meaning of each
on_finish_actionvalue for monitoring pods, andthe
on_kill_actionbehaviour.Tests
on_finish_actionvalue (delete_pod,delete_succeeded_pod,delete_active_pod,keep_pod) onboth success and failure paths.
on_kill_action(delete_pod/keep_pod).get_logs=Falsepath.spec/autospecto catch attribute typosagainst the real
kubernetesclient surface.Backwards compatibility
Default
on_finish_actionis unchanged (delete_pod), so existingdeployments will start reclaiming the leaked monitoring pods
automatically. Users who relied on monitoring pods surviving the task
(e.g. for offline log inspection) can opt in explicitly by passing
on_finish_action="keep_pod".How to verify
KubernetesJobOperatorwith default settings.V1Job's child pod and themonitoring pod (label
airflow_kpo_in_cluster=True, noownerReferences) should be gone.on_finish_action="delete_succeeded_pod"and afailing command — the monitoring pod should remain for forensics.
on_finish_action="keep_pod"— both pods shouldremain.
Was generative AI tooling used to co-author this PR?
Generated-by: GitHub Copilot (Claude Opus 4.7) following the guidelines