Fix KubernetesJobTrigger hang for parallelism > completions case (#64867) by holmuk · Pull Request #65058 · apache/airflow

holmuk · 2026-04-11T12:40:19Z

Was generative AI tooling used to co-author this PR?

Yes (please specify the tool below)
Cursor

This PR resolves the hanging Running state issue in KubernetesJobOperator / KubernetesJobTrigger for deferrable=True / do_xcom_push=True.

Problem description

The trigger waits for container completion for every pod name from a precomputed snapshot (pod_names) before checking the final Job status. That snapshot is built from pod discovery tied to parallelism, not to actual successful completions.

Example (parallelism=2, completions=1):

Airflow creates a Job
Kubernetes starts 2 pods
One pod succeeds
Job becomes Complete (completions=1 reached)
The second pod may never reach the expected terminal state
KubernetesJobTrigger keeps waiting on the second pod and does not reach Job-status evaluation, so the task can remain Running/Deferred forever.

Proposed fix: Task completion should be driven by Job terminal status (Complete / Failed), which already reflects completions:

Make Job status the primary completion condition.
Collect XCom/logs only as best-effort from pods that actually finished and are still readable.
Do not block task finalization on missing/non-terminal pods from the initial snapshot.

What does this PR do?

Updates logic for KubernetesJobTrigger:

The waiting flow is now job-first: completion is driven by final Job status, not by requiring all pods from the initial snapshot to finish.
XCom collection is now best-effort: results are collected only from pods that are available and successfully processed.
404 for missing/deleted pods is handled as skip instead of failing the trigger.
The previous unbounded pod-first waits were removed: container waits are now bounded and periodically re-check whether the Job has already completed.

Regresssion tests for #64867

Trigger regression tests (triggers/test_job.py)
- test_run_completes_when_job_is_done_even_if_some_snapshot_pods_never_complete: Verify the trigger does not hang when a pod from the initial snapshot never reaches terminal state after the Job is already complete.
- test_run_skips_deleted_snapshot_pod_and_completes_when_job_is_done: verifies the trigger handles stale snapshot pods gracefully by skipping 404 Not Found pods and still finishing successfully with available XCom results.
- test_run_collects_later_pod_xcom_best_effort_after_job_done: verifies post-completion best-effort behavior: once the Job is already complete, the trigger continues processing remaining snapshot pods, skips per-pod extraction failures, and still returns XCom from pods that can be read.
Operator regression test (operators/test_job.py)
- test_execute_complete_supports_partial_xcom_results: Verify execute_complete correctly handles partial xcom_result payloads (fewer XCom entries than initial pod snapshot), which is expected in parallelism > completions scenarios.

Additional tests for new code

test_wait_until_container_state_or_job_done_does_not_restart_wait_task: Copilot pointed out that the naive implementation of the waiting loop may not work as expected on slow clusters because of constant coroutine retrying. The test validates that we don't recreate wait_method on every tick on a slow cluster.

Behavior change

Task finalization is now Job-driven.
xcom_result may be partial (fewer entries than initial pod_names) and this is expected.
Missing pods (404) do not fail task completion.

Risks

With very small poll_interval values, the new bounded wait loop may generate extra timeout/cancel/retry iterations while waiting for pod container states. This does not fail the task by itself (it is expected retry behavior), but it can increase polling overhead and log noise until the Job reaches a terminal state.
Best-effort post-job XCom no longer fails the task on per-pod extraction errors (e.g. RBAC/network). To keep this observable, the trigger now emits warnings and a summary with counters (succeeded, skipped_missing, timed_out, failed_other).

Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
When adding dependency, check compliance with the ASF 3rd Party License Policy.
For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

Copilot

Pull request overview

This PR fixes a deferrable KubernetesJobOperator / KubernetesJobTrigger hang when Kubernetes parallelism > completions by making trigger completion primarily driven by the Job’s terminal state (Complete/Failed) rather than waiting for every pod from an initial “snapshot” to reach a terminal state. It also adds regression tests to cover the reported scenario (#64867).

Changes:

Reworks KubernetesJobTrigger.run() to wait for Job completion concurrently and collect XCom from pods on a best-effort basis (skipping missing/deleted pods).
Adds regression tests to ensure the trigger doesn’t hang when some snapshot pods never complete or are deleted.
Adds an operator regression test verifying execute_complete tolerates partial XCom results.