Skip to content

Fix KubernetesJobTrigger hang for parallelism > completions case (#64867)#65058

Open
holmuk wants to merge 1 commit intoapache:mainfrom
holmuk:bugfix/kubernetes-job-task-competition
Open

Fix KubernetesJobTrigger hang for parallelism > completions case (#64867)#65058
holmuk wants to merge 1 commit intoapache:mainfrom
holmuk:bugfix/kubernetes-job-task-competition

Conversation

@holmuk
Copy link
Copy Markdown
Contributor

@holmuk holmuk commented Apr 11, 2026

Closes #64867


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)
    Cursor

This PR resolves the hanging Running state issue in KubernetesJobOperator / KubernetesJobTrigger for deferrable=True / do_xcom_push=True.

Problem description

The trigger waits for container completion for every pod name from a precomputed snapshot (pod_names) before checking the final Job status. That snapshot is built from pod discovery tied to parallelism, not to actual successful completions.

Example (parallelism=2, completions=1):

  • Airflow creates a Job
  • Kubernetes starts 2 pods
  • One pod succeeds
  • Job becomes Complete (completions=1 reached)
  • The second pod may never reach the expected terminal state
  • KubernetesJobTrigger keeps waiting on the second pod and does not reach Job-status evaluation, so the task can remain Running/Deferred forever.

Proposed fix: Task completion should be driven by Job terminal status (Complete / Failed), which already reflects completions:

  • Make Job status the primary completion condition.
  • Collect XCom/logs only as best-effort from pods that actually finished and are still readable.
  • Do not block task finalization on missing/non-terminal pods from the initial snapshot.

What does this PR do?

Updates logic for KubernetesJobTrigger:

  • The waiting flow is now job-first: completion is driven by final Job status, not by requiring all pods from the initial snapshot to finish.
  • XCom collection is now best-effort: results are collected only from pods that are available and successfully processed.
  • 404 for missing/deleted pods is handled as skip instead of failing the trigger.
  • The previous unbounded pod-first waits were removed: container waits are now bounded and periodically re-check whether the Job has already completed.

Regresssion tests for #64867

  • Trigger regression tests (triggers/test_job.py)

    • test_run_completes_when_job_is_done_even_if_some_snapshot_pods_never_complete: Verify the trigger does not hang when a pod from the initial snapshot never reaches terminal state after the Job is already complete.

    • test_run_skips_deleted_snapshot_pod_and_completes_when_job_is_done: verifies the trigger handles stale snapshot pods gracefully by skipping 404 Not Found pods and still finishing successfully with available XCom results.

    • test_run_collects_later_pod_xcom_best_effort_after_job_done: verifies post-completion best-effort behavior: once the Job is already complete, the trigger continues processing remaining snapshot pods, skips per-pod extraction failures, and still returns XCom from pods that can be read.

  • Operator regression test (operators/test_job.py)

    • test_execute_complete_supports_partial_xcom_results: Verify execute_complete correctly handles partial xcom_result payloads (fewer XCom entries than initial pod snapshot), which is expected in parallelism > completions scenarios.

Additional tests for new code

  • test_wait_until_container_state_or_job_done_does_not_restart_wait_task: Copilot pointed out that the naive implementation of the waiting loop may not work as expected on slow clusters because of constant coroutine retrying. The test validates that we don't recreate wait_method on every tick on a slow cluster.

Behavior change

  • Task finalization is now Job-driven.
  • xcom_result may be partial (fewer entries than initial pod_names) and this is expected.
  • Missing pods (404) do not fail task completion.

Risks

  • With very small poll_interval values, the new bounded wait loop may generate extra timeout/cancel/retry iterations while waiting for pod container states. This does not fail the task by itself (it is expected retry behavior), but it can increase polling overhead and log noise until the Job reaches a terminal state.
  • Best-effort post-job XCom no longer fails the task on per-pod extraction errors (e.g. RBAC/network). To keep this observable, the trigger now emits warnings and a summary with counters (succeeded, skipped_missing, timed_out, failed_other).

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a deferrable KubernetesJobOperator / KubernetesJobTrigger hang when Kubernetes parallelism > completions by making trigger completion primarily driven by the Job’s terminal state (Complete/Failed) rather than waiting for every pod from an initial “snapshot” to reach a terminal state. It also adds regression tests to cover the reported scenario (#64867).

Changes:

  • Reworks KubernetesJobTrigger.run() to wait for Job completion concurrently and collect XCom from pods on a best-effort basis (skipping missing/deleted pods).
  • Adds regression tests to ensure the trigger doesn’t hang when some snapshot pods never complete or are deleted.
  • Adds an operator regression test verifying execute_complete tolerates partial XCom results.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/triggers/job.py Changes trigger control flow to be job-first and makes XCom extraction best-effort without blocking task finalization.
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/triggers/test_job.py Adds async regression tests for parallelism > completions pod snapshot edge cases and updates job polling assertion.
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_job.py Adds regression test ensuring execute_complete handles partial XCom payload lists.

@holmuk holmuk force-pushed the bugfix/kubernetes-job-task-competition branch from 475be15 to 55ab20b Compare April 11, 2026 18:36
Copy link
Copy Markdown
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me but I am not really an expert with K8s Jobs, so I have a hard time judging details of the fix. Looking for a second maintainer review.

Copy link
Copy Markdown
Contributor

@Nataneljpwd Nataneljpwd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a little too complex, some improvements can be made to simplify it

Comment on lines +169 to +172
if not job_task.done():
job_task.cancel()
with suppress(asyncio.CancelledError):
await job_task
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a little weird, we retry in the finally no matter what, even if the job_task threw an exception, or hasn't finished yet, we retry, but if an api request has been sent, you cannot cancel it, and so I see a case where the request was sent, job was created, task was cancelled and then you retry creating the job, either failing because of a unique name constraint or running the job twice

This generally looks weird to retry in the finally block, I would suggest either to hand ethe exception as intended or put the try except only on the collect XCOM

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job_task doesn't create a new job in k8s, it only checks for status of an existing job. In finally, if the job_task is still running, we cancel the task to avoid leaving a background coroutine alive. The second await job_task is only waiting for the cancellation to complete.


if wait_task in done:
try:
await wait_task
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we await a task which is done?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done() means that the task is finished, but it doesn't raise exceptions or return results. We await for the task to get the final result or exception.

Comment on lines +358 to +369
async with semaphore:
try:
return await asyncio.wait_for(
self._extract_xcom_for_pod_best_effort(pod_name=pod_name),
timeout=per_pod_timeout,
)
except asyncio.TimeoutError:
self.log.warning(
"Timed out extracting XCom from pod '%s' after job completion; skipping.",
pod_name,
)
return PodXComAttempt(pod_name=pod_name, outcome="timeout")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a semaphore used here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To limit the number of concurrent XCom extraction tasks. Otherwise we can put a burden on the API if the number of pods is high.

job_task=job_task,
)
if completion_outcome == "job_done":
post_job_pod_names = self.pod_names[pod_index:]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use the pod_name variable

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pod_name is the pod where we got job_done status, and we are going to extract XCom from this pod and all the pods that are following the pod_name named pod. That's why we can't use pod_name alone.

@holmuk holmuk force-pushed the bugfix/kubernetes-job-task-competition branch from 55ab20b to b8fb3ea Compare April 13, 2026 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KubernetesJobOperator task stuck in Running state when parallelism > completions

4 participants