KubernetesExecutor: scope periodic completed-pod adoption to dead schedulers by potiuk · Pull Request #66400 · apache/airflow

potiuk · 2026-05-05T09:22:50Z

Summary

Fix for #66396. PR #61839
(cncf-kubernetes 10.15.0) added a periodic call to _adopt_completed_pods
from inside KubernetesExecutor.sync(), gated by
[scheduler] orphaned_tasks_check_interval (default 300 s). The query
selects every Succeeded pod whose airflow-worker label is not the
current scheduler's label and PATCHes it with the current scheduler's label
so its KubernetesJobWatcher will see the change and DELETE the pod.

With multi-scheduler deployments that caused thrashing — every interval
tick each scheduler relabeled every other scheduler's completed pods,
fighting over ownership and burning kube-API + watcher cycles. See
#66396 for the full
walkthrough and a real user report on the mailing list (3 schedulers,
Airflow 3.2.1, 1000+ pods/min, tasks stalling in scheduled / queued,
delete_worker_pods=False not helping).

What changed

New helper _alive_other_scheduler_job_ids queries the Job table
for SchedulerJobs whose state == RUNNING and whose
latest_heartbeat is within
[scheduler] scheduler_health_check_threshold — matching the
alive-scheduler definition already used by
SchedulerJobRunner.adopt_or_reset_orphaned_tasks. Returns an empty
set on any DB error so the caller falls back to the pre-k8s executor - ensure pods cleaned up #61839
"exclude self only" selector — a transient DB issue must not break
completed-pod cleanup.
_adopt_completed_pods now builds the label selector to exclude
self + every alive sibling using K8s set-based syntax:
airflow-worker notin (a,b,c). With one scheduler the selector is
identical to the pre-fix equality form, so single-scheduler
deployments see no behavior change.

Why this preserves the original #61839 goal

#61839 closed #57553:
"completed pods leaking after scheduler restart". The fix here keeps
that working — when a scheduler dies, its Job row's
latest_heartbeat ages out of the threshold, the helper stops
including its ID in the alive set, and the surviving schedulers will
adopt and clean up its completed pods on the next interval tick. The
only behavior removed is the steady-state cross-scheduler relabeling
that caused the thrash.

Test plan

New test_adopt_completed_pods_excludes_alive_siblings: with
_alive_other_scheduler_job_ids mocked to {7, 9} and
scheduler_job_id = "5", the selector emitted is
kubernetes_executor=True,airflow-worker notin (5,7,9),airflow_executor_done!=True.
New test_adopt_completed_pods_single_scheduler_unchanged: with
the helper returning an empty set, the selector is identical to
the pre-fix string
(kubernetes_executor=True,airflow-worker!=modified,airflow_executor_done!=True).
Existing test_adopt_completed_pods and
test_adopt_completed_pods_api_exception still pass unchanged
(the helper's fallback returns an empty set when
scheduler_job_id is the tests' non-numeric string).
Smoke test in CI on the AMD pipeline.
(Manual) cluster-mode verification on a 3-scheduler
KubernetesExecutor deployment — drop PATCH traffic should be
visible immediately on the kube API server metrics. (Anyone
affected by KubernetesExecutor: multi-scheduler completed-pod thrash (10.15.0+) #66396 can apply this patch and report back; happy
to coordinate.)

closes: #66396

Was generative AI tooling used to co-author this PR?

Yes — Claude Code (Opus 4.7)

Generated-by: Claude Code (Opus 4.7) following the guidelines

…edulers PR apache#61839 (cncf-kubernetes 10.15.0) added a periodic call to `_adopt_completed_pods` from inside `KubernetesExecutor.sync()`, gated by `[scheduler] orphaned_tasks_check_interval` (default 300 s). The query selects every Succeeded pod whose `airflow-worker` label is not the current scheduler's label and PATCHes it with the current scheduler's label so its KubernetesJobWatcher will see the change and DELETE the pod. With multi-scheduler deployments that caused thrashing — every `orphaned_tasks_check_interval` each scheduler iterated over every Succeeded pod that did not carry its own label and PATCHed it. Schedulers fought each other: * Scheduler A relabels every Succeeded pod owned by B and C → A's watcher DELETEs them. * Scheduler B does the same a few seconds later → relabels A's freshly patched pods to B → B's watcher takes over. * Scheduler C the same. At steady state with high pod churn this manifested as heavy PATCH /api/v1/namespaces/.../pods/... traffic, expensive `_list_pods` calls on every interval tick (apache#35599 already documents this is 15-30 s with 500 pods), and tasks stalling in `scheduled` / `queued` because every scheduler loop was burning seconds inside `_list_pods` and `patch_namespaced_pod` instead of doing useful scheduling. Setting `delete_worker_pods=False` did NOT help — the periodic adoption code path doesn't gate on it; it goes through the watcher's delete. Fix: scope the periodic adoption to pods owned by no-longer-alive schedulers. New helper `_alive_other_scheduler_job_ids` queries the `Job` table for SchedulerJobs whose `state == RUNNING` and whose `latest_heartbeat` is within `[scheduler] scheduler_health_check_threshold` (matching the alive-scheduler definition already used by `SchedulerJobRunner.adopt_or_reset_orphaned_tasks`). The label selector in `_adopt_completed_pods` is then built to exclude self + every alive sibling using K8s set-based syntax `airflow-worker notin (a,b,c)`: * Single-scheduler deployment: no behavior change. Helper returns empty set, selector falls back to the original equality form `airflow-worker!=<self_label>`. * Multi-scheduler deployment: each scheduler only adopts pods whose owning scheduler is gone — preserving the original goal of apache#61839 (cleanup after a scheduler restart) without the thrash. If the DB query fails, the helper returns an empty set so the caller falls back to the pre-apache#61839 "exclude self only" selector — a transient DB issue must not break completed-pod cleanup. Two new unit tests cover the multi-scheduler set-based selector and confirm the single-scheduler equality form is unchanged. Existing `test_adopt_completed_pods` and `test_adopt_completed_pods_api_exception` keep their original assertions because the new helper falls back to an empty set when `scheduler_job_id` is the test's non-numeric string. Closes: apache#66396

vincbeck

Approving but I am not a Kubernetes expert. Ideally a second approval would be appreciated :)

@jscheffl

Address @jscheffl's first review point on apache#66400: previously the `_alive_other_scheduler_job_ids` helper called `.all()` to materialize the SQLAlchemy scalar result into a `list[int]` before passing it to `set(...)`. Switch to a set-comprehension over the scalar cursor so no intermediate list is built. The functional behavior is identical; this just keeps the in-flight memory footprint flat regardless of how many sibling schedulers are alive at the moment of the query. In practice the query returns `int`s of a tens-of-bytes-per-row class for a handful of rows in any realistic Airflow HA deployment, so the saving is small — but the cleaner pattern matches the review feedback exactly.

potiuk · 2026-05-05T18:30:42Z

Pushed 69070f0e9c to address point (1) — _alive_other_scheduler_job_ids now feeds the scalar cursor straight into a set comprehension, no intermediate list[int] materialised.

On the other concerns:

FOR UPDATE … SKIP LOCKED — I don't think it fits here. The query is a pure read of "who is alive right now" and is racing against every other scheduler's heartbeat write on its own Job row. With FOR UPDATE we'd block on those heartbeats; with SKIP LOCKED we'd routinely miss siblings whose row is in the middle of a heartbeat update and falsely classify them as dead — which would relabel their pods and bring the thrashing back. So adding the lock would actively make the selector wrong rather than safer.

The race you're describing — two healthy schedulers both adopting the same dead-scheduler pod at the same kube-API tick — is real, but it's at the K8s layer, not at the DB. Worst case: scheduler A patches the pod with airflow-worker=A, scheduler B patches it with airflow-worker=B, last write wins, the winning scheduler's watcher deletes the pod. One redundant patch_namespaced_pod call per pod in that exact-collision window — not a correctness problem and orders of magnitude smaller than the per-tick thrash this PR fixes. Truly serialising adoption between alive schedulers would need K8s-level coordination (e.g. a Lease-based leader election around _adopt_completed_pods), which is well beyond the scope of fixing the regression #61839 introduced. Happy to file a follow-up issue if you want to pursue that.

Limit / batching — the K8s pod-list label selector goes in the URL query string. With a generous 100 alive schedulers (typical HA runs 2–5; >10 is already unusual) the resulting notin (…) clause is ~1.5 KB; the kube-apiserver URL-length ceiling is in the hundreds-of-KB range and metav1.LabelSelector itself doesn't cap MatchExpressions value cardinality. So at any operationally plausible scheduler count we're 100×–1000× under the limit and don't need to batch the K8s side. If you'd still like a defensive cap with a logged warning above some threshold (say 64 alive siblings) I can add that — it'd be a sanity guard rather than a correctness fix.

Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting

…s/executors/kubernetes_executor.py Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>

boring-cyborg Bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels May 5, 2026

potiuk marked this pull request as ready for review May 5, 2026 09:23

potiuk requested review from hussein-awala, jedcunningham and jscheffl as code owners May 5, 2026 09:23

vincbeck approved these changes May 5, 2026

View reviewed changes

jscheffl reviewed May 5, 2026

View reviewed changes

Comment thread ...iders/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py Outdated

jscheffl approved these changes May 5, 2026

View reviewed changes

Comment thread ...iders/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py Outdated

Update providers/cncf/kubernetes/src/airflow/providers/cncf/kubernete…

8f35386

…s/executors/kubernetes_executor.py Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>

potiuk merged commit 8464275 into apache:main May 5, 2026
7 checks passed

potiuk deleted the fix-k8s-multi-scheduler-thrash branch May 5, 2026 20:19

jscheffl mentioned this pull request May 19, 2026

Status of testing Providers that were prepared on May 19, 2026 #67213

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KubernetesExecutor: scope periodic completed-pod adoption to dead schedulers#66400

KubernetesExecutor: scope periodic completed-pod adoption to dead schedulers#66400
potiuk merged 3 commits into
apache:mainfrom
potiuk:fix-k8s-multi-scheduler-thrash

potiuk commented May 5, 2026

Uh oh!

vincbeck left a comment

Uh oh!

Uh oh!

potiuk commented May 5, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

potiuk commented May 5, 2026

Summary

What changed

Why this preserves the original #61839 goal

Test plan

Was generative AI tooling used to co-author this PR?

Uh oh!

vincbeck left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

potiuk commented May 5, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants