Skip to content

KubernetesExecutor: scope periodic completed-pod adoption to dead schedulers#66400

Merged
potiuk merged 3 commits into
apache:mainfrom
potiuk:fix-k8s-multi-scheduler-thrash
May 5, 2026
Merged

KubernetesExecutor: scope periodic completed-pod adoption to dead schedulers#66400
potiuk merged 3 commits into
apache:mainfrom
potiuk:fix-k8s-multi-scheduler-thrash

Conversation

@potiuk
Copy link
Copy Markdown
Member

@potiuk potiuk commented May 5, 2026

Summary

Fix for #66396. PR #61839
(cncf-kubernetes 10.15.0) added a periodic call to _adopt_completed_pods
from inside KubernetesExecutor.sync(), gated by
[scheduler] orphaned_tasks_check_interval (default 300 s). The query
selects every Succeeded pod whose airflow-worker label is not the
current scheduler's label and PATCHes it with the current scheduler's label
so its KubernetesJobWatcher will see the change and DELETE the pod.

With multi-scheduler deployments that caused thrashing — every interval
tick each scheduler relabeled every other scheduler's completed pods,
fighting over ownership and burning kube-API + watcher cycles. See
#66396 for the full
walkthrough and a real user report on the mailing list (3 schedulers,
Airflow 3.2.1, 1000+ pods/min, tasks stalling in scheduled / queued,
delete_worker_pods=False not helping).

What changed

  • New helper _alive_other_scheduler_job_ids queries the Job table
    for SchedulerJobs whose state == RUNNING and whose
    latest_heartbeat is within
    [scheduler] scheduler_health_check_threshold — matching the
    alive-scheduler definition already used by
    SchedulerJobRunner.adopt_or_reset_orphaned_tasks. Returns an empty
    set on any DB error so the caller falls back to the pre-k8s executor - ensure pods cleaned up #61839
    "exclude self only" selector — a transient DB issue must not break
    completed-pod cleanup.
  • _adopt_completed_pods now builds the label selector to exclude
    self + every alive sibling using K8s set-based syntax:
    airflow-worker notin (a,b,c). With one scheduler the selector is
    identical to the pre-fix equality form, so single-scheduler
    deployments see no behavior change.

Why this preserves the original #61839 goal

#61839 closed #57553:
"completed pods leaking after scheduler restart". The fix here keeps
that working — when a scheduler dies, its Job row's
latest_heartbeat ages out of the threshold, the helper stops
including its ID in the alive set, and the surviving schedulers will
adopt and clean up its completed pods on the next interval tick. The
only behavior removed is the steady-state cross-scheduler relabeling
that caused the thrash.

Test plan

  • New test_adopt_completed_pods_excludes_alive_siblings: with
    _alive_other_scheduler_job_ids mocked to {7, 9} and
    scheduler_job_id = "5", the selector emitted is
    kubernetes_executor=True,airflow-worker notin (5,7,9),airflow_executor_done!=True.
  • New test_adopt_completed_pods_single_scheduler_unchanged: with
    the helper returning an empty set, the selector is identical to
    the pre-fix string
    (kubernetes_executor=True,airflow-worker!=modified,airflow_executor_done!=True).
  • Existing test_adopt_completed_pods and
    test_adopt_completed_pods_api_exception still pass unchanged
    (the helper's fallback returns an empty set when
    scheduler_job_id is the tests' non-numeric string).
  • Smoke test in CI on the AMD pipeline.
  • (Manual) cluster-mode verification on a 3-scheduler
    KubernetesExecutor deployment — drop PATCH traffic should be
    visible immediately on the kube API server metrics. (Anyone
    affected by KubernetesExecutor: multi-scheduler completed-pod thrash (10.15.0+) #66396 can apply this patch and report back; happy
    to coordinate.)

closes: #66396


Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (Opus 4.7)

Generated-by: Claude Code (Opus 4.7) following the guidelines

…edulers

PR apache#61839 (cncf-kubernetes 10.15.0) added a periodic call to
`_adopt_completed_pods` from inside `KubernetesExecutor.sync()`, gated by
`[scheduler] orphaned_tasks_check_interval` (default 300 s). The query
selects every Succeeded pod whose `airflow-worker` label is not the current
scheduler's label and PATCHes it with the current scheduler's label so its
KubernetesJobWatcher will see the change and DELETE the pod.

With multi-scheduler deployments that caused thrashing — every
`orphaned_tasks_check_interval` each scheduler iterated over every Succeeded
pod that did not carry its own label and PATCHed it. Schedulers fought each
other:

  * Scheduler A relabels every Succeeded pod owned by B and C → A's watcher
    DELETEs them.
  * Scheduler B does the same a few seconds later → relabels A's freshly
    patched pods to B → B's watcher takes over.
  * Scheduler C the same.

At steady state with high pod churn this manifested as heavy
PATCH /api/v1/namespaces/.../pods/... traffic, expensive `_list_pods` calls
on every interval tick (apache#35599 already documents this is 15-30 s with 500
pods), and tasks stalling in `scheduled` / `queued` because every scheduler
loop was burning seconds inside `_list_pods` and `patch_namespaced_pod`
instead of doing useful scheduling. Setting `delete_worker_pods=False` did
NOT help — the periodic adoption code path doesn't gate on it; it goes
through the watcher's delete.

Fix: scope the periodic adoption to pods owned by no-longer-alive
schedulers. New helper `_alive_other_scheduler_job_ids` queries the
`Job` table for SchedulerJobs whose `state == RUNNING` and whose
`latest_heartbeat` is within `[scheduler] scheduler_health_check_threshold`
(matching the alive-scheduler definition already used by
`SchedulerJobRunner.adopt_or_reset_orphaned_tasks`). The label selector
in `_adopt_completed_pods` is then built to exclude self + every alive
sibling using K8s set-based syntax `airflow-worker notin (a,b,c)`:

  * Single-scheduler deployment: no behavior change. Helper returns empty
    set, selector falls back to the original equality form
    `airflow-worker!=<self_label>`.
  * Multi-scheduler deployment: each scheduler only adopts pods whose
    owning scheduler is gone — preserving the original goal of apache#61839
    (cleanup after a scheduler restart) without the thrash.

If the DB query fails, the helper returns an empty set so the caller
falls back to the pre-apache#61839 "exclude self only" selector — a transient
DB issue must not break completed-pod cleanup.

Two new unit tests cover the multi-scheduler set-based selector and
confirm the single-scheduler equality form is unchanged. Existing
`test_adopt_completed_pods` and `test_adopt_completed_pods_api_exception`
keep their original assertions because the new helper falls back to an
empty set when `scheduler_job_id` is the test's non-numeric string.

Closes: apache#66396
@boring-cyborg boring-cyborg Bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels May 5, 2026
@potiuk potiuk marked this pull request as ready for review May 5, 2026 09:23
Copy link
Copy Markdown
Contributor

@vincbeck vincbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving but I am not a Kubernetes expert. Ideally a second approval would be appreciated :)

Address @jscheffl's first review point on apache#66400: previously the
`_alive_other_scheduler_job_ids` helper called `.all()` to materialize
the SQLAlchemy scalar result into a `list[int]` before passing it to
`set(...)`. Switch to a set-comprehension over the scalar cursor so
no intermediate list is built.

The functional behavior is identical; this just keeps the in-flight
memory footprint flat regardless of how many sibling schedulers are
alive at the moment of the query. In practice the query returns
`int`s of a tens-of-bytes-per-row class for a handful of rows in any
realistic Airflow HA deployment, so the saving is small — but the
cleaner pattern matches the review feedback exactly.
@potiuk
Copy link
Copy Markdown
Member Author

potiuk commented May 5, 2026

Pushed 69070f0e9c to address point (1) — _alive_other_scheduler_job_ids now feeds the scalar cursor straight into a set comprehension, no intermediate list[int] materialised.

On the other concerns:

FOR UPDATE … SKIP LOCKED — I don't think it fits here. The query is a pure read of "who is alive right now" and is racing against every other scheduler's heartbeat write on its own Job row. With FOR UPDATE we'd block on those heartbeats; with SKIP LOCKED we'd routinely miss siblings whose row is in the middle of a heartbeat update and falsely classify them as dead — which would relabel their pods and bring the thrashing back. So adding the lock would actively make the selector wrong rather than safer.

The race you're describing — two healthy schedulers both adopting the same dead-scheduler pod at the same kube-API tick — is real, but it's at the K8s layer, not at the DB. Worst case: scheduler A patches the pod with airflow-worker=A, scheduler B patches it with airflow-worker=B, last write wins, the winning scheduler's watcher deletes the pod. One redundant patch_namespaced_pod call per pod in that exact-collision window — not a correctness problem and orders of magnitude smaller than the per-tick thrash this PR fixes. Truly serialising adoption between alive schedulers would need K8s-level coordination (e.g. a Lease-based leader election around _adopt_completed_pods), which is well beyond the scope of fixing the regression #61839 introduced. Happy to file a follow-up issue if you want to pursue that.

Limit / batching — the K8s pod-list label selector goes in the URL query string. With a generous 100 alive schedulers (typical HA runs 2–5; >10 is already unusual) the resulting notin (…) clause is ~1.5 KB; the kube-apiserver URL-length ceiling is in the hundreds-of-KB range and metav1.LabelSelector itself doesn't cap MatchExpressions value cardinality. So at any operationally plausible scheduler count we're 100×–1000× under the limit and don't need to batch the K8s side. If you'd still like a defensive cap with a logged warning above some threshold (say 64 alive siblings) I can add that — it'd be a sanity guard rather than a correctness fix.


Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting

…s/executors/kubernetes_executor.py

Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>
@potiuk potiuk merged commit 8464275 into apache:main May 5, 2026
7 checks passed
@potiuk potiuk deleted the fix-k8s-multi-scheduler-thrash branch May 5, 2026 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KubernetesExecutor: multi-scheduler completed-pod thrash (10.15.0+) Completed Kubernetes Pods not cleared up

3 participants