KubernetesExecutor: scope periodic completed-pod adoption to dead schedulers#66400
Conversation
…edulers PR apache#61839 (cncf-kubernetes 10.15.0) added a periodic call to `_adopt_completed_pods` from inside `KubernetesExecutor.sync()`, gated by `[scheduler] orphaned_tasks_check_interval` (default 300 s). The query selects every Succeeded pod whose `airflow-worker` label is not the current scheduler's label and PATCHes it with the current scheduler's label so its KubernetesJobWatcher will see the change and DELETE the pod. With multi-scheduler deployments that caused thrashing — every `orphaned_tasks_check_interval` each scheduler iterated over every Succeeded pod that did not carry its own label and PATCHed it. Schedulers fought each other: * Scheduler A relabels every Succeeded pod owned by B and C → A's watcher DELETEs them. * Scheduler B does the same a few seconds later → relabels A's freshly patched pods to B → B's watcher takes over. * Scheduler C the same. At steady state with high pod churn this manifested as heavy PATCH /api/v1/namespaces/.../pods/... traffic, expensive `_list_pods` calls on every interval tick (apache#35599 already documents this is 15-30 s with 500 pods), and tasks stalling in `scheduled` / `queued` because every scheduler loop was burning seconds inside `_list_pods` and `patch_namespaced_pod` instead of doing useful scheduling. Setting `delete_worker_pods=False` did NOT help — the periodic adoption code path doesn't gate on it; it goes through the watcher's delete. Fix: scope the periodic adoption to pods owned by no-longer-alive schedulers. New helper `_alive_other_scheduler_job_ids` queries the `Job` table for SchedulerJobs whose `state == RUNNING` and whose `latest_heartbeat` is within `[scheduler] scheduler_health_check_threshold` (matching the alive-scheduler definition already used by `SchedulerJobRunner.adopt_or_reset_orphaned_tasks`). The label selector in `_adopt_completed_pods` is then built to exclude self + every alive sibling using K8s set-based syntax `airflow-worker notin (a,b,c)`: * Single-scheduler deployment: no behavior change. Helper returns empty set, selector falls back to the original equality form `airflow-worker!=<self_label>`. * Multi-scheduler deployment: each scheduler only adopts pods whose owning scheduler is gone — preserving the original goal of apache#61839 (cleanup after a scheduler restart) without the thrash. If the DB query fails, the helper returns an empty set so the caller falls back to the pre-apache#61839 "exclude self only" selector — a transient DB issue must not break completed-pod cleanup. Two new unit tests cover the multi-scheduler set-based selector and confirm the single-scheduler equality form is unchanged. Existing `test_adopt_completed_pods` and `test_adopt_completed_pods_api_exception` keep their original assertions because the new helper falls back to an empty set when `scheduler_job_id` is the test's non-numeric string. Closes: apache#66396
vincbeck
left a comment
There was a problem hiding this comment.
Approving but I am not a Kubernetes expert. Ideally a second approval would be appreciated :)
Address @jscheffl's first review point on apache#66400: previously the `_alive_other_scheduler_job_ids` helper called `.all()` to materialize the SQLAlchemy scalar result into a `list[int]` before passing it to `set(...)`. Switch to a set-comprehension over the scalar cursor so no intermediate list is built. The functional behavior is identical; this just keeps the in-flight memory footprint flat regardless of how many sibling schedulers are alive at the moment of the query. In practice the query returns `int`s of a tens-of-bytes-per-row class for a handful of rows in any realistic Airflow HA deployment, so the saving is small — but the cleaner pattern matches the review feedback exactly.
|
Pushed On the other concerns:
The race you're describing — two healthy schedulers both adopting the same dead-scheduler pod at the same kube-API tick — is real, but it's at the K8s layer, not at the DB. Worst case: scheduler A patches the pod with Limit / batching — the K8s pod-list label selector goes in the URL query string. With a generous 100 alive schedulers (typical HA runs 2–5; >10 is already unusual) the resulting Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting |
…s/executors/kubernetes_executor.py Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>
Summary
Fix for #66396. PR #61839
(cncf-kubernetes 10.15.0) added a periodic call to
_adopt_completed_podsfrom inside
KubernetesExecutor.sync(), gated by[scheduler] orphaned_tasks_check_interval(default 300 s). The queryselects every Succeeded pod whose
airflow-workerlabel is not thecurrent scheduler's label and PATCHes it with the current scheduler's label
so its
KubernetesJobWatcherwill see the change and DELETE the pod.With multi-scheduler deployments that caused thrashing — every interval
tick each scheduler relabeled every other scheduler's completed pods,
fighting over ownership and burning kube-API + watcher cycles. See
#66396 for the full
walkthrough and a real user report on the mailing list (3 schedulers,
Airflow 3.2.1, 1000+ pods/min, tasks stalling in
scheduled/queued,delete_worker_pods=Falsenot helping).What changed
_alive_other_scheduler_job_idsqueries theJobtablefor
SchedulerJobs whosestate == RUNNINGand whoselatest_heartbeatis within[scheduler] scheduler_health_check_threshold— matching thealive-scheduler definition already used by
SchedulerJobRunner.adopt_or_reset_orphaned_tasks. Returns an emptyset on any DB error so the caller falls back to the pre-k8s executor - ensure pods cleaned up #61839
"exclude self only" selector — a transient DB issue must not break
completed-pod cleanup.
_adopt_completed_podsnow builds the label selector to excludeself+ every alive sibling using K8s set-based syntax:airflow-worker notin (a,b,c). With one scheduler the selector isidentical to the pre-fix equality form, so single-scheduler
deployments see no behavior change.
Why this preserves the original #61839 goal
#61839 closed #57553:
"completed pods leaking after scheduler restart". The fix here keeps
that working — when a scheduler dies, its
Jobrow'slatest_heartbeatages out of the threshold, the helper stopsincluding its ID in the alive set, and the surviving schedulers will
adopt and clean up its completed pods on the next interval tick. The
only behavior removed is the steady-state cross-scheduler relabeling
that caused the thrash.
Test plan
test_adopt_completed_pods_excludes_alive_siblings: with_alive_other_scheduler_job_idsmocked to{7, 9}andscheduler_job_id = "5", the selector emitted iskubernetes_executor=True,airflow-worker notin (5,7,9),airflow_executor_done!=True.test_adopt_completed_pods_single_scheduler_unchanged: withthe helper returning an empty set, the selector is identical to
the pre-fix string
(
kubernetes_executor=True,airflow-worker!=modified,airflow_executor_done!=True).test_adopt_completed_podsandtest_adopt_completed_pods_api_exceptionstill pass unchanged(the helper's fallback returns an empty set when
scheduler_job_idis the tests' non-numeric string).KubernetesExecutor deployment — drop PATCH traffic should be
visible immediately on the kube API server metrics. (Anyone
affected by KubernetesExecutor: multi-scheduler completed-pod thrash (10.15.0+) #66396 can apply this patch and report back; happy
to coordinate.)
closes: #66396
Was generative AI tooling used to co-author this PR?
Generated-by: Claude Code (Opus 4.7) following the guidelines