k8s tests: wait for push task in executor before killing scheduler#67067
Merged
Conversation
test_integration_run_dag_with_scheduler_failure is intermittently flaky on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes, so the new scheduler pod (after `kubectl rollout status` returns successfully) sometimes has to handle the very first scheduling step itself before the 40s monitor_task timeout expires. push stays in `queued` for the full 40s and the test fails with `assert 'queued' == 'success'`. Two adjustments: 1. Before killing the scheduler, wait until the `push` task instance has reached a `queued`-or-later state. That way the original scheduler has already handed the task to the executor and the post-restart scheduler only needs to drive the downstream dependency for `puller`, not pick up `push` from scratch. 2. Bump the post-restart monitor_task timeout from 40s to 120s. The previous "fail fast if failing" budget races with scheduler-loop warm-up under load; 120s is still fast for a successful run and gives a clear margin for the legitimate cases. This is a residual flake left after apache#46502 — the rollout-status wait fixed the worst of it, but the race between "pod is running" and "scheduler loop is actually scheduling" remained. Reopens the spirit of apache#45145.
jscheffl
approved these changes
May 17, 2026
Member
Author
|
I'd love to get this one merged — and would love it in 3.2.2 if it's not too late. cc @vatsrahul1001 (3.2.2 RM) Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting |
Contributor
Backport successfully created: v3-2-testNote: As of Merging PRs targeted for Airflow 3.X In matter of doubt please ask in #release-management Slack channel.
|
vatsrahul1001
added a commit
that referenced
this pull request
May 18, 2026
…scheduler (#67067) (#67068) test_integration_run_dag_with_scheduler_failure is intermittently flaky on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes, so the new scheduler pod (after `kubectl rollout status` returns successfully) sometimes has to handle the very first scheduling step itself before the 40s monitor_task timeout expires. push stays in `queued` for the full 40s and the test fails with `assert 'queued' == 'success'`. Two adjustments: 1. Before killing the scheduler, wait until the `push` task instance has reached a `queued`-or-later state. That way the original scheduler has already handed the task to the executor and the post-restart scheduler only needs to drive the downstream dependency for `puller`, not pick up `push` from scratch. 2. Bump the post-restart monitor_task timeout from 40s to 120s. The previous "fail fast if failing" budget races with scheduler-loop warm-up under load; 120s is still fast for a successful run and gives a clear margin for the legitimate cases. This is a residual flake left after #46502 — the rollout-status wait fixed the worst of it, but the race between "pod is running" and "scheduler loop is actually scheduling" remained. Reopens the spirit of #45145. (cherry picked from commit 246c19f) Co-authored-by: Jarek Potiuk <jarek@potiuk.com> Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
vatsrahul1001
added a commit
that referenced
this pull request
May 20, 2026
…scheduler (#67067) (#67068) test_integration_run_dag_with_scheduler_failure is intermittently flaky on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes, so the new scheduler pod (after `kubectl rollout status` returns successfully) sometimes has to handle the very first scheduling step itself before the 40s monitor_task timeout expires. push stays in `queued` for the full 40s and the test fails with `assert 'queued' == 'success'`. Two adjustments: 1. Before killing the scheduler, wait until the `push` task instance has reached a `queued`-or-later state. That way the original scheduler has already handed the task to the executor and the post-restart scheduler only needs to drive the downstream dependency for `puller`, not pick up `push` from scratch. 2. Bump the post-restart monitor_task timeout from 40s to 120s. The previous "fail fast if failing" budget races with scheduler-loop warm-up under load; 120s is still fast for a successful run and gives a clear margin for the legitimate cases. This is a residual flake left after #46502 — the rollout-status wait fixed the worst of it, but the race between "pod is running" and "scheduler loop is actually scheduling" remained. Reopens the spirit of #45145. (cherry picked from commit 246c19f) Co-authored-by: Jarek Potiuk <jarek@potiuk.com> Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
vatsrahul1001
added a commit
that referenced
this pull request
May 20, 2026
…scheduler (#67067) (#67068) test_integration_run_dag_with_scheduler_failure is intermittently flaky on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes, so the new scheduler pod (after `kubectl rollout status` returns successfully) sometimes has to handle the very first scheduling step itself before the 40s monitor_task timeout expires. push stays in `queued` for the full 40s and the test fails with `assert 'queued' == 'success'`. Two adjustments: 1. Before killing the scheduler, wait until the `push` task instance has reached a `queued`-or-later state. That way the original scheduler has already handed the task to the executor and the post-restart scheduler only needs to drive the downstream dependency for `puller`, not pick up `push` from scratch. 2. Bump the post-restart monitor_task timeout from 40s to 120s. The previous "fail fast if failing" budget races with scheduler-loop warm-up under load; 120s is still fast for a successful run and gives a clear margin for the legitimate cases. This is a residual flake left after #46502 — the rollout-status wait fixed the worst of it, but the race between "pod is running" and "scheduler loop is actually scheduling" remained. Reopens the spirit of #45145. (cherry picked from commit 246c19f) Co-authored-by: Jarek Potiuk <jarek@potiuk.com> Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
vatsrahul1001
added a commit
that referenced
this pull request
May 21, 2026
…scheduler (#67067) (#67068) test_integration_run_dag_with_scheduler_failure is intermittently flaky on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes, so the new scheduler pod (after `kubectl rollout status` returns successfully) sometimes has to handle the very first scheduling step itself before the 40s monitor_task timeout expires. push stays in `queued` for the full 40s and the test fails with `assert 'queued' == 'success'`. Two adjustments: 1. Before killing the scheduler, wait until the `push` task instance has reached a `queued`-or-later state. That way the original scheduler has already handed the task to the executor and the post-restart scheduler only needs to drive the downstream dependency for `puller`, not pick up `push` from scratch. 2. Bump the post-restart monitor_task timeout from 40s to 120s. The previous "fail fast if failing" budget races with scheduler-loop warm-up under load; 120s is still fast for a successful run and gives a clear margin for the legitimate cases. This is a residual flake left after #46502 — the rollout-status wait fixed the worst of it, but the race between "pod is running" and "scheduler loop is actually scheduling" remained. Reopens the spirit of #45145. (cherry picked from commit 246c19f) Co-authored-by: Jarek Potiuk <jarek@potiuk.com> Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
tests/kubernetes_tests/test_other_executors.py::TestCeleryAndLocalExecutor::test_integration_run_dag_with_scheduler_failureis occasionally flaky on ARM CI — the scheduler is killed immediately after
start_job_in_kubernetes, so the post-restart scheduler sometimes has tohandle the very first scheduling step itself before the 40-second
monitor_tasktimeout expires.pushstays inqueuedfor the full 40sand the test fails with
assert 'queued' == 'success'.Two adjustments:
pushtask instance hasreached a
queued-or-later state via a newwait_until_task_in_executorhelper. The original scheduler has already handed the task to the
executor, so the post-restart scheduler only needs to drive the
downstream dependency for
puller, not pick uppushfrom scratch.monitor_tasktimeouts from 40 s → 120 s. Theprevious "fail fast if failing" budget races with scheduler-loop warm-up
under load; 120 s is still quick on a real bug but gives headroom for
the legitimate cases.
This is the residual flake left after #46502 — the
kubectl rollout statuswait fixed the worst of it, but the race between "pod is running" and
"scheduler loop is actually scheduling" remained. Reopens the spirit of
#45145.
Observed failing CI job: https://github.com/apache/airflow/actions/runs/25973707142/job/76353251911
Was generative AI tooling used to co-author this PR?
Generated-by: Claude Opus 4.7 (1M context) following the guidelines