k8s tests: wait for push task in executor before killing scheduler by potiuk · Pull Request #67067 · apache/airflow

potiuk · 2026-05-17T18:15:28Z

tests/kubernetes_tests/test_other_executors.py::TestCeleryAndLocalExecutor::test_integration_run_dag_with_scheduler_failure
is occasionally flaky on ARM CI — the scheduler is killed immediately after
start_job_in_kubernetes, so the post-restart scheduler sometimes has to
handle the very first scheduling step itself before the 40-second
monitor_task timeout expires. push stays in queued for the full 40s
and the test fails with assert 'queued' == 'success'.

Two adjustments:

Before killing the scheduler, wait until the push task instance has
reached a queued-or-later state via a new wait_until_task_in_executor
helper. The original scheduler has already handed the task to the
executor, so the post-restart scheduler only needs to drive the
downstream dependency for puller, not pick up push from scratch.
Bump the post-restart monitor_task timeouts from 40 s → 120 s. The
previous "fail fast if failing" budget races with scheduler-loop warm-up
under load; 120 s is still quick on a real bug but gives headroom for
the legitimate cases.

This is the residual flake left after #46502 — the kubectl rollout status
wait fixed the worst of it, but the race between "pod is running" and
"scheduler loop is actually scheduling" remained. Reopens the spirit of
#45145.

Observed failing CI job: https://github.com/apache/airflow/actions/runs/25973707142/job/76353251911

Was generative AI tooling used to co-author this PR?

Yes — Claude Opus 4.7 (1M context)

Generated-by: Claude Opus 4.7 (1M context) following the guidelines

test_integration_run_dag_with_scheduler_failure is intermittently flaky on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes, so the new scheduler pod (after `kubectl rollout status` returns successfully) sometimes has to handle the very first scheduling step itself before the 40s monitor_task timeout expires. push stays in `queued` for the full 40s and the test fails with `assert 'queued' == 'success'`. Two adjustments: 1. Before killing the scheduler, wait until the `push` task instance has reached a `queued`-or-later state. That way the original scheduler has already handed the task to the executor and the post-restart scheduler only needs to drive the downstream dependency for `puller`, not pick up `push` from scratch. 2. Bump the post-restart monitor_task timeout from 40s to 120s. The previous "fail fast if failing" budget races with scheduler-loop warm-up under load; 120s is still fast for a successful run and gives a clear margin for the legitimate cases. This is a residual flake left after apache#46502 — the rollout-status wait fixed the worst of it, but the race between "pod is running" and "scheduler loop is actually scheduling" remained. Reopens the spirit of apache#45145.

potiuk · 2026-05-17T19:45:00Z

I'd love to get this one merged — and would love it in 3.2.2 if it's not too late. cc @vatsrahul1001 (3.2.2 RM)

Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting

github-actions · 2026-05-17T19:51:55Z

Backport successfully created: v3-2-test

Note: As of Merging PRs targeted for Airflow 3.X
the committer who merges the PR is responsible for backporting the PRs that are bug fixes (generally speaking) to the maintenance branches.

In matter of doubt please ask in #release-management Slack channel.

Status	Branch	Result
✅	v3-2-test

…scheduler (#67067) (#67068) test_integration_run_dag_with_scheduler_failure is intermittently flaky on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes, so the new scheduler pod (after `kubectl rollout status` returns successfully) sometimes has to handle the very first scheduling step itself before the 40s monitor_task timeout expires. push stays in `queued` for the full 40s and the test fails with `assert 'queued' == 'success'`. Two adjustments: 1. Before killing the scheduler, wait until the `push` task instance has reached a `queued`-or-later state. That way the original scheduler has already handed the task to the executor and the post-restart scheduler only needs to drive the downstream dependency for `puller`, not pick up `push` from scratch. 2. Bump the post-restart monitor_task timeout from 40s to 120s. The previous "fail fast if failing" budget races with scheduler-loop warm-up under load; 120s is still fast for a successful run and gives a clear margin for the legitimate cases. This is a residual flake left after #46502 — the rollout-status wait fixed the worst of it, but the race between "pod is running" and "scheduler loop is actually scheduling" remained. Reopens the spirit of #45145. (cherry picked from commit 246c19f) Co-authored-by: Jarek Potiuk <jarek@potiuk.com> Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>

potiuk requested review from ashb, gopidesupavan and jason810496 as code owners May 17, 2026 18:15

boring-cyborg Bot added the area:kubernetes-tests label May 17, 2026

potiuk added the backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch label May 17, 2026

potiuk added this to the Airflow 3.2.2 milestone May 17, 2026

jscheffl approved these changes May 17, 2026

View reviewed changes

potiuk merged commit 246c19f into apache:main May 17, 2026
92 checks passed

potiuk deleted the fix-k8s-scheduler-failure-test-race branch May 17, 2026 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k8s tests: wait for push task in executor before killing scheduler#67067

k8s tests: wait for push task in executor before killing scheduler#67067
potiuk merged 1 commit into
apache:mainfrom
potiuk:fix-k8s-scheduler-failure-test-race

potiuk commented May 17, 2026

Uh oh!

potiuk commented May 17, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

potiuk commented May 17, 2026

Was generative AI tooling used to co-author this PR?

Uh oh!

potiuk commented May 17, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 17, 2026

Backport successfully created: v3-2-test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants