[v3-2-test] k8s tests: wait for push task in executor before killing scheduler (#67067) by github-actions[bot] · Pull Request #67068 · apache/airflow

github-actions · 2026-05-17T19:51:52Z

test_integration_run_dag_with_scheduler_failure is intermittently flaky
on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes,
so the new scheduler pod (after kubectl rollout status returns successfully)
sometimes has to handle the very first scheduling step itself before the
40s monitor_task timeout expires. push stays in queued for the full 40s
and the test fails with assert 'queued' == 'success'.

Two adjustments:

Before killing the scheduler, wait until the push task instance has
reached a queued-or-later state. That way the original scheduler has
already handed the task to the executor and the post-restart scheduler
only needs to drive the downstream dependency for puller, not pick up
push from scratch.
Bump the post-restart monitor_task timeout from 40s to 120s. The
previous "fail fast if failing" budget races with scheduler-loop warm-up
under load; 120s is still fast for a successful run and gives a clear
margin for the legitimate cases.

This is a residual flake left after #46502 — the rollout-status wait fixed
the worst of it, but the race between "pod is running" and "scheduler loop
is actually scheduling" remained.

Reopens the spirit of #45145.
(cherry picked from commit 246c19f)

Co-authored-by: Jarek Potiuk jarek@potiuk.com

…scheduler (#67067) test_integration_run_dag_with_scheduler_failure is intermittently flaky on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes, so the new scheduler pod (after `kubectl rollout status` returns successfully) sometimes has to handle the very first scheduling step itself before the 40s monitor_task timeout expires. push stays in `queued` for the full 40s and the test fails with `assert 'queued' == 'success'`. Two adjustments: 1. Before killing the scheduler, wait until the `push` task instance has reached a `queued`-or-later state. That way the original scheduler has already handed the task to the executor and the post-restart scheduler only needs to drive the downstream dependency for `puller`, not pick up `push` from scratch. 2. Bump the post-restart monitor_task timeout from 40s to 120s. The previous "fail fast if failing" budget races with scheduler-loop warm-up under load; 120s is still fast for a successful run and gives a clear margin for the legitimate cases. This is a residual flake left after #46502 — the rollout-status wait fixed the worst of it, but the race between "pod is running" and "scheduler loop is actually scheduling" remained. Reopens the spirit of #45145. (cherry picked from commit 246c19f) Co-authored-by: Jarek Potiuk <jarek@potiuk.com>

…scheduler (#67067) (#67068) test_integration_run_dag_with_scheduler_failure is intermittently flaky on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes, so the new scheduler pod (after `kubectl rollout status` returns successfully) sometimes has to handle the very first scheduling step itself before the 40s monitor_task timeout expires. push stays in `queued` for the full 40s and the test fails with `assert 'queued' == 'success'`. Two adjustments: 1. Before killing the scheduler, wait until the `push` task instance has reached a `queued`-or-later state. That way the original scheduler has already handed the task to the executor and the post-restart scheduler only needs to drive the downstream dependency for `puller`, not pick up `push` from scratch. 2. Bump the post-restart monitor_task timeout from 40s to 120s. The previous "fail fast if failing" budget races with scheduler-loop warm-up under load; 120s is still fast for a successful run and gives a clear margin for the legitimate cases. This is a residual flake left after #46502 — the rollout-status wait fixed the worst of it, but the race between "pod is running" and "scheduler loop is actually scheduling" remained. Reopens the spirit of #45145. (cherry picked from commit 246c19f) Co-authored-by: Jarek Potiuk <jarek@potiuk.com> Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>

boring-cyborg Bot added the area:kubernetes-tests label May 17, 2026

github-actions Bot mentioned this pull request May 17, 2026

k8s tests: wait for push task in executor before killing scheduler #67067

Merged

1 task

vatsrahul1001 approved these changes May 18, 2026

View reviewed changes

vatsrahul1001 marked this pull request as ready for review May 18, 2026 06:00

vatsrahul1001 requested review from ashb, gopidesupavan, jason810496 and potiuk as code owners May 18, 2026 06:00

Merge branch 'v3-2-test' into backport-246c19f-v3-2-test

0ef37f9

vatsrahul1001 added this to the Airflow 3.2.2 milestone May 18, 2026

vatsrahul1001 added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label May 18, 2026

jason810496 approved these changes May 18, 2026

View reviewed changes

vatsrahul1001 merged commit 6db9402 into v3-2-test May 18, 2026
88 checks passed

vatsrahul1001 deleted the backport-246c19f-v3-2-test branch May 18, 2026 09:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v3-2-test] k8s tests: wait for push task in executor before killing scheduler (#67067)#67068

[v3-2-test] k8s tests: wait for push task in executor before killing scheduler (#67067)#67068
vatsrahul1001 merged 2 commits into
v3-2-testfrom
backport-246c19f-v3-2-test

github-actions Bot commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

github-actions Bot commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants