Skip to content

[v3-2-test] k8s tests: wait for push task in executor before killing scheduler (#67067)#67068

Merged
vatsrahul1001 merged 2 commits into
v3-2-testfrom
backport-246c19f-v3-2-test
May 18, 2026
Merged

[v3-2-test] k8s tests: wait for push task in executor before killing scheduler (#67067)#67068
vatsrahul1001 merged 2 commits into
v3-2-testfrom
backport-246c19f-v3-2-test

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

test_integration_run_dag_with_scheduler_failure is intermittently flaky
on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes,
so the new scheduler pod (after kubectl rollout status returns successfully)
sometimes has to handle the very first scheduling step itself before the
40s monitor_task timeout expires. push stays in queued for the full 40s
and the test fails with assert 'queued' == 'success'.

Two adjustments:

  1. Before killing the scheduler, wait until the push task instance has
    reached a queued-or-later state. That way the original scheduler has
    already handed the task to the executor and the post-restart scheduler
    only needs to drive the downstream dependency for puller, not pick up
    push from scratch.

  2. Bump the post-restart monitor_task timeout from 40s to 120s. The
    previous "fail fast if failing" budget races with scheduler-loop warm-up
    under load; 120s is still fast for a successful run and gives a clear
    margin for the legitimate cases.

This is a residual flake left after #46502 — the rollout-status wait fixed
the worst of it, but the race between "pod is running" and "scheduler loop
is actually scheduling" remained.

Reopens the spirit of #45145.
(cherry picked from commit 246c19f)

Co-authored-by: Jarek Potiuk jarek@potiuk.com

…scheduler (#67067)

test_integration_run_dag_with_scheduler_failure is intermittently flaky
on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes,
so the new scheduler pod (after `kubectl rollout status` returns successfully)
sometimes has to handle the very first scheduling step itself before the
40s monitor_task timeout expires. push stays in `queued` for the full 40s
and the test fails with `assert 'queued' == 'success'`.

Two adjustments:

1. Before killing the scheduler, wait until the `push` task instance has
   reached a `queued`-or-later state. That way the original scheduler has
   already handed the task to the executor and the post-restart scheduler
   only needs to drive the downstream dependency for `puller`, not pick up
   `push` from scratch.

2. Bump the post-restart monitor_task timeout from 40s to 120s. The
   previous "fail fast if failing" budget races with scheduler-loop warm-up
   under load; 120s is still fast for a successful run and gives a clear
   margin for the legitimate cases.

This is a residual flake left after #46502 — the rollout-status wait fixed
the worst of it, but the race between "pod is running" and "scheduler loop
is actually scheduling" remained.

Reopens the spirit of #45145.
(cherry picked from commit 246c19f)

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
@vatsrahul1001 vatsrahul1001 marked this pull request as ready for review May 18, 2026 06:00
@vatsrahul1001 vatsrahul1001 added this to the Airflow 3.2.2 milestone May 18, 2026
@vatsrahul1001 vatsrahul1001 added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label May 18, 2026
@vatsrahul1001 vatsrahul1001 merged commit 6db9402 into v3-2-test May 18, 2026
88 checks passed
@vatsrahul1001 vatsrahul1001 deleted the backport-246c19f-v3-2-test branch May 18, 2026 09:39
vatsrahul1001 added a commit that referenced this pull request May 20, 2026
…scheduler (#67067) (#67068)

test_integration_run_dag_with_scheduler_failure is intermittently flaky
on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes,
so the new scheduler pod (after `kubectl rollout status` returns successfully)
sometimes has to handle the very first scheduling step itself before the
40s monitor_task timeout expires. push stays in `queued` for the full 40s
and the test fails with `assert 'queued' == 'success'`.

Two adjustments:

1. Before killing the scheduler, wait until the `push` task instance has
   reached a `queued`-or-later state. That way the original scheduler has
   already handed the task to the executor and the post-restart scheduler
   only needs to drive the downstream dependency for `puller`, not pick up
   `push` from scratch.

2. Bump the post-restart monitor_task timeout from 40s to 120s. The
   previous "fail fast if failing" budget races with scheduler-loop warm-up
   under load; 120s is still fast for a successful run and gives a clear
   margin for the legitimate cases.

This is a residual flake left after #46502 — the rollout-status wait fixed
the worst of it, but the race between "pod is running" and "scheduler loop
is actually scheduling" remained.

Reopens the spirit of #45145.
(cherry picked from commit 246c19f)

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
vatsrahul1001 added a commit that referenced this pull request May 20, 2026
…scheduler (#67067) (#67068)

test_integration_run_dag_with_scheduler_failure is intermittently flaky
on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes,
so the new scheduler pod (after `kubectl rollout status` returns successfully)
sometimes has to handle the very first scheduling step itself before the
40s monitor_task timeout expires. push stays in `queued` for the full 40s
and the test fails with `assert 'queued' == 'success'`.

Two adjustments:

1. Before killing the scheduler, wait until the `push` task instance has
   reached a `queued`-or-later state. That way the original scheduler has
   already handed the task to the executor and the post-restart scheduler
   only needs to drive the downstream dependency for `puller`, not pick up
   `push` from scratch.

2. Bump the post-restart monitor_task timeout from 40s to 120s. The
   previous "fail fast if failing" budget races with scheduler-loop warm-up
   under load; 120s is still fast for a successful run and gives a clear
   margin for the legitimate cases.

This is a residual flake left after #46502 — the rollout-status wait fixed
the worst of it, but the race between "pod is running" and "scheduler loop
is actually scheduling" remained.

Reopens the spirit of #45145.
(cherry picked from commit 246c19f)

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
vatsrahul1001 added a commit that referenced this pull request May 21, 2026
…scheduler (#67067) (#67068)

test_integration_run_dag_with_scheduler_failure is intermittently flaky
on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes,
so the new scheduler pod (after `kubectl rollout status` returns successfully)
sometimes has to handle the very first scheduling step itself before the
40s monitor_task timeout expires. push stays in `queued` for the full 40s
and the test fails with `assert 'queued' == 'success'`.

Two adjustments:

1. Before killing the scheduler, wait until the `push` task instance has
   reached a `queued`-or-later state. That way the original scheduler has
   already handed the task to the executor and the post-restart scheduler
   only needs to drive the downstream dependency for `puller`, not pick up
   `push` from scratch.

2. Bump the post-restart monitor_task timeout from 40s to 120s. The
   previous "fail fast if failing" budget races with scheduler-loop warm-up
   under load; 120s is still fast for a successful run and gives a clear
   margin for the legitimate cases.

This is a residual flake left after #46502 — the rollout-status wait fixed
the worst of it, but the race between "pod is running" and "scheduler loop
is actually scheduling" remained.

Reopens the spirit of #45145.
(cherry picked from commit 246c19f)

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:kubernetes-tests changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants