Skip to content

k8s tests: wait for push task in executor before killing scheduler#67067

Merged
potiuk merged 1 commit into
apache:mainfrom
potiuk:fix-k8s-scheduler-failure-test-race
May 17, 2026
Merged

k8s tests: wait for push task in executor before killing scheduler#67067
potiuk merged 1 commit into
apache:mainfrom
potiuk:fix-k8s-scheduler-failure-test-race

Conversation

@potiuk
Copy link
Copy Markdown
Member

@potiuk potiuk commented May 17, 2026

tests/kubernetes_tests/test_other_executors.py::TestCeleryAndLocalExecutor::test_integration_run_dag_with_scheduler_failure
is occasionally flaky on ARM CI — the scheduler is killed immediately after
start_job_in_kubernetes, so the post-restart scheduler sometimes has to
handle the very first scheduling step itself before the 40-second
monitor_task timeout expires. push stays in queued for the full 40s
and the test fails with assert 'queued' == 'success'.

Two adjustments:

  1. Before killing the scheduler, wait until the push task instance has
    reached a queued-or-later state via a new wait_until_task_in_executor
    helper. The original scheduler has already handed the task to the
    executor, so the post-restart scheduler only needs to drive the
    downstream dependency for puller, not pick up push from scratch.
  2. Bump the post-restart monitor_task timeouts from 40 s → 120 s. The
    previous "fail fast if failing" budget races with scheduler-loop warm-up
    under load; 120 s is still quick on a real bug but gives headroom for
    the legitimate cases.

This is the residual flake left after #46502 — the kubectl rollout status
wait fixed the worst of it, but the race between "pod is running" and
"scheduler loop is actually scheduling" remained. Reopens the spirit of
#45145.

Observed failing CI job: https://github.com/apache/airflow/actions/runs/25973707142/job/76353251911


Was generative AI tooling used to co-author this PR?
  • Yes — Claude Opus 4.7 (1M context)

Generated-by: Claude Opus 4.7 (1M context) following the guidelines

test_integration_run_dag_with_scheduler_failure is intermittently flaky
on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes,
so the new scheduler pod (after `kubectl rollout status` returns successfully)
sometimes has to handle the very first scheduling step itself before the
40s monitor_task timeout expires. push stays in `queued` for the full 40s
and the test fails with `assert 'queued' == 'success'`.

Two adjustments:

1. Before killing the scheduler, wait until the `push` task instance has
   reached a `queued`-or-later state. That way the original scheduler has
   already handed the task to the executor and the post-restart scheduler
   only needs to drive the downstream dependency for `puller`, not pick up
   `push` from scratch.

2. Bump the post-restart monitor_task timeout from 40s to 120s. The
   previous "fail fast if failing" budget races with scheduler-loop warm-up
   under load; 120s is still fast for a successful run and gives a clear
   margin for the legitimate cases.

This is a residual flake left after apache#46502 — the rollout-status wait fixed
the worst of it, but the race between "pod is running" and "scheduler loop
is actually scheduling" remained.

Reopens the spirit of apache#45145.
@potiuk potiuk added the backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch label May 17, 2026
@potiuk potiuk added this to the Airflow 3.2.2 milestone May 17, 2026
@potiuk
Copy link
Copy Markdown
Member Author

potiuk commented May 17, 2026

I'd love to get this one merged — and would love it in 3.2.2 if it's not too late. cc @vatsrahul1001 (3.2.2 RM)


Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting

@potiuk potiuk merged commit 246c19f into apache:main May 17, 2026
92 checks passed
@potiuk potiuk deleted the fix-k8s-scheduler-failure-test-race branch May 17, 2026 19:50
@github-actions
Copy link
Copy Markdown
Contributor

Backport successfully created: v3-2-test

Note: As of Merging PRs targeted for Airflow 3.X
the committer who merges the PR is responsible for backporting the PRs that are bug fixes (generally speaking) to the maintenance branches.

In matter of doubt please ask in #release-management Slack channel.

Status Branch Result
v3-2-test PR Link

vatsrahul1001 added a commit that referenced this pull request May 18, 2026
…scheduler (#67067) (#67068)

test_integration_run_dag_with_scheduler_failure is intermittently flaky
on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes,
so the new scheduler pod (after `kubectl rollout status` returns successfully)
sometimes has to handle the very first scheduling step itself before the
40s monitor_task timeout expires. push stays in `queued` for the full 40s
and the test fails with `assert 'queued' == 'success'`.

Two adjustments:

1. Before killing the scheduler, wait until the `push` task instance has
   reached a `queued`-or-later state. That way the original scheduler has
   already handed the task to the executor and the post-restart scheduler
   only needs to drive the downstream dependency for `puller`, not pick up
   `push` from scratch.

2. Bump the post-restart monitor_task timeout from 40s to 120s. The
   previous "fail fast if failing" budget races with scheduler-loop warm-up
   under load; 120s is still fast for a successful run and gives a clear
   margin for the legitimate cases.

This is a residual flake left after #46502 — the rollout-status wait fixed
the worst of it, but the race between "pod is running" and "scheduler loop
is actually scheduling" remained.

Reopens the spirit of #45145.
(cherry picked from commit 246c19f)

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
vatsrahul1001 added a commit that referenced this pull request May 20, 2026
…scheduler (#67067) (#67068)

test_integration_run_dag_with_scheduler_failure is intermittently flaky
on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes,
so the new scheduler pod (after `kubectl rollout status` returns successfully)
sometimes has to handle the very first scheduling step itself before the
40s monitor_task timeout expires. push stays in `queued` for the full 40s
and the test fails with `assert 'queued' == 'success'`.

Two adjustments:

1. Before killing the scheduler, wait until the `push` task instance has
   reached a `queued`-or-later state. That way the original scheduler has
   already handed the task to the executor and the post-restart scheduler
   only needs to drive the downstream dependency for `puller`, not pick up
   `push` from scratch.

2. Bump the post-restart monitor_task timeout from 40s to 120s. The
   previous "fail fast if failing" budget races with scheduler-loop warm-up
   under load; 120s is still fast for a successful run and gives a clear
   margin for the legitimate cases.

This is a residual flake left after #46502 — the rollout-status wait fixed
the worst of it, but the race between "pod is running" and "scheduler loop
is actually scheduling" remained.

Reopens the spirit of #45145.
(cherry picked from commit 246c19f)

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
vatsrahul1001 added a commit that referenced this pull request May 20, 2026
…scheduler (#67067) (#67068)

test_integration_run_dag_with_scheduler_failure is intermittently flaky
on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes,
so the new scheduler pod (after `kubectl rollout status` returns successfully)
sometimes has to handle the very first scheduling step itself before the
40s monitor_task timeout expires. push stays in `queued` for the full 40s
and the test fails with `assert 'queued' == 'success'`.

Two adjustments:

1. Before killing the scheduler, wait until the `push` task instance has
   reached a `queued`-or-later state. That way the original scheduler has
   already handed the task to the executor and the post-restart scheduler
   only needs to drive the downstream dependency for `puller`, not pick up
   `push` from scratch.

2. Bump the post-restart monitor_task timeout from 40s to 120s. The
   previous "fail fast if failing" budget races with scheduler-loop warm-up
   under load; 120s is still fast for a successful run and gives a clear
   margin for the legitimate cases.

This is a residual flake left after #46502 — the rollout-status wait fixed
the worst of it, but the race between "pod is running" and "scheduler loop
is actually scheduling" remained.

Reopens the spirit of #45145.
(cherry picked from commit 246c19f)

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
vatsrahul1001 added a commit that referenced this pull request May 21, 2026
…scheduler (#67067) (#67068)

test_integration_run_dag_with_scheduler_failure is intermittently flaky
on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes,
so the new scheduler pod (after `kubectl rollout status` returns successfully)
sometimes has to handle the very first scheduling step itself before the
40s monitor_task timeout expires. push stays in `queued` for the full 40s
and the test fails with `assert 'queued' == 'success'`.

Two adjustments:

1. Before killing the scheduler, wait until the `push` task instance has
   reached a `queued`-or-later state. That way the original scheduler has
   already handed the task to the executor and the post-restart scheduler
   only needs to drive the downstream dependency for `puller`, not pick up
   `push` from scratch.

2. Bump the post-restart monitor_task timeout from 40s to 120s. The
   previous "fail fast if failing" budget races with scheduler-loop warm-up
   under load; 120s is still fast for a successful run and gives a clear
   margin for the legitimate cases.

This is a residual flake left after #46502 — the rollout-status wait fixed
the worst of it, but the race between "pod is running" and "scheduler loop
is actually scheduling" remained.

Reopens the spirit of #45145.
(cherry picked from commit 246c19f)

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:kubernetes-tests backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants