[v3-2-test] k8s tests: wait for push task in executor before killing scheduler (#67067)#67068
Merged
Merged
Conversation
…scheduler (#67067) test_integration_run_dag_with_scheduler_failure is intermittently flaky on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes, so the new scheduler pod (after `kubectl rollout status` returns successfully) sometimes has to handle the very first scheduling step itself before the 40s monitor_task timeout expires. push stays in `queued` for the full 40s and the test fails with `assert 'queued' == 'success'`. Two adjustments: 1. Before killing the scheduler, wait until the `push` task instance has reached a `queued`-or-later state. That way the original scheduler has already handed the task to the executor and the post-restart scheduler only needs to drive the downstream dependency for `puller`, not pick up `push` from scratch. 2. Bump the post-restart monitor_task timeout from 40s to 120s. The previous "fail fast if failing" budget races with scheduler-loop warm-up under load; 120s is still fast for a successful run and gives a clear margin for the legitimate cases. This is a residual flake left after #46502 — the rollout-status wait fixed the worst of it, but the race between "pod is running" and "scheduler loop is actually scheduling" remained. Reopens the spirit of #45145. (cherry picked from commit 246c19f) Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
1 task
vatsrahul1001
approved these changes
May 18, 2026
jason810496
approved these changes
May 18, 2026
vatsrahul1001
added a commit
that referenced
this pull request
May 20, 2026
…scheduler (#67067) (#67068) test_integration_run_dag_with_scheduler_failure is intermittently flaky on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes, so the new scheduler pod (after `kubectl rollout status` returns successfully) sometimes has to handle the very first scheduling step itself before the 40s monitor_task timeout expires. push stays in `queued` for the full 40s and the test fails with `assert 'queued' == 'success'`. Two adjustments: 1. Before killing the scheduler, wait until the `push` task instance has reached a `queued`-or-later state. That way the original scheduler has already handed the task to the executor and the post-restart scheduler only needs to drive the downstream dependency for `puller`, not pick up `push` from scratch. 2. Bump the post-restart monitor_task timeout from 40s to 120s. The previous "fail fast if failing" budget races with scheduler-loop warm-up under load; 120s is still fast for a successful run and gives a clear margin for the legitimate cases. This is a residual flake left after #46502 — the rollout-status wait fixed the worst of it, but the race between "pod is running" and "scheduler loop is actually scheduling" remained. Reopens the spirit of #45145. (cherry picked from commit 246c19f) Co-authored-by: Jarek Potiuk <jarek@potiuk.com> Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
vatsrahul1001
added a commit
that referenced
this pull request
May 20, 2026
…scheduler (#67067) (#67068) test_integration_run_dag_with_scheduler_failure is intermittently flaky on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes, so the new scheduler pod (after `kubectl rollout status` returns successfully) sometimes has to handle the very first scheduling step itself before the 40s monitor_task timeout expires. push stays in `queued` for the full 40s and the test fails with `assert 'queued' == 'success'`. Two adjustments: 1. Before killing the scheduler, wait until the `push` task instance has reached a `queued`-or-later state. That way the original scheduler has already handed the task to the executor and the post-restart scheduler only needs to drive the downstream dependency for `puller`, not pick up `push` from scratch. 2. Bump the post-restart monitor_task timeout from 40s to 120s. The previous "fail fast if failing" budget races with scheduler-loop warm-up under load; 120s is still fast for a successful run and gives a clear margin for the legitimate cases. This is a residual flake left after #46502 — the rollout-status wait fixed the worst of it, but the race between "pod is running" and "scheduler loop is actually scheduling" remained. Reopens the spirit of #45145. (cherry picked from commit 246c19f) Co-authored-by: Jarek Potiuk <jarek@potiuk.com> Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
vatsrahul1001
added a commit
that referenced
this pull request
May 21, 2026
…scheduler (#67067) (#67068) test_integration_run_dag_with_scheduler_failure is intermittently flaky on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes, so the new scheduler pod (after `kubectl rollout status` returns successfully) sometimes has to handle the very first scheduling step itself before the 40s monitor_task timeout expires. push stays in `queued` for the full 40s and the test fails with `assert 'queued' == 'success'`. Two adjustments: 1. Before killing the scheduler, wait until the `push` task instance has reached a `queued`-or-later state. That way the original scheduler has already handed the task to the executor and the post-restart scheduler only needs to drive the downstream dependency for `puller`, not pick up `push` from scratch. 2. Bump the post-restart monitor_task timeout from 40s to 120s. The previous "fail fast if failing" budget races with scheduler-loop warm-up under load; 120s is still fast for a successful run and gives a clear margin for the legitimate cases. This is a residual flake left after #46502 — the rollout-status wait fixed the worst of it, but the race between "pod is running" and "scheduler loop is actually scheduling" remained. Reopens the spirit of #45145. (cherry picked from commit 246c19f) Co-authored-by: Jarek Potiuk <jarek@potiuk.com> Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
test_integration_run_dag_with_scheduler_failure is intermittently flaky
on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes,
so the new scheduler pod (after
kubectl rollout statusreturns successfully)sometimes has to handle the very first scheduling step itself before the
40s monitor_task timeout expires. push stays in
queuedfor the full 40sand the test fails with
assert 'queued' == 'success'.Two adjustments:
Before killing the scheduler, wait until the
pushtask instance hasreached a
queued-or-later state. That way the original scheduler hasalready handed the task to the executor and the post-restart scheduler
only needs to drive the downstream dependency for
puller, not pick uppushfrom scratch.Bump the post-restart monitor_task timeout from 40s to 120s. The
previous "fail fast if failing" budget races with scheduler-loop warm-up
under load; 120s is still fast for a successful run and gives a clear
margin for the legitimate cases.
This is a residual flake left after #46502 — the rollout-status wait fixed
the worst of it, but the race between "pod is running" and "scheduler loop
is actually scheduling" remained.
Reopens the spirit of #45145.
(cherry picked from commit 246c19f)
Co-authored-by: Jarek Potiuk jarek@potiuk.com