Skip to content

Fix flaky test_chaos_rechunk#9317

Draft
crusaderky wants to merge 2 commits into
dask:mainfrom
crusaderky:test_chaos_rechunk
Draft

Fix flaky test_chaos_rechunk#9317
crusaderky wants to merge 2 commits into
dask:mainfrom
crusaderky:test_chaos_rechunk

Conversation

@crusaderky

@crusaderky crusaderky commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

This test frequently hangs, e.g. https://github.com/dask/distributed/actions/runs/28594679995/job/84786875121?pr=9315

WIP unreviewed clanker output; please disregard for now

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

   19 files   -     21     19 suites   - 21   7h 8m 50s ⏱️ - 7h 6m 35s
  100 tests  -  4 055    100 ✅  -  3 870    0 💤  -   180  0 ❌  - 5 
1 900 runs   - 77 520  1 800 ✅  - 73 412  100 💤  - 4 103  0 ❌  - 5 

Results for commit b1bef38. ± Comparison against base commit 2d43f7f.

This pull request removes 4155 and adds 100 tests. Note that renamed tests count towards both.
distributed.cli.tests.test_dask_scheduler ‑ test_dashboard
distributed.cli.tests.test_dask_scheduler ‑ test_dashboard_allowlist
distributed.cli.tests.test_dask_scheduler ‑ test_dashboard_non_standard_ports
distributed.cli.tests.test_dask_scheduler ‑ test_dashboard_port_zero
distributed.cli.tests.test_dask_scheduler ‑ test_defaults
distributed.cli.tests.test_dask_scheduler ‑ test_hostport
distributed.cli.tests.test_dask_scheduler ‑ test_idle_timeout
distributed.cli.tests.test_dask_scheduler ‑ test_interface
distributed.cli.tests.test_dask_scheduler ‑ test_multiple_protocols
distributed.cli.tests.test_dask_scheduler ‑ test_multiple_workers
…
distributed.tests.test_stress ‑ test_chaos_rechunk[1-100]
distributed.tests.test_stress ‑ test_chaos_rechunk[10-100]
distributed.tests.test_stress ‑ test_chaos_rechunk[100-100]
distributed.tests.test_stress ‑ test_chaos_rechunk[11-100]
distributed.tests.test_stress ‑ test_chaos_rechunk[12-100]
distributed.tests.test_stress ‑ test_chaos_rechunk[13-100]
distributed.tests.test_stress ‑ test_chaos_rechunk[14-100]
distributed.tests.test_stress ‑ test_chaos_rechunk[15-100]
distributed.tests.test_stress ‑ test_chaos_rechunk[16-100]
distributed.tests.test_stress ‑ test_chaos_rechunk[17-100]
…

♻️ This comment has been updated with latest results.

test_chaos_rechunk failed frequently on slow CI hosts with a Nanny stuck
in Status.closing at teardown. Root cause: WorkerProcess.kill waits
timeout*0.8 for graceful shutdown, then SIGKILLs the worker but joins it
with only whatever remains of the original budget (~1s or less), which on
an overloaded machine is not enough even for a SIGKILLed process to be
reaped. The resulting asyncio.TimeoutError aborted Nanny.close midway,
leaving the nanny in Status.closing forever and deadlocking any further
close() call on self.finished().

- WorkerProcess.kill: join the killed process with a fresh timeout instead
  of the leftovers of the graceful-shutdown budget
- Nanny.close: always reach a terminal status and complete Server.close,
  even if killing the worker process fails
- test_chaos_rechunk: raise gen_cluster timeout; on loaded CI hosts
  cluster startup alone can eat most of the 30s default while the test
  body by design runs for 10+ seconds

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@crusaderky crusaderky mentioned this pull request Jul 3, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant