Scheduler shuts down after 20 minutes of inactivity - tasks not executed

Reported by a Coiled user `forman` for his recent clusters id 121956 and 121459:

> Hello! I'm currently trying to use Coiled for upscaling an ND image (= 2+D dask array) but I constantly fail.
My source is read via xr.open_zarr() from S3 and is  around 2k^2 pixels. I upscale it by factor of 50 to around 100k^2 pixels. My x/y chunk size is 2048. I'm using dask.array.map_blocks(f, ...) with f being a custom function that uses numba. The output is written to S3 using [xr.Dataset.to](http://xr.dataset.to/)_zarr().
When I execute my job using dask local threading or even processes and the local filesystem, it runs without any problems and result is as expected and finishes within seconds.
If I run the same on Coiled using S3 (from my local JNB), the cluster shuts down after 20 minutes of inactivity - the scheduler doesn't receive a single task. I cannot tell what is happening. My local CPU is 3%, RAM 8 GB, practically no network.

User moved past this issue by setting his next cluster to use 4 workers and including `scheduler_options={"idle_timeout": "2 hours"`

Odd behavior was observed in worker and scheduler logs that may have lead to the bad state of the cluster. Full scheduler logs attached. For cluster 121956, worker `coiled-dask-forman-121956-worker-35c5f23472,` logs show "
`Mar 09 13:12:09 ip-10-13-8-190 cloud-init[1860]: distributed.batched - ERROR - Error in batched write`
`Mar 09 13:12:09 ip-10-13-8-190 cloud-init[1860]: Traceback (most recent call last):`
`Mar 09 13:12:09 ip-10-13-8-190 cloud-init[1860]: BufferError: Existing exports of data: object cannot be re-sized`

Scheduler for the same cluster shows that it removed this worker, and then ran into a stream of `Unexpected worker completed task` errors referencing this same removed worker.

`Mar 09 13:12:05 ip-10-13-11-186 cloud-init[1528]: distributed.core - INFO - Event loop was unresponsive in Scheduler for 30.75s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.`

`Mar 09 13:12:10 ip-10-13-11-186 cloud-init[1528]: distributed.scheduler - INFO - Remove worker <WorkerState 'tls://10.13.8.140:44147', name: coiled-dask-forman-121956-worker-35c5f23472, status: running, memory: 952, processing: 7056>`

`Mar 09 13:12:12 ip-10-13-11-186 cloud-init[1528]: distributed.scheduler - INFO - Unexpected worker completed task. Expected: <WorkerState 'tls://10.13.15.152:42935', name: coiled-dask-forman-121956-worker-1f11f7b175, status: running, memory: 806, processing: 8479>, Got: <WorkerState 'tls://10.13.8.140:44147', name: coiled-dask-forman-121956-worker-35c5f23472, status: running, memory: 0, processing: 0>, Key: ('block-info-_upscale_numpy_array-2ba5457e4d03cf22addd23421859e823', 44, 25)`

Possibly related to https://github.com/dask/distributed/issues/5675

Scheduler logs:
[forman-scheduler-121956-logs.zip](https://github.com/dask/distributed/files/8217680/forman-scheduler-121956-logs.zip)

Task graph of stuck cluster
<img width="1641" alt="forman-cluster121956- graph" src="https://user-images.githubusercontent.com/92529250/157521637-fdbd026c-41e2-4ded-968d-a5ef63d1b553.png">







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scheduler shuts down after 20 minutes of inactivity - tasks not executed #5921

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Scheduler shuts down after 20 minutes of inactivity - tasks not executed #5921

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions