Reported by a Coiled user forman for his recent clusters id 121956 and 121459:
Hello! I'm currently trying to use Coiled for upscaling an ND image (= 2+D dask array) but I constantly fail.
My source is read via xr.open_zarr() from S3 and is around 2k^2 pixels. I upscale it by factor of 50 to around 100k^2 pixels. My x/y chunk size is 2048. I'm using dask.array.map_blocks(f, ...) with f being a custom function that uses numba. The output is written to S3 using xr.Dataset.to_zarr().
When I execute my job using dask local threading or even processes and the local filesystem, it runs without any problems and result is as expected and finishes within seconds.
If I run the same on Coiled using S3 (from my local JNB), the cluster shuts down after 20 minutes of inactivity - the scheduler doesn't receive a single task. I cannot tell what is happening. My local CPU is 3%, RAM 8 GB, practically no network.
User moved past this issue by setting his next cluster to use 4 workers and including scheduler_options={"idle_timeout": "2 hours"
Odd behavior was observed in worker and scheduler logs that may have lead to the bad state of the cluster. Full scheduler logs attached. For cluster 121956, worker coiled-dask-forman-121956-worker-35c5f23472, logs show "
Mar 09 13:12:09 ip-10-13-8-190 cloud-init[1860]: distributed.batched - ERROR - Error in batched write
Mar 09 13:12:09 ip-10-13-8-190 cloud-init[1860]: Traceback (most recent call last):
Mar 09 13:12:09 ip-10-13-8-190 cloud-init[1860]: BufferError: Existing exports of data: object cannot be re-sized
Scheduler for the same cluster shows that it removed this worker, and then ran into a stream of Unexpected worker completed task errors referencing this same removed worker.
Mar 09 13:12:05 ip-10-13-11-186 cloud-init[1528]: distributed.core - INFO - Event loop was unresponsive in Scheduler for 30.75s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Mar 09 13:12:10 ip-10-13-11-186 cloud-init[1528]: distributed.scheduler - INFO - Remove worker <WorkerState 'tls://10.13.8.140:44147', name: coiled-dask-forman-121956-worker-35c5f23472, status: running, memory: 952, processing: 7056>
Mar 09 13:12:12 ip-10-13-11-186 cloud-init[1528]: distributed.scheduler - INFO - Unexpected worker completed task. Expected: <WorkerState 'tls://10.13.15.152:42935', name: coiled-dask-forman-121956-worker-1f11f7b175, status: running, memory: 806, processing: 8479>, Got: <WorkerState 'tls://10.13.8.140:44147', name: coiled-dask-forman-121956-worker-35c5f23472, status: running, memory: 0, processing: 0>, Key: ('block-info-_upscale_numpy_array-2ba5457e4d03cf22addd23421859e823', 44, 25)
Possibly related to #5675
Scheduler logs:
forman-scheduler-121956-logs.zip
Task graph of stuck cluster

Reported by a Coiled user
formanfor his recent clusters id 121956 and 121459:User moved past this issue by setting his next cluster to use 4 workers and including
scheduler_options={"idle_timeout": "2 hours"Odd behavior was observed in worker and scheduler logs that may have lead to the bad state of the cluster. Full scheduler logs attached. For cluster 121956, worker
coiled-dask-forman-121956-worker-35c5f23472,logs show "Mar 09 13:12:09 ip-10-13-8-190 cloud-init[1860]: distributed.batched - ERROR - Error in batched writeMar 09 13:12:09 ip-10-13-8-190 cloud-init[1860]: Traceback (most recent call last):Mar 09 13:12:09 ip-10-13-8-190 cloud-init[1860]: BufferError: Existing exports of data: object cannot be re-sizedScheduler for the same cluster shows that it removed this worker, and then ran into a stream of
Unexpected worker completed taskerrors referencing this same removed worker.Mar 09 13:12:05 ip-10-13-11-186 cloud-init[1528]: distributed.core - INFO - Event loop was unresponsive in Scheduler for 30.75s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.Mar 09 13:12:10 ip-10-13-11-186 cloud-init[1528]: distributed.scheduler - INFO - Remove worker <WorkerState 'tls://10.13.8.140:44147', name: coiled-dask-forman-121956-worker-35c5f23472, status: running, memory: 952, processing: 7056>Mar 09 13:12:12 ip-10-13-11-186 cloud-init[1528]: distributed.scheduler - INFO - Unexpected worker completed task. Expected: <WorkerState 'tls://10.13.15.152:42935', name: coiled-dask-forman-121956-worker-1f11f7b175, status: running, memory: 806, processing: 8479>, Got: <WorkerState 'tls://10.13.8.140:44147', name: coiled-dask-forman-121956-worker-35c5f23472, status: running, memory: 0, processing: 0>, Key: ('block-info-_upscale_numpy_array-2ba5457e4d03cf22addd23421859e823', 44, 25)Possibly related to #5675
Scheduler logs:
forman-scheduler-121956-logs.zip
Task graph of stuck cluster
