Skip to content

Scheduler shuts down after 20 minutes of inactivity - tasks not executed #5921

@lojohnson

Description

@lojohnson

Reported by a Coiled user forman for his recent clusters id 121956 and 121459:

Hello! I'm currently trying to use Coiled for upscaling an ND image (= 2+D dask array) but I constantly fail.
My source is read via xr.open_zarr() from S3 and is around 2k^2 pixels. I upscale it by factor of 50 to around 100k^2 pixels. My x/y chunk size is 2048. I'm using dask.array.map_blocks(f, ...) with f being a custom function that uses numba. The output is written to S3 using xr.Dataset.to_zarr().
When I execute my job using dask local threading or even processes and the local filesystem, it runs without any problems and result is as expected and finishes within seconds.
If I run the same on Coiled using S3 (from my local JNB), the cluster shuts down after 20 minutes of inactivity - the scheduler doesn't receive a single task. I cannot tell what is happening. My local CPU is 3%, RAM 8 GB, practically no network.

User moved past this issue by setting his next cluster to use 4 workers and including scheduler_options={"idle_timeout": "2 hours"

Odd behavior was observed in worker and scheduler logs that may have lead to the bad state of the cluster. Full scheduler logs attached. For cluster 121956, worker coiled-dask-forman-121956-worker-35c5f23472, logs show "
Mar 09 13:12:09 ip-10-13-8-190 cloud-init[1860]: distributed.batched - ERROR - Error in batched write
Mar 09 13:12:09 ip-10-13-8-190 cloud-init[1860]: Traceback (most recent call last):
Mar 09 13:12:09 ip-10-13-8-190 cloud-init[1860]: BufferError: Existing exports of data: object cannot be re-sized

Scheduler for the same cluster shows that it removed this worker, and then ran into a stream of Unexpected worker completed task errors referencing this same removed worker.

Mar 09 13:12:05 ip-10-13-11-186 cloud-init[1528]: distributed.core - INFO - Event loop was unresponsive in Scheduler for 30.75s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

Mar 09 13:12:10 ip-10-13-11-186 cloud-init[1528]: distributed.scheduler - INFO - Remove worker <WorkerState 'tls://10.13.8.140:44147', name: coiled-dask-forman-121956-worker-35c5f23472, status: running, memory: 952, processing: 7056>

Mar 09 13:12:12 ip-10-13-11-186 cloud-init[1528]: distributed.scheduler - INFO - Unexpected worker completed task. Expected: <WorkerState 'tls://10.13.15.152:42935', name: coiled-dask-forman-121956-worker-1f11f7b175, status: running, memory: 806, processing: 8479>, Got: <WorkerState 'tls://10.13.8.140:44147', name: coiled-dask-forman-121956-worker-35c5f23472, status: running, memory: 0, processing: 0>, Key: ('block-info-_upscale_numpy_array-2ba5457e4d03cf22addd23421859e823', 44, 25)

Possibly related to #5675

Scheduler logs:
forman-scheduler-121956-logs.zip

Task graph of stuck cluster
forman-cluster121956- graph

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is brokenstabilityIssue or feature related to cluster stability (e.g. deadlock)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions