Skip to content

Losing all workers on cluster  #1914

@BrendanMartin

Description

@BrendanMartin

I'm not sure how much detail to put here, or if this is the right place, but I am attempting to run dask_searchcv on a Kubernetes cluster and whenever I run the grid search all of my workers get killed immediately.

From the logs on the cluster I have:

distributed.scheduler - INFO - Register tcp://10.16.1.9:41940

distributed.scheduler - INFO - Starting worker compute stream, tcp://10.16.1.9:41940

distributed.scheduler - INFO - Register tcp://10.16.0.9:46410

distributed.scheduler - INFO - Starting worker compute stream, tcp://10.16.0.9:46410

distributed.scheduler - INFO - Register tcp://10.16.2.9:38646

distributed.scheduler - INFO - Starting worker compute stream, tcp://10.16.2.9:38646

distributed.scheduler - INFO - Remove worker tcp://10.16.2.9:38646

distributed.scheduler - INFO - Remove worker tcp://10.16.1.9:41940

distributed.scheduler - INFO - Remove worker tcp://10.16.0.9:46410

distributed.scheduler - INFO - Lost all workers

I have a cluster of 3 CPUs and 12G RAM, and so 3 workers are spawned by default.

One thing I noticed, not sure if it's related, but when I run the grid search and look at the workers, there's always one worker that's at 90%+ utilization while the others are only around 5%.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions