Skip to content

Attempting autoscaling on GKE autoscaling node pool #3942

@chrisroat

Description

@chrisroat

I have a graph that contains 300 tasks which take ~hour. If I start up a cluster with >300 nodes before submitting a graph, the work gets distributed nicely and the tasks run in parallel. If I cut it too close to 300 (especially if the nodes are preemptible), I will run into a few nodes getting multiple tasks. This isn't too big a deal - I give it a few extra nodes and sometimes it runs a 2 or 3 hours.

On the other hand, if I use a fresh (initial size 0) autoscaling dask cluster and set it to scale from 0-350, a lot of my long tasks pile up on individual workers. This is because K8s can take time to add nodes, and the graph starts processing as soon as 1 or 2 nodes are up. The scheduler does not know yet about the coming workers, and so a few workers get assigned many tasks. The autoscaling in this cases stops around ~30 nodes.

Is my best solution to avoid autoscaling with fresh clusters and graphs like this?

I'm using dask/distributed 2.19, and dask-gateway 0.7.1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions