Attempting autoscaling on GKE autoscaling node pool

I have a graph that contains 300 tasks which take ~hour.  If I start up a cluster with >300 nodes before submitting a graph, the work gets distributed nicely and the tasks run in parallel.  If I cut it too close to 300 (especially if the nodes are preemptible), I will run into a few nodes getting multiple tasks.  This isn't too big a deal - I give it a few extra nodes and sometimes it runs a 2 or 3 hours.

On the other hand, if I use a fresh (initial size 0) autoscaling dask cluster and set it to scale from 0-350, a lot of my long tasks pile up on individual workers.  This is because K8s can take time to add nodes, and the graph starts processing as soon as 1 or 2 nodes are up.  The scheduler does not know yet about the coming workers, and so a few workers get assigned many tasks.  The autoscaling in this cases stops around ~30 nodes.

Is my best solution to avoid autoscaling with fresh clusters and graphs like this?

I'm using dask/distributed 2.19, and dask-gateway 0.7.1.  



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Attempting autoscaling on GKE autoscaling node pool #3942

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Attempting autoscaling on GKE autoscaling node pool #3942

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions