Cluster.scale is not robust to multiple calls #2257

guillaumeeb · 2018-09-18T14:42:19Z

As experienced in dask/dask-jobqueue#112 and a related PR dask/dask-jobqueue#97, Cluster.scale behavior is unstable if called multiple times in a row.

I suspect part of this problem is due to how asynchronism is used here:

We retrieve the cluster number of workers in a synchronous way here https://github.com/dask/distributed/blob/master/distributed/deploy/cluster.py#L100, but we launch scale_up asynchronously, so something could happen (here: another call to scale) between state retrieval and effective scale_up.
Similarly, we get the worker to close synchronously, but stop them asynchronously.

If we want scale to run asynchronously, I propose to just add a _scale() method here (a corountine?) to be called in an async manner from scale(). In this scale, we would get the state and perform the modifications at the same time:

def _scale(self, n):
        with log_errors():
            if n >= len(self.scheduler.workers):
                self.scale_up(n)
            else:
                to_close = self.scheduler.workers_to_close(
                    n=len(self.scheduler.workers) - n)
                logger.debug("Closing workers: %s", to_close)
                self.scheduler.retire_workers(workers=to_close)
                self.scale_down(to_close)

@jhamman @mrocklin any opinion, advice?

The text was updated successfully, but these errors were encountered:

mrocklin · 2018-09-18T16:25:43Z

FWIW I suspect that we'll want to just fully rewrite the distributed/deploy/cluster.py system. Some of my thoughts are here: #2235

…

On Tue, Sep 18, 2018 at 10:43 AM, Guillaume Eynard-Bontemps < ***@***.***> wrote: As experienced in dask/dask-jobqueue#112 <dask/dask-jobqueue#112> and a related PR dask/dask-jobqueue#97 <dask/dask-jobqueue#97>, Cluster.scale behavior is unstable if called multiple times in a row. I suspect part of this problem is due to how asynchronism is used here: - We retrieve the cluster number of workers in a synchronous way here https://github.com/dask/distributed/blob/master/ distributed/deploy/cluster.py#L100 <https://github.com/dask/distributed/blob/master/distributed/deploy/cluster.py#L100>, but we launch scale_up asynchronously, so something could happen (here: another call to scale) between state retrieval and effective scale_up. - Similarly, we get the worker to close synchronously, but stop them asynchronously. If we want scale to run asynchronously, I propose to just add a _scale() method here (a corountine?) to be called in an async manner from scale(). In this scale, we would get the state and perform the modifications at the same time: def _scale(self, n): with log_errors(): if n >= len(self.scheduler.workers): self.scale_up(n) else: to_close = self.scheduler.workers_to_close( n=len(self.scheduler.workers) - n) logger.debug("Closing workers: %s", to_close) self.scheduler.retire_workers(workers=to_close) self.scale_down(to_close) @jhamman <https://github.com/jhamman> @mrocklin <https://github.com/mrocklin> any opinion, advice? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2257>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszFvV8iTf83MsHEqSmbwcUkXAm1reks5ucQaggaJpZM4WuFZM> .

GenevieveBuckley · 2021-10-18T09:30:54Z

Closing this issue in favour of #2235

guillaumeeb mentioned this issue Sep 18, 2018

more adaptive scaling fixes dask/dask-jobqueue#97

Merged

This was referenced Oct 6, 2018

Implementing an Ersatz of ClusterManager to fix jobqueue issues linked to upstream deploy.Cluster limitations dask/dask-jobqueue#170

Closed

Fix scale edge cases dask/dask-jobqueue#171

Merged

ian-r-rose mentioned this issue Dec 30, 2018

Increase testing dask/dask-labextension#39

Merged

GenevieveBuckley closed this as completed Oct 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster.scale is not robust to multiple calls #2257

Cluster.scale is not robust to multiple calls #2257

guillaumeeb commented Sep 18, 2018

mrocklin commented Sep 18, 2018 via email

GenevieveBuckley commented Oct 18, 2021

Cluster.scale is not robust to multiple calls #2257

Cluster.scale is not robust to multiple calls #2257

Comments

guillaumeeb commented Sep 18, 2018

mrocklin commented Sep 18, 2018 via email

GenevieveBuckley commented Oct 18, 2021