Adaptive: recommend close workers when any are idle #2330

delgadom · 2018-11-01T02:39:45Z

Adding this as a placeholder fix for #2329. No tests added nor doc changes yet.

delgadom · 2018-11-01T07:01:34Z

hmm. all these broken tests are a bit daunting for a newcomer. any recommended starting points?

mrocklin · 2018-11-01T09:30:01Z

I've created a trivial PR to see if they're due to something unrelated: #2331

mrocklin · 2018-11-01T09:32:18Z

distributed/deploy/adaptive.py

+            num_workers = len(self.scheduler.workers)
+
+            if tasks_processing <= num_workers:
+                return False


I think I can convince myself that the tasks_processing < num_workers check is redundant given the all(ws.processing ...) check.

it's definitely sufficient to prompt the correct behavior when the cluster has idle workers, but once those have been spun down, the current logic doesn't distinguish between a resource-constrained cluster with waiting tasks and one with ntasks == nworkers. This was confirmed by my experiments, where the cluster successfully scaled down to (ntasks) and then shot back up to adaptive.maximum a few heartbeats later, and continued oscilating between these two until all tasks completed.

I don't see a way to do this with all(ws.processing ...) but if there is then the performance benefit would be significant.

delgadom · 2018-11-01T15:38:19Z

I think the errors are unrelated (I tried pytest and pytest distributed/deploy before making changes and faced a wall of errors). But I'll try digging in at some point - just couldn't figure it out right now

mrocklin · 2018-11-01T22:25:30Z

Sorry for the delay in responding here. I've been a bit saturated. I wonder if @guillaumeeb has any interest in engaging here.

delgadom · 2018-11-01T23:09:26Z

no worries! I appreciate all the feedback @mrocklin. hope my pushback is constructive :) Dask is becoming increasingly central to everything I do - it's a really great project.

guillaumeeb · 2018-11-02T08:45:00Z

Yes I have indeed some interest on this, I've been following the discussion, and the solution of @delgadom seems good.

I would try to look more closely today, but I fear we are missing some information in the Scheduler to do this more elegantly.

What would be nice would be to reproduce this in a small test, or a script easier to run.

guillaumeeb

Given my still limited knowledge of Scheduler's internal, and what I know of adaptive, I think the proposed solution is good for the time being.

It seems to me that scheduler.total_occupancy doesn't contain enough information for estimating the need for CPUs. A more efficient solution could be to improve scheduler, but I'm not familiar enough with this class yet. Is there a way to get the list of tasks? Or a pendant to scheduler.total_occupancy, like scheduler.total_processing_tasks? Would it be feasible? If I understand correctly @delgadom already tried to browse all the tasks and this was not performant enough?

We should still try to add a test, even if we don't run it for every PR (I believe we can easily ignore some tests for CI?). It should be doable by slightly modifying the test proposed in the issue. Just by being more deterministic in the task duration. Launching say 7 tasks running for 20 seconds, and 3 tasks 30 seconds. then waiting 21 seconds, and testing recommendations and cluster state.

guillaumeeb · 2018-11-02T14:26:30Z

distributed/deploy/adaptive.py

+            num_workers = len(self.scheduler.workers)
+
+            if tasks_processing <= num_workers:
+                return False


I think that all the task_processing count and checks would be better in needs_cpu(). The wrong part of code seems to be there: it just take into account the scheduler.total_occupancy, which seems to me like the sum of the time of each processing task. It should take into account the number of tasks remaining as pointed out by @delgadom.

ok. if we move it into needs_cpu this resolves the problem. that's how I implemented it as a first cut, and I've been running this on a subclassed version of Adaptive on our cluster and it works beautifully. I spun up 10k 15minute jobs with an adaptive cluster with between 0, 1000 nodes and the cluster scaled up then scaled back down as workers completed.

If we go with an implementation where recommendations() suggests scaling down based on workers_to_close's recommendation only, I suppose this could involve a race condition where we scale down idle workers before the scheduler has time to transfer tasks to them. So maybe only placing this logic in needs_cpu is the right way to go?

I'll modify the PR to only include a change in needs_cpu. I'll try to come up with a less interactive example/test soon, but won't get to implementing real tests until next week.

My comment was just about moving the specific part of tasks running vs available cores.

But if this part is needed and sufficient as it seems, then I'm very happy to go only with it!

Thanks @delgadom for working on this.

delgadom · 2018-11-04T03:06:57Z

Just made an update that improves performance significantly for large numbers of workers by looping through workers and returning as soon as the cumulative number of tasks exceeds the number of workers. Running a job with 100 workers and 10,000 tasks I got a 10-25% performance hit with the latest update vs a 65-150% performance hit for the previous commit.

The following tests the current version on master (needs_cpu_0) vs the previous implementation (needs_cpu_1) vs the latest commit (needs_cpu_2):

I'm swamped through tuesday but if should be able to spend some time implementing tests and a good working example later next week. In the meantime, we're using this fix pretty heavily and will continue to tweak and test, but it works really well so far.

guillaumeeb

Some details I did not see before.

guillaumeeb · 2018-11-04T09:46:52Z

distributed/deploy/adaptive.py

-        else:
-            return False
+
+            num_workers = len(self.scheduler.workers)


Shouldn't you use total_cores for the number of tasks comparison? Do you have one thread per worker in your setup?

hmm - I'm not sure I follow, but I am on a dask_kubernetes setup with 1:1 threads to workers, so maybe my use case isn't universal. I think I see, though... in many cases the number of dask processes/threads which can take on tasks is greater than the number of workers? Is total_cores the number of workers * (threads per worker or procs per worker)?

I am on a dask_kubernetes setup with 1:1 threads to workers, so maybe my use case isn't universal
in many cases the number of dask processes/threads which can take on tasks is greater than the number of workers

Yes!

I think in most cases having several threads per worker is better, except for GIL boud tasks. Dask-kubernetes propose 2 threads by default, which is low but it depends on VMs you use:

https://github.com/dask/dask-kubernetes/blob/master/dask_kubernetes/kubernetes.yaml#L24-L26

With dask-jobqueue, I often use 4, 8 or even 24 cores per worker.

Is total_cores the number of workers * (threads per worker)?

Yes!

Ok thanks! Just so I understand before modifying the code again... should I take this to mean that dask uses the term "core" to refer specifically to the number of tasks that can be handled simultaneously by a worker, regardless of the hardware? Or are you doing something clever with the relationship between physical cores and CPU needs that I'm not following? Thanks again for the helping hand in this :)

should I take this to mean that dask uses the term "core" to refer specifically to the number of tasks that can be handled simultaneously by a worker

Yes, see https://github.com/dask/distributed/blob/master/distributed/cli/dask_worker.py#L62-L63, https://github.com/dask/distributed/blob/master/distributed/cli/dask_worker.py#L217.

I like it better when it is called nthreads, like in dask-worker options.

ok this makes a lot of sense. so I should count the number of available threads or processes, not the number of workers

A Worker = 1 process = several threads, on scheduler side. So just using the already computed total_cores should be fine.

guillaumeeb · 2018-11-04T09:48:22Z

distributed/deploy/adaptive.py

+                if tasks_processing > num_workers:
+                    logger.info(
+                        "pending tasks exceed number of workers "
+                        "[%d tasks / %d workers]",


I think this message, or probably the one above, should be changed.

mrocklin · 2018-11-14T12:32:55Z

Checking in. What is the status of this?

guillaumeeb · 2018-11-15T21:31:13Z

Current PR only compares running tasks to the number of worker processes, not the real number of available cores. So we're waiting for an update by @delgadom who is probably busy elsewhere as many of us 🙂 !

guillaumeeb · 2018-12-04T21:33:03Z

@delgadom, do you need help going forward here? If you're too busy with other things, maybe I can try to advance this PR?

delgadom · 2018-12-07T19:01:03Z

Sorry for the delay, both. Last couple months did indeed get very busy. Will try to spend some time wrapping this up this weekend.

delgadom · 2018-12-07T19:22:53Z

@guillaumeeb just made (I think) the change that you suggested. I'm not totally sure how to test this though, and I do think it needs a pretty thorough set of tests. Do you have a test that you think would be a good model to start with? I'll probably need help getting this across the finish line on the testing front.

delgadom · 2018-12-08T20:31:09Z

Hmmm not sure what happened in this test

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
c = <Client: scheduler='tcp://127.0.0.1:35398' processes=2 cores=3>
s = <Scheduler: "tcp://127.0.0.1:35398" processes: 2 cores: 3>
a = <Worker: tcp://127.0.0.1:40977, running, stored: 10, running: 0/1, ready: 0, comm: 0, waiting: 0>
b = <Worker: tcp://127.0.0.1:45859, running, stored: 0, running: 0/2, ready: 0, comm: 0, waiting: 0>
    @pytest.mark.skipif(sys.version_info[0] < 3, reason="intermittent failure")
    @gen_cluster(client=True)
    def test_dont_hold_on_to_large_messages(c, s, a, b):
        np = pytest.importorskip('numpy')
        da = pytest.importorskip('dask.array')
        x = np.random.random(1000000)
        xr = weakref.ref(x)
    
        d = da.from_array(x, chunks=(100000,))
        d = d.persist()
        del x
    
        start = time()
        while xr() is not None:
            if time() > start + 5:
                # Help diagnosing
                from types import FrameType
                x = xr()
                if x is not None:
                    del x
                    rc = sys.getrefcount(xr())
                    refs = gc.get_referrers(xr())
                    print("refs to x:", rc, refs, gc.isenabled())
                    frames = [r for r in refs if isinstance(r, FrameType)]
                    for i, f in enumerate(frames):
                        print("frames #%d:" % i,
                              f.f_code.co_name, f.f_code.co_filename, sorted(f.f_locals))
>               pytest.fail("array should have been destroyed")
E               Failed: array should have been destroyed
distributed/tests/test_batched.py:270: Failed

guillaumeeb · 2018-12-09T20:33:24Z

For the test, it looks like there's already something just waiting for you!

distributed/distributed/deploy/tests/test_adaptive.py

Lines 302 to 321 in 1a171ba

    
           @pytest.mark.xfail(reason="we currently only judge occupancy, not ntasks") 
        
           @gen_test(timeout=30) 
        
           def test_no_more_workers_than_tasks(): 
        
               loop = IOLoop.current() 
        
               cluster = yield LocalCluster(0, scheduler_port=0, silence_logs=False, 
        
                                            processes=False, diagnostics_port=None, 
        
                                            loop=loop, asynchronous=True) 
        
               yield cluster._start() 
        
               try: 
        
                   adapt = Adaptive(cluster.scheduler, cluster, minimum=0, maximum=4, 
        
                                    interval='10 ms') 
        
                   client = yield Client(cluster, asynchronous=True, loop=loop) 
        
                   cluster.scheduler.task_duration['slowinc'] = 1000 
        
                   yield client.submit(slowinc, 1, delay=0.100) 
        
                   assert len(cluster.scheduler.workers) <= 1 
        
               finally: 
        
                   yield client._close() 
        
                   yield cluster._close()

.

Not sure it will work out of the box, but it is a good start. It point to everything you need: slowinc, how to specify default function duration directly in scheduler, how to launch a LocalCluster...

You can also add a more complex one, something like:

Start Cluster
Make it adaptive, with low interval, low starting cost, 10 processes max
Make slowinc default to something relativly long, maybe 0.5 seconds, in scheduler
Submit 5 slowinc of 0.5 seconds, and 5 slowinc lasting 1 second.
wait for 0.1 seconds, check that 10 processes are started in LocalCluster
wait for 0.5 seconds, check that there are only 5 processes left
loop until the end of tasks, wating 0.1 seconds, and always checking there is only 5 processes running in the LocalCluster.

guillaumeeb · 2018-12-09T20:38:02Z

Hmmm not sure what happened in this test

No trace of Adaptive here, so probably not linked to your PR.

mrocklin · 2018-12-09T21:26:09Z

That failure is indeed unrelated. Maybe (?) fixed here: #2404

guillaumeeb · 2019-01-14T16:33:48Z

@delgadom still up for this one?

SimonBoothroyd · 2019-04-24T14:28:21Z

The changes proposed in this PR seem to fix an issue I was seeing when using the dask-jobqueue library, whereby the queue clusters in adaptive mode were continuously requesting new workers even if they did not have any tasks for them to perform.

It would be great to see this fix merged in! Are there any remaining issues holding this back?

martindurant · 2019-04-24T15:29:42Z

Looks like this PR will pass after linting - will recommend merging. @guillaumeeb , were you thinking of contributing a test along the lines you described?

mrocklin · 2019-04-24T15:43:54Z

@guillaumeeb , were you thinking of contributing a test along the lines you described?

The test that @guillaumeeb is pointing to already exists in the code. You just need to remove the xfail.

guillaumeeb · 2019-04-24T17:26:49Z

Thanks @martindurant, sorry, did not find the time to take over this one...

The simple test already there is probably enough indeed.

martindurant · 2019-04-24T17:30:17Z

OK, good to go - thank you.

martindurant · 2019-04-24T17:33:44Z

Will merge end of day, if no further comments.

delgadom · 2019-04-25T21:37:15Z

Just saw this last flurry of activity. Thank you so much for pushing this across the finish line @martindurant and @guillaumeeb

* upstream/master: Fix deserialization of bytes chunks larger than 64MB (dask#2637) bump version to 1.27.1 Updated logging module doc links from docs.python.org/2 to docs.python.org/3. (dask#2635) Adaptive: recommend close workers when any are idle (dask#2330)

delgadom added 2 commits October 31, 2018 19:32

adaptive: recommend close workers if idle

1a97a2f

adaptive: check for idle workers before recommending scale up

3ea0d5c

delgadom changed the title ~~Recommend close workers adaptive~~ Adaptive: close workers when any workers are idle Nov 1, 2018

delgadom changed the title ~~Adaptive: close workers when any workers are idle~~ Adaptive: recommend close workers when any are idle Nov 1, 2018

adaptive: check for waiting tasks in should_scale_up

2487e2e

mrocklin reviewed Nov 1, 2018

View reviewed changes

guillaumeeb reviewed Nov 2, 2018

View reviewed changes

delgadom added 2 commits November 3, 2018 01:57

revert to changing needs_cpu only

ef70b0e

performance bump in adaptive.needs_cpu by looping through workers

519785c

guillaumeeb reviewed Nov 4, 2018

View reviewed changes

switch to checking number of cores, not workers

62b94f0

Merge branch 'master' into recommend_close_workers_adaptive

597c166

apply black

9c16405

Martin Durant added 2 commits April 24, 2019 12:03

remove xfail

a085e8e

flake

3bb8e25

martindurant merged commit 0c8918b into dask:master Apr 24, 2019

guillaumeeb mentioned this pull request May 1, 2020

Adaptive.needs_cpu does not depend on number of tasks remaining #2329

Closed

Adaptive: recommend close workers when any are idle #2330

Adaptive: recommend close workers when any are idle #2330

Conversation

delgadom commented Nov 1, 2018

delgadom commented Nov 1, 2018

mrocklin commented Nov 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

delgadom commented Nov 1, 2018

mrocklin commented Nov 1, 2018

delgadom commented Nov 1, 2018

guillaumeeb commented Nov 2, 2018

guillaumeeb left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

delgadom commented Nov 4, 2018 • edited

guillaumeeb left a comment

Choose a reason for hiding this comment

guillaumeeb Nov 4, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Nov 14, 2018

guillaumeeb commented Nov 15, 2018

guillaumeeb commented Dec 4, 2018

delgadom commented Dec 7, 2018

delgadom commented Dec 7, 2018

delgadom commented Dec 8, 2018

guillaumeeb commented Dec 9, 2018

guillaumeeb commented Dec 9, 2018

mrocklin commented Dec 9, 2018

guillaumeeb commented Jan 14, 2019

SimonBoothroyd commented Apr 24, 2019

martindurant commented Apr 24, 2019

mrocklin commented Apr 24, 2019

guillaumeeb commented Apr 24, 2019

martindurant commented Apr 24, 2019

martindurant commented Apr 24, 2019

delgadom commented Apr 25, 2019 • edited

guillaumeeb left a comment •

edited

delgadom commented Nov 4, 2018 •

edited

guillaumeeb Nov 4, 2018 •

edited

delgadom commented Apr 25, 2019 •

edited