# Dask Schedulers

## Notebook Objectives
* **Performance comparison** of different dask schedulers.
* **References** for further reading.

## Performance comparison of different dask schedulers

To compare the different schedulers, let's go back to the DataFrame example where we read the NYC Taxi Trips dataset and compute the maximum tip amount.

In [5]:
from dask.distributed import Client

client = Client(n_workers=4)
client

0,1
Connection method: Cluster object,Cluster type: LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Status: running,Using processes: True
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads:  12,Total memory:  16.00 GiB

0,1
Comm: tcp://127.0.0.1:49654,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads:  12
Started:  Just now,Total memory:  16.00 GiB

0,1
Comm: tcp://127.0.0.1:49666,Total threads: 3
Dashboard: http://127.0.0.1:49668/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:49658,
Local directory: /Users/pavithra-coiled/Developer/talkpython-dask-course/2-dask-fundamentals/dask-worker-space/worker-r84rhk5j,Local directory: /Users/pavithra-coiled/Developer/talkpython-dask-course/2-dask-fundamentals/dask-worker-space/worker-r84rhk5j

0,1
Comm: tcp://127.0.0.1:49667,Total threads: 3
Dashboard: http://127.0.0.1:49670/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:49659,
Local directory: /Users/pavithra-coiled/Developer/talkpython-dask-course/2-dask-fundamentals/dask-worker-space/worker-c0savz3z,Local directory: /Users/pavithra-coiled/Developer/talkpython-dask-course/2-dask-fundamentals/dask-worker-space/worker-c0savz3z

0,1
Comm: tcp://127.0.0.1:49662,Total threads: 3
Dashboard: http://127.0.0.1:49664/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:49657,
Local directory: /Users/pavithra-coiled/Developer/talkpython-dask-course/2-dask-fundamentals/dask-worker-space/worker-zhy6e5hn,Local directory: /Users/pavithra-coiled/Developer/talkpython-dask-course/2-dask-fundamentals/dask-worker-space/worker-zhy6e5hn

0,1
Comm: tcp://127.0.0.1:49660,Total threads: 3
Dashboard: http://127.0.0.1:49661/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:49656,
Local directory: /Users/pavithra-coiled/Developer/talkpython-dask-course/2-dask-fundamentals/dask-worker-space/worker-kl2ig26m,Local directory: /Users/pavithra-coiled/Developer/talkpython-dask-course/2-dask-fundamentals/dask-worker-space/worker-kl2ig26m


In [6]:
import dask.dataframe as dd

df = dd.read_csv("data/yellow_tripdata_2019-*.csv",
                 dtype={'RatecodeID': 'float64',
                        'VendorID': 'float64',
                        'passenger_count': 'float64',
                        'payment_type': 'float64'
                       })

max_tip_amount = df.groupby("passenger_count").tip_amount.mean().max()
max_tip_amount

dd.Scalar<series-..., dtype=float64>

In [7]:
%%time

max_tip_amount.compute()

CPU times: user 1min 14s, sys: 4.03 s, total: 1min 18s
Wall time: 2min 12s


7.377822222222222

Let's try this computation using different schedulers and look at the results. We are selecting the scheduler _inline_ while calling `compute`.

In [4]:
import time

for sch in ['threading', 'processes', 'synchronous']:
    t0 = time.time()
    amount = max_tip_amount.compute(scheduler=sch)
    t1 = time.time()
    print("Scheduler:", sch, ", Compute time:", t1 - t0, ", Result:", amount)

  result = _execute_task(task, cache)
  return self._engine.get_loc(casted_key)
  result = _execute_task(task, cache)


Scheduler: threading , Compute time: 105.8745608329773 , Result: 7.377822222222222
Scheduler: processes , Compute time: 384.7675268650055 , Result: 7.377822222222222


  result = _execute_task(task, cache)


Scheduler: sync , Compute time: 176.0339961051941 , Result: 7.377822222222222


We can see that the results are the same, but the time to compute varies. This is because each scheduler works differently and is best-suited for specific situations.

For most cases, we recommend using the distributed scheduler:

```
from dask.distributed import Client
client = Client()
```

Note that only the distributed scheduler supports all the dashboards, modern scheduling improvements, and other features.

The distributed scheduler:

  * will also work well for these workloads on a single machine
  * recommended for workloads that do hold the GIL, (`dask.bag` and custom code wrapped with `dask.delayed`), even on single machine
  * more intelligent and provides better diagnostics than the processes scheduler
  * required for scaling out work across a cluster
 

Finally, let's close the cluster!

In [8]:
client.close()

## References

* [Scheduling Documentation](https://docs.dask.org/en/latest/scheduling.html)
* [Dask Tutorial - Distributed](https://tutorial.dask.org/05_distributed.html)