<img src="../images/dask_horizontal.svg" align="right" width="30%">

# Distributed

## Learning Objectives 

- Use single machine Dask schedulers
- Deploy a local Dask Distributed Cluster and access the diagnostics dashboard


## Prerequisites


| Concepts | Importance | Notes |
| --- | --- | --- |
| Familiarity with Python | Necessary | |
| Familiarity with Dask Fundamentals | Necessary | |


- **Time to learn**: *25-35 minutes*


## Dask Schedulers

As we have seen so far, Dask allows you to simply construct graphs of tasks with dependencies, as well as have graphs created automatically for you using functional, Numpy or Xarray syntax on data collections. None of this would be very useful, if there weren't also a way to execute these graphs, in a parallel and memory-aware way. So far we have been calling `thing.compute()` or `dask.compute(thing)` without worrying what this entails. Now we will discuss the options available for that execution, and in particular, the distributed scheduler, which comes with additional functionality.

Dask comes with four available schedulers:

- "threaded" (aka "threading"): a scheduler backed by a thread pool
- "processes": a scheduler backed by a process pool
- "single-threaded" (aka "sync"): a synchronous scheduler, good for debugging
- distributed: a distributed scheduler for executing graphs on multiple machines, see below.

To select one of these for computation, you can specify at the time of asking for a result, e.g.,
```python
myvalue.compute(scheduler="single-threaded")  # for debugging
```

You can also set a default scheduler either temporarily
```python
with dask.config.set(scheduler='processes'):
    # set temporarily for this block only
    # all compute calls within this block will use the specified scheduler
    myvalue.compute()
    anothervalue.compute()
```

Or globally
```python
# set until further notice
dask.config.set(scheduler='processes')
```

Let's try out a few schedulers on the Sea Surface Temperature data.

In [None]:
import pathlib

import dask
import xarray as xr

In [None]:
data_dir = pathlib.Path("data/")
files = sorted(data_dir.glob("tos_Omon_CESM2*"))
files

In [None]:
dset = xr.open_mfdataset(
    sorted(files),
    concat_dim='ensemble_member',
    combine="nested",
    parallel=True,
    data_vars=['tos'],
    engine="netcdf4",
    chunks={'time': 90},
)
# Add coordinate labels for the newly created `ensemble_member` dimension
dset["ensemble_member"] = ['r11i1p1f1', 'r7i1p1f1', 'r8i1p1f1', 'r9i1p1f1']
dset

In [None]:
# Compute anomaly
gb = dset.tos.groupby('time.month')
tos_anom = gb - gb.mean(dim='time')
tos_anom

In [None]:
# each of the following gives the same results (you can check!)
# any surprises?
import time

for sch in ['threading', 'processes', 'sync']:
    t0 = time.time()
    r = tos_anom.compute(scheduler=sch)
    t1 = time.time()
    print(f"{sch:>10}, {t1 - t0:0.4f} s; {r.min().data, r.max().data, r.mean().data}")

In [None]:
dask.visualize(tos_anom)

### Some Questions to Consider:

- How much speedup is possible for this task (hint, look at the graph).
- Given how many cores are on this machine, how much faster could the parallel schedulers be than the single-threaded scheduler.
- How much faster was using threads over a single thread? Why does this differ from the optimal speedup?
- Why is the multiprocessing scheduler so much slower here?

The `threaded` scheduler is a fine choice for working with large datasets out-of-core on a single machine, as long as the functions being used release the [Python Global Interpreter Lock (GIL)](https://wiki.python.org/moin/GlobalInterpreterLock) most of the time. NumPy and pandas release the GIL in most places, so the `threaded` scheduler is the default for `dask.array` and `dask.dataframe`. The distributed scheduler, perhaps with `processes=False`, will also work well for these workloads on a single machine.

For workloads that do hold the GIL, as is common with `dask.bag` and custom code wrapped with `dask.delayed`, we recommend using the distributed scheduler, even on a single machine. Generally speaking, it's more intelligent and provides better diagnostics than the `processes` scheduler.

<div class="admonition alert alert-info">
    <p class="admonition-title" style="font-weight:bold">What Is the Python Global Interpreter Lock (GIL)?</p>
    <q>The Python Global Interpreter Lock or GIL, in simple words, is a mutex (or a lock) that allows only one thread to hold the control of the Python interpreter.</q>
    <br>
    See <a href="https://realpython.com/python-gil/">this blog post</a> for more details on Python GIL.
</div>



https://docs.dask.org/en/latest/scheduling.html provides some additional details on choosing a scheduler.

For scaling out work across a cluster, the distributed scheduler is required.

## Making a cluster

### Simple method

The `dask.distributed` system is composed of a single centralized scheduler and one or more worker processes. [Deploying](https://docs.dask.org/en/latest/setup.html) a remote Dask cluster involves some additional effort. But doing things locally is just involves creating a `LocalCluster` object and connecting this object to a `Client` object, which lets you interact with the "cluster" (local threads or processes on your machine). For more information see [here](https://docs.dask.org/en/latest/setup/single-distributed.html). 

<img src="../images/Distributed Overview (Light).png">

Note that `LocalCluster()` takes a lot of optional [arguments](https://distributed.dask.org/en/latest/local-cluster.html#api), to configure the number of processes/threads, memory limits and other 

In [None]:
from dask.distributed import Client, LocalCluster

# Setup a local cluster.
# By default this sets up 1 worker per CPU core

cluster = LocalCluster()
client = Client(cluster)
client

**Note:**

This code

```python
cluster = LocalCluster()
client = Client(cluster)
```

is equivalent to 

```python
client = Client()
```

If you aren't in jupyterlab and using the `dask-labextension`, be sure to click the `Dashboard` link to open up the diagnostics dashboard.



## Distributed Dask clusters for HPC and Cloud environments

Dask can be deployed on distributed infrastructure, such as a an HPC system or a cloud computing system. There is a growing ecosystem of Dask deployment projects that faciliate easy deployment and scaling of Dask clusters on a wide variety of computing systems.

### HPC

#### Dask Jobqueue (https://jobqueue.dask.org/)

- `dask_jobqueue.PBSCluster`
- `dask_jobqueue.SlurmCluster`
- `dask_jobqueue.LSFCluster`
- etc.

#### Dask MPI (https://mpi.dask.org/)

- `dask_mpi.initialize`

### Cloud

#### Dask Kubernetes (https://kubernetes.dask.org/)

- `dask_kubernetes.KubeCluster`

#### Dask Cloud Provider (https://cloudprovider.dask.org)

- `dask_cloudprovider.FargateCluster`
- `dask_cloudprovider.ECSCluster`
- `dask_cloudprovider.ECSCluster`

#### Dask Gateway (https://gateway.dask.org/)

- `dask_gateway.GatewayCluster`


## Executing with the distributed client

Consider some calculation, such as we've used before, where we computed anomaly per ensemble member

In [None]:
tos_anom

By default, creating a `Client` makes it the default scheduler. Any calls to `.compute` will use the cluster your `client` is attached to, unless you specify otherwise, as above.


The tasks will appear in the web UI as they are processed by the cluster and, eventually, a result will be printed as output of the cell above. Note that the kernel is blocked while waiting for the result.

You can also see a simplified version of the graph being executed on Graph pane of the dashboard, so long as the calculation is in-flight.


Let's return to the anomaly computation from before, and see what happens on the dashboard (you may wish to have both the notebook and dashboard side-by-side). How does this perform compared to before?

In [None]:
%time tos_anom.compute()

In this particular case, this should be as fast or faster than the best case, threading, above. Why do you suppose this is? You should start your reading [here](https://distributed.dask.org/en/latest/index.html#architecture), and in particular note that the distributed scheduler was a complete rewrite with more intelligence around sharing of intermediate results and which tasks run on which worker. This will result in better performance in *some* cases, but still larger latency and overhead compared to the threaded scheduler, so there will be rare cases where it performs worse. Fortunately, the dashboard now gives us a lot more [diagnostic information](https://distributed.dask.org/en/latest/diagnosing-performance.html). Look at the Profile page of the dashboard to find out what takes the biggest fraction of CPU time for the computation we just performed?

In [None]:
cluster.close()
client.close()

In [None]:
%load_ext watermark
%watermark --time --python --updated --iversion

---

## Learn More

If all you want to do is execute computations created using delayed, or run calculations based on the higher-level data collections, then that is about all you need to know to scale your work up to cluster scale. However, there is more detail to know about the distributed scheduler that will help with efficient usage. See this tutorial on advanced features of Distributed: https://tutorial.dask.org/06_distributed_advanced.html.

## Resources and references

* Reference
    *  [Dask Docs](https://dask.org/)
    *  [Dask Blog](https://blog.dask.org/)
    *  [Xarray Docs](https://xarray.pydata.org/)
  
*  Ask for help
    *   [`dask`](http://stackoverflow.com/questions/tagged/dask) tag on Stack Overflow, for usage questions
    *   [github discussions (dask):](https://github.com/dask/dask/discussions) for general, non-bug, discussion, and usage questions
    *   [github issues (dask): ](https://github.com/dask/dask/issues/new) for bug reports and feature requests
     *   [github discussions (xarray): ](https://github.com/pydata/xarray/discussions) for general, non-bug, discussion, and usage questions
    *   [github issues (xarray): ](https://github.com/pydata/xarray/issues/new) for bug reports and feature requests
    
* Pieces of this notebook are adapted from the following sources
  * https://github.com/dask/dask-tutorial/blob/main/05_distributed.ipynb
  * https://github.com/xarray-contrib/xarray-tutorial/blob/master/scipy-tutorial/05_intro_to_dask.ipynb
  
  
  
 <div class="admonition alert alert-success">
     <p class="title" style="font-weight:bold">Previous: <a href="./10-dask-and-xarray.ipynb">Dask and Xarray</a></p>
    
</div>