Dask
====
Parallel Python  
Fast and Easy
-----------------

`dask` is a module to spread computing tasks and data accross multiple workers.  

It support support both local and distributed parallelism.

Tasks can be split using:
- Futures
- **Delayed**

And data using:
- **Array**
- Daraframe
- Bags

Why dask?
---------

1- **Familiarity**:

    Dask want to make it as easy as possible to scale conputation, so it mostly follows other popular project api:
    
    - dask.array ~= numpy, Xarray  
    - dask.dataframe ~= pandas  
    - dask's future ~ multiprocessing  

2- **Scalable**:

    Dask can be used on a local machine, on a cluster and in the cloud.  
    It synchornize works from as much ressource as you need.

When to use dask?
-----------------

- When datasets get too larges to fit locally.  
Use dask only when needed. Locally, numpy, numba, cupy, jax etc. can usually do a better job!

But when the scale is such that it doesn't fit in memory locally, then it's worth looking at dask.

Useful links
------------
- [Documentation](https://docs.dask.org/en/stable/)
- [Dask-Cookbook](https://projectpythia.org/dask-cookbook/notebooks/00-dask-overview.html)
- [github](https://github.com/dask/dask/)


First let's install dask and related modules.

In [None]:
pip install dask[array,distributed,diagnostics] graphviz sparse scipy matplotlib

When using dask, you submit your workload from the client. Then the scheduler split that workload unto workers:


![Dask client](https://tutorial.dask.org/_images/distributed-overview.png)
[_Image from the dask-cookbook of Project Pythia_](https://projectpythia.org/dask-cookbook/notebooks/00-dask-overview.html)


Using dask starts by starting the `Client`.

Once created, the client is used automatically by all following dask jobs.  
We can also dispatch work to the client directly, somewhat similarly to multiprocessing pool.

In [None]:
from dask.distributed import Client

client = Client()
client

Now let's use that client.  
We can ask it to send a job to workers.

In [None]:
def mul(a, b):
    return a * b

future1 = client.submit(mul, 11, 10)
future2 = client.submit(mul, 4, 4)
future1

The task submitted, it's waiting for a worker to be executed.

In [None]:
future1

When finished, the `result` method is needed to get the output.

In [None]:
future1.result(), future2.result()

To run multiple tasks at once, we can call them all in a loop or use `map`:

In [None]:
def double(x):
    return x * 2

N = 10
futures = client.map(double, range(N))
futures

In [None]:
[future.result() for future in futures]

## Using a cluster

Dask support multiple clusters types:
- Local
  - threading
  - process
- Distributed
  - HPC (dask_jobqueue)
    - PBS
    - Slurm
    - ...
  - Cloud (dask_cloudprovider)
    - Azure
    - AWS
    - ...
  - Coiled
 
### Only the configuration step changes!

In [None]:
# Amazon
try:
    from dask_cloudprovider.aws import FargateCluster
    cluster = FargateCluster(
        # Cluster manager specific config kwargs
    )
    client = Client(cluster)
    ...
    cluster.close()
except:
    pass

In [None]:
# SLURM
try:
    from dask_jobqueue.slurm import SLURMRunner
    with SLURMRunner() as cluster:
        with Client(cluster) as client:
            # Wait for all the workers to be ready before continuing.
            client.wait_for_workers(runner.n_workers)
            main()

except:
    pass

On the Alliance server, you should use `SLURMRunner` not ~`SLURMCluster`~.  
- `SLURMRunner`: Create a worker for each task in a slurm job. Good for cluster that prefer fewer larger jobs.
- `SLURMCluster`: Uses a different slurm job for each workers, which all need to go through the waiting queue...
  Good for cluster that prefer lot of small jobs.