## Dask

Dask is a flexible library for parallel computing in Python.

Dask is composed of two parts:

- Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
- “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.

https://docs.dask.org/en/latest/

In [None]:
#start a cluster in the labextension, drag across and create a client

In [None]:
import dask.array as da

In [None]:
x = da.random.random((10000, 10000, 10), chunks=(1000, 1000, 5))
y = da.random.random((10000, 10000, 10), chunks=(1000, 1000, 5))
z = (da.arcsin(x) + da.arccos(y)).sum(axis=(1,2))
#z.visualize(rankdir="LR")
z.compute()

## Dask-jobqueue 

Dask jobqueue lets you scale up dask on batch-based HPC clusters, e.g. Slurm or PBS

In [None]:
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(cores=12,
                       memory="61GB",
                       project='csstaff', 
                       walltime='0:10:00',
                       job_extra=['-C gpu', '--reservation=interact_gpu'])

In [None]:
cluster

In [None]:
print(cluster.job_script())

In [None]:
cluster.scale(4)

In [None]:
import dask.distributed
dask.config.config