# Dask

- Suitable for both CPU bound and Memory bound problems
- Distributes stroage and compute
- Efficiently utilize multiple CPUs on a single node or multiple nodes (can cross node boundary)
- Handles big data that cannot fit in the memory


#### DASK in Python Ecosystem

<img src="https://raw.githubusercontent.com/dmbala/python-bigData/main/Figures/dask-eco.jpeg" width=500 height=400>

#### Dask-API for Scikit-Learn to perform distributed task executions

<img src="https://raw.githubusercontent.com/dmbala/python-bigData/main/Figures/DaskDistributedJob.png" width=500 height=200>

## Dask Collections
- dask.bag: an unordered set, effectively a distributed replacement for Python iterators, read from text/binary files or from arbitrary Delayed sequences
- dask.array: Distributed arrays with a numpy-like interface, great for scaling large matrix operations
- dask.dataframe: Distributed pandas-like dataframes, for efficient handling of tabular, organized data
- dask_ml: distributed wrappers around scikit-learn-like machine-learning tools

In [1]:
# Importing dask array and dataframe
import dask
import dask.array as da
import dask.dataframe as dd
dask.__version__

'2022.02.0'

## Dask delayed and compute
- Delayed function -  builds task graphs
- Compute function -  Executes the tasks according to the Scheduler

## Dask Scheduler
- Threads - the default choice, calling compute() or compute(scheduler=’threads’). This uses multiple threads in the same processes. 
- Processes - uses a pool of child process, calling compute(scheduler-’process’).Each process has its own Python interpreter. This takes longer to start up than threads. 
- Single thread - no parallelism, calling .compute(scheduler=’single-threaded’). Useful for debugging. 
- Distributed - uses a pool of worker processes along with a scheduler process. It can be used on a single machine or scaled out to many machines. 

## Dask Distributed Cluster

In [2]:
from dask.distributed import Client, LocalCluster
client = Client(n_workers=6, threads_per_worker=4, memory_limit='4GB')
client 

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 6
Total threads: 24,Total memory: 22.35 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:40581,Workers: 6
Dashboard: http://127.0.0.1:8787/status,Total threads: 24
Started: Just now,Total memory: 22.35 GiB

0,1
Comm: tcp://127.0.0.1:34243,Total threads: 4
Dashboard: http://127.0.0.1:39453/status,Memory: 3.73 GiB
Nanny: tcp://127.0.0.1:37707,
Local directory: /content/dask-worker-space/worker-m3o_fvk7,Local directory: /content/dask-worker-space/worker-m3o_fvk7

0,1
Comm: tcp://127.0.0.1:38465,Total threads: 4
Dashboard: http://127.0.0.1:33835/status,Memory: 3.73 GiB
Nanny: tcp://127.0.0.1:45159,
Local directory: /content/dask-worker-space/worker-ehzag_15,Local directory: /content/dask-worker-space/worker-ehzag_15

0,1
Comm: tcp://127.0.0.1:35389,Total threads: 4
Dashboard: http://127.0.0.1:36405/status,Memory: 3.73 GiB
Nanny: tcp://127.0.0.1:33343,
Local directory: /content/dask-worker-space/worker-w_cdbmav,Local directory: /content/dask-worker-space/worker-w_cdbmav

0,1
Comm: tcp://127.0.0.1:37547,Total threads: 4
Dashboard: http://127.0.0.1:43157/status,Memory: 3.73 GiB
Nanny: tcp://127.0.0.1:34041,
Local directory: /content/dask-worker-space/worker-yjoprg7g,Local directory: /content/dask-worker-space/worker-yjoprg7g

0,1
Comm: tcp://127.0.0.1:34581,Total threads: 4
Dashboard: http://127.0.0.1:33247/status,Memory: 3.73 GiB
Nanny: tcp://127.0.0.1:41605,
Local directory: /content/dask-worker-space/worker-qe_nnt7p,Local directory: /content/dask-worker-space/worker-qe_nnt7p

0,1
Comm: tcp://127.0.0.1:46065,Total threads: 4
Dashboard: http://127.0.0.1:36099/status,Memory: 3.73 GiB
Nanny: tcp://127.0.0.1:42835,
Local directory: /content/dask-worker-space/worker-kdwjyz90,Local directory: /content/dask-worker-space/worker-kdwjyz90


In [3]:
client.shutdown()