<h1 style='font-size:60px'>
    What is Dask?
<img src='logo.png' align='left' height='120' width='120' style='float: left; margin-right: 40px; margin-top: 1px;'/>
</h1>

<font size='4'> Henry Wilde | 
<i class='fa fa-github' aria-hidden='false'></i>
<i class='fa fa-twitter' aria-hidden='false'></i> @daffidwilde </font>
<hr>

### Dask is a parallel computing framework all about building graphs.

<br>
We can divide Dask into two main components:

* **High-level** data collections which mirror `np.array`, `list` and `pd.DataFrame`    


* **Low-level** task schedulers for executing task graphs

<br>

<img align='centre' src='collections-schedulers.png'>

# Parallelising code with `dask.delayed`
---

### Normal code

In [None]:
from time import sleep

def inc(x):
    sleep(0.5)
    return x + 1

def add(x, y):
    sleep(1)
    return x + y

def mul(x, y):
    sleep(0.5)
    return x * y

In [None]:
%%time

results = []
a = inc(1)
for i in range(3): 
    b = mul(i, a)
    c = add(a, b)
    results.append(c)

total = sum(results)
print(total)

### Parallelised code


We can use `dask.delayed` either by wrapping it around a function or by using a decorator at the point of definition.

In either case, this indicates that the function should be executed lazily.

In [None]:
from dask import delayed

@delayed
def inc(x):
    sleep(0.5)
    return x + 1

@delayed
def add(x, y):
    sleep(1)
    return x + y

@delayed
def mul(x, y):
    sleep(0.5)
    return x * y

In [None]:
%%time

results = []
a = inc(1)
for i in range(3):
    b = mul(i, a)
    c = add(a, b)
    results.append(c)
    
total = delayed(sum)(results)

In [None]:
total

In [None]:
total.visualize(rankdir='LR')

In [None]:
%time total.compute()

# Familiarity between `dask.dataframe` and `pandas.DataFrame`
---

### In Pandas, we might have something like this:

In [None]:
import pandas as pd
from glob import iglob

dt = {'CRSElapsedTime': 'float64', 'TailNum': 'object'}

In [None]:
dfs = [pd.read_csv(csv, dtype=dt) for csv in iglob('nycflights/*.csv')]
df = pd.concat(dfs)

In [None]:
df.head()

In [None]:
df.groupby(df.Origin).DepDelay.mean()

### The process is much the same in Dask:

In [None]:
import dask.dataframe as dd

In [None]:
%%time
ddf = dd.read_csv('nycflights/*.csv', dtype=dt)

In [None]:
ddf.visualize()

In [None]:
ddf.head()

In [None]:
mean = ddf.groupby(ddf.Origin).DepDelay.mean()
mean

In [None]:
mean.visualize(rankdir='LR')

In [None]:
mean.compute()

# Schedulers
---


There are four schedulers currently implemented in Dask:

-  **Threaded:**
 - Useful for numeric code such as `numpy` and `pandas` where the GIL is released 


- **Multiprocessing:**
 - Good for Python-bound code that requires multiple interpreters


- **Synchronous:**
 - Helps with debugging and profiling


- **Distributed:**
 - For working with a cluster of machines on larger tasks
 - Alternative to `multiprocessing` for better diagnostic tools