<h1 style='font-size:60px'>
    What is Dask?
<img src='logo.svg' align='left' height='120' width='120' style='float: left; margin-right: 40px; margin-top: 1px;'/>
</h1>

<font size='4'> Henry Wilde | 
<i class='fa fa-github' aria-hidden='false'></i>
<i class='fa fa-twitter' aria-hidden='false'></i> @daffidwilde </font>
<hr>

## Dask is a parallel computing framework all about building graphs
## Documentation available at [docs.dask.org](https://docs.dask.org)

<br>
<br>

<img align='centre' width='75%' src='dask-overview.svg'>

<br>
<br>

# Parallelising serial code with `dask.delayed`
---

In [None]:
from time import sleep


def inc(x):
    sleep(0.5)
    return x + 1


def add(x, y):
    sleep(1)
    return x + y


def mul(x, y):
    sleep(0.5)
    return x * y

In [None]:
%%time

results = []
a = inc(1)
for i in range(3):
    b = mul(i, a)
    c = add(a, b)
    results.append(c)

total = sum(results)
total

We can use `dask.delayed` either by wrapping it around a function or by using a decorator at the point of definition.

In either case, this indicates that the function should be executed lazily.

In [None]:
from dask import delayed


@delayed
def inc(x):
    sleep(0.5)
    return x + 1


@delayed
def add(x, y):
    sleep(1)
    return x + y


@delayed
def mul(x, y):
    sleep(0.5)
    return x * y

In [None]:
%%time

results = []
a = inc(1)
for i in range(3):
    b = mul(i, a)
    c = add(a, b)
    results.append(c)

total = sum(results)

# Familiarity between `dask.dataframe` and `pandas.DataFrame`
---

### In Pandas, we might have something like this:

In [None]:
%%time
import glob
import pandas as pd


dtypes = {
    "CRSElapsedTime": "float",
    "TailNum": "category",
    "UniqueCarrier": "category",
    "Origin": "category",
    "Dest": "category",
}

dfs = (pd.read_csv(csv, dtype=dtypes) for csv in glob.iglob("nycflights/*.csv"))
df = pd.concat(dfs)

In [None]:
df.memory_usage(deep=True).sum() / (1024**2)

In [None]:
df.head()

In [None]:
%%time
df.groupby("Origin")["DepDelay"].mean().sort_values()

### The process is much the same in Dask:


In [None]:
%%time
import dask.dataframe as dd


ddf = dd.read_csv("nycflights/*.csv", dtype=dtypes)

In [None]:
ddf.head()

In [None]:
mean = ddf.groupby("Origin")["DepDelay"].mean()
mean

In [None]:
mean.visualize()

In [None]:
%%time
mean.compute().sort_values()

# Schedulers
---


There are four schedulers currently implemented in Dask:

-  **Threaded:**
 - Useful for numeric code such as `numpy` and `pandas` where the GIL is released 


- **Multiprocessing:**
 - Good for Python-bound code that requires multiple interpreters


- **Synchronous:**
 - Helps with debugging and profiling


- **Distributed:**
 - For working with a cluster of machines on larger tasks
 - Local alternative with better diagnostic tools

In [None]:
import dask_ml.datasets
import dask_ml.cluster
from distributed import Client

client = Client(n_workers=8, memory_limit="4GiB")
client

In [None]:
n_clusters = 3

X, _ = dask_ml.datasets.make_blobs(
    n_samples=1e7,
    chunks=1e5,
    n_features=2,
    centers=n_clusters,
    random_state=0,
)

X.persist()
X

In [None]:
%%time
kmeans = dask_ml.cluster.KMeans(n_clusters=n_clusters, random_state=0).fit(X)
kmeans.labels_

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline


skip = 1000
_, ax = plt.subplots(dpi=300)

plot = ax.scatter(X[::skip, 0], X[::skip, 1], marker=".", c=kmeans.labels_[::skip])

In [None]:
client.shutdown()