<img src="https://docs.dask.org/en/latest/_images/dask_horizontal.svg" align="right" width="30%"/>

<center>
    <h1>
Dask Tutorial for PyHEP 2022
    </h1>
</center>

Dask is a pure Python library for parallel and distributed computing designed to scale up workflows in the PyData ecosystem.

The two main components of Dask:

- The **collections library(ies)** (sometimes called "Dask core") listed here in alphabetical order:
  - `dask.array`: chunked NumPy
  - `dask.bag`: partitioned Python iterables
  - `dask.dataframe`: partitioned Pandas
  - `dask.delayed`: custom algorithms
- The **execution engines** (task schedulers)
  - The distributed engine is its own project (`distributed`, sometimes called "Dask Distributed")
 

<div style="text-align: center;">
  <img src="https://docs.dask.org/en/stable/_images/dask-overview.svg" align="center" width="70%"/>
</div>

There are also a number of other projects in the Dask ecosystem that leverage both upstream components.

First Example (using `dask.delayed`)
------------------------------------

We'll start with a simple `dask.delayed` example that covers _a lot_ of how Dask works:

In [None]:
from dask.delayed import delayed

def inc(x):
    return x + 1

inc = delayed(inc)

In [None]:
inc(7)

Notice that this just creates a `Delayed` object.

In [None]:
eight = inc(7)

We have to ask Dask to determine the result via `compute()`

In [None]:
eight.compute()

We can start to construct a more complex task graph by chaining function calls:

In [None]:
@delayed
def inc(x):
    return x + 1

@delayed
def add(x, y):
    return x + y

In [None]:
five = add(inc(1), inc(2))

In [None]:
five.compute()

We can inspect the complete task graph to see how dask accomplishing computing the result of the collection:

In [None]:
delayed_task_graph = five.dask.to_dict()

In [None]:
for i, (k, v) in enumerate(delayed_task_graph.items()):
    if i != 0:
        print("\n")
    print("The key (label) of a task:   ", k)
    print("The task itself (Lisp S-exp):", v)

There is a much better method of inspection! (`visualize()`)

In [None]:
five.visualize()

Second Example
--------------

Let's take a look at an example that illustrates something closer to a real workflow: reading and operating on files to produce a histogram. Our example will have two steps:

1. Load an uproot TTree by file and tree name
2. Calculate something from information in the file
3. Histogram the calculation

We'll look at the workflow while leveraging Dask, and compare to a workflow without Dask

In [None]:
import uproot
import awkward as ak
import hist
import time
from skhep_testdata import data_path
paths = [data_path("uproot-Zmumu.root")] * 5

In [None]:
@delayed
def read_tree(file_name, tree_name):
    time.sleep(1)  # faking making the file larger
    return uproot.open(file_name)[tree_name]

@delayed
def calculation(tree):
    arrs = tree.arrays()
    return abs(arrs.E1 - arrs.E2)
    
@delayed
def histo(data, bins, range):
    h = hist.Hist(hist.axis.Regular(bins=bins, start=range[0], stop=range[1], name="abs(E1-E2)"))
    h.fill(data)
    return h

In [None]:
histos = []
for p in paths:
    tree = read_tree(p, "events")
    calc = calculation(tree)
    h = histo(calc, 20, (0, 200))
    histos.append(h)

In [None]:
histos

In [None]:
sum(histos).visualize()

In [None]:
sum(histos).compute()

In [None]:
%%timeit
sum(histos).compute()

Now without Dask:

In [None]:
def s_read_tree(file_name, tree_name):
    time.sleep(1)  # faking making the file larger
    return uproot.open(file_name)[tree_name]

def s_calculation(tree):
    arrs = tree.arrays()
    return abs(arrs.E1 - arrs.E2)
    
def s_histo(data, bins, range):
    h = hist.Hist(hist.axis.Regular(bins=bins, start=range[0], stop=range[1], name="abs(E1-E2)"))
    h.fill(data)
    return h

In [None]:
%%timeit
s_histos = []
for p in paths:
    tree = s_read_tree(p, "events")
    calc = s_calculation(tree)
    h = histo(calc, 20, (0, 200))
    s_histos.append(h)
sum(s_histos)

dask.array
==========

While `dask.delayed` is incredibly flexible and can turn almost any Python function into a node in a task graph, the other collection libraries are designed to provide task graph creation as a near drop in replacement to existing PyData libraries. The NumPy API is meant to be recreated with `dask.array`. Arrays in `dask.array` are chunked and lazily evaluated NumPy arrays. The data nodes in a task graph are just NumPy arrays: Dask doesn't create a new array computation kernel library. 

<center>
<img src="https://docs.dask.org/en/stable/_images/dask-array.svg" width="50%">
</center>

In [None]:
import numpy as np
import dask.array as da

In [None]:
a1 = np.ones((10,))

In [None]:
a1.sum()

In [None]:
a1[:5].sum() + a1[5:].sum()

In [None]:
a2 = da.ones((10,), chunks=5)

In [None]:
a2

In [None]:
a2.visualize()

Chaining together function calls with `dask.array` is very similar to what we did with `dask.delayed`: it simply builds up the task graph. However, now we get to use the ubiquitous NumPy API.

dask.dataframe
==============

The NumPy/dask.array relationship is mirrored for Pandas with dask.dataframe. DataFrames(Series) in dask.dataframe are partitioned and lazily evalualated Pandas DataFrames(Series). The data nodes in a task graph are pandas objects.

<center>
<img src="https://docs.dask.org/en/stable/_images/dask-dataframe.svg" width="35%">
</center>

In [None]:
from dask.datasets import timeseries

In [None]:
ddf = timeseries()

In [None]:
ddf

distributed
===========

So far in this tutorial we've been using the default execution engine at each `.compute()` call. This has been the threaded scheduler: Dask tried to use all threads on the system where you have imported Dask. This parallelizes computation on a laptop, for example.

<center>
<img src="images/distributed.png" width=60%"/>
</center>