Foundations
===========

Start off by showing the guts of dask.  This is going to be straightforward but somewhat tedious.  

You can skip ahead if you don't care about how things work.

### We make functions

In [None]:
def inc(x):
    return x + 1

def add(x, y):
    return x + y

### Normally we call these functions in Python code

Python then executes our code in the order written.

In [None]:
a = 1
b = inc(a)

x = 10
y = inc(x)

z = add(b, y)
z

If we want parallelism then we can't do this.  We need to stop Python from taking control.

### We represent our computation as a data structure

We store the steps of the computation above as a Python dictionary.  We store function calls as tuples.

This is going to look a little strange but we'll have the entire computation stored in a Python data structure that we can manipulate with *other* Python code.

In [None]:
dsk = {'a': 1, 
       'b': (inc, 'a'),
       
       'x': 10,
       'y': (inc, 'x'),
       
       'z': (add, 'b', 'y')}

### We use functions to execute these computations

The dask library contains functions to execute these dictionaries with multiple threads or multiple processes.

In [None]:
from dask.threaded import get
get(dsk, 'z')  # Execute in multiple threads

In [None]:
from dask.multiprocessing import get
get(dsk, 'z')  # Execute in multiple processes

### We can also analyze and visualize these graphs

In [None]:
# Requires that you have pydot and graphviz installed
# This isn't a problem if this doesn't work for you
from dask.dot import dot_graph
dot_graph(dsk)

That's it
----------

The rest of this tutorial is just fancy ways of constructing and executing these dictionaries of task graphs.  

Fundamentally dask is a way to represent computations as dictionaries, and then analyze and execute them.

Exercise - `read_csv`
------------------------

There are three CSV files in your `data` directory.  Lets count how many rows are in all of these csv files total.  In normal Python we might do the following.

In [None]:
import pandas as pd

import os
filenames = [os.path.join('data', 'accounts.%d.csv' % i) for i in [0, 1, 2]]
filenames

In [None]:
%%time 

a = pd.read_csv(filenames[0])
b = pd.read_csv(filenames[1])
c = pd.read_csv(filenames[2])

na = len(a)
nb = len(b)
nc = len(c)

total = sum([na, nb, nc])
total

### Exercise: Construct a dask graph/dictionary for this computation

Just as we turned code that looks like 

```python
y = f(x)
```

into dictionaries like 

```python
{'y': (f, 'x')}
```

We can transform the above calls to `pd.read_csv`, `len`, and `sum` into a dictionary of tuples

In [None]:
dsk = {'a': ...,
       'b': ...,
      }

In [None]:
# Solution
%load solutions/Foundations-01.py

### Execute your dask graph

Use the threaded scheduler and the multiprocessing scheduler.  

How well does each perform?

In [None]:
from dask.threaded import get
%time get(dsk, 'total')

In [None]:
from dask.multiprocessing import get
%time get(dsk, 'total')

Exercise
--------

Compute the sum of the amounts field across all three CSV files.  In normal sequential code we might execute the following:

In [None]:
sums = list()
for fn in filenames:
    df = pd.read_csv(fn)
    sums.append(df.amount.sum())
total = sum(sums)
total

Now create the same dask graph.  The use of attribute access (e.g. `.amount`) and methods (e.g. `.sum()`) will require you to be a little tricky when putting this code into an explicit dictionary.  

We suggest building and using a small function to compute the sum of the amount of a dataframe and using this function in your dask graph.

In [None]:
def amount_sum(df):
    return df.amount.sum()

In [None]:
dsk = dict()

for ...

In [None]:
# Solution
%load solutions/Foundations-02.py

In [None]:
get(dsk, 'total')

Conclusion
------------

Dask graphs represent computations as dictionaries of tuples.  

The `get` functions execute these dictionaries in parallel.

We've made a few of these dictionaries by hand.  It's straightforward but perhaps tiresome.
In the next sections we'll play with systems that generate these dictionaries for us.