Imperative Programming
=================

Many problems don't fit cleanly into `ndarray` or `DataFrame` abstractions.  How can we use dask to parallelize more custom workloads?

We can always fall back to creating dictionaries manually:

    dsk = {'load-1': (load, filename1), 'clean-1': (clean, 'load-1'), ...,
           'load-2': (load, filename2), 'clean-2': (clean, 'load-2'), ...,
           ...}
    
Manual dictionary creation though can be tedious, is prone to programmer error, and feels foreign to many developers. 

The dask `do` function helps you to construct custom dask graphs using more typical coding styles than the explicit construction of a dictionary.

### Custom graphs with `do`

The `do` function delays a function evaluation, producing a lazily evaluated result.  One wraps a function with a `do` call

*  Before:  

        result = f(a, b, c=10)
*  After:  

        result = do(f)(a, b, c=10)
        
The result of a call to `do(function)` is a lazy `Value` object that we can use in future `do` calls or eventually call `.compute()`

    >>> result.compute()

### A Familiar Example

To explore this abstraction we revisit our examples from the [Foundations Notebook](02-Foundations.ipynb)

In [None]:
def inc(x):
    return x

def add(x, y):
    return x + y

a = 1
b = inc(a)

x = 10
y = inc(x)

z = add(b, y)
z

Originally we parallelized this by constructing a dask graph explicitly

In [None]:
dsk = {'a': 1, 
       'b': (inc, 'a'),
       
       'x': 10,
       'y': (inc, 'x'),
       
       'z': (add, 'b', 'y')}

Now we can also use the `do` function to construct the dask graph with more traditional programming.

In [None]:
from dask import do

a = 1
b = do(inc)(a)

x = 10
y = do(inc)(x)

z = do(add)(b, y)
z

In [None]:
z.compute()

These value objects build up the dask graph as they go.  These graphs are less interpretable but fine for normal execution.

In [None]:
z.dask

Exercise
---------

Consider our first exercise reading three CSV files with `pd.read_csv` and then measuring their total length.  

In [None]:
import pandas as pd
import os
filenames = [os.path.join('data', 'accounts.%d.csv' % i) for i in [0, 1, 2]]
filenames

In [None]:
%%time

a = pd.read_csv(filenames[0])
b = pd.read_csv(filenames[1])
c = pd.read_csv(filenames[2])

na = len(a)
nb = len(b)
nc = len(c)

total = sum([na, nb, nc])
total

In the first notebook we constructed a dask graph from this computation and then executed it in parallel using multiple processes to get a speedup

In [None]:
# %load solutions/Foundations-01.py
dsk = {'a': (pd.read_csv, filenames[0]),
       'b': (pd.read_csv, filenames[1]),
       'c': (pd.read_csv, filenames[2]),
       'na': (len, 'a'),
       'nb': (len, 'b'),
       'nc': (len, 'c'),
       'total': (sum, ['na', 'nb', 'nc'])}

In [None]:
from dask.multiprocessing import get
%time  get(dsk, 'total')

Your task is to recreate this graph again using the `do` function on the original Python code.

In [None]:
a = do(pd.read_csv)(filenames[0])
...

total = ...

%time total.compute(get=get) # use multiprocessing get function in call to compute