<center><img src="http://www.nasa.gov/sites/all/themes/custom/nasatwo/images/nasa-logo.svg"></center>

<center>
<h1><font size="+3">GSFC Python Bootcamp</font></h1>
</center>

---

<center><h1> <font color="red">Introduction to Dask</font></h1></center>

## <font color="red">Reference Document</font>

- <a href="https://docs.dask.org/en/latest/why.html">Why Dask?</a>
- <a href="https://www.manning.com/books/data-science-with-python-and-dask">Data Science with Python and Dask</a>

![fig_dask](https://miro.medium.com/max/1000/1*D6mSsdWECFLn6wJne4VTjg.png)


### What is Dask?

- A flexible library for parallel computing in Python that makes it easy to build intuitive workflows for ingesting and analyzing large, distributed datasets. 
- A native parallel analytics tool designed to integrate seamlessly with Numpy, Pandas, and Scikit-Learn. 

Dask consists of several different components and APIs, which can be categorized into three layers: the scheduler, low-level APIs, and high-level APIs.

![fig_layers](https://dpzbhybb2pdcj.cloudfront.net/daniel/HighResolutionFigures/figure_1-1.png)
Image Source: 

**Advantages of Using Dask**

- Fully implemented in Python and natively scales NumPy, Pandas, and scikit-learn.
- Can be used effectively to work with both medium datasets on a single machine and large datasets on a cluster.
- Can be used as a general framework for parallelizing most Python objects.
- Has a very low configuration and maintenance overhead.





In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import dask
import dask.array as da
import dask.dataframe as dd
from dask.diagnostics import ProgressBar 

## Dask Array

- Dask arrays coordinate many Numpy arrays, arranged into chunks within a grid. 
- They support a large subset of the Numpy API.

![fig_array](https://miro.medium.com/max/1388/1*JfQnXJ5_R104bPyE8_XhwQ.png)

**Create a Dask Array**

- Create a 20000x20000 array of random numbers, represented as many numpy arrays of size 1000x1000 (or smaller if the array cannot be divided evenly). 
- There are 100 (10x10) numpy arrays of size 1000x1000.

In [None]:
x = da.random.random((20000, 20000), chunks=(1000, 1000))
x

We can use Numpy syntax:

In [None]:
y = x + x.T
y.shape

In [None]:
z = y[::2, 5000:].mean(axis=1)
z

Use the `compute()` function if you want your result as a NumPy array.

In [None]:
w = z.compute()
print(type(w), w.shape )

**Persit Data in Memory**

- If you have the available RAM for your dataset then you can persist data in memory.
- This allows future computations to be much faster.

In [None]:
%time y.sum().compute()

In [None]:
y = y.persist()

In [None]:
%time y[0, 0].compute()

In [None]:
%time y.sum().compute()

**Numpy against Dask**

In [None]:
%%time 
x = np.random.normal(10, 0.1, size=(20000, 20000)) 
y = x.mean(axis=0)[::100] 
#y

In [None]:
%%time
x = da.random.normal(10, 0.1, size=(20000, 20000), chunks=(1000, 1000))
x = x.persist()
y = x.mean(axis=0)[::100] 
#y.compute() 

## Dask DataFrames

- Coordinate many Pandas dataframes, partitioned along an index. 
- Support a large subset of the Pandas API.

![fig_df](https://pythondata.com/wp-content/uploads/2016/11/Screen-Shot-2016-11-24-at-6.52.24-PM-168x300.png)

In [None]:
url = "http://www.cpc.ncep.noaa.gov/products/precip/CWlink/daily_ao_index/monthly.ao.index.b50.current.ascii"

In [None]:
df = dd.read_table(url, sep='\s+', 
               parse_dates={'Dates':[0, 1]}, header=None)
df

In [None]:
df.dtypes

In [None]:
df.head()

**Use Standard Pandas Operations**

In [None]:
df.compute()

In [None]:
df.columns = ['Dates', 'AO']
df.compute()

In [None]:
df.AO.compute()

In [None]:
df.compute().plot(x="Dates", y="AO")

In [None]:
df2 = df[df.AO > 0]
df2.compute().plot(x="Dates", y="AO")

In [None]:
df3 = df[df.Dates > '1995-01-01'].AO.sum()
print(df3.compute())

## Parallelize Code with `dask.delayed`

- We want to parallelize a simple for-loop

### Simple Code

Consider the following functions:

In [None]:
import time
import random

def inc(x):
    time.sleep(0.2)
    return x + 1

def double(x):
    time.sleep(0.2)
    return 2 * x

def add(x,y):
    time.sleep(0.2)
    return x + y

Let us use the above functions within two for-loops:

In [None]:
%%timeit -r 1

n = 20
data = [i+1 for i in range(n)]

out = []
for x in data:
    y = inc(x)
    z = double(y)
    out.append(z)
    
total = 0
for z in out:
    total = add(total, z)

total

We can use the `dask.delayed` decorator to parallelize the functions `inc`, `double` and `add`:

In [None]:
inc = dask.delayed(inc)
double = dask.delayed(double)
add = dask.delayed(add)

We use the `visualize` method (relies on the `graphviz` package) that provide a visual representation of the operations being performed.

In [None]:
x = inc(1)
y = inc(2)
z = add(x, y)
z.visualize(rankdir='LR')

We can now revisit the orginal code by using the wrapped/decorated methods defined above.

In [None]:
n = 20
data = [i+1 for i in range(n)]

out = []
for x in data:
    y = inc(x)
    z = double(y)
    out.append(z)
    
total = 0
for z in out:
    total = add(total, z)

total

- Note that we have not physically calculated total yet.
- We need to apply the `compute` method to get the answer.

In [None]:
%%timeit -r 1
dask.compute(total)

We can also get the visual representation of the sequence of operations.

In [None]:
total.visualize()

### Example with DataFrame

Build a Pandas DataFrame with 100K rows and two columns with values selected randomly between 1 and 1000

In [None]:
df = pd.DataFrame({'X':np.random.randint(1000, size=100000),
                   'Y':np.random.randint(1000, size=100000)})
df

Write a function that computes the sum of square for each column of the DataFrame.

In [None]:
def add_squares(df):
    return df.X**2+df.Y**2

Measure the time it takes to call the function:

In [None]:
%%timeit
df['add_squares'] = df.apply(add_squares,axis=1)

In [None]:
df

**Parallelize using Dask `Map_Partition`**

- We construct a dask dataframe from pandas dataframe using `from_pandas` function and specify the number of partitions (nparitions) to break this dataframe into.
- We will break into 4 partitions (number of available cores)

In [None]:
ddf = dd.from_pandas(df, npartitions=4)

We will apply `add_squares` method on each of these partitions:

In [None]:
%%timeit
ddf['z'] = ddf.map_partitions(add_squares, 
                               meta=(None, 'int64')).compute()

In [None]:
def myfunc(x, y):
    return y * (x**2 + 1)

In [None]:
%%timeit
df1 = df.apply(lambda row: myfunc(row.X, row.Y), axis=1)

In [None]:
import multiprocessing
ddf = dd.from_pandas(df, npartitions=4*multiprocessing.cpu_count())
ddf

In [None]:
%%timeit
ddfz = ddf.map_partitions(lambda data: 
                              data.apply(lambda row: myfunc(row.X, row.Y), axis=1)).compute(scheduler='processes')