# Big Data

### Dask Arrays

* David Booker-Earley
* 6/2/2020
<!-- * Checkpoint 4, Module 37 -->

---

In [1]:
import dask.array as da
import numpy as np

In [2]:
%%time
x = da.random.random((10000, 10000), chunks=(1000, 1000))
y = x + x.T
z = y[::2, 5000:].mean(axis=1)
z.compute()

Wall time: 908 ms


array([0.99760363, 1.00569824, 1.00193466, ..., 1.00602198, 1.00437257,
       0.99416104])

In [3]:
%%time
x = np.random.random((10000, 10000))
y = x + x.T
z = y[::2, 5000:].mean(axis=1)

Wall time: 5.68 s


In [6]:
5680 / 908

6.255506607929515

* Implementing the code with a Dask Array took $908$ milliseconds (ms).
* Using NumPy took $5.68$ seconds (s), which is equivalent to $5680$ ms.
* The solution through NumPy ran for at least $6$ times the runtime of Dask's solution.

## 1. How long does it take to run when setting chunks=(250, 250)?

In [4]:
%%timeit
x = da.random.random((10000, 10000), chunks=(250, 250))
y = x + x.T
z = y[::2, 5000:].mean(axis=1)
z.compute()

2.64 s ± 81.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


* With these parameters, the code ran for approximately $2.64$ seconds.

## 2. How long does it take to run with chunks=(500, 500)? Why does this one or the previous one run more quickly?

In [5]:
%%timeit
x = da.random.random((10000, 10000), chunks=(500, 500))
y = x + x.T
z = y[::2, 5000:].mean(axis=1)
z.compute()

1.05 s ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


* With these parameters, the code ran for $1.05$ seconds.

This variation ran more quickly than the `chunks=(250,250)` code.
* `chunks=(500,500)` doubled the chunk dimensions and had half the runtime.
* An array with dimensions of $500x500$ can account for more data than one of $250x250$.
* Executing array operations on fewer, larger chunks seems to be faster because it's less computationally expensive than operating on the elements of numerous, smaller chunks.