# Dask Array

Depending on the focus of your work, Dask Array is likely to be the first interface you use for Dask after Dataframe ... or perhaps just the first interface you use (e.g., if you work primarily with NumPy).

Dask Array implements a subset of the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays. This lets us compute on arrays larger than memory using all of our cores. We coordinate these blocked algorithms using Dask graphs.

Dask arrays coordinate many NumPy arrays arranged into a grid. These NumPy arrays may live on disk or on other machines.

<img src="images/dask-array-black-text.svg">

## Dask Arrays

- Dask arrays are chunked, n-dimensional arrays
- Can think of a Dask array as a collection of NumPy `ndarray` arrays
- Dask arrays implement a large subset of the NumPy API using blocked algorithms
- For many purposes Dask arrays can serve as drop-in replacements for NumPy arrays

In [None]:
from dask.distributed import Client

client = Client(n_workers=2, threads_per_worker=1, memory_limit='1GB')
client

In [None]:
import numpy as np
import dask.array as da

*These examples courtesy of Dask contributor James Bourbeau*

In [None]:
a_np = np.arange(1, 50, 3)
a_np

In [None]:
a_da = da.arange(1, 50, 3, chunks=5)
a_da

In [None]:
print(a_da.dtype)
print(a_da.shape)

In [None]:
print(a_da.chunks)
print(a_da.chunksize)

In [None]:
a_da.visualize()

In [None]:
(a_da ** 2).visualize()

In [None]:
(a_da ** 2).compute()

In [None]:
type((a_da ** 2).compute())

Dask arrays support a large portion of the NumPy interface:

- Arithmetic and scalar mathematics: `+`, `*`, `exp`, `log`, ...

- Reductions along axes: `sum()`, `mean()`, `std()`, `sum(axis=0)`, ...

- Tensor contractions / dot products / matrix multiply: `tensordot`

- Axis reordering / transpose: `transpose`

- Slicing: `x[:100, 500:100:-2]`

- Fancy indexing along single axes with lists or numpy arrays: `x[:, [10, 1, 5]]`

- Array protocols like `__array__` and `__array_ufunc__`

- Some linear algebra: `svd`, `qr`, `solve`, `solve_triangular`, `lstsq`, ...

- ...

See the [Dask array API docs](http://docs.dask.org/en/latest/array-api.html) for full details about what portion of the NumPy API is implemented for Dask arrays.

### Blocked Algorithms

Dask arrays are implemented using _blocked algorithms_. These algorithms break up a computation on a large array into many computations on smaller peices of the array. This minimizes the memory load (amount of RAM) of computations and allows for working with larger-than-memory datasets in parallel.

In [None]:
x = da.random.random(20, chunks=5)
x

In [None]:
result = x.sum()
result

In [None]:
result.visualize()

In [None]:
result.compute()

Dask supports a large portion of the NumPy API. This can be used to build up more complex computations using the familiar NumPy operations you're used to.

In [None]:
x = da.random.random(size=(15, 15), chunks=(10, 5))
x

In [None]:
result = (x + x.T).sum()
result

In [None]:
result.compute()

We can perform computations on larger-than-memory arrays!

In [None]:
x = da.random.random(size=(20_000, 20_000), chunks=(2_000, 2_000))
x

In [None]:
result = (x + x.T).sum()
result

In [None]:
x.nbytes / 1e9    # Size of array in gigabytes

In [None]:
result.compute()

In [None]:
client.close()