# Dask Arrays

<img src="images/dask-array-black-text.svg" 
     align="right"
     alt="Dask arrays are blocked numpy arrays">
     
Dask arrays coordinate many Numpy arrays, arranged into chunks within a grid.  They support a large subset of the Numpy API.

## Start Dask Client for Dashboard

Starting the Dask Client is optional.  It will provide a dashboard which 
is useful to gain insight on the computation.

In [None]:
from dask.distributed import Client, progress
client = Client(processes=False, threads_per_worker=4, n_workers=1)
client

If running from Binder you can access the dashboard here:

-  [Dask Diagnostic Dashboard](../proxy/8787/status)

We recommend having it open on one side of your screen while using your notebook on the other side.  This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning.

## Create Random array

This creates a 10000x10000 array of random numbers, represented as many numpy arrays of size 1000x1000 (or smaller if the array cannot be divided evenly). In this case there are 100 (10x10) numpy arrays of size 1000x1000.

In [None]:
import dask.array as da
x = da.random.random((10000, 10000), chunks=(1000, 1000))
x

Use NumPy syntax as usual

In [None]:
y = x + x.T
z = y[::2, 5000:].mean(axis=1)
z

Call `.compute()` when you want your result as a NumPy array.

If you started `Client()` above then you may want to watch the status page during computation.

In [None]:
z.compute()

## Persist data in memory

If you have the available RAM for your dataset then you can persist data in memory.  

This allows future computations to be much faster.

In [None]:
y = y.persist()

In [None]:
%time y[0, 0].compute()

In [None]:
%time y.sum().compute()

## Access Data

You can interact with on-disk array stores like HDF5

### First, make a toy HDF5 file

First we make a fake dataset with Numpy.  

Typically you already have a file like this.

In [None]:
import numpy as np

x = np.random.random((10000, 10000))  # make fake dataset 

import h5py

with h5py.File('myfile.hdf5') as f:
    dset = f.require_dataset('x', shape=y.shape, dtype=y.dtype, chunks=(100, 100))
    dset[:] = x

### Read data from HDF5 file

We use `da.from_array` to load data from any object that supports NumPy slicing

In [None]:
f = h5py.File('myfile.hdf5')
dset = f.require_dataset('x', shape=y.shape, dtype=y.dtype, chunks=(100, 100))
x = da.from_array(dset, chunks=(1000, 1000))

In [None]:
x[:5, :5].compute()