# Dask Arrays from file to GPU

This is an example from [martindurant's repo](https://github.com/martindurant/dask-tutorial-scipy-2018). There you can find much more examples. [Here](https://www.youtube.com/watch?v=mqdglv9GnM8) you can find the fantastic presentation for which that repository was made.

We are going to use CuPy as the backend for dask arrays to compute on a GPU the mean of daily weather measurements over a month.

In [None]:
import os
import h5py
import dask.array as da
import matplotlib.pyplot as plt

We are working here with global weather data in hdf5 format (Hierarchical Data Format). Each hdf5 file takes about 576MB on `SCRATCH` and corresponds to a day measurements.

In [None]:
filenames = ['2014-01-01.hdf5', '2014-01-02.hdf5', '2014-01-03.hdf5',
             '2014-01-04.hdf5', '2014-01-05.hdf5', '2014-01-06.hdf5',
             '2014-01-07.hdf5', '2014-01-08.hdf5', '2014-01-09.hdf5',
             '2014-01-10.hdf5', '2014-01-11.hdf5', '2014-01-12.hdf5',
             '2014-01-13.hdf5', '2014-01-14.hdf5', '2014-01-15.hdf5',
             '2014-01-16.hdf5', '2014-01-17.hdf5', '2014-01-18.hdf5',
             '2014-01-19.hdf5', '2014-01-20.hdf5', '2014-01-21.hdf5',
             '2014-01-22.hdf5', '2014-01-23.hdf5', '2014-01-24.hdf5',
             '2014-01-25.hdf5', '2014-01-26.hdf5', '2014-01-27.hdf5',
             '2014-01-28.hdf5', '2014-01-29.hdf5', '2014-01-30.hdf5',
             '2014-01-31.hdf5']

Let's open the hdf5 files.

In [None]:
datadir = os.path.join('/scratch/snx3000/class256/weather-big')
dsets = [
    h5py.File(os.path.join(datadir, filename), mode='r')['/t2m']
    for filename in filenames
]

The code in the previous cell doesn't load the content of the hdf5 in memory. It just creates a list of hdf5 dataset objects which are <ins>*connected*</ins> to the files on disk.

In [None]:
dsets[0].shape

Let's plot the values for one of the days

In [None]:
plt.imshow(dsets[0][::8, ::8], cmap='RdBu_r')  # We skip 8 elemnts in rows and columns (with the `::8`) to plot faster
plt.axis('off')
plt.show()

Often we need to do calculations that depend on all the days. For instance let's say that we need to calculate the average values over the month and plot it.

If we are not used to deal with large datasets, we would probably create a numpy array with all the data and compute the mean over the axis that represents the days. However, with data that doesn't fit in memory, that won't be possible. We would get a `MemoryError` exception (probably not in Piz Daint because the data is not big enough). Then we have to come up with a way to compute the averages array by array and modify significantly our scripts and workflows. 

In cases like this is where Dask arrays are useful.

## Connecting the hdf5 files to a Dask array

We first create a list of dask arrays, and stack it on a single one. Like this, from the point of view of the programmer, <ins>it feels like there is only a single hdf5 file on disk</ins>. 

In [None]:
arrays = [da.from_array(dset, chunks=(5760, 11520))
          for dset in dsets]
len(arrays)

In [None]:
x = da.stack(arrays)   # stack all the arrays in a single one
x

In [None]:
x_cupy = x.to_backend('cupy')
x_cupy

Here we only declared the array `x`. It is not loaded in memory.

In [None]:
x_cupy[0]

At this point, nothing is loaded in memory. We have declared the Dask array `x`, which is connected to the hdf5 files on Disk. We we compute something Dask will fetch the data from Disk on demand.

Note that on the next cells we plot the data, but we don't call the `compute` method. This is because matplotlib *undaskifies* the array. In general that happens with functions that expect a numpy array.

In [None]:
mean = x_cupy.mean(axis=0)

In [None]:
mean.visualize()

In [None]:
from cupy_timer import cupy_timer

In [None]:
with cupy_timer():
    x_cupy_mean = mean.compute()

In [None]:
type(x_cupy_mean)

In [None]:
x_cupy_mean.device

In [None]:
x_numpy_mean = x_cupy_mean.get()  # cupy -> numpy

In [None]:
# plot the mean
plt.imshow(x_numpy_mean, cmap='RdBu_r')
plt.axis('off')
plt.show()