### Introduction

In this notebook I will be reading the book ['Python and HDF5, Unlocking Scientific Data'](http://shop.oreilly.com/product/0636920030249.do) and try implementing the code given in the book, either as is or modified to suit my understanding or try some new thing out. The notebook will be filled with a lot of code accompanied by comments mentioning some important concept, intuition. To finish the book faster and get as much content as possible in the notebook I won't be giving the background for the HDF5 file format and other high level theoretical details which can be found in the book or other pages online but focus more on practical aspect of coding in Python and using HDF5 for storing scientific data.

The HDF5 web site can be found [here](https://support.hdfgroup.org/HDF5/)

In [11]:
import numpy as np
import h5py
import os

Lets look at the an example of data collected from weather station. Suppose we have 10 weather stations numbered from 1 to 10 for a date 1-Jan-2017 and each of them record temperature in Fahrenheit and wind speed in mph. We assume the numbers are integers and not floating point numbers

In [12]:
date = 1
month = 'Jan'
year = 2017

station_ids = range(1, 11, 1)
np.random.seed(0)
temperatures = np.asarray([np.random.randint(60, 80) for _ in station_ids])
wind_speeds = np.asarray([np.random.randint(0, 10) for _ in station_ids])

with h5py.File('weather.hdf5', 'w') as f:
    for station_id, temperature, wind_speed in zip(station_ids, temperatures, wind_speeds):        
        temperature_key = '/' + str(station_id) + '/temperature'
        f[temperature_key] = temperature
        f[temperature_key].attrs['date'] = 1
        f[temperature_key].attrs['month'] = month
        f[temperature_key].attrs['year'] = year
        wind_speed_key = '/' + str(station_id) + '/wind_speed'
        f[wind_speed_key] = wind_speed
        f[wind_speed_key].attrs['date'] = 1
        f[wind_speed_key].attrs['month'] = month
        f[wind_speed_key].attrs['year'] = year
        


Let us now open this file and retrieve readings of the station 7 and confirm its what we wrote

In [13]:
with h5py.File('weather.hdf5') as f:
    station7_temperature = f['/7/temperature']
    station7_wind_speed = f['/7/wind_speed']
    assert station7_temperature.value == temperatures[6], 'Value not same as the one written'
    assert station7_wind_speed.value == wind_speeds[6], 'Value not same as the one written'
    temperature_node_attrs = dict([a for a in station7_temperature.attrs.items()])
    print('Temperature at station 7 is', station7_temperature.value, ',wind speed recorded is'
          ,station7_wind_speed.value, ', the date these measurements were taken is',
         '%d-%s-%d'%(temperature_node_attrs['date'], temperature_node_attrs['month'], temperature_node_attrs['year']))

Temperature at station 7 is 69 ,wind speed recorded is 7 , the date these measurements were taken is 1-Jan-2017


HDF5 file is not entirely loaded in memory but only the data required and read is loaded. In above case the weathers file may have a lot of data but only the necessary information about station 7 was read in memory when requested

Let's look at another example. where we create a dataset (we are yet to see what a dataset is).


In [14]:
with h5py.File('BigArrayFile.hdf5', 'w') as f:
    dataset = f.create_dataset('big', shape = (1024, 1024), dtype = 'float32')
    
stats = os.stat('BigArrayFile.hdf5')
print('Size of the file BigArrayFile.hdf5 is',stats.st_size, 'bytes')

Size of the file BigArrayFile.hdf5 is 1400 bytes


As we see above, we created an HDF5 file and created a data set called big in it of shape $1024 \times 1024$ of type float32. Yet, the size of the file on the disk is 1400 bytes, let us set a byte at index (2, 2) with value 2.0

In [15]:
with h5py.File('BigArrayFile.hdf5') as f:
    dataset = f['big']
    dataset[2, 2] = 2.0
    
stats = os.stat('BigArrayFile.hdf5')
print('Size of the file BigArrayFile.hdf5 is',stats.st_size, 'bytes')

Size of the file BigArrayFile.hdf5 is 4195704 bytes


 As we see above, once we accessed the byte of the data set the entire dataset was flushed to the disk. The shape times 4 bytes(for float32) per location should take $1025 \times 1024 \times 8 = 4194304$, the size we see above is pretty close to this number as there HDF5 itself takes few bytes for the meta data. Also, an interesting point to note is that the dataset can be large in size (large enough to load all in memory), but only the bytes accessed will be loaded in memory.
 
 HDF5 also supports compression of the data. Lets create a dataset of same size $1024 \times 1024$, but create the dataset using compression gzip

In [16]:
with h5py.File('BigCompressedArrayFile.hdf5', 'w') as f:
    dataset = f.create_dataset('big', shape = (1024, 1024), dtype = 'float32', compression = 'gzip')
    dataset[2, 2] = 2.0
    
stats = os.stat('BigCompressedArrayFile.hdf5')
print('Size of the file BigCompressedArrayFile.hdf5 is',stats.st_size, 'bytes')

Size of the file BigCompressedArrayFile.hdf5 is 4075 bytes


As we see above, the file containing the dataset with same shape, but compressed has a much lower size of the data stored on the disk. There is however a tradeoff between space on disk and CPU time required to compress and decompress the contents. We will talk more on this later in the notebook.

We have used ``h5py.File`` to open the file. The second parameter is the ``mode`` argument which can either be

* r  : read only for existing file, fails when the provided file is not present
* r+ : read/write for existing file, fails when the provided file is not present
* w  : write, create a new file, truncates an existing file
* w- : write, same as w except that it doesnt truncate an existing file but the operation fails
* a  : read/write, if existing file not found, new one will be created (this if not the same in case of r+)

---

#### Drivers

HDF5 drivers are the ones who map the bytes to be written to the low level disk bytes. Following are some important types of drivers

* SEC2: Default driver that uses posix file system read/write functions to a single file
* STDIO: Use the stdio.h for performing buffered read and write functions to single file
* CORE: Performs read/write in memory for with option an option to create backing files on file system.
* FAMILY: Partitions a large file in multiple chunks
* MPIIO: Parallel HDF5 which allows multiple processes to read/write from same HDF5 file in parallel. We will see this typr of driver in more details later.

---

### Datasets

In this section we will look at datasets in HDF5

Datasets are like Numpy array on disk. These datasets have a name, shape, type and can be sliced like an in memory Numpy array except that the entire dataset need not be loaded to memory and only the accessed slices from disk are loaded in memory as needed.

Let us now create a Numpy array and create a data set out of it in an HDF5 file. Lets get some attributes of the dataset and the numpy array and compare them

In [17]:
arr = np.array([[1, 2, 3, 4, 5]])
print('Numpy array type is', arr.dtype, 'shape is', arr.shape, 'type of arr is', type(arr))

with h5py.File('DatasetTest.hdf5', 'w') as f:
    f['arr'] = arr

# Opening the file again, in read mode
with h5py.File('DatasetTest.hdf5', 'r') as f:
    a = f['arr']
    print('Dataset type is', a.dtype, 'shape is', a.shape, 'type of a is', type(a))

Numpy array type is int64 shape is (1, 5) type of arr is <class 'numpy.ndarray'>
Dataset type is int64 shape is (1, 5) type of a is <class 'h5py._hl.dataset.Dataset'>




We can see the similarity between Numpy and h5py API.

Lets now read, modify the dataset and slice the dataset

In [18]:
with h5py.File('DatasetTest.hdf5', 'a') as f:
    #1
    arr = f['arr'][:]
    print('1. Arr type is', arr.dtype, 'shape is', arr.shape, 'type of a is', type(arr))
    
    #2
    arr[:] = 10
    print('2. Arr now is', arr[:], 'Dataset is', f['arr'][:])
    
    #3
    arr = f['arr'][0, 2:4]
    print('3. Arr type is', arr.dtype, 'shape is', arr.shape, 'type is', type(arr), 'contents are', arr)
    
    #4
    f['arr'][:] = 10
    print('4. Dataset now is', f['arr'][:])

1. Arr type is int64 shape is (1, 5) type of a is <class 'numpy.ndarray'>
2. Arr now is [[10 10 10 10 10]] Dataset is [[1 2 3 4 5]]
3. Arr type is int64 shape is (2,) type is <class 'numpy.ndarray'> contents are [3 4]
4. Dataset now is [[10 10 10 10 10]]



We have 4 outputs on the previous line and following are the observations

1. We can read the entire dataset in memory using `:` on the dataset object of the HDF5 file and serialize the contents to a numpy array in memory. Be careful while performing this operation on a large dataset as the entire dataset is seldom needed top be loaded in memory or in the worst case the size may be much larger than the available memory. The `...` can be used instead of `:` to reference the entire dataset.
2. Once in memory, this numpy array is just a copy and stale copy of the dataset in the file.  There is more synchronization between the array and the dataset in the file. As we see, any changes made to the contents of the array are local to the array and not reflected if we read the contents of the file back.
3. The access here to the dataset is how we will be practically accessing the HDF5 dataset in the file. We slice a part of it as we need to read and only and only this slice is serialized to memory by the HDF5 library.
4. Here we modify the entire dataset in the file. We would seldom do this in a real application especially for large datasets but only a slice of the dataset will be updated.

---

We will look at the how we can control the datatypes of the data that is read and written from and to the underlying HDF5 file.

Suppose we have a numpy array with double precision datatype float64 but we are ok to store the data as float32 to save some extra disk space effectively making the IO fast.

Let us first write the same array of double precision to the underlying file, one using double precision and one using single precision floating point type and compare the size of the files


In [19]:
arr = np.random.rand(100, 100)

with h5py.File('DoublePrecision.hdf5', 'w') as f:
    dset = f.create_dataset('arr', data = arr, dtype = 'float64')
    
with h5py.File('SinglePrecision.hdf5', 'w') as f:
    dset = f.create_dataset('arr', data = arr, dtype = 'float32')
    
dblstat = os.stat('DoublePrecision.hdf5')
singlestat = os.stat('SinglePrecision.hdf5')

print('Size of file holding float64 array is', dblstat.st_size, 
      ', Size of file holding float32 is', singlestat.st_size)

Size of file holding float64 array is 82144 , Size of file holding float32 is 42144



As we see above, the size of the file is halved by changing the datatype. This is a good optimization provided that the change of double to single precision float doesn't impact the application using the data. Similar choices can be made when using integer datatype in choosing between 8, 16 and 32 bits depending on the expected range of the values the data can take. This is no different than choosing datatypes for a variable in programming languages

Let's say we have written the float64 numpy array as a float32 array. This means when we read the value back, the datatype of the numpy array will be float32 which might not be desirable and we would want to seamlessly read the value into a float64 numpy array. There are a couple of ways we can do that as we see below.


In [20]:
with h5py.File('SinglePrecision.hdf5', 'r') as f:
    #1
    ds = f['arr']
    arr = ds[:]
    #2
    arr1 = np.empty(shape = ds.shape, dtype = 'float64')
    ds.read_direct(arr1)
    #3
    with ds.astype('float64'):
        arr2 = ds[:]

print('dtype of arr is', arr.dtype, 'dtype of arr is', arr1.dtype, 'dtype of arr2 is', arr2.dtype)
print('Equality check of arr and arr1 gives',(arr == arr1).all())
print('Equality check of arr1 and arr2 gives',(arr2 == arr1).all())

dtype of arr is float32 dtype of arr is float64 dtype of arr2 is float64
Equality check of arr and arr1 gives True
Equality check of arr1 and arr2 gives True



As we see from the above output, by default, the array is read as a float32 and the datatype of the original array written is lost. If we want to read the object back the object as a float64 array, the first approach (given in #2 above) is to initialize an empty numpy array of the desired target type and pass it to ``read_direct`` method of the HDF5 dataset instance. Another alternate(and cleaner in my opinion) is given in #3

---

Let us see the default values of an empty dataset and see how to use a different default value

In [28]:
with h5py.File('FillValue.hdf5', 'w') as f:
    dseti = f.create_dataset('empty_ints', shape = (1, 2), dtype = 'int16')
    dsetf = f.create_dataset('empty_floats', shape = (1, 2), dtype = 'float32')
    arri = dseti[:]
    arrf = dsetf[:]
    print('1. Using Defaults: arri, ', arri, 'arrf, ', arrf)
    dseti = f.create_dataset('fillv_ints', shape = (1, 2), dtype = 'int16', fillvalue = -1)
    dsetf = f.create_dataset('fillv_floats', shape = (1, 2), dtype = 'float32', fillvalue = float('nan'))
    arri = dseti[:]
    arrf = dsetf[:]
    print('2. Using FillValues: arri, ', arri, 'arrf, ', arrf)
    

1. Using Defaults: arri,  [[0 0]] arrf,  [[ 0.  0.]]
2. Using FillValues: arri,  [[-1 -1]] arrf,  [[ nan  nan]]



As we see above, the default value for int as well as float is 0. The domain of the application might have 0 as a valid value and thus we desire to different value which clearly identifies the value at that location in the array being absent and not a valid value. In such case we can use the ``fillvalue`` parameter to use a different fill value other than 0.

Suppose our application domain requires a value to be a non negative value for int fields and a non Nan value to be used for floating point numbers, it makes sense to give a negative value as a fill value for integers and a NaN for floats as a default value.

---

### Reading and writing Data

Let us create a dataset of size $100 \times 1000 $

In [36]:
arr = np.random.rand(100, 1000)
with h5py.File('Slicing.hdf5', 'w') as f:
    f['arr1'] = arr
    f['arr2'] = arr
    dset = f['arr1']
    #1
    print(dset)
    slice = dset[]
    

<HDF5 dataset "arr1": shape (100, 1000), type "<f8">



As we see above, we have two datasets of size $100 \times 1000$
If we want a slice of it of ``dset[10:20, ]