### Introduction

In this notebook I will be reading the book ['Python and HDF5, Unlocking Scientific Data'](http://shop.oreilly.com/product/0636920030249.do) and try implementing the code given in the book, either as is or modified to suit my understanding or try some new thing out. The notebook will be filled with a lot of code accompanied by comments mentioning some important concept, intuition. To finish the book faster and get as much content as possible in the notebook I won't be giving the background for the HDF5 file format and other high level theoretical details which can be found in the book or other pages online but focus more on practical aspect of coding in Python and using HDF5 for storing scientific data.

The HDF5 web site can be found [here](https://support.hdfgroup.org/HDF5/)

In [83]:
import numpy as np
import h5py
import os

Lets look at the an example of data collected from weather station. Suppose we have 10 weather stations numbered from 1 to 10 for a date 1-Jan-2017 and each of them record temperature in Fahrenheit and wind speed in mph. We assume the numbers are integers and not floating point numbers

In [46]:
date = 1
month = 'Jan'
year = 2017

station_ids = range(1, 11, 1)
np.random.seed(0)
temperatures = np.asarray([np.random.randint(60, 80) for _ in station_ids])
wind_speeds = np.asarray([np.random.randint(0, 10) for _ in station_ids])

with h5py.File('weather.hdf5', 'w') as f:
    for station_id, temperature, wind_speed in zip(station_ids, temperatures, wind_speeds):        
        temperature_key = '/' + str(station_id) + '/temperature'
        f[temperature_key] = temperature
        f[temperature_key].attrs['date'] = 1
        f[temperature_key].attrs['month'] = month
        f[temperature_key].attrs['year'] = year
        wind_speed_key = '/' + str(station_id) + '/wind_speed'
        f[wind_speed_key] = wind_speed
        f[wind_speed_key].attrs['date'] = 1
        f[wind_speed_key].attrs['month'] = month
        f[wind_speed_key].attrs['year'] = year
        


Let us now open this file and retrieve readings of the station 7 and confirm its what we wrote

In [78]:
with h5py.File('weather.hdf5') as f:
    station7_temperature = f['/7/temperature']
    station7_wind_speed = f['/7/wind_speed']
    assert station7_temperature.value == temperatures[6], 'Value not same as the one written'
    assert station7_wind_speed.value == wind_speeds[6], 'Value not same as the one written'
    temperature_node_attrs = dict([a for a in station7_temperature.attrs.items()])
    print('Temperature at station 7 is', station7_temperature.value, ',wind speed recorded is'
          ,station7_wind_speed.value, ', the date these measurements were taken is',
         '%d-%s-%d'%(temperature_node_attrs['date'], temperature_node_attrs['month'], temperature_node_attrs['year']))

Temperature at station 7 is 69 ,wind speed recorded is 7 , the date these measurements were taken is 1-Jan-2017


HDF5 file is not entirely loaded in memory but only the data required and read is loaded. In above case the weathers file may have a lot of data but only the necessary information about station 7 was read in memory when requested

Let's look at another example. where we create a dataset (we are yet to see what a dataset is).


In [90]:
with h5py.File('BigArrayFile.hdf5', 'w') as f:
    dataset = f.create_dataset('big', shape = (1024, 1024), dtype = 'float32')
    
stats = os.stat('BigArrayFile.hdf5')
print('Size of the file BigArrayFile.hdf5 is',stats.st_size, 'bytes')

Size of the file BigArrayFile.hdf5 is 1400 bytes


As we see above, we created an HDF5 file and created a data set called big in it of shape $1024 \times 1024$ of type float32. Yet, the size of the file on the disk is 1400 bytes, let us set a byte at index (2, 2) with value 2.0

In [104]:
with h5py.File('BigArrayFile.hdf5') as f:
    dataset = f['big']
    dataset[2, 2] = 2.0
    
stats = os.stat('BigArrayFile.hdf5')
print('Size of the file BigArrayFile.hdf5 is',stats.st_size, 'bytes')

Size of the file BigArrayFile.hdf5 is 4195704 bytes


 As we see above, once we accessed the byte of the data set the entire dataset was flushed to the disk. The shape times 4 bytes(for float32) per location should take $1025 \times 1024 \times 8 = 4194304$, the size we see above is pretty close to this number as there HDF5 itself takes few bytes for the meta data. Also, an interesting point to note is that the dataset can be large in size (large enough to load all in memory), but only the bytes accessed will be loaded in memory.
 
 HDF5 also supports compression of the data. Lets create a dataset of same size $1024 \times 1024$, but create the dataset using compression gzip

In [100]:
with h5py.File('BigCompressedArrayFile.hdf5', 'w') as f:
    dataset = f.create_dataset('big', shape = (1024, 1024), dtype = 'float32', compression = 'gzip')
    dataset[2, 2] = 2.0
    
stats = os.stat('BigCompressedArrayFile.hdf5')
print('Size of the file BigCompressedArrayFile.hdf5 is',stats.st_size, 'bytes')

Size of the file BigCompressedArrayFile.hdf5 is 4075 bytes


As we see above, the file containing the dataset with same shape, but compressed has a much lower size of the data stored on the disk. There is however a tradeoff between space on disk and CPU time required to compress and decompress the contents. We will talk more on this later in the notebook.

We have used ``h5py.File`` to open the file. The second parameter is the ``mode`` argument which can either be

* r  : read only for existing file, fails when the provided file is not present
* r+ : read/write for existing file, fails when the provided file is not present
* w  : write, create a new file, truncates an existing file
* w- : write, same as w except that it doesnt truncate an existing file but the operation fails
* a  : read/write, if existing file not found, new one will be created (this if not the same in case of r+)

---

TODO: Give some introduction on the type of drivers available
