# HDF5 File Creation and Conventions Documentation

#### Based on my experience with using the __[h5py](http://docs.h5py.org/en/stable/)__ library to create forcing files for MOHID, this notebook documents the recommended way of creating HDF5 files with a tree structure, compression variables and metadata attributes for datasets.

# Creating HDF5 files

#### The code that follows assumes that the `h5py` python library has been imported:
    
```python
import h5py
```

# Files

#### Create an empty HDF5 file. Creating an instance of a  `File` object gives the HDF5 file a `root` or `/` Group and returns a 'File' object which can be assigned to a variable. The code that follows assumes that the `foo` file object is open. 

```python
foo = h5py.File('foo.hdf5', 'w')
```
#### The first argument that `File` takes is the path and name of the `.hdf5` file that will be created, updated or read. The second argument is the mode with which to access the file. From the __[h5py docmunetation](http://docs.h5py.org/en/stable/high/file.html)__:
| Mode | What it does |
| --- | --- |
| 'r' | Read only, file must exist |
| 'r+' | Read/write, file must exist |
| 'w' | Create file, truncate if exists |
| 'w-' or x | Create file, fail if exists |
| 'a' | Read/write if exists, create otherwise (default) |

# About the Tree Structure

#### The way MOHID reads in the hdf5 files requires the datasets to be contianed systematically in an arbitrary arity tree (__[from CPSC110](https://eyqs.ca/assets/documents/UBCCPSC110.txt)__: a tree whose nodes have an arbitrary number of children)

#### The tree structure is comprised of:
#### - Groups: The containers that create the 'nodes'/'branches' of the tree that hold the datasets ('children'/'leaves')
#### - Datasets: Homogeneous collections (i.e. all elements are of one type, such as `float64`, `int`, and so on) of data. These are the 'children'/'leaves' of the tree.

#### Computer scientists draw their trees upside down
#### Consider the example:
                            foo.hdf5 (root or /)
                                |
                                |
               +------------------------------------+
               |                                    |
               |                                    |
             time (Group)                        Results (Group)
              - time_00001 (Dataset)                 |
              - time_00002 (Dataset)                 |
                                                     |
                                         +-----------------------+
                                         |                       |
                                         |                       |
                                      bar (Group)            baz (Group)
                                       - bar_00001 (Dataset)  - baz_00001 (Dataset)
                                       - bar_00002 (Dataset)  - baz_00002 (Dataset)

#### We can have nested Groups, such as `bar` and `baz`, which are children of the `Results` Group
#### The tree structure terminates with datasets (You could terminate with another Group, but that would just be an empty container
#### All children at a certain level must have unique names, for instance we cannot have two Datasets under `time` called `time_00001` or two groups under `Results` called `bar`

# Creating Groups

#### Group objects are created using the method `Group.create_group()`, where `Group` is a Group object. This method works on a newly created file object like `foo` because the file was created by default with a `root` or `/` Group. The `Group.create_group()` method takes a `name` argument, which is an identifier for the group, and returns a Group object that can be assigned to a variable. You can read more about Groups __[here](http://docs.h5py.org/en/stable/high/group.html#creating-groups)__

#### Create a Group on `root` or `/`
##### This creates "/Results" explicitly

```python
Results = foo.create_group('Results')
```


#### Create a nested Group by creating a new child group on an existing group
##### This creates "/Results/bar" explicitly

```python
bar = Results.create_group('bar')
```


#### Implicitly define a Group, for instance, instead of the above two examples simply do:
##### This creates "/Results/bar" implicitly

```python
bar = foo.create_group('Results/bar')
```

# Creating Datasets

#### Dataset objects are created using the method `Group.create_dataset()`, where `group` os a Group object. 
#### The `Group.create_dataset()` method takes two mandatory arguments, the positional argument `identifier`, a string used to name the dataset, and the keyword argument `data`, which can be a NumPy `ndarray` or a `list`. For MOHID, we use NumPy `ndarrays` with `float64` for consistency, and because this is what I saw when reverse engineering Shihan's files. 
#### The `Group.create_dataset()` method also accepts a multitude of other optional keyword arguments, some of which I use:
#### - `shape` is a `tuple` that describes the dimensions of `data`
#### - `chunks` is a `tuple` that describes the dimensions of the chucnk sizes we want to store `data` in. When revverse engineering Shihan's files, I saw that that `chunks` was the same as `shape` so I left it as is
#### - `compression` is a `str`. I use `'gzip'` due to the reasons described __[here](http://docs.h5py.org/en/stable/high/dataset.html#lossless-compression-filters)__, and because 1) it produces an HDF5 file of acceptable size comparable to that of the size produced by Shihan's Matlab scripts 2) it works with MOHID
#### - `compression_opts` is an `int` from 1-9. The defualt value is 4.
#### You can read more about datasets __[here](http://docs.h5py.org/en/stable/high/dataset.html)__

#### Suppose `data` is a NumPy array, defined as follows:
```python
import numpy
dataarray = numpy.ones([100,100]) # a 2D array of shape (100,100) with 1 everywhere
dataarray = dataarray.astype('float64') # convert all values to float64 
```

#### Create a Dataset on the `bar` Group using data from `dataarray`
```python
bar_00001 = bar.create_dataset(
    'bar_00001',
    shape = (100,100),
    data = dataarray,
    chunks = (100,100),
    compression = 'gzip',
    compression_opts = 1,
)
```

# Metadata

#### Metadata is assigned to a `Dataset` object to give MOHID vital information, such as the FillValue, Maximum and Minimum of a dataset. It is also assigned to a dataset for human convenience, such as the units of the data contained within the dataset

#### The `attrs` atribute of a `Dataset` object contains is a `dict` that contains its metadata. It is updated using a python `dict`.

#### For instance:
```python
metadata = {
    'FillValue' : np.array([0.]),
    'Maximum' : np.array([5.]),
    'Minimum' : np.array([-5.]),
    'Units' : b'm/s'
    }
```
#### Note: Not all Datasets have a `FillValue` key, such as Time

#### `FillValue` is a flag used by MOHID to mask land values. I inherited this from Shihan's matlab scripts. Notice the use of a NumPy array with a float value.
#### `Maximum` is a flag used by MOHID to mask land values. I inherited this from Shihan's matlab scripts. Notice the use of a NumPy array with a float value.
#### `Minimum` is a flag used by MOHID to mask land values. I inherited this from Shihan's matlab scripts. Notice the use of a NumPy array with a float value.
#### `Units` is a flag used by MOHID to mask land values. I inherited this from Shihan's matlab script. It is an `str`.

# Writing Metadata to a Dataset

#### You can use a `dict` to add metadata attributes to a named dataset
```python
bar_00001.attrs.update(metadata)
```

# Closing a Dataset

#### To flush the data to disk, the `h5py.File` object must be closed when you are done writing to it
```python
foo.close()
```