# HDF5 File Creation and Conventions Documentation

#### Based on my experience with using the __[h5py](http://docs.h5py.org/en/stable/)__ library to create forcing files for MOHID, this notebook documents the recommended way of creating HDF5 files with a tree structure, compression variables and metadata attributes for datasets.

# Creating HDF5 files

#### The code that follows assumes that the `h5py` python library has been imported:
    
```python
import h5py
```

# Files

#### Create an empty HDF5 file. Creating an instance of a  `File` object gives the HDF5 file a `root` or `/` Group and returns a 'File' object which can be assigned to a variable. The code that follows assumes that the `foo` file object is open. 

```python
foo = h5py.File('foo.hdf5', 'w')
```
#### The first argument that `File` takes is the path and name of the `.hdf5` file that will be created, updated or read. The second argument is the mode with which to access the file. From the __[h5py documentation](http://docs.h5py.org/en/stable/high/file.html)__:
| Mode | What it does |
| --- | --- |
| 'r' | Read only, file must exist |
| 'r+' | Read/write, file must exist |
| 'w' | Create file, truncate if exists |
| 'w-' or x | Create file, fail if exists |
| 'a' | Read/write if exists, create otherwise (default) |

# About the Tree Structure

#### The way MOHID reads in the hdf5 files requires the datasets to be contianed systematically in an arbitrary arity tree (__[from CPSC110](https://eyqs.ca/assets/documents/UBCCPSC110.txt)__: a tree whose nodes have an arbitrary number of children)

#### The tree structure is comprised of:
#### - Groups: The containers that create the 'nodes'/'branches' of the tree that hold the datasets ('children'/'leaves')
#### - Datasets: Homogeneous collections (i.e. all elements are of one type, such as `float64`, `int`, and so on) of data. These are the 'children'/'leaves' of the tree.

#### Computer scientists draw their trees upside down
#### Consider the example:
                            foo.hdf5 (root or / Group)
                                |
                                |
               +------------------------------------+
               |                                    |
               |                                    |
             Time (Group)                        Results (Group)
              - Time_00001 (Dataset)                 |
              - Time_00002 (Dataset)                 |
                                                     |
                                         +-----------------------+
                                         |                       |
                                         |                       |
                                      bar (Group)            baz (Group)
                                       - bar_00001 (Dataset)  - baz_00001 (Dataset)
                                       - bar_00002 (Dataset)  - baz_00002 (Dataset)

#### We can have nested Groups, such as `bar` and `baz`, which are children of the `Results` Group
#### The tree structure terminates with datasets (You could terminate with another Group, but that would just be an empty container
#### All children at a certain level must have unique names, for instance we cannot have two Datasets under `Time` called `Time_00001` or two groups under `Results` called `bar`

### Notes: 
#### 1) ALL the input HDF5 files for MOHID have a `Time` Group and a `Results` Group under which the required Groups are created. This is not configurable without going into the source code. The rest of the Groups can be customised, as I explain in the section 'The HDF5 structures for various input files' below.
#### 2) MOHID reads in the data sequentially by referring to the numerical reference. By its convention, all numerical references begin with `'_00001'`. They must be recorded chronologically, for instance, MOHID will crash if the timestamp in `Time_00001` occurs after `Time_00002`
#### 3) The names of all Gatasets under a Group must have the same name as its Group, followed by its numerical reference, as seen in the diagram above
#### 4) Since Datasets do not contain information about the timestamp they pertain to, you must ensure that the numerical reference ascribed to a Dataset is the same as the numerical reference ascribed to its timestamp e.g. a `bar_01100` will be assigned `Time_01100` by MOHID. You must then be careful only to contain Groups whose Datasets have the same timestamps. For instance, since WaveWatch3 outputs are twice as frequent as SalishSeaCast outputs and are output at different timestamps, they cannot be sored in the same HDF5 forcing file.

# The HDF5 structures for various input files

#### The names given to the `Groups` are defined in the input file blocks in the `.dat` configuration files

#### For instance, the `Hydrodynamic.dat` file we have been using contains the blocks:
```dat
! Excerpt from Hydrodynamic.dat
<begin_waterlevel>
NAME                      : water level
UNITS                     : m
DIMENSION                 : 2D
DEFAULTVALUE              : 0
INITIALIZATION_METHOD     : hdf
FILE_IN_TIME              : hdf
FILENAME                  : ./water_levels.hdf5
<end_waterlevel>

<begin_velocity_u>
NAME                      : velocity U
UNITS                     : m/s
DIMENSION                 : 3D
DEFAULTVALUE              : 0
INITIALIZATION_METHOD     : hdf
FILE_IN_TIME              : hdf
FILENAME                  : ./currents.hdf5
<end_velocity_u>

<begin_velocity_v>
NAME                      : velocity V
UNITS                     : m/s
DIMENSION                 : 3D
DEFAULTVALUE              : 0
INITIALIZATION_METHOD     : hdf
FILE_IN_TIME              : hdf
FILENAME                  : ./currents.hdf5
<end_velocity_v>

<begin_velocity_w>
NAME                      : velocity W
UNITS                     : m/s
DIMENSION                 : 3D
DEFAULTVALUE              : 0
INITIALIZATION_METHOD     : hdf
FILE_IN_TIME              : hdf
FILENAME                  : ./currents.hdf5
<end_velocity_w>
```

#### The `NAME` attribute gives the name of the Group that the quantity will be read from under the `Results` group of the input `.hdf5` file.
#### The `FILENAME` attribute will be the name of the `.hdf5` file MOHID will look for to find that Group. These will be symlinked to by MOHID-cmd. This means that you must include the filename as recorded in the `.dat` file in the `.yaml` file you use to set off the run under the `forcing` block, for instance:

````yaml
! .yaml
forcing:
  currents.hdf5: path of file containing 'velocity U', 'velocity V' and 'velocity W' Groups
  water_levels.hdf5: path of file contianing the 'water level' Group
````
### Note:
#### 1) These two file paths can even refer to the same file. As long as the group exists under `Results`, it will be read.


#### If we decide to change the `FILENAME` in the `begin_velocity_w` block to say `./vertical_velocities.hdf5`, we then change the `forcing` block in the `.yaml` file to:

```yaml
! .yaml
forcing:
  currents.hdf5: path of file containing 'velocity U' and 'velocity V' Groups
  vertical_velocities: path of file containing 'velocity W' Groups
  water_levels.hdf5: path of file contianing the 'water level' Group
```

#### This means that we can set the tree structure to be what we want it to be, and can be flexible about what input variables we want to group together. I used the tree strucutres I inherited from Shihan, but you can make your own.

###  For instance, we can have:

                         currents.hdf5 (root or / Group)
                                |
                                |
               +------------------------------------+
               |                                    |
               |                                    |
             Time (Group)                        Results (Group)
              - Time_00001 (Dataset)                 |
              - Time_00002 (Dataset)                 |
                                                     |
                             +-------------------------------------------------------------------------+
                             |                                   |                                     |
                             |                                   |                                     |
                      Velocity U (Group)                 Velocity V (Group                        Velocity W
                       - Velocity U_00001 (Dataset)       - Velocity V_00001 (Dataset)             - Velocity W_00001 (Dataset)
                       - Veloctiy U_00002 (Dataset)       - Velocity V_00002 (Dataset)             - Velocity W_00002 (Dataset)

### Or even:

                             currents.hdf5 (root or / Group)
                                |
                                |
               +------------------------------------+
               |                                    |
               |                                    |
             Time (Group)                        Results (Group)
              - Time_00001 (Dataset)                 |
              - Time_00002 (Dataset)                 |
                                                     |
                                     +-----------------------------------+
                                     |                                   |                                     
                                     |                                   |                                     
                              Velocity U (Group)                 Velocity V (Group              
                               - Velocity U_00001 (Dataset)       - Velocity V_00001 (Dataset) 
                               - Veloctiy U_00002 (Dataset)       - Velocity V_00002 (Dataset) 
                               
#### and

                             currents.hdf5 (root or / Group)
                                |
                                |
               +------------------------------------+
               |                                    |
               |                                    |
             Time (Group)                        Results (Group)
              - Time_00001 (Dataset)                 |
              - Time_00002 (Dataset)                 |
                                                     |
                                                   +---+
                                                     |                                                              
                                                     |                                                           
                                              Velocity W (Group)                           
                                               - Velocity W_00001 (Dataset)     
                                               - Veloctiy W_00002 (Dataset)      

### It depends on how you set it up, and as long as the rules are adhered to, MOHID will accept it

#### For the Examples that follow, I shall be referring to                             

                    foo.hdf5 (root or / Group)
                                |
                                |
               +------------------------------------+
               |                                    |
               |                                    |
             Time (Group)                        Results (Group)
              - Time_00001 (Dataset)                 |
              - Time_00002 (Dataset)                 |
                                                     |
                                         +-----------------------+
                                         |                       |
                                         |                       |
                                      bar (Group)            baz (Group)
                                       - bar_00001 (Dataset)  - baz_00001 (Dataset)
                                       - bar_00002 (Dataset)  - baz_00002 (Dataset)

# Creating Groups

#### Group objects are created using the method `Group.create_group()`, where `Group` is a Group object. This method works on a newly created file object like `foo` because the file was created by default with a `root` or `/` Group. The `Group.create_group()` method takes a `name` argument, which is an identifier for the group, and returns a Group object that can be assigned to a variable. You can read more about Groups __[here](http://docs.h5py.org/en/stable/high/group.html#creating-groups)__

#### Create a Group on `root` or `/`
##### This creates "/Results" explicitly

```python
Results = foo.create_group('Results')
```


#### Create a nested Group by creating a new child group on an existing group
##### This creates "/Results/bar" explicitly

```python
bar = Results.create_group('bar')
```


#### Implicitly define a Group, for instance, instead of the above two examples simply do:
##### This creates "/Results/bar" implicitly

```python
bar = foo.create_group('Results/bar')
```

# Creating Datasets

#### Dataset objects are created using the method `Group.create_dataset()`, where `group` os a Group object. 
#### The `Group.create_dataset()` method takes two mandatory arguments, the positional argument `identifier`, a string used to name the dataset, and the keyword argument `data`, which can be a NumPy `ndarray` or a `list`. For MOHID, we use NumPy `ndarrays` with `float64` for consistency, and because this is what I saw when reverse engineering Shihan's files. 
#### The `Group.create_dataset()` method also accepts a multitude of other optional keyword arguments, some of which I use:
#### - `shape` is a `tuple` that describes the dimensions of `data`
#### - `chunks` is a `tuple` that describes the dimensions of the chucnk sizes we want to store `data` in. When revverse engineering Shihan's files, I saw that that `chunks` was the same as `shape` so I left it as is
#### - `compression` is a `str`. I use `'gzip'` due to the reasons described __[here](http://docs.h5py.org/en/stable/high/dataset.html#lossless-compression-filters)__, and because 1) it produces an HDF5 file of acceptable size comparable to that of the size produced by Shihan's Matlab scripts 2) it works with MOHID
#### - `compression_opts` is an `int` from 1-9. The defualt value is 4.

#### You can read more about datasets __[here](http://docs.h5py.org/en/stable/high/dataset.html)__

#### Suppose `data` is a NumPy array, defined as follows:
```python
import numpy
data = numpy.ones([100,100]) # a 2D array of shape (100,100) with 1 everywhere
data = data.astype('float64') # convert all values to float64 
```

#### Create a Dataset on the `bar` Group using data from `dataarray`
```python
bar_00001 = bar.create_dataset(
    'bar_00001',
    shape = (100,100),
    data = data,
    chunks = (100,100),
    compression = 'gzip',
    compression_opts = 1,
)
```

# Metadata

#### Metadata is assigned to a `Dataset` object to give MOHID vital information. The FillValue, Maximum, Minimum and Units of a dataset are the only ones I have encountered so far that MOHID requires when it reads in a Dataset.

#### The `attrs` atribute of a `Dataset` object contains is a `dict` that contains its metadata. It is updated using a python `dict`.

#### For instance:
```python
metadata = {
    'FillValue' : np.array([0.]),
    'Maximum' : np.array([5.]),
    'Minimum' : np.array([-5.]),
    'Units' : b'm/s'
    }
```
#### Note: Not all Datasets have a `FillValue` key, such as Time

#### `FillValue` is a flag used by MOHID to mask land values. I inherited this from Shihan's matlab scripts. Notice the use of a NumPy array with a float value.
#### `Maximum` is a flag used by MOHID. I inherited this from Shihan's matlab scripts. Notice the use of a NumPy array with a float value.
#### `Minimum` is a flag used by MOHID. I inherited this from Shihan's matlab scripts. Notice the use of a NumPy array with a float value.
#### `Units` are assigned to the dataset. I inherited this from Shihan's matlab script. It is an `str`. I suppose that this may be for human convenience but MOHID uses it as well, for instance in `MohidWater/ModuleHydrodynamicFile.F90`

# Writing Metadata to a Dataset

#### You can use a `dict` to add metadata attributes to a named dataset
```python
bar_00001.attrs.update(metadata)
```

# Closing a Dataset

#### To flush the data to disk, the `h5py.File` object must be closed when you are done writing to it
```python
foo.close()
```