## Accessing HDF files in python

### HDF file structure

HDF files have a hierarchical structure that consists of a directory and a collection of data objects. Every data object has a pointer to the data object location and information on its datatype. Different HDF formats allow different types of data objects.<br>
Related data objects can be grouped into datasets similarly to subdirectories to organize files within a computer directory. <br><br>
While netCDF4 format is based on HDF5, HDF files usually have a much more complex structure. You can think of HDF as groups of datasets. Hence, when you are reading a HDF file you first retrieve the list of groups, then the list of datasets for a particular group. You can then access each dataset in a similar manner as a netCDF file.<br><br>
HDF formats are the older HDF4 and the current HDF5, then there are HDF4-EOS and HDF5-EOS which are used by NASA.<br><br>
Since the HDF files can be so complex and it might not be immediately obvious which format youre dealing with, it is useful to open them with a viewer before accessing them. HDFView is specific to HDF and freely available, you can also use panoply which also opens netCDF files and is already available on the VDI.

There are a few python modules that can read HDF data. Here's a list with links to the available documentation:
+ [pyhdf](http://fhs.github.io/pyhdf/)- developed by NASA specifically for HDF-EOS data
+ [h5py](https://docs.h5py.org/en/stable/) - for HDF5 files
+ xarray - currently works only with HDF4 and HDF5 not with HDF-EOS 
+ [rasterio](https://rasterio.readthedocs.io/en/latest/) - reads all raster data, can be used on its own or in conjuction with xarray
+ [pynio](https://www.pyngl.ucar.edu/NioFormats.shtml#HDF) - can read and write HDF data that uses the SDS (Scientific Data Set) interface but only read HDF Vdata, SWATH and GRID data groups in HDF-EOS. POINT data groups are ignored
+ [pandas PyTables](https://pandas.pydata.org/docs/user_guide/io.html#hdf5-pytables) - limited to specific datasets structures that Pandas understands

In [1]:
import xarray as xr
import rasterio as rio
import numpy as np
import pyhdf
import h5py
import netCDF4
import pandas as pd

### HDF4

While HDF4 has been replaced by HDF5 for most new satellite products. NASA still uses a modified version (HDF4-EOS) for the MODIS products. It might also be encountered when using older datasets.<br>
The HDF4 format and library support the following eight basic objects:
 1. Scientific dataset (SDS), a multidimensional array with dimension scales
 2. 8-bit raster image (RIS8), a 2-dimensional array of 8-bit pixels
 3. 24-bit raster image (RIS24), a 2-dimensional array of 24-bit pixels
 4. General raster image (GR), a 2-dimensional array of multi-component pixels
 5. 8-bit color lookup table (palette), a 256 by 3 array of 8-bit integers
 6. Table (Vdata), a sequence of records
 7. Annotation, a stream of text that can be attached to any object
 8. VGroup, a structure for grouping objects <br><br>
 
It is likely you will deal only with the first one but it's good to know that HDF4 allows different and more complex objects compared to HDF5. While some python module might be able to deal with a relative simple HDF4 file, more complex ones needs a module that can deal with the complexity.

#### HDF4-EOS

NASA adapted both HDF4 and HDF5 to contain additional geolocated data types (point, grid, swath) from the Earth Observing System (EOS). HDF4-EOS, the format adapted from HDF4, is the one currently used for MODIS data products.

.(1)

HDF4 format can be opened in python using any of the following modules:
  + rasterio
  + xarray
  + pyhdf
  + pynio another option which is better than pyhdf if you have datasets with same name in different groups 
which to choose depend on which operations you want to do on the data, if you want to access in the same way other file formats etc. In some cases they would be fully equivalent and then you can choose the one which is more familiar to you.
https://www.pyngl.ucar.edu/NioFormats.shtml#HDF
Pynio can read and write HDF data that uses the SDS (Scientific Data Set) interface but only read HDF Vdata and for SWATH and GRID data groups in HDF-EOS 

Let's start from rasterio which is a wrapper of gdal, so you can use rasterio to open any kind of raster data provided that the underlining gdal has been compiled to include that file format.
MODIS data usually comes as HDF4 so I'm using some files we have in project ua8 as an example.

In [None]:
trmm = '/g/data/ua8/Precipitation/TRMM/3B42/hdf/1998/3B42.19980101.00.7.HDF'
# HDF4-EOS
modis = '/g/data/ua8/tmp/hdf_data/MOD09GA.A2016189.h09v05.006.2016191073856.hdf'

If you try to open the file using xarray.open_dataset() function it will fail because xarray doesn't recognise HDF4-EOS format.

In [None]:
with rio.open(modis) as dataset:
    print(dataset)
    hdf4_meta = dataset.meta
    for name in dataset.subdatasets:
        print(name)

In [None]:
# reading all 500m 2 D grid variables from file
modis_ds = rio.open(modis)
# You need to loop over the sds, open them and read the data as a list of numpy arrays
sds_list = []
# while looping we can also access the sds metadata and store it in a  dictionary
modis_meta = {}
for sds in modis_ds.subdatasets:
    # selecting all variables on 500m_2D grid
    if 'Grid_500m_2D' in sds:
        with rio.open(sds) as subd:       
            sds_list.append(subd.read(1))
            # using only variable name for meta dictionary key
            key = sds.split(":")[-1]
            modis_meta[key] = subd.profile
# stack the list of arrays in one
modis_data = np.stack(sds_list)
print(modis_data.shape, "\n")
print(modis_meta['sur_refl_b01_1'])
# Close the file
trmm_ds.close()

Remember you can always use dir() to find out all the methods and attributes of a python object
  >  print( dir( modis_ds ) )
  
will give you a list of all the operations you can do on the rasterio file object.<br>
We could get the data but we can't access the information on the variables, ie. whihc bands corresponds to which variable. If you tried to open the same file with panoply or HDFview you will see that such information is available in the file itself.

We could access the variable information with this file, but this is not always the case, if we try to read the trmm file instead, while we can get the data values the information on the variables is lost.<br>
Let's see an example while using this time rasterio with xarray.<br>
Starting from xarray version 16.3 a new function *xarray.open_rasterio()* which uses *rasterio* to open the files is provided. <br> 
NB. This function is still experimental so it might not work with all files or it might change functionality in the future.

In [None]:
# Use rasterio to get the subdatasets names
with rio.open(trmm) as trmm_ds:
    sds = trmm_ds.subdatasets

print(sds[0])
trmm_xr = xr.open_rasterio(sds[0])
trmm_xr

To load all the sub-datasets we need to use a loop, after saving the single arrays in alist we'll concatenate them along the 'band' dimension. <br>
Again, we can't access the actual metadata that would help us working out what each band represents.

In [None]:
data = []
for s in sds:
    data.append(xr.open_rasterio(s))
trmm_xr = xr.concat(data, dim='band') 
trmm_xr

In [None]:
# Alternatively you could use gdal rather than rasterio to load the subdatasets.
# Remember rasterio is a gdal wrapper!

import gdal

g = gdal.Open(trmm)
subdatasets = g.GetSubDatasets()
# To close the gdal file
g = None

subdatasets

Rasterio can do the job with or without xarray but it's a bit clunky compared to pyhdf

In [None]:
#pyhdf.HDF.HDF(modis)

In [None]:
from pyhdf.SD import SD, SDC
mod = SD(modis, SDC.READ)
print(dir(mod))

In [None]:
for attr in mod.attributes():
    print(attr)

In [None]:
for key,ds in mod.datasets().items():
    print(f"{key}:  {ds}")
print(type(ds))

In [None]:
ds_obj = mod.select('sur_refl_b01_1') # select  a ds

data = ds_obj.get() # get ds data
print(data)
# get attributes
print(ds_obj.attributes())

HDF5
----

HDF5 files are organized in a hierarchical structure, with two primary structures: groups and datasets.

HDF5 group: a grouping structure containing instances of zero or more groups or datasets, together with supporting metadata.
HDF5 dataset: a multidimensional array of data elements, together with supporting metadata.
Working with groups and group members is similar in many ways to working with directories and files in UNIX. As with UNIX directories and files, objects in an HDF5 file are often described by giving their full (or absolute) path names.

/ signifies the root group.
/foo signifies a member of the root group called foo.
/foo/zoo signifies a member of the group foo, which in turn is a member of the root group.
Any HDF5 group or dataset may have an associated attribute list. An HDF5 attribute is a user-defined HDF5 structure that provides extra information about an HDF5 object. Attributes are described in more detail below.
HDF5 Groups

An HDF5 group is a structure containing zero or more HDF5 objects. A group has two parts:

A group header, which contains a group name and a list of group attributes.
A group symbol table, which is a list of the HDF5 objects that belong to the group.
(Return to TOC)

HDF5 Datasets

A dataset is stored in a file in two parts: a header and a data array.

In [None]:
# HDF5 example
imerg = '/g/data/ua8/tmp/hdf_data/3B-MO.MS.MRG.3IMERG.20000601-S000000-E235959.06.V06B.HDF5'

Using xarray open_dataset(0  works with HDF5 file, you need though to specify the *h5netcdf* engine.<br>
Let's see what happens:

In [None]:
ds = xr.open_dataset(imerg, engine='h5netcdf')
ds

Xarray opened the file and can read the global attributes, but there are no variables or dimensions visible yet!<br>
This is because of the HDF structured format, the top level is a group and we need to tell xarray we're dealing with groups. To do so we need to pass the group name to open_dataset() <br>
If you haven't checked the file structure beforehand with HDFView or panoply, you can always use one of the other modules to retrieve the group/s names.

In [None]:
with rio.open(imerg) as f:
    for sds in f.subdatasets:
        print(sds)

We can see now that the main group is *Grid* and we can pass this information to xarray

In [None]:
xr.open_dataset(imerg, engine='h5netcdf', group='Grid')

If you have a file with sub-groups you can access them by using a path-like string: <br>
> xr.open_dataset(file_path, engine='h5netcdf', group="/Group1/subgroup1')

In the example above you can use both "Grid" and "/Grid"

An alternative approach is to use netCDF4.Dataset to open the file in diskless non-persistence mode:

In [None]:
nc = netCDF4.Dataset(imerg, diskless=True, persist=False)
group = nc.groups.get('Grid')
#print(ncf.groups)
xds = xr.open_dataset(xr.backends.NetCDF4DataStore(group))
xds

In [None]:
imerg_ds = rio.open(imerg)
print(imerg_ds.meta)
sds = imerg_ds.subdatasets
for s in sds:
    with rio.open(s) as subd:
        print(subd.units)
xr.open_rasterio(sds[0])
print(len(sds))

In [None]:
f = h5py.File(imerg,'r')
groups = [ x for x in f.keys() ]
print(groups)
gridMembers = [ x for x in f['Grid'] ]
print(gridMembers)

# Read the precipitation, latitude, and longitude data:

precip = f['Grid/precipitation'][0][:][:]

HDF5-EOS

In [None]:
print(type(precip))
precip.shape

Pandas
----

In [None]:
#Show you can do that but only with data that has right table / datset structure
#store = pd.io.pytables.HDFStore(imerg, 'r')
print(store.info())
some = pd.read_hdf(store)
some
#store.close()

#print(store.info())
#for k in store.keys():
#    print(k)

Finally some resources: 
 + https://www.earthdatascience.org/courses/use-data-open-source-python/hierarchical-data-formats-hdf/open-MODIS-hdf4-files-python/
 + https://www.youtube.com/watch?v=xuWB_byi-6Q
 + https://disc.gsfc.nasa.gov/information/howto?title=How%20to%20Read%20IMERG%20Data%20Using%20Python
 + https://support.hdfgroup.org/HDF5/doc/H5.intro.html
 + https://portal.hdfgroup.org/display/support/Documentation
 + https://moonbooks.org/Articles/How-to-read-a-MODIS-HDF-file-using-python-/
 (1) This information was adapted from  1st reference
 + example of modis pre-rpocessesing with gdal: http://www.loicdutrieux.net/pyLandsat/modisPreProcess.html