<center>
<table>
  <tr>
    <td><img src="http://www.nasa.gov/sites/all/themes/custom/nasatwo/images/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://github.com/astg606/py_materials/blob/master/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

        
<center>
<h1><font color= "blue" size="+3">ASTG Python Courses</font></h1>
</center>

---

<center><h1> <font color="red">Reading OMI hdf5 Files using h5py</font></h1></center>

## <font color="red">Primary References/Resources</font>

- [Ozone Monitoring Instrument (OMI)](https://www.earthdata.nasa.gov/learn/find-data/near-real-time/omi)
- [h5py Quick Start Guide](https://docs.h5py.org/en/stable/quick.html)
- [OMNO2d File Specification](https://docserver.gesdisc.eosdis.nasa.gov/repository/Mission/OMI/3.3_ScienceDataProductDocumentation/3.3.2_ProductRequirements_Designs/OMNO2d_FileSpec_V003.pdf)

### Import Statements

In [None]:
import pprint
import os

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import hvplot.xarray
from cartopy import crs as ccrs
import cartopy.feature as cfeature
import cartopy.io.shapereader as shapereader
from cartopy.mpl.ticker import LongitudeFormatter, LatitudeFormatter

In [None]:
import numpy as np
import xarray as xr
import h5py

In [None]:
# Toggles off alphabetical sorting
pprint.sorted = lambda x, key=None:x

## <font color="red">What is OMI? </font>

- The Ozone Monitoring Instrument (OMI) aboard NASA's Aura satellite (launched in 2004) measures ozone from Earth's surface to top-of-atmosphere. 
- OMI also measures sulfur dioxide (SO2), aerosols, and cloud top pressure. 
- Near real-time (NRT) OMI data are available through LANCE generally within three hours after a satellite observation.

## <font color="red"> Accessing a Sample HDF5 Data Files</font>

Directory where the OMI files are located:

In [None]:
data_dir = "/Users/jkouatch/myTasks/PythonTraining/ASTG606/Materials/sat_data/OMI_Data/"

Full path to the file name:

In [None]:
file_name = os.path.join(data_dir, "OMI-Aura_L3-OMTO3e_2022m0709_v003-2022m0711t031807.he5")

### <font color="blue"> Opening the File</font>

Opening file for reading:

In [None]:
fid = h5py.File(file_name, 'r')

#### File Hierarchy

File -->  Group -->  Sub-group -->  Dataset

The `visit()` function returns the hierarchy of the file by utilizing the Python `print()` function.

In [None]:
fid.visit(print)

You can even incorporate `lambda` or use predefined functions to retrieve more information.

In [None]:
fid.visit(lambda x: print(x, fid[x], "\n"))

In [None]:
# Retrieve hierarchy and corresponding objects
def print_more(name):
    print(name, fid[name], "\n")
    
fid.visit(print_more)

In addition to the type of each object, for groups, the number of members and its path is returned. For datasets, the name, shape, and array type is returned instead.

### <font color="blue"> Data Extraction </font>

#### Accessing Groups / Subgroups

- Using what we know about the behavior of groups, we can access all objects like dictionaries.

We can access group and subgroup keys:


In [None]:
fid_keys = list(fid.keys())
print(fid_keys)

group and subgroup values:

In [None]:
fid_values = list(fid.values())
print(fid_values)

group and subgroup items,

In [None]:
fid_items = dict(fid.items())
print(fid_items)

or we can access the objects within them themselves.

In [None]:
print(fid['HDFEOS']['ADDITIONAL'])

In [None]:
print(dict(fid['HDFEOS']['ADDITIONAL']))

#### Other Information

Let's use a the `/HDFEOS/GRIDS/` sub-group as an example.

In [None]:
sample_group = fid['HDFEOS']['GRIDS']

You can access group names (includes path),

In [None]:
sample_group.name

the parent group of a subgroup, 

In [None]:
sample_group.parent

and the file to which the group belongs.

In [None]:
sample_group.file

In addition, we can access the **attributes** through the `attrs` variable which follows a dictionary-like interface

In [None]:
sample_group_attrs = dict(sample_group.attrs)

In [None]:
sample_group_attrs

Unfortunately, this group doesn't have any attributes.

#### Important Application

At first glance, it appears that most of the groups and sub-groups in the folder are irrelevant. When looking at the hierarchy, they either lead to the data itself or to other empty sub-groups and datasets.

In reality, they may hold crucial information stored as attributes. Luckily, we can take advantage of the `visit()` function to get our "invisible" metadata.

In [None]:
def print_attrs(name):
    print(name, "\n\tAttributes:", fid[name].attrs.keys(), '\n')

By using a pre-defined function, we can access the attribute keys of every single object in the `HDF5` file.

In [None]:
fid.visit(print_attrs)

In addition to **file-level attributes** and even **coordinate metadata**, we can access our **dataset attributes** as they, too, use the `attrs` variable to access them.

### <font color="blue">Accessing Top-level Metadata</font>

#### File-level Attributes

From displaying all attributes above, we can see that file-level attributes are stored as attributes w/in the `HDFEOS/ADDITIONAL/FILE_ATTRIBUTES/` sub-group.

Since attributes have a dictionary-like interface in `h5py`, it's simple to obtain them.

In [None]:
file_attrs = dict( fid['HDFEOS']['ADDITIONAL']['FILE_ATTRIBUTES'].attrs )

file_attrs

`h5py` stores the attribute values as `NumPy` data types: `numpy.ndarray` for all numeric and array representations and `numpy.bytes_` for all string and character representations along with tuples and dictionaries.

While we could leave them that way, it would definitely be more convenient to convert them into more familiar data types due to their string representations. Thankfully, the `isinstance()` function exists.

In [None]:
for key, item in file_attrs.items():
    if isinstance(item, np.ndarray):   # Converts np arrays to a list to, if applicable, an int or float
        item = list(item)
        
        if len(item) == 1:
            item = item[0]
    elif isinstance(item, np.bytes_):   # Converts np bytes to an np string to a Python string
        item = str(item.astype('str'))
        
        if item[0] == '(' or item[0] == '{':   # Converts to tuple or dict if applicable
            item = eval(item)
        # **eval() relaiability??**
            
    file_attrs[key] = item   # Updates any changes to the key value

In [None]:
pprint.pprint(file_attrs)

#### Coordinates and Plotting Information

Our plotting-related metadata seems to be stored as attributes in the `HDFEOS/GRIDS/OMI Column Amount O3` sub-group. We can try to access them the same way as file attributes.

In [None]:
plot_attrs = dict( fid['HDFEOS']['GRIDS']['OMI Column Amount O3'].attrs )

In [None]:
plot_attrs

Using the same data type conversion method, we can get more convenient data types.

In [None]:
for key, item in plot_attrs.items():
    if isinstance(item, np.ndarray):   # Converts np arrays to a list to, if applicable, an int or float
        item = list(item)
        
        if len(item) == 1:
            item = item[0]
    elif isinstance(item, np.bytes_):   # Converts np bytes to an np string to a Python string
        item = str(item.astype('str'))
        
        if item[0] == '(' or item[0] == '{':   # Converts to tuple or dict if applicable
            item = eval(item)
        # **eval() relaiability??**
            
    plot_attrs[key] = item   # Updates any changes to the key value

In [None]:
plot_attrs

These attributes give us all the information we need to construct coordinates need for `XArray` datasets.

First, we want to identify our coordinate boundaries.

In [None]:
lonW = plot_attrs['GridSpan'][0]
lonE = plot_attrs['GridSpan'][1]
latS = plot_attrs['GridSpan'][2]
latN = plot_attrs['GridSpan'][3]

Next, we just need to obtain the number of lats and lons in the grid (our dimension sizes), which is also readily available.

In [None]:
lon_size = plot_attrs['NumberOfLongitudesInGrid']
lat_size = plot_attrs['NumberOfLatitudesInGrid']

Finally, using NumPy's `linspace()` function, we can now create our coordinates!

In [None]:
lons = np.linspace(lonW, lonE, lon_size)
lats = np.linspace(latS, latN, lat_size)

In [None]:
print('Longitudes:\n', lons)
print('Latitudes:\n', lats)

### <font color="blue">Accessing Data Fields and Datasets</font>

#### Data Fields

From looking back at the file layout, we can see that the data appears to be w/in the subgroup `/HDFEOS/GRIDS/OMI Column Amount O3/Data Fields/`.

In [None]:
data_group = fid['HDFEOS']['GRIDS']['OMI Column Amount O3']['Data Fields']

We can take advantage of the `visit()` function once again and get some descriptive information and attributes of each dataset w/in the sub-group.

In [None]:
def print_data_info(name):
    print('Name:', name, 
          '\n\tInfo:', data_group[name],
          '\n\tAttrs:', data_group[name].attrs.keys(), '\n')

In [None]:
data_group.visit(print_data_info)

#### Datasets

Given our previous knowledge of reading attributes, accessing important keys such as missing and fill values, scale factors, and offset values will be straightforward.

Let's use the `SolarZenithAngle` dataset as our sample.

In [None]:
sample_ds = data_group['SolarZenithAngle']

Let's now examine the attributes more closely.

In [None]:
sample_ds_attrs = dict(sample_ds.attrs)

In [None]:
sample_ds_attrs

Time for our signature data type conversion.

In [None]:
for key, item in sample_ds_attrs.items():
    if isinstance(item, np.ndarray):   # Converts np arrays to a list to, if applicable, an int or float
        item = list(item)
        
        if len(item) == 1:
            item = item[0]
    elif isinstance(item, np.bytes_):   # Converts np bytes to an np string to a Python string
        item = str(item.astype('str'))
        
        if item[0] == '(' or item[0] == '{':   # Converts to tuple or dict if applicable
            item = eval(item)
        # **eval() relaiability??**
            
    sample_ds_attrs[key] = item   # Updates any changes to the key value

In [None]:
sample_ds_attrs

Now, we can extract our targeted attributes.

In [None]:
# Default values (also a reset if testing different datasets/variables)
fill = None
scale = 1
offset = 0

In [None]:
for key, value in sample_ds_attrs.items():
    if key == '_FillValue':
        fill = value  
    if key == 'ScaleFactor':
        scale = value
    if key == 'Offset':
        offset = value
# data = data * scale + offset
    
print('Fill Value:', fill)
print('Scale Factor:', scale)
print('Offset:', offset)

The last thing we need ot do is to access our actual **data**. `h5py` makes this really simple. All we need to do is add `[()]` next to our dataset object and all of it is now in `NumPy` array format.

In [None]:
sample_data = sample_ds[()]

In [None]:
sample_data

### <font color="blue">Accessing Dimensions</font>

The last thing to access in our HDF5 file is dataset **dimensions**, known as **dimension scales** in `h5py`.

We can access a dataset's dimensions by getting a list of dimension objects using the `dims()` function.

In [None]:
sample_ds_dims = list(sample_ds.dims)

In [None]:
sample_ds_dims

Dimension objects are simply another `HDF5` dataset. Normally, one would be able to access dimension labels and scales associated with each axis. For our OMI satellite data file, our dimension objects are empty.

In [None]:
len(sample_ds_dims[0])

In [None]:
sample_ds_dims[0].label   # would return dimension label

In [None]:
dict(sample_ds_dims[0].items())   # would return label and scales associated with this axis

In [None]:
#sample_ds_dims[0][0]   # would return scale data

Instead, we can try to match the dataset shape to our plotting attributes describing lon and lat size to assign our `xarray` dimension names.

In [None]:
sample_ds.shape

In [None]:
print(lon_size, lat_size)

In [None]:
if sample_ds.shape[0] == lon_size:
    sample_ds_dims = ['lon', 'lat']
elif sample_ds.shape[0] == lat_size:
    sample_ds_dims = ['lat', 'lon']

Configuring the order is important for our `xarray` DataArray initilizations.

Now that we've gotten all the information we need, we can close our file reader.

In [None]:
fid.close()

## <font color="red">Conversion to Xarray DataArrays and Datasets</font>

Now that we've been able to get all of the necessary information to create an `xarray` dataset, we can start!

In [None]:
def get_fid(filename):
# Accessing our file identifier object
    '''
       Receive amd hdf5 file name, open it and 
       return the file identifier object.
    
       Input Parameterd: 
         - filename (str): file name
       Returned value:
         - fid: h5py file identifier object
    '''
    fid = h5py.File(filename, 'r')
    return fid

In [None]:
def get_data_group(fid):
    '''
       Use the file identifier to extract a datafield subgroup.

       Input Parameterd: 
         - fid: h5py file identifier object
       Returned value:
         - data_group: the data field group (contents) of the file
    '''
    # contents of our parent group
    parent_contents = dict(fid['HDFEOS']['GRIDS']) 
    # our subparent group object
    subparent = list(parent_contents.values())[0]
    # contents of our subparent group
    subparent_contents = dict(subparent)   
    # our data group object
    data_group = list(subparent_contents.values())[0]   
    
    return dict(data_group)

In [None]:
def convert_dict_dtype(sample_dict):
    '''
       Converts attribute dictionary from Numpy data types 
       to general Python data types

       Input Parameterd: 
         - sample_dict: A dictionary of attributes
       Returned value:
         - sample_dict: A dictionary of attributes
    '''
    for key, item in sample_dict.items():
        if isinstance(item, np.ndarray):   # Converts np arrays to a list to, if applicable, an int or float
            item = list(item)
        
            if len(item) == 1:
                item = item[0]
        elif isinstance(item, np.bytes_):   # Converts np bytes to an np string to a Python string
            item = str(item.astype('str'))
        
            if item[0] == '(' or item[0] == '{':   # Converts to tuple or dict if applicable
                item = eval(item)
            # **eval() relaiability??**
            
        sample_dict[key] = item   # Updates any changes to the key value
        
    return sample_dict

In [None]:
def get_fid_attrs(fid):
    """
       Use the file identified to return the file-level attributes 
       in the proper data type
       
       Input Parmeters:
         - fid: file identifier
       Returned value:
         - fid_attrs: a dictionary.
    """
    fid_attrs = dict( fid['HDFEOS']['ADDITIONAL']['FILE_ATTRIBUTES'].attrs )
    fid_attrs = convert_dict_dtype(fid_attrs)
    
    fid_attrs.update(get_plot_attrs(fid))
    
    return fid_attrs

In [None]:
def get_plot_attrs(fid):
    """
       Use a file attribute returns the plotting attributes.
       
       Input Parameters:
          - fid: h5py file identifier
       Returned value:
          - plot_attrs: a dictionatory
    """
    parent_contents = dict(fid['HDFEOS']['GRIDS'])
    subgroup = list(parent_contents.values())[0]
    
    plot_attrs = dict(subgroup.attrs)
    plot_attrs = convert_dict_dtype(plot_attrs)
    
    return plot_attrs

In [None]:
def get_ds_attrs(ds):
    """
       Give a dataset identifier, return the dataset attribute.
       
       Input Parameters:
          - ds: dataset identifier
       Returned value:
          - ds_attrs: a dictionary
    """
    ds_attrs = dict(ds.attrs)
    ds_attrs = convert_dict_dtype(ds_attrs)
    
    return ds_attrs

In [None]:
def get_fill(ds_attrs):
    """
       Return the fill value of a given dataset object.
       
       Input Parameters: 
          - ds_attrs: A dictionary of dataset attributes
       Returned Value: 
          - value: Either an integer, floating point, or 'None'
    """
    for key, value in ds_attrs.items():
        if key == '_FillValue':
            return value
    return None

def get_scale(ds_attrs):
    """
       Return the scale factor of a given dataset object.
       
       Input Parameters: 
          - ds_attrs: A dictionary of dataset attributes
       Returned Value: 
          - value: Either an integer, floating point, or 'None'
    """
    for key, value in ds_attrs.items():
        if key == 'ScaleFactor':
            return value

def get_offset(ds_attrs):
    """
       Return the offset value of a given dataset object.
       
       Input Parameters: 
          - ds_attrs: A dictionary of dataset attributes
       Returned Value: 
          - value: Either an integer, floating point, or 'None'
    """
    for key, value in ds_attrs.items():
        if key == 'Offset':
            return value


In [None]:
def restore_data(ds):
   '''
      Restore the dataset data using the dataset attributes.
      
      Input Parameters:
         - ds: h5py dataset identifier
      Returned Value:
         - data: numpy array
    '''
    ds_attrs = get_ds_attrs(ds)
    
    fill = get_fill(ds_attrs)
    scale = get_scale(ds_attrs)
    offset = get_offset(ds_attrs)
    
    data = ds[()]#.astype('float')
    
    data = np.where(data != fill, data, np.nan)
    data *= scale
    data += offset
    
    return data

In [None]:
def get_coords(fid):
    '''
       Return the file coordinates given its identifier object.
       
       Input Parameters:
          - fid: h5py file identifier
       Returned value:
          - dictionary of latitudes and longitudes and Numpy arrays.
    '''
    plot_attrs = get_plot_attrs(fid)
    
    lonW = plot_attrs['GridSpan'][0]
    lonE = plot_attrs['GridSpan'][1]
    latS = plot_attrs['GridSpan'][2]
    latN = plot_attrs['GridSpan'][3]
    
    lon_size = plot_attrs['NumberOfLongitudesInGrid']
    lat_size = plot_attrs['NumberOfLatitudesInGrid']
    
    lons = np.linspace(lonW, lonE, lon_size)
    lats = np.linspace(latS, latN, lat_size)
    
    return {'lons': lons, 'lats': lats}

In [None]:
def get_ds_dims(ds, coords):
    '''
       Get dataset dimension names given dataset and coordinates
       
       Input Parameters:
          - ds: a h5py dataset
          - coords: a dictionary of coordinates
       Returned Value:
          - ds_dims: a dctionany
   '''
    dims = ds.dims
    ds_dims = {}
    
    for i in range(len(dims)):
        if dims[i].label == '':
            if ds.shape[i] == coords['lons'].size:
                ds_dims['lon'] = ds.shape[i]
            elif ds.shape[i] == coords['lats'].size:
                ds_dims['lat'] = ds.shape[i]
        else:
            ds_dims[dims[i].label] = ds.shape[i]
    
    return ds_dims

In [None]:
def check_coords(dims, coords): 
    '''
       Rearrange coordinates order to match dimensions
       shapes for a dataset.
       
       Input Parameters:
          - dims: a dictionary of dimensions
          - coords: a dictionary of coordinates
       Returned Value:
          - coords: a dictionary
   '''
    if list(dims.values())[0] != list(coords.values())[0].size:
        temp = coords
        coords = {list(coords.keys())[1]: list(coords.values())[1], 
                  list(coords.keys())[0]: list(coords.values())[0]}
    return coords

In [None]:
def read_file(filename):
    '''
       Given an OMI HDF5 file name, convert the data into 
       an Xarray Dataset.
       
       Input Parameters:
          - filename (str): HDF5 file name containing OMI data
       Returned Value:
          - xr_ds: an Xarray Dataset
    '''
    xr_ds = xr.Dataset()
    
    fid = get_fid(filename)
    
    data_group = get_data_group(fid)
    fid_attrs = get_fid_attrs(fid)   
    fid_coords = get_coords(fid)
    
    for name, hdf_ds in data_group.items():
        data = restore_data(hdf_ds)       
        ds_attrs = get_ds_attrs(hdf_ds)
        
        ds_dims = get_ds_dims(hdf_ds, fid_coords)
        ds_coords = check_coords(ds_dims, fid_coords)
    
        xr_ds[name] = xr.DataArray(data, dims = list(ds_dims.keys()), coords = list(ds_coords.values()))
        xr_ds[name].attrs = ds_attrs
        
        
    xr_ds.attrs = fid_attrs    
       
    fid.close()    
    return xr_ds

In [None]:
file_ds = read_file(file)

In [None]:
file_ds

## <font color="red">Plotting Our Data</font>

File size

In [None]:
file_MB = file_ds.nbytes / 1000000

file_MB

Example variable

In [None]:
var = file_ds['RadiativeCloudFraction']

Basic `matplotlib` plot

In [None]:
var.plot()

Basic `hvPlot` plot

In [None]:
var.hvplot()

More intermediate `hvPlot` plots

In [None]:
var.hvplot.quadmesh('lon', 'lat', projection = ccrs.PlateCarree(), geo = True, ylim = (-60, 80),
                    project = True, cmap = 'blues', rasterize = True, coastline = True)

In [None]:
var.hvplot.contour('lon', 'lat', projection = ccrs.PlateCarree(), ylim = (-60, 80),
                   cmap = 'reds', coastline = True, geo = True, levels = 9)