<center>
<table>
  <tr>
    <td><img src="http://www.nasa.gov/sites/all/themes/custom/nasatwo/images/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://github.com/astg606/py_materials/blob/master/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

        
<center>
<h1><font color= "blue" size="+3">ASTG Python Courses</font></h1>
</center>

---

<center><h1><font color="red" size="+3">Reading Scientific Data Format Files</font></h1></center>

# <font color='red'> Useful References </font>

* <A HREF="http://pyhogs.github.io/intro_netcdf4.html">Create and read netCDF files</A>
* <A HREF="https://unidata.github.io/netcdf4-python/">netCDF4 module</A>
* <a href="https://annefou.github.io/metos_python/07-LargeFiles/">Handling very large files in Python</a>
* <A HREF="https://www.pythonforthelab.com/blog/how-to-use-hdf5-files-in-python/">How to use HDF5 files in Python</A>
* <a href="https://confluence.slac.stanford.edu/pages/viewpage.action?pageId=99485805">How to access HDF5 data from Python</a>
* [Welcome to pyhdf’s documentation!](http://fhs.github.io/pyhdf/)

## <font color="red"> Scientific Data</font>

- Store a variety of data types that include singlepoint observations, time series, regularly spaced grids, and satellite or radar images.
- Include metadata.
- Measurements at specific time, location, condition
   - Physics: temperature, pressure
   - Chemistry: reaction speed
   - Biology: type (species, cell types, nucleotides)
   - Economics: price
   - Algorithmics: program time and space
   - Networking: network activity
   - Robotics: movements
     
### Requirements

+ **Compact storage**: compression
+ **Fast I/O**: parallel, partial, random access
+ **Portability**: transporting data between computers
+ **Tools for manipulating data**: reorganizing, aggregating, subsetting, converting,visualizing
+ **Easy API in many languages**: C, C++, Fortran, Java, Matlab, Perl, Python, R, ...

We need to use four guiding principles (known as the [FAIR Principles](https://www.nature.com/articles/sdata201618)) for the proper creation, storage and manipulation of scientific data:

1. Data must be **F**indable
2. Data must be **A**ccessible
3. Data must be **I**nteroperable
4. Data must be **R**eusable.


> In order for the scientific community to get the most value out of the available data, it is vital that storage formats are optimal for sharing, archiving and reuse. Adequate description of the data (stored in the form of metadata) is also key for turning data into information.

## <font color="red"> Data Formats of Interest </font>

+ **Network Common Data Format** (netCDF)
+ **Hierarchical Data Format** (HDF)
  - HDF4
  - HDF5
  
We will learn how to access data in files using the above formats. 

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import os
import pprint
import datetime
import numpy as np

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm

In [None]:
import cartopy
import cartopy.crs as ccrs
import cartopy.feature as cfeature
from cartopy.mpl.ticker import LongitudeFormatter, LatitudeFormatter
from cartopy.mpl.gridliner import LONGITUDE_FORMATTER, LATITUDE_FORMATTER

## <font color='red'> netCDF</font>
### <font color='blue'> What is netCDF?</font>

#### Overview

* NetCDF, is an interface to a library of data access functions for storing and retrieving data in the form of arrays.
* NetCDF is an abstraction that supports a view of data as a collection of self-describing, portable objects that can be accessed through a simple interface.
* All operations to access and manipulate data in a netCDF dataset must use only the set of functions provided by the interface.
* Array values may be accessed directly, without knowing details of how the data are stored.
* NetCDF supports efficient access to small subsets of large datasets.
* Stores data in an array-oriented dataset which contains dimensions, variables, and attributes.
* The dataset file is divided into two parts: 
   - The file header contains all information (metadata) about dimensions, attributes, and variables except for the variable data itself.
   - The array data section contains arrays of variable values (raw data).

#### Features

- Self-contained, platform independent, binary
- Dimensions
   - Contain name and size
   - Only one size unlimited (dataset dimension)
   - Measurands e.g. time, latitude, longitude, etc.
- Variables
   - Array of values with same type
   - Contain name, datatype, shape
   - Coordinate variable: one dimensional variable with same name as dimension
- Attributes
   - Metadata
   - Used for variables and file (global attributes)
- Conventions
   - Standards for specific use case
      - * The names of variables and dimensions should be meaningful and conform to any relevant conventions.
      - Attribute settings need to follow relevant conventions.
   - Compare files from different sources.
   - e.g. Climate and Forecast (CF), Cooperative Ocean/Atmosphere Research Data Service (COARDS)


#### Portability

* The netCDF library is supported for various Linux/UNIX operating systems as well as MS Windows.
* APIs written for Fortran 77/90, C, C++, Java, etc.


### <font color='blue'> What is netCDF4 Python?</font>

* Python interface to the netCDF version 4 library.
* **Can read and write files in both the new netCDF 4 and the netCDF 3 formats**.
* Can create files that are readable by HDF5 utilities.
* Relies on NumPy arrays.

---

In [None]:
import netCDF4 as nc4

#### Open the File

In [None]:
#data_dir = "/Users/jkouatch/myTasks/PythonTraining/ASTG606/Materials/sat_data/VIIRS_Data/"
data_dir = "/tljh-data/sat_data/VIIRS_Data"

In [None]:
nc_file = os.path.join(data_dir, "VNP14IMG_NRT.A2018064.1200.001.nc")

In [None]:
ncfid = nc4.Dataset(nc_file,'r')

Quick overview of the file content:

In [None]:
ncfid

#### Check the Content of the file

List all the variable information:

In [None]:
print(ncfid.variables)

List all the dimension information:

In [None]:
for dim in ncfid.dimensions.values():
     print(dim, dim.isunlimited())

Get the list of dimension name and retrieve info for each dimension:

In [None]:
for name in ncfid.dimensions.keys():
    try:
        dim = ncfid.variables[name]
        print(name, dim.dtype, dim.size)
    except:
        print(name)

In [None]:
for name, dim in ncfid.dimensions.items():
    print(name, dim.size, dim.isunlimited())

#### <font color='blue'>Printing File Attributes</font>

Get the global file attributes

In [None]:
for att in ncfid.ncattrs():
    print(f"{att}: \n\t {ncfid.getncattr(att)}")
    #print("{:>15}: {}".format(att, ncfid.getncattr(att)))

In [None]:
for att in ncfid.ncattrs():
    print(f"{att:>15}: {getattr(ncfid, att)}")

Global attributes as a dictionary:

In [None]:
print(ncfid.__dict__)

#### <font color='blue'>Printing Variable Information</font>

List variable information but exclude dimensions:

In [None]:
for name in ncfid.variables.keys():
    if (name not in ncfid.dimensions.keys()):
        data = ncfid.variables[name]
        try:
            print(f"{name}: \n\t Unit: {data.units} \n\t shape: {data.shape} \n\t Type: {data.dtype} \n\t Dim: {data.dimensions}")
        except:
            print(f"{name}: \n\t Type: {data.dtype}")

In [None]:
for name, var in ncfid.variables.items():
    if (name not in ncfid.dimensions.keys()):
        try:
            print(name, var.units, var.shape, var.dtype, var.dimensions)
        except:
            print(name, var.dtype)
       

You can write a function to print variable attribute:

In [None]:
def print_ncattr(fid, key):
    """
        Prints the NetCDF file attributes for a given key

        Parameters: 
            * fid:  netCDF file identifier
            * key:  unicode (a valid netCDF4.Dataset.variables key)
    """
    try:
        print('{}  -->'.format(key))
        print("\t {:>15}: {}".format("type", fid.variables[key].dtype))
        for attr in fid.variables[key].ncattrs():
            print('\t {:>15}: {}'.format(attr, fid.variables[key].getncattr(attr)))
    except KeyError:
        print("\t WARNING: {} does not contain variable attributes".format(key))

In [None]:
print(print_ncattr.__doc__)

In [None]:
for name in ncfid.variables.keys():
    print_ncattr(ncfid, name)

In [None]:
print_ncattr(ncfid, "FP_T5")

In [None]:
for name, var in ncfid.variables.items():
    print('{}  -->'.format(name))
    print("\t {:>15}: {}".format("type", var.dtype))
    for attr in var.ncattrs():
        print('\t {:>15}: {}'.format(attr, var.getncattr(attr)))

List the groups (if any):

In [None]:
print(ncfid.groups)

In [None]:
def walk_group_tree(top):
    """
       Python generator that is used to walk the directory tree.
    """
    values = top.groups.values()
    yield values
    for value in top.groups.values():
        for children in walk_group_tree(value):
            yield children

List of the created groups in the dataset:

In [None]:
for children in walk_group_tree(ncfid):
    for child in children:
        print(child)

#### <font color='blue'>Close the file</font>

In [None]:
ncfid.close()

#### <font color="blue"> Important Notes</font>

The command:

```python
   field = ncfid.variables[var_name]
```
extract from the netCDF file all the infomation (attribute, dimensions, values, etc.) related to `var_name` and `field` may point (memory) to the dataset in the netCDF file. Any modification of `field` may result in changes in the netCDF file.

To access to the values, we recommend the command:

```python
   vals = ncfid.variables[var_name][:]
``` 
However, it loads into memory all the values of the variable `var_name`. To potentially avoid crashing your computer, it is better to load small parts at a time by replacing `[:]` with slices, for example, `[0:2]`, `[2:4]`, etc.

#### <font color="blue"> APPLICATION:</font> Plot of I05 brightness temperature of fire pixel@1


[Source code](https://hdfeos.org/zoo/LAADS/VNP14IMG_NRT.A2018064.1200.001.nc.py)

[Plot](https://hdfeos.org/zoo/LAADS/VNP14IMG_NRT.A2018064.1200.001.nc.py.png)

In [None]:
with nc4.Dataset(nc_file,'r') as ncfid:
    lons = ncfid.variables['FP_longitude'][:] # longitude grid points
    lats = ncfid.variables['FP_latitude'][:]  # latitude grid points
    var = ncfid.variables['FP_T5']
    data = var[:]
    units = var.units
    long_name = var.long_name

In [None]:
print(f"Shape of lons: {np.shape(lons)}")
print(f"Shape of lats: {np.shape(lats)}")
print(f"Shape of data: {np.shape(data)}")
print(f"Units:         {units}")
print(f"Long name:     {long_name}")

In [None]:
fig = plt.figure(figsize=(9, 5))

# Get the central latitude and longitude
lat_m = lats[lats.shape[0]//2]
lon_m = lons[lons.shape[0]//2]

# Orthographic projection.
map_projection = ccrs.Orthographic(central_longitude=lon_m,
                                   central_latitude=lat_m,
                                   globe=None)

data_transform = ccrs.PlateCarree()

ax = fig.add_subplot(1, 1, 1, projection=map_projection)

# Remove the following to see zoom-in view.
ax.set_global()

# Plot on map.
p = plt.scatter(lons, lats, c=data, s=1, cmap=plt.cm.jet,
                transform=data_transform)
# Put grids.
gl = ax.gridlines()

# Put coast lines.
ax.coastlines()

# Adjust colorbar size and location using fraction and pad.
cb = plt.colorbar(p, fraction=0.022, pad=0.01)
cb.set_label(units, fontsize=8)

# Put title.
basename = os.path.basename(nc_file)
plt.title('{0}\n{1}'.format(basename, long_name), fontsize=8);

---

## <font color="red"> HDF4</font>

### <font color="blue"> What is HDF4? </font>

- HDF4 is an older hierarchical data format as compared to HDF5.
- HDF4 files are self-describing.
- It supports annotated multidimensional arrays (also called scientific data sets), raster images, tables, etc.
-  One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects. 
- Users can create their own grouping structures called "vgroups."
- Does not support file larger than 2Gb.
- HDF4 is still the primary data format that is adapted for MODIS data products published by NASA.

### <font color="blue"> What is pyhdf? </font>

- Python interface to HDF4.
- Implements the SD (scientific dataset), VS (Vdata) and V (Vgroup) APIs.
- SD datasets are read/written through numpy arrays. 
- It supports both Python 2 and Python 3.

In [None]:
from pyhdf.SD import SD
from pyhdf.SD import SDS
from pyhdf.SD import SDC
from pyhdf.SD import SDim
from pyhdf.SD import SDAttr

#### pyHDF Import Context

- The `SD` (Scientific Data) class is used for file and top-level info access and implements the HDF SD interface.
- The `SDS` (Scientific Dataset) class is used for dataset objects.
- The `SDC` (Scientific Data Constants) class holds the constants that define file opening modes and data types.
- The `SDim` (Scientific Data Dimensions) class is used for dimension objects.
- The `SDAttr` (Scientific Data Attributes) class is used for attribute objects

In [None]:
hdf4_file_name = os.path.join(data_dir, "VNP09_NRT.A2018064.0524.001.hdf")

#### Open File

In [None]:
hdf = SD(hdf4_file_name, SDC.READ)

#### Information on the file

- The first number indicates the number of datasets in the file  while the second number indicates the number of attributes attached to the global file.

In [None]:
hdf.info()

#### Obtain the file attributes

In [None]:
global_attrs = hdf.attributes()
pprint.pprint(global_attrs)

In [None]:
for i in range(len(global_attrs)):
    print(hdf.attr(i).info())

#### List all available SDS datasets

In [None]:
pprint.pprint(hdf.datasets())

In [None]:
datasets_dict = hdf.datasets()

for index, name in enumerate(datasets_dict.keys()):
    print(f"{index}: --> {name}")

#### Data Extraction

- The `select()` method from the `SD` class allows us to extract a dataset (object) given it's name or index number.

In [None]:
field_name = "375m Surface Reflectance Band I1"

In [None]:
field_ds = hdf.select(field_name) # selects a dataset

We can get information on the dataset:

In [None]:
field_ds.info()

To retrieve the data as a Numpy array:

In [None]:
field_data = field_ds.get()

In [None]:
field_data.shape

In [None]:
field_data_slice = field_data[:,0].astype(np.double)

We can also get the data directly from the dataset:

In [None]:
field_data = field_ds[:,:].astype(np.double)
field_data_slice = field_ds[:,0].astype(np.double)

From the `SDS` class, we can also access the dimension names and sizes using the `dimensions()` function.

In [None]:
field_ds.dimensions()

In [None]:
field_attrs = field_ds.attributes(full=1)
field_attrs

In [None]:
field_attrs["FILL_VALUES"]

In [None]:
fill_values = set()
for a in field_attrs["FILL_VALUES"][0].split(","):
    b = a.split("=")[1]
    fill_values.add(float(b))
    
fill_values

In [None]:
long_name = field_name
data = field_data

scale_factor = field_attrs["Scale"][0]        
add_offset = field_attrs["Offset"][0]

In [None]:
print(f"Shape of data:  {np.shape(data)}")
print(f"Min/Max Values: {np.nanmin(data)}/{np.nanmax(data)}")
print(f"Long Name:      {long_name}")
print(f"Scale:          {scale_factor}")
print(f"Offset:         {add_offset}")
print(f"Units:          {units}")

In [None]:
for _FillValue in fill_values:
    data[data == _FillValue] = np.nan

data -= add_offset
data *= scale_factor

In [None]:
print(f"Min/Max Values: {np.nanmin(data)}/{np.nanmax(data)}")

In [None]:
hdf.end()

In [None]:
fig, ax = plt.subplots()
ax.imshow(data);

In [None]:
fig, ax = plt.subplots()
im = ax.imshow(data, interpolation='nearest')
ax.set_title(long_name)
plt.colorbar(im);

## <font color='red'>HDF5</font>

### <font color='blue'> What is HDF5?</font>

* HDF5 is a file format and library for storing scientific data.  
* It supports files larger than 2 GB and  parallel I/O. 
* It uses a "file directory" like structure that allows you to organize data within the file in many different structured ways, as you might do with files on your computer. 
* The HDF5 format allows for embedding of metadata making it self-describing.
* An HDF5 file is a container for two kinds of objects: 
   1. **Datasets**: Array-like collections of data.
   2. **Groups**: Folder-like containers that hold datasets and other groups.
* Each group or dataset can have an associated attribute list to provide extra information related to the object.
   
![hdf5](https://miro.medium.com/max/1400/0*_vh8GIkBQNOg42uv.jpg)
Image Source: [https://www.neonscience.org/about-hdf5](https://www.neonscience.org/about-hdf5)
   
- HDF5 is conceptually related to HDF4 but incompatible; it cannot directly read or work with HDF4 files or the HDF4 library

### <font color='blue'> What is h5py?</font>

* h5py is the Python interface to the HDF5.
* Provide easy-to-use high level interface, which allows you to store huge amounts of numerical data.
* Easily manipulate that data from NumPy. 
* Use straightforward NumPy and Python metaphors, like dictionary and NumPy array syntax. 
* Within h5py, HDF5 groups work like dictionaries, and datasets work like NumPy arrays.

In [None]:
import h5py

In [None]:
hdf5_file_name = os.path.join(data_dir, "VNP46A1.A2020302.h07v07.001.2020303075447.h5")

Basic content of the file:

In [None]:
with h5py.File(hdf5_file_name, 'r') as hdfid:
    for var in hdfid.keys():
        print(f"{var}")

Check if we have `Dataset` or `Group`:

```python
   isGroup   = isinstance(item, h5py.Group)
   isDataset = isinstance(item, h5py.Dataset)
```

In [None]:
with h5py.File(hdf5_file_name, 'r') as hdfid:
    for var in hdfid.keys():
        obj = hdfid[var]
        if isinstance(obj, h5py.Group):
            print(f"{var:>25}: --> Group")
        elif isinstance(obj, h5py.Dataset):
            print(f"{var:>25}: --> Dataset")
        else:
            print(f"{var:25}: --> unknown type")

#### Get specific information about an item

```python
  item.id     
  item.ref     
  item.parent  
  item.file   
  item.name 
```

In [None]:
with h5py.File(hdf5_file_name, 'r') as hdfid:
     for var in hdfid.keys():
         obj = hdfid[var]
         print(f"{obj.name:>20} --> {obj.parent}")

#### Identify the datasets

```python
  isinstance(obj, h5py.Dataset)

  ds.dtype     
  ds.shape     
  ds.value 
```

In [None]:
with h5py.File(hdf5_file_name, 'r') as hdfid:
    for var in hdfid.keys():
        obj = hdfid[var]
        if isinstance(obj, h5py.Dataset):
            print(f"{var}: \n\t Type: {obj.dtype} \t Shape: {obj.shape}")

**Get item attributes for File or Group (if attributes available)**

```python
item.attrs  # for example: <Attributes of HDF5 object at 230141696>
item.attrs.keys() # for example: ['start.seconds', 'start.nanoseconds']
item.attrs.values() # for example: [1297608424L, 627075857L]
len(item.attrs)
```

In [None]:
with h5py.File(hdf5_file_name, 'r') as hdfid:
    mylist = list(hdfid.keys())
    for var in mylist:
        obj = hdfid[var]
        print(obj.attrs.keys(), len(obj.attrs))

**List the names of datasets:** Use `visit`

In [None]:
with h5py.File(hdf5_file_name, 'r') as hdfid:
     hdfid.visit(print)

**List the names of the datasets and the corresponding objects**

In [None]:
def my_func(name):
    print(f"{name}: \n\t {hdfid[name]}")

with h5py.File(hdf5_file_name, 'r') as hdfid:
    hdfid.visit(my_func)

**List the datasets and their attributes:** Use `visititems`

In [None]:
def print_all(name, obj):
    print(f"{name}: \n\t {dict(obj.attrs)}")

with h5py.File(hdf5_file_name, 'r') as hdfid:
     hdfid.visititems(print_all)

**List each item and determine if it is a group or a dataset**

In [None]:
def print_all_2(name, obj):
    if isinstance(obj, h5py.Group):
        print(f"{name:>25}: --> Group")
    elif isinstance(obj, h5py.Dataset):
        print(f"{name:>25}: --> Dataset")
    else:
        print(f"{name:25}: --> unknown type")

with h5py.File(hdf5_file_name, 'r') as hdfid:
     hdfid.visititems(print_all_2)

#### <font color="blue">APPLICATION:</font>  Read and restore the values of the   `Brightness Temperature of band M12`

In [None]:

GRID_NAME = 'VNP_Grid_DNB'
DATAFIELD_NAME = 'BrightnessTemperature_M12'

with h5py.File(hdf5_file_name, mode='r') as f:        
    name = f'/HDFEOS/GRIDS/{GRID_NAME}/Data Fields/{DATAFIELD_NAME}'
    data = f[name][:].astype(np.float64)
    # Read attributes.
    scale_factor = f[name].attrs['scale_factor'][0]
    add_offset = f[name].attrs['add_offset'][0]
    units = f[name].attrs['units'].decode()
    _FillValue = f[name].attrs['_FillValue'][0]
    long_name = f[name].attrs['long_name'].decode()

In [None]:
print(f"Shape of data: {np.shape(data)}")
print(f"Scale:         {scale_factor}")
print(f"Offset:        {add_offset}")
print(f"Fill Value:    {_FillValue}")
print(f"Units:         {units}")
print(f"Long name:     {long_name}")

In [None]:
print(f"Minimum: {data.min()}")
print(f"Maximum: {data.max()}")
print(f"Mean:    {data.mean()}")
print(f"STD:     {data.std()}")

Restoration:

In [None]:
data[data == _FillValue] = np.nan
data -= add_offset
data *= scale_factor

In [None]:
print(f"Minimum: {data.min()}")
print(f"Maximum: {data.max()}")
print(f"Mean:    {data.mean()}")
print(f"STD:     {data.std()}")

In [None]:
fig, ax = plt.subplots()
im = ax.imshow(data, interpolation='nearest',cmap=cm.hot)
ax.set_title(long_name)
plt.colorbar(im);