# Exploring the GEDI HDF5 file structure

**author**: Stefanie Lumnitz, stefanie.lumnitz@esa.int

**goal**: 
1. discover how to work with HDF5 file format
2. explore best practices for HDF5 files
3. explore GEDI 2A and 2B data structures & content

**content**:
* The HDF5 file structure
* Performance & Scalability
    * HDF5 Best Practices
    * Scalability through the HDF5 API and functionality

## HDF file structure

GEDI data is distributed in `.h5` data format, a Hierarchical Data Format (HDF). It is basically a container holding multidimensional arrays and attributes of scientific data. Imagine a whole unix file structure with directories and values stored in a file. HDF5 can commonly be opened and analysed using one of the following:
* [h5py](https://www.h5py.org/)
    * can store and work with scalars and numpy arrays
    * more pythonic easier to use
* PyTables
    * stores python objects and classes as attributes
    * much faster for reading slices of datasets (as it does not read the array first like h5py)
    * supported natively for compression and chuncking of dataset during writing operations

For more information on HDF5 check out Tom Kooij's tutorial [HDF5 take 2](https://www.youtube.com/watch?v=ofLFhQ9yxCw).

In [1]:
import h5py
import os

In [2]:
file_path_1B = os.path.join("/home","stef","Testbed","00_data","GEDI", "GEDI01_B_2019122150008_O02186_T04733_02_003_01.h5")

with h5py.File(file_path_1B, 'r') as f_1B: # open file in read mode
    keys = list(f_1B)
    
keys

['BEAM0000',
 'BEAM0001',
 'BEAM0010',
 'BEAM0011',
 'BEAM0101',
 'BEAM0110',
 'BEAM1000',
 'BEAM1011',
 'METADATA']

In [3]:
file_path_2B = "/home/stef/Testbed/00_data/GEDI/GEDI02_B_2019113083317_O02042_T04038_02_001_01.h5"
f_2B = h5py.File(file_path_2B, 'r')
list(f_2B)

['BEAM0000',
 'BEAM0001',
 'BEAM0010',
 'BEAM0011',
 'BEAM0101',
 'BEAM0110',
 'BEAM1000',
 'BEAM1011',
 'METADATA']

In [4]:
f_2B.close

<bound method File.close of <HDF5 file "GEDI02_B_2019113083317_O02042_T04038_02_001_01.h5" (mode r)>>

In [5]:
# !h5ls {file_path_2B}

The hierarchical HDF5 format is build using *groups*, folder-like containers that hold datasets and other groups, and, *datasets*, array-like collections of data. It resembles a unix file path. GEDI data is organized in 8 groups representing ground-beams (BEAM0000', 'BEAM0001', 'BEAM0010', 'BEAM0011', ..) and Metadata. A description of groups and datasets can also be found [here for level 1 B Products](file:///tmp/mozilla_stef0/gedi_l1b_product_data_dictionary_P003_v1.html) and [here for Level 2B Products](file:///tmp/mozilla_stef0/gedi_l2b_dictionary_P001_v1.html).

Let's investigate the BEAM0000 group.

In [6]:
# list groups (aka dictionaty keys) and attached objects
# list(f_2B['BEAM0000'].items())

# list (groups) dictionary keys only
# list(f_2B['BEAM0000'])

In [7]:
def printname(name):
    """ prints names of GEDI file
    
    Note: callback used in combination with f.visit(printname)
    """
    print(name)

In [8]:
 f_2B['BEAM0000'].visit(printname)

algorithmrun_flag
ancillary
ancillary/dz
ancillary/l2a_alg_count
ancillary/maxheight_cuttoff
ancillary/phony_dim_52
ancillary/rg_eg_constraint_center_buffer
ancillary/rg_eg_mpfit_max_func_evals
ancillary/rg_eg_mpfit_maxiters
ancillary/rg_eg_mpfit_tolerance
ancillary/signal_search_buff
ancillary/tx_noise_stddev_multiplier
beam
channel
cover
cover_z
fhd_normal
geolocation
geolocation/degrade_flag
geolocation/delta_time
geolocation/digital_elevation_model
geolocation/elev_highestreturn
geolocation/elev_lowestmode
geolocation/elevation_bin0
geolocation/elevation_bin0_error
geolocation/elevation_lastbin
geolocation/elevation_lastbin_error
geolocation/height_bin0
geolocation/height_lastbin
geolocation/lat_highestreturn
geolocation/lat_lowestmode
geolocation/latitude_bin0
geolocation/latitude_bin0_error
geolocation/latitude_lastbin
geolocation/latitude_lastbin_error
geolocation/local_beam_azimuth
geolocation/local_beam_elevation
geolocation/lon_highestreturn
geolocation/lon_lowestmode
geoloca

We can also create a list of all objects in the file:

In [9]:
f_2B_obj = []
f_2B.visit(f_2B_obj.append)
# f_2B_obj

As we can see the `f_2B['BEAM0000']` group consists on several subgroups. What separates the HDF5 file system from a normal unix path is that attributes, little pieces of metadata can be attached to groups, subgroups and datasets directly. Here is an example, investigating the attributes attached ot the `cover` dataset:

In [10]:
!h5ls -vlr {file_path_2B}/BEAM0001/cover

Opened "/home/stef/Testbed/00_data/GEDI/GEDI02_B_2019113083317_O02042_T04038_02_001_01.h5" with sec2 driver.
BEAM0001/cover           Dataset {343295/343295}
    Attribute: DIMENSION_LIST {1}
        Type:      variable length of
                   object reference
        Data:  (DATASET-1:1286593865)
    Attribute: _FillValue scalar
        Type:      native double
        Data:  -9999
    Attribute: coordinates scalar
        Type:      variable-length null-terminated UTF-8 string
        Data:  "geolocation/delta_time geolocation/lat_lowestmode geolocation/lon_lowestmode"
    Attribute: description scalar
        Type:      variable-length null-terminated UTF-8 string
        Data:  "Total canopy cover, defined as the percent of the ground covered by the vertical projection of canopy material"
    Attribute: long_name scalar
        Type:      variable-length null-terminated UTF-8 string
        Data:  "Total cover"
    Attribute: units scalar
        Type:      variable-length nul

In [11]:
dset = f_2B['BEAM0000/cover']

In [12]:
list(dset.attrs)

['DIMENSION_LIST',
 '_FillValue',
 'coordinates',
 'description',
 'long_name',
 'units',
 'valid_range']

In [13]:
dset.attrs['description']

'Total canopy cover, defined as the percent of the ground covered by the vertical projection of canopy material'

In [14]:
dset.attrs['coordinates']

'geolocation/delta_time geolocation/lat_lowestmode geolocation/lon_lowestmode'

The `coordinates` attribute for example provides the path name or dictionary key to the spatial coordinates used with the layer.

# PERFORMANCE & SCALABILITY

## Performance through Best Practices:

* access subgroups: `f[../..]` - efficient, `f[][]` - inefficient
* use standard Python containership tests: `if 'name' in group`, NOT `if 'name' in group.keys()`
* don't forget to close the hd5 file
* or open through `with h5py.File(file_path_1B, 'r') as f_1B:` statement


## Scalability through the HDF5 API & Functionality

**core driver**: the coredriver stores your file entirely in memory, which makes I/O operations incredibly fast. Beware there is a limit to how much data fits into memory, so only use this for quick data exploration of small files. YOU can set the driver to core using:

In [15]:
# f = h5py.File(fiel_path_1B, driver ="core")

**chunking**: It is possible to chunck hdf5 files and work with chucked files. This is worth doing in case the file is reused many times. Can can speed up I/O operations.

In [16]:
!h5ls -vlr {file_path_2B}/BEAM0001/cover

Opened "/home/stef/Testbed/00_data/GEDI/GEDI02_B_2019113083317_O02042_T04038_02_001_01.h5" with sec2 driver.
BEAM0001/cover           Dataset {343295/343295}
    Attribute: DIMENSION_LIST {1}
        Type:      variable length of
                   object reference
        Data:  (DATASET-1:1286593865)
    Attribute: _FillValue scalar
        Type:      native double
        Data:  -9999
    Attribute: coordinates scalar
        Type:      variable-length null-terminated UTF-8 string
        Data:  "geolocation/delta_time geolocation/lat_lowestmode geolocation/lon_lowestmode"
    Attribute: description scalar
        Type:      variable-length null-terminated UTF-8 string
        Data:  "Total canopy cover, defined as the percent of the ground covered by the vertical projection of canopy material"
    Attribute: long_name scalar
        Type:      variable-length null-terminated UTF-8 string
        Data:  "Total cover"
    Attribute: units scalar
        Type:      variable-length nul

We can see arrays are chunked. See line ```Chunks:    {14200} 56800 bytes```, indicating that there is not much to optimize here.

(Other questions of consideration: Do chunks fit data access patterns? Are chunks optimized for spatial subsetting? Because we are doing the I/O operations on a 2 CPU, 4GB RAM core, we need to be beware of the chunk cache. The fact that the kernel is not crashing when opening the file indicated strong performant chunking.)

**compression**: In hdf5 uncompressed datasets are faster to read and write. the `.zarr` datafromat for example is highly optimized for BLOSC compression and faster read or written when compressed.

**reading from disk**: Note here using pd.HDFStore.select and a where query allows you to read file content direclty from disc instead of using memory. This process is slower but allows you to read larger datasets when memory is the bottelneck.

**parallel hdf5**: (Note if I/O operations are the bottelneck you will not get any speed up through using parallel processign here, but you will improve CPU intensive processing.) In order to make use of parallel processing with hdf5 check if Threadsafety is no and Parallel HDf5 is yes in your local configuration. You can check your configuration using: 

In [17]:
!h5cc -showconfig

	    SUMMARY OF THE HDF5 CONFIGURATION

General Information:
-------------------
                   HDF5 Version: 1.10.4
                  Configured on: Wed Dec 19 18:26:52 UTC 2018
                  Configured by: root@3dad7c19-81ba-4672-4f33-547177f88490
                    Host system: x86_64-conda_cos6-linux-gnu
              Uname information: Linux 3dad7c19-81ba-4672-4f33-547177f88490 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
                       Byte sex: little-endian
             Installation point: /home/stef/miniconda3/envs/gedi

Compiling Options:
------------------
                     Build Mode: production
              Debugging Symbols: no
                        Asserts: no
                      Profiling: no
             Optimization Level: high

Linking Options:
----------------
                      Libraries: static, shared
  Statically Linked Executables: 
                        LDFLAGS: -Wl,-O2 -Wl,--sort-com

More information on how to change your local settings can be found in Tom Kooij's tutorial [HDF5 take 2](https://www.youtube.com/watch?v=ofLFhQ9yxCw) presented at SciPy 2017, minute 2:35:00 onwards. All of the above is interesting for reoccuring and very large data processing.