
Load datasets stored in .h5 file format
=======================================

This example demonstrates how to load the data from a stored .h5 file and to build a 
data generator.

At first, we create a small temporary dataset by utilizing dataset1, compounding 5 source cases and the CSM as input feature.    

In [4]:
import os
import tensorflow as tf
from acoupipe.datasets.dataset1 import Dataset1
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # change tensorflow log level for doc purposes

# training dataset
d1 = Dataset1(features=["csm"])

# save to .h5 file
d1.save_h5(split="training", size=5, name="/tmp/dataset.h5")

100%|[38;2;31;119;180m██████████[0m| 5/5 [00:03<00:00,  1.47it/s]


The AcouPipe toolbox provides the `LoadH5Dataset` class to load the datasets stored into HDF5 format.
One can access each individual sample/source case by the h5f attribute of the class. To extract the first input feature ('csm' in this case) of the dataset:


In [5]:
from acoupipe.loader import LoadH5Dataset

dataset_h5 = LoadH5Dataset(name="/tmp/dataset.h5")

s1 = dataset_h5.h5f['1']['csm'][:] # we use [:] to copy the data from file into the variable s1

Exception occurred in traits notification handler for object: <acoupipe.loader.LoadH5Dataset object at 0x7f557c6e7c70>, trait: basename, old value: None, new value: dataset
Traceback (most recent call last):
  File "/home/kujawski/mambaforge/envs/py39/lib/python3.9/site-packages/traits/trait_notifiers.py", line 524, in _dispatch_change_event
    self.dispatch(handler, *args)
  File "/home/kujawski/mambaforge/envs/py39/lib/python3.9/site-packages/traits/trait_notifiers.py", line 486, in dispatch
    handler(*args)
  File "/home/kujawski/mambaforge/envs/py39/lib/python3.9/site-packages/acoupipe/loader.py", line 95, in load_data
    self.load_metadata()
  File "/home/kujawski/mambaforge/envs/py39/lib/python3.9/site-packages/acoupipe/loader.py", line 105, in load_metadata
    int_indices = list(map(int,indices))
ValueError: invalid literal for int() with base 10: '<HDF5 group "'


In [3]:
dataset_h5.h5f.keys()


<KeysViewHDF5 ['1', '2', '3', '4', '5', '<HDF5 group "', 'metadata']>

## Building a TensorFlow/Keras Dataset 

With these definitions, a Python generator can be created which can be consumed by the Tensorflow Dataset API. Here, the dataset comprises the location, squared sound pressure, and the CSM. 

In [None]:

data_generator = dataset_h5.get_dataset_generator(
            features=['loc','p2','csm'], # the desired features to return from the file
            )

# provide the signature of the features
output_signature = {
            'loc' : tf.TensorSpec(shape=(3,None), dtype=tf.float32),
            'p2' : tf.TensorSpec(shape=(None,None), dtype=tf.float32),
            'csm':  tf.TensorSpec(shape=(None,64,64,None), dtype=tf.float32),
            }

dataset = tf.data.Dataset.from_generator(
            generator=data_generator,
            output_signature=output_signature
            )

data = next(iter(dataset))
print(data['loc'])