Importing module 
=======

In [None]:
import hdf5Lib

Initialising class
=======

Let's first specify the paths to the example files. For this example we have two folders:
* /examples/single_file/file.hdf5
* /examples/split_files/subfile_??.hdf5

The former is a single file that contains all of the data, which in this example is simply a numpy array holding integers from 0 to 499, i.e. np.arange(500). The latter folder contains 50 files, each containing one tenth of the data, i.e.:
* /examples/split_files/subfile_00.hdf5 --> np.arange(0,10)
* /examples/split_files/subfile_01.hdf5 --> np.arange(10,20)
* ...
* /examples/split_files/subfile_49.hdf5 --> np.arange(490,500)

In [None]:
# Path to single file example
single_file_path    = '../examples/single_file/file.hdf5'

# Alternative ways of specifying paths to the split files 
split_file_path_v1 = '../examples/split_files/subfile_%.2d.hdf5'
split_file_path_v2 = ['../examples/split_files/subfile_%.2d.hdf5'%i for i in range(50)]

When we initialise the class, we should specify whether the files have been split (is_split) and if we are using string formatting (i.e. split_file_path_v1), then the number of files the data has been split into. In this case, it is 50. 

In [None]:
single_file   = hdf5Lib.Read(single_file_path)
split_file_v1 = hdf5Lib.Read(split_file_path_v1, number_files=50)
split_file_v2 = hdf5Lib.Read(split_file_path_v2)

Note that either way of defining the path to the split files works:
* Formatted string: one needs to specify the number of subfiles
* List of strings: each entry should be the path to an individual file

In [None]:
assert split_file_v1._file_list ==  split_file_v2._file_list

Exploring hdf5 file
=======
We can now have a look at the contents of the hdf5 files... Let's check what groups the single file hdf5 contains:

In [None]:
single_file.print_entries()

Again, we can check whether 'group_a' contains any groups and/or datasets.

In [None]:
single_file.print_entries('group_a')

What about attributes? Let's see those in 'group_a/dataset_1'

In [None]:
single_file.print_attributes('group_a/dataset_1')

So what are their values?

In [None]:
print ('hubbleParameter:' ,single_file.get_attribute('group_a/dataset_1','hubbleParameter'))
print ('pi:' ,single_file.get_attribute('group_a/dataset_1','pi'))

It works exactly the same for the split files, except that only one of the subfiles is considered when retrieving the information (by default the first one) 

In [None]:
split_file_v1.print_entries()

In [None]:
split_file_v1.print_entries('group_a')

In [None]:
split_file_v1.print_attributes('group_a/dataset_1')

In [None]:
print ('hubbleParameter:' ,split_file_v1.get_attribute('group_a/dataset_1','hubbleParameter'))
print ('pi:' ,split_file_v1.get_attribute('group_a/dataset_1','pi'))

Loading data
=======
This can be done easily by calling the __get_item__ method of the class. Data can be loaded sequentially or in parallel, which is determined during class initialisation via the flag parallel. By default, parallel loading is always enabled, except for when there is only a single hdf5 file (not implemented).

In [None]:
# Check if opened in parallel
print ('Single file opened in parallel mode?', single_file._parallel)
print ('Multiple file opened in parallel mode?', split_file_v1._parallel)

In [None]:
# Load data from single file
data_single_file = single_file['group_a/dataset_1']

In [None]:
# Load data from split file
data_split_file = split_file_v1['group_a/dataset_1']

Note that this class joins all the data that was previously split over multiple files into a single array, as if it was loaded from a single file.

In [None]:
assert (data_single_file == data_split_file).all()

Finally, before the load methods are called, the code checks whether the requested dataset has been already loaded (i.e. in self._data). This prevents spending time loading the same data over and over again.

In [None]:
split_file_v1._data