The Hierarchical Data Format is a binary, self-describing format, supporting regular strided and random access. There are three main options in Python to interact with HDF5
- h5py - an unopinionated reflection of the HDF5 library
- pytables - an opinionated version, adding extra features and conventions
- pandas.HDFStore - a commonly used format among Pandas users.
All of these libraries create and read HDF5 files. Unfortunately some of them have special conventions that can only be understood by their library. So a given HDF5 file created some of these libraries may not be well understood by the others.
If given an explicit object (not a string uri), like an
pandas.HDFStore then the
odo project can
intelligently decide what to do. If given a string, like
odo defaults to using the vanilla
h5py solution, the least opinionated of the three.
You can specify that you want a particular format with one of the following protocols
Each library has limitations.
- H5Py does not like datetimes
- PyTables does not like variable length strings,
- Pandas does not like non-tabular data (like
ndarrays) and, if users don't select the
format='table'keyword argument, creates HDF5 files that are not well understood by other libraries.
Our support for PyTables is admittedly weak. We would love contributions here.
A URI to an HDF5 dataset includes a filename, and a datapath within that file. Optionally it can include a protocol
Examples of HDF5 uris:
The default paths in and out of HDF5 files include sequences of Pandas
DataFrames and sequences of NumPy
h5py.Dataset <-> chunks(np.ndarray) tables.Table <-> chunks(pd.DataFrame) pandas.AppendableFrameTable <-> chunks(pd.DataFrame) pandas.FrameFixed <-> DataFrame