Efficient HDF5 chunk iteration via HDF5 1.14, h5py 3.8, and H5Dchunk_iter #286

mkitti · 2023-01-25T21:41:43Z

Currently, kerchunk iterates over HDF5 chunks by looping over a linear index passed to get_chunk_info. This uses H5Dget_chunk_info from the HDF5 C API.

kerchunk/kerchunk/hdf.py

Lines 523 to 524 in ff16c05

    
           for index in range(num_chunks): 
        
               blob = dsid.get_chunk_info(index)

Support for H5Dchunk_iter was recently released as part of HDF5 1.14.0:
https://docs.hdfgroup.org/hdf5/v1_14/group___h5_d.html#gac482c2386aa3aea4c44730a627a7adb8
This method iterates through all the chunks contained with a dataset, visiting each chunk once.

H5Dchunk_iter was incorporated into h5py 3.8.0 as h5py.h5d.DatasetID.chunk_iter() when used with HDF5 1.14:
h5py/h5py#2202

Support for H5Dchunk_iter is expected in HDF5 1.12.3 and 1.10.9.

@ajelenak implemented a test which checks the equivalence of that call with the current iteration method implemented here:
https://github.com/h5py/h5py/blob/d2e84badfa5e4d8095bcc5d3db81f8548c340919/h5py/tests/test_dataset.py#L1800-L1826

Kerchunk should take advantage of H5Dchunk_iter so that it can efficiently iterate over chunks with linear scaling.

The text was updated successfully, but these errors were encountered:

martindurant · 2023-01-31T15:04:03Z

@mkitti , thanks for the info. Do you know how to implement this on our end? I assume we should wait a while until the new version of HDF becomes standard (and maybe maintain the old behaviour for a while anyway).

mkitti · 2023-01-31T17:59:48Z

You could switch implementations depending on h5py.version.hdf5_version_tuple

mkitti · 2023-02-01T18:47:34Z

PyTables just merged an implementation based on H5Dchunk_iter:

PyTables/PyTables#997

martindurant · 2023-02-02T20:47:24Z

We should probably wait until at least HDF5==1.14 is on conda-forge ( https://anaconda.org/conda-forge/hdf5/files ). @ajelenak , will you have any appetite to implement faster iteration?
@mkitti , do you have evidence that the iteration in the current kerchunk.hdf module is particularly slow?

mkitti · 2023-02-02T21:31:11Z

@ajelenak added H5Dchunk_iter to h5py in h5py/h5py#2202

HDF5 1.14 is due to hit conda-forge on February 5th according to this pull request:
conda-forge/hdf5-feedstock#188

My experience is mainly low-level using the C-API directly. I did do some concrete benchmarks for the Julia interface to HDF5:
JuliaIO/HDF5.jl#1031 (comment)

Essentially, when we are processing 16,384 chunks, retrieving chunk information via dsid.get_chunk_info (H5Dget_chunk_info) takes on the order of 10 seconds. Retrieving chunk information via H5Dchunk_iter takes less than 0.1 second. H5Dget_chunk_info may be faster for a very few chunks, but then we're talking tens of milliseconds in either case.

mkitti · 2023-02-02T22:08:55Z

Here's a table summary of my Julia benchmarks:

Number of Chunks	H5Dchunk_iter (seconds)	H5Dget_chunk_info time (seconds)	H5Dget_chunk_info / H5Dchunk_iter Ratio
4	0.000040029	0.000027406	0.7
16	0.000053692	0.000077566	1.4
64	0.000194085	0.000456561	2.4
256	0.000717275	0.004541661	6
1024	0.003004693	0.048780859	16
4096	0.011674653	0.662931619	57
16384	0.064214971	13.353451558	208

I would be happy to attempt a pull request if we determine a path to proceed here.

martindurant · 2023-02-03T14:27:56Z

Since we have extra work that we do for each chunk, and python has generally higher overheads, I bet the difference is nowhere near as dramatic - but your point is taken!

mkitti · 2023-02-03T18:50:28Z

It's the 13 seconds that bothers me the most here. I think that far exceeds any overhead you might see from Python.

Basically, H5Dchunk_iter scales much better. This gets noticeable when we are talking about ~10K chunks.

I have an interest in seeing HDF5, Zarr, and others scaling into that number of chunks.

mkitti · 2023-02-06T09:03:27Z

HDF5 1.14 is now available via conda-forge

martindurant · 2023-02-06T14:14:28Z

Thanks for letting us know

ajelenak · 2023-02-06T16:52:13Z

@mkitti Can you share somehow your test file?

mkitti · 2023-02-06T18:16:37Z

@ajelenak I created it via Julia:
JuliaIO/HDF5.jl#1031 (comment)

ajelenak · 2023-02-08T19:34:10Z

Here are benchmark results with an h5 file created according to this JuliaIO/HDF5.jl#1031 (comment) code. chunk_info is the old method and chunk_iter is the new one.

Python 3.11.0
h5py-3.8.0
libhdf5-1.14.0
h5 file is loaded in memory
Each chunk location method is run 10 times and the best (quickest) time is used

4 chunks :: chunk info = 1.31e-05 s
4 chunks :: chunk iter = 4.861e-06 s
4 chunks :: chunk_info / chunk_iter = 2.69
16 chunks :: chunk info = 4.5497e-05 s
16 chunks :: chunk iter = 1.2756e-05 s
16 chunks :: chunk_info / chunk_iter = 3.57
64 chunks :: chunk info = 0.000231134 s
64 chunks :: chunk iter = 4.3405e-05 s
64 chunks :: chunk_info / chunk_iter = 5.33
256 chunks :: chunk info = 0.002054627 s
256 chunks :: chunk iter = 0.000166648 s
256 chunks :: chunk_info / chunk_iter = 12.33
1024 chunks :: chunk info = 0.026394162 s
1024 chunks :: chunk iter = 0.000703572 s
1024 chunks :: chunk_info / chunk_iter = 37.51
4096 chunks :: chunk info = 0.378013247 s
4096 chunks :: chunk iter = 0.003634743 s
4096 chunks :: chunk_info / chunk_iter = 104.00
16384 chunks :: chunk info = 9.026964752 s
16384 chunks :: chunk iter = 0.013217899 s
16384 chunks :: chunk_info / chunk_iter = 682.93

martindurant · 2023-02-08T19:53:30Z

@ajelenak , are you likely to have the time to implement this?

ajelenak · 2023-02-08T21:12:14Z

Sure, will add it to my to-do list. Should the new method kick in when a suitable libhdf5 version is detected, or be a user option? I prefer former.

martindurant · 2023-02-08T21:14:05Z

I think it's fine to use it when available without needing a new option.

mkitti · 2023-05-03T04:22:31Z

I gave this a shot in #331. With many chunks, the results are quite remarkable.

Time for SingleHdf5ToZarr.translate():

Number of Chunks	Before this pull request, with `get_chunk_info`	After this pull request, with `chunk_iter`	Ratio
16,384	13 seconds	0.131 seconds	99x
32,768	74 seconds	0.214 seconds	346x
65,536	393 seconds	0.472 seconds	832x

mkitti mentioned this issue May 3, 2023

Use H5Dchunk_iter to get chunk information from HDF5 files #331

Merged

martindurant closed this as completed in #331 May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient HDF5 chunk iteration via HDF5 1.14, h5py 3.8, and H5Dchunk_iter #286

Efficient HDF5 chunk iteration via HDF5 1.14, h5py 3.8, and H5Dchunk_iter #286

mkitti commented Jan 25, 2023 •

edited

Loading

martindurant commented Jan 31, 2023

mkitti commented Jan 31, 2023

mkitti commented Feb 1, 2023

martindurant commented Feb 2, 2023

mkitti commented Feb 2, 2023

mkitti commented Feb 2, 2023 •

edited

Loading

martindurant commented Feb 3, 2023

mkitti commented Feb 3, 2023

mkitti commented Feb 6, 2023

martindurant commented Feb 6, 2023

ajelenak commented Feb 6, 2023

mkitti commented Feb 6, 2023

ajelenak commented Feb 8, 2023 •

edited

Loading

martindurant commented Feb 8, 2023

ajelenak commented Feb 8, 2023

martindurant commented Feb 8, 2023

mkitti commented May 3, 2023 •

edited

Loading

Efficient HDF5 chunk iteration via HDF5 1.14, h5py 3.8, and H5Dchunk_iter #286

Efficient HDF5 chunk iteration via HDF5 1.14, h5py 3.8, and H5Dchunk_iter #286

Comments

mkitti commented Jan 25, 2023 • edited Loading

martindurant commented Jan 31, 2023

mkitti commented Jan 31, 2023

mkitti commented Feb 1, 2023

martindurant commented Feb 2, 2023

mkitti commented Feb 2, 2023

mkitti commented Feb 2, 2023 • edited Loading

martindurant commented Feb 3, 2023

mkitti commented Feb 3, 2023

mkitti commented Feb 6, 2023

martindurant commented Feb 6, 2023

ajelenak commented Feb 6, 2023

mkitti commented Feb 6, 2023

ajelenak commented Feb 8, 2023 • edited Loading

martindurant commented Feb 8, 2023

ajelenak commented Feb 8, 2023

martindurant commented Feb 8, 2023

mkitti commented May 3, 2023 • edited Loading

mkitti commented Jan 25, 2023 •

edited

Loading

mkitti commented Feb 2, 2023 •

edited

Loading

ajelenak commented Feb 8, 2023 •

edited

Loading

mkitti commented May 3, 2023 •

edited

Loading