-
Notifications
You must be signed in to change notification settings - Fork 95
Description
Demo codes that trigers the bug:
import h5py
import hdf5plugin
from kerchunk.hdf import SingleHdf5ToZarr
import numpy as np
import ujson
with h5py.File("temp.h5", mode="a") as f:
# write any small chunk is ok
f.require_dataset("a", (4,), np.int32, **hdf5plugin.Blosc()).write_direct(np.arange(4, np.int32))
with open("temp.h5", "rb") as f:
print(ujson.dumps(SingleHdf5ToZarr(f, None).translate(), indent=4))You will see that "a/0" saves the raw buffer of range(4), while in "a/.zarray" there is still the compression settings in filters. Then if you read the h5 with zarr, you will receive blosc's -1 error code (as for my test, using zstd works ok).
In hdf.py around line 301, we can see that library has made some attemp to detect and handle this problem, but the detecting condition looks not enough:
if isinstance(h5obj, h5py.Dataset):
lggr.debug(f"HDF5 dataset: {h5obj.name}")
lggr.debug(f"HDF5 compression: {h5obj.compression}")
if h5obj.id.get_create_plist().get_layout() == h5py.h5d.COMPACT:
# Only do if h5obj.nbytes < self.inline??
kwargs["data"] = h5obj[:]
kwargs["filters"] = []
else:
kwargs["filters"] = self._decode_filters(h5obj)For a compression-configed block, h5obj.id.get_create_plist().get_layout() will probably still returns 2 (CHUNKED) instead of 0 (COMPACT). So my suggestion is that directly test for each chunk, if the chunk size equals to (or even bigger than) the raw binary size, just simply save the raw binary data in the translated result and discard all filters.