Bug: hdf5 may skip filters for small chunks and store their raw binary

Demo codes that trigers the bug:
```Python
import h5py
import hdf5plugin
from kerchunk.hdf import SingleHdf5ToZarr
import numpy as np
import ujson

with h5py.File("temp.h5", mode="a") as f:
    # write any small chunk is ok
    f.require_dataset("a", (4,), np.int32, **hdf5plugin.Blosc()).write_direct(np.arange(4, np.int32))
with open("temp.h5", "rb") as f:
    print(ujson.dumps(SingleHdf5ToZarr(f, None).translate(), indent=4))
```
You will see that "a\/0" saves the raw buffer of range(4), while in "a\/.zarray" there is still the compression settings in filters. Then if you read the h5 with zarr, you will receive blosc's -1 error code (as for my test, using zstd works ok).

In hdf.py around line 301, we can see that library has made some attemp to detect and handle this problem, but the detecting condition looks not enough:
```Python
            if isinstance(h5obj, h5py.Dataset):
                lggr.debug(f"HDF5 dataset: {h5obj.name}")
                lggr.debug(f"HDF5 compression: {h5obj.compression}")
                if h5obj.id.get_create_plist().get_layout() == h5py.h5d.COMPACT:
                    # Only do if h5obj.nbytes < self.inline??
                    kwargs["data"] = h5obj[:]
                    kwargs["filters"] = []
                else:
                    kwargs["filters"] = self._decode_filters(h5obj)
```
For a compression-configed block, `h5obj.id.get_create_plist().get_layout()` will probably still returns 2 (CHUNKED) instead of 0 (COMPACT). So my suggestion is that directly test for each chunk, if the chunk size equals to (or even bigger than) the raw binary size, just simply save the raw binary data in the translated result and discard all filters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: hdf5 may skip filters for small chunks and store their raw binary #573

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: hdf5 may skip filters for small chunks and store their raw binary #573

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions