Skip to content

Bug: hdf5 may skip filters for small chunks and store their raw binary #573

@victor-zou

Description

@victor-zou

Demo codes that trigers the bug:

import h5py
import hdf5plugin
from kerchunk.hdf import SingleHdf5ToZarr
import numpy as np
import ujson

with h5py.File("temp.h5", mode="a") as f:
    # write any small chunk is ok
    f.require_dataset("a", (4,), np.int32, **hdf5plugin.Blosc()).write_direct(np.arange(4, np.int32))
with open("temp.h5", "rb") as f:
    print(ujson.dumps(SingleHdf5ToZarr(f, None).translate(), indent=4))

You will see that "a/0" saves the raw buffer of range(4), while in "a/.zarray" there is still the compression settings in filters. Then if you read the h5 with zarr, you will receive blosc's -1 error code (as for my test, using zstd works ok).

In hdf.py around line 301, we can see that library has made some attemp to detect and handle this problem, but the detecting condition looks not enough:

            if isinstance(h5obj, h5py.Dataset):
                lggr.debug(f"HDF5 dataset: {h5obj.name}")
                lggr.debug(f"HDF5 compression: {h5obj.compression}")
                if h5obj.id.get_create_plist().get_layout() == h5py.h5d.COMPACT:
                    # Only do if h5obj.nbytes < self.inline??
                    kwargs["data"] = h5obj[:]
                    kwargs["filters"] = []
                else:
                    kwargs["filters"] = self._decode_filters(h5obj)

For a compression-configed block, h5obj.id.get_create_plist().get_layout() will probably still returns 2 (CHUNKED) instead of 0 (COMPACT). So my suggestion is that directly test for each chunk, if the chunk size equals to (or even bigger than) the raw binary size, just simply save the raw binary data in the translated result and discard all filters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions