Add support for H5Z_FILTER_SHUFFLE for GZIP/ZLIB compressed data #11

pbranson · 2020-11-25T02:29:43Z

I have started testing out this approach on some of the Australian Integrated Marine Observing System datasets that are stored in NetCDF format on S3 (eg. s3://imos-data/IMOS/SRS/OC/gridded/aqua/P1D/2014/08/A.P1D.20140801T053000Z.aust.K_490.nc )

This dataset utilises gzip compression with the H5Z_FILTER_SHUFFLE filter applied. I have looked over the upstream code dependencies and would appreciate some advice on the pathway forward.

Currently the numcodecs gzip and zlib Codecs do not support a shuffle option. The code for the shuffle is pretty straight forward, however will likely be very slow in python if not compiled: https://github.com/HDFGroup/hsds/blob/03890edfa735cc77da3bc06f6cf5de5bd40d1e23/hsds/util/storUtil.py#L43
numcodecs uses cython for compiled code rather than numba.

I am keen to help get this sorted and suggest one possible way forward could be:

Raise an issue and PR on numcodecs to implement the shuffle/unshuffle as a Codec in C with a cython binding to bloscs shuffle.h
Alter the call to create_dataset at https://github.com/intake/fsspec-reference-maker/blob/main/fsspec_reference_maker/hdf.py#L118 to include the filters argument = sequence of numcodec filters

Appreciate advice from some of the more experienced devs here (@martindurant, @ajelenak and @rabernat) if you think this is a reasonable way forward?

Thanks for everyone's effort on this!

martindurant · 2020-11-25T14:46:20Z

That sounds like a totally reasonable plan. I don't think that numcodecs requires you to use cython, but it may be convenient to keep to the same coding style. Whether this needs to be an option to the existing codec or to be a separate one is an open question.

As discussed in #9 , SZIP from libaec would be worth adding, as well as perhaps fletcher32 checksum (although I don't know how the checksums are stored)

PS: I had no idea that HDF (or HSDS) used numba internally. Probably numcodecs is not interested in taking numba as a dependency, but the question could be asked (I personally am in the zarr team, but have not contributed to numcodecs).

rabernat · 2020-11-25T16:10:09Z

Sounds like a great plan Paul! I'll be happy to help out on the numcodecs side.

ajelenak · 2020-11-25T18:03:27Z

The Blosc library does have the C code for byte and bit shuffling so that could be used as a starting point. I thought about doing this when working on the HDF5-Zarr translator but did not have enough time on that project to actually do it.

Another thing, the Zarr spec supports filters so @pbranson's first option is probably the best approach. The shuffle would be declared as a filter and be available to any compression method.

pbranson · 2020-11-26T01:07:52Z

Thanks for the advice and the suggestions about the filters and reusing the blosc shuffle - there is plenty of hardware optimisation there that is worth leveraging.

I have dug through the blosc code and found where the byte shuffle is implemented:
blosc_internal_shuffle(typesize, blocksize, src, tmp);
https://github.com/Blosc/c-blosc/blob/9fae1c9acb659159321aca69aefcdbce663e2374/blosc/shuffle.c#L374

The shuffle.h header is exposed in numcodecs yet, I havent done any cython development work but it seems that the headers can be exposed via a cdef like this: https://github.com/zarr-developers/numcodecs/blob/master/numcodecs/blosc.pyx#L21

And the zarr filters are just calls to numcodec codecs - so I could make a Shuffle(Codec) that uses the blosc shuffle in a similar way that blosc.pyx does it and then we dont need to change the gzip etc codecs.

Have updated the checklist with a revised plan and will see about raising the issue at numcodecs.

martindurant · 2020-11-26T14:51:30Z

I think the suggestion may have been that you can already use the blosc compressor, as wrapped by numcodecs, to do gzip/shuffle without any extra code: numcodecs.blosc.Blosc("gzip", compression_level, shuffle, blocksize), where

0: NOSHUFFLE
1: SHUFFLE
2: BITSHUFFLE

pbranson · 2020-11-27T01:12:55Z

Oh! I didn't realise that Blosc also provided a number of different compressors. I think I will continue with the addition of the shuffle filter, probably won't perform as well as maintaining the context for the shuffle operation within C, but adds some flexibility for other numcodecs.... Will test the Blosc gzip approach in this library.

…

On Thu., 26 Nov. 2020, 10:51 pm Martin Durant, ***@***.***> wrote: I think the suggestion may have been that you can already use the blosc compressor, as wrapped by numcodecs, to do gzip/shuffle without any extra code: numcodecs.blosc.Blosc("gzip", compression_level, shuffle, blocksize), where - 0: NOSHUFFLE - 1: SHUFFLE - 2: BITSHUFFLE — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADG5WQBD4TQTQHEQDJHVO33SRZTQLANCNFSM4UBW7XDQ> .

martindurant · 2020-11-27T14:56:57Z

Will test the Blosc gzip approach in this library.

OK, let us know! Note that, if you are running any tests with Dask and multiple threads, you may want to set the number of blosc threads to 1, e.g., blosc.set_nthreads(1).

Looking through the blosc documentation now, I don't see a way to turn OFF the blosc-specific header data in the comp/decompression functions currently available to python.

ajelenak · 2020-11-27T19:31:28Z

Decompressing HDF5 dataset chunks using Blosc gzip and shuffle did not work for me exactly because of that 16-byte Blosc header. That's why I suggested shuffle as a Zarr filter.

martindurant · 2020-11-27T19:38:08Z

Annoying, right?
We could propose that blosc expose more deeper functionality to python...

martindurant · 2020-12-08T14:08:17Z

@pbranson , any luck with this?

@ajelenak : it does occur to me that it probably is very easy to create the blosc header for each chunk to be decompressed. That could be a super-simple numcodecs filter, or a custom part of a zarr store class. Even fsspec could do this, but it's less obvious a fit there.

ajelenak · 2020-12-08T21:26:37Z

I am not familiar with the Blosc header but it should be possible to do something like that. I think a proper shuffle filter is still a better long term approach.

pbranson · 2020-12-09T01:59:38Z

After understanding that blosc add's a header I stopped pursuing that approach.

I have made a start on setting up the numcodec cython and classes, but getting swamped with a pre-christmas work crunch.

I need to test if the full blosc context/init is required use the shuffle functions - I am not sure that it does which would be convenient.

I can see how adding a blosc header filter could also be an option, but would that still require a memory copy operation to make the data contiguous prior to sending to blosc-c? In which case it is possibly faster to do the shuffle directly.

Hopefully get back to it tomorrow.

martindurant · 2020-12-09T13:49:34Z

would that still require a memory copy operation

Yes, I suppose so, unless we somehow read the bytes data into a buffer with an initial offset of two bytes, and then fill in the header after reading. Of course, the point is that the compressed data should be much smaller than the final data buffer, so maybe a copy isn't so bad.

pbranson · 2020-12-10T15:20:40Z

OK I have provided an update of my progress here: zarr-developers/numcodecs#260

Having trouble with incompatible pointer types (I think)

it does occur to me that it probably is very easy to create the blosc header for each chunk to be decompressed

Possibly that is an easier way to go is that documented here: https://blosc-doc.readthedocs.io/en/latest/frame_format.html?highlight=chunk%20format#description-for-different-fields-in-header ?

martindurant · 2020-12-10T15:32:22Z

Blosc 2 is not in use anywhere - but it's documentation has completely takes over from Blosc v1! I'm pretty sure it's two bytes, giving the compression type, shuffle, and size. I swear I saw the spec recently.

pbranson · 2020-12-10T15:35:19Z

Oh and I thought it was just me struggling to find it...!

pbranson · 2021-02-18T08:49:09Z

OK I have finally got back to this - I have implemented the shuffle with numba and raised a PR to get some feedback from the numcodecs devs.

Have have updated my fork of fsspec_reference_maker to add support for shuffle and tested it out in this gist
https://gist.github.com/pbranson/6edbe1e3c4214e6b826f4e82e5df55fb
Not terribly scientific, but a 2x speed up over using OpenDAP from my laptop on the uni network - there appeared to be some caching going on then I used %%timeit

Also it seems that these lines in the main branch have been commented out by mistake?
https://github.com/intake/fsspec-reference-maker/blob/main/fsspec_reference_maker/hdf.py#L101
It prevents writing the zarr related dictionary entries (.zgroup etc) to the json.

martindurant · 2021-02-19T14:48:06Z

Also it seems that these lines in the main branch have been commented out by mistake?

To be checked, but I don't think we need those, everything is in chunkstore

martindurant · 2021-02-19T16:46:29Z

Does numcodecs support numba? Or are you suggesting this as a filter step somewhere else in the pipeline?

pbranson · 2021-02-20T02:00:21Z

Does numcodecs support numba?

Not at present, but I have asked about it in zarr-developers/numcodecs#273

It seems according to zarr-developers/numcodecs#274 that they are moving towards a more pure-python approach (for blosc at least) so maybe they might consider adding numba? Should I ping someone specifically in that PR?

Or are you suggesting this as a filter step somewhere else in the pipeline?

Probably not, it seems logical for the shuffle to be in numcodecs - but maybe it should be cython rather than numba.

martindurant · 2021-02-22T15:45:04Z

I suppose the usual maintainers will notice; but I doubt that numba as a dep is a great fit. We'll see. Cython would fit better, for the case where existing python bindings to some compiled code doesn't exist.

pbranson mentioned this issue Nov 26, 2020

Add support for byte shuffle as a codec so that it can be used as a Zarr filter zarr-developers/numcodecs#260

Closed

martindurant closed this as completed Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for H5Z_FILTER_SHUFFLE for GZIP/ZLIB compressed data #11

Add support for H5Z_FILTER_SHUFFLE for GZIP/ZLIB compressed data #11

pbranson commented Nov 25, 2020 •

edited

martindurant commented Nov 25, 2020

rabernat commented Nov 25, 2020

ajelenak commented Nov 25, 2020

pbranson commented Nov 26, 2020

martindurant commented Nov 26, 2020

pbranson commented Nov 27, 2020 via email

martindurant commented Nov 27, 2020

ajelenak commented Nov 27, 2020

martindurant commented Nov 27, 2020

martindurant commented Dec 8, 2020

ajelenak commented Dec 8, 2020

pbranson commented Dec 9, 2020 •

edited

martindurant commented Dec 9, 2020

pbranson commented Dec 10, 2020

martindurant commented Dec 10, 2020

pbranson commented Dec 10, 2020 via email •

edited

pbranson commented Feb 18, 2021

martindurant commented Feb 19, 2021

martindurant commented Feb 19, 2021

pbranson commented Feb 20, 2021

martindurant commented Feb 22, 2021

Add support for H5Z_FILTER_SHUFFLE for GZIP/ZLIB compressed data #11

Add support for H5Z_FILTER_SHUFFLE for GZIP/ZLIB compressed data #11

Comments

pbranson commented Nov 25, 2020 • edited

martindurant commented Nov 25, 2020

rabernat commented Nov 25, 2020

ajelenak commented Nov 25, 2020

pbranson commented Nov 26, 2020

martindurant commented Nov 26, 2020

pbranson commented Nov 27, 2020 via email

martindurant commented Nov 27, 2020

ajelenak commented Nov 27, 2020

martindurant commented Nov 27, 2020

martindurant commented Dec 8, 2020

ajelenak commented Dec 8, 2020

pbranson commented Dec 9, 2020 • edited

martindurant commented Dec 9, 2020

pbranson commented Dec 10, 2020

martindurant commented Dec 10, 2020

pbranson commented Dec 10, 2020 via email • edited

pbranson commented Feb 18, 2021

martindurant commented Feb 19, 2021

martindurant commented Feb 19, 2021

pbranson commented Feb 20, 2021

martindurant commented Feb 22, 2021

pbranson commented Nov 25, 2020 •

edited

pbranson commented Dec 9, 2020 •

edited

pbranson commented Dec 10, 2020 via email •

edited