Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for H5Z_FILTER_SHUFFLE for GZIP/ZLIB compressed data #11

Closed
1 of 2 tasks
pbranson opened this issue Nov 25, 2020 · 21 comments
Closed
1 of 2 tasks

Add support for H5Z_FILTER_SHUFFLE for GZIP/ZLIB compressed data #11

pbranson opened this issue Nov 25, 2020 · 21 comments

Comments

@pbranson
Copy link

pbranson commented Nov 25, 2020

I have started testing out this approach on some of the Australian Integrated Marine Observing System datasets that are stored in NetCDF format on S3 (eg. s3://imos-data/IMOS/SRS/OC/gridded/aqua/P1D/2014/08/A.P1D.20140801T053000Z.aust.K_490.nc )

This dataset utilises gzip compression with the H5Z_FILTER_SHUFFLE filter applied. I have looked over the upstream code dependencies and would appreciate some advice on the pathway forward.

Currently the numcodecs gzip and zlib Codecs do not support a shuffle option. The code for the shuffle is pretty straight forward, however will likely be very slow in python if not compiled: https://github.com/HDFGroup/hsds/blob/03890edfa735cc77da3bc06f6cf5de5bd40d1e23/hsds/util/storUtil.py#L43
numcodecs uses cython for compiled code rather than numba.

I am keen to help get this sorted and suggest one possible way forward could be:

Appreciate advice from some of the more experienced devs here (@martindurant, @ajelenak and @rabernat) if you think this is a reasonable way forward?

Thanks for everyone's effort on this!

@martindurant
Copy link
Member

That sounds like a totally reasonable plan. I don't think that numcodecs requires you to use cython, but it may be convenient to keep to the same coding style. Whether this needs to be an option to the existing codec or to be a separate one is an open question.

As discussed in #9 , SZIP from libaec would be worth adding, as well as perhaps fletcher32 checksum (although I don't know how the checksums are stored)

PS: I had no idea that HDF (or HSDS) used numba internally. Probably numcodecs is not interested in taking numba as a dependency, but the question could be asked (I personally am in the zarr team, but have not contributed to numcodecs).

@rabernat
Copy link
Contributor

Sounds like a great plan Paul! I'll be happy to help out on the numcodecs side.

@ajelenak
Copy link
Collaborator

The Blosc library does have the C code for byte and bit shuffling so that could be used as a starting point. I thought about doing this when working on the HDF5-Zarr translator but did not have enough time on that project to actually do it.

Another thing, the Zarr spec supports filters so @pbranson's first option is probably the best approach. The shuffle would be declared as a filter and be available to any compression method.

@pbranson
Copy link
Author

Thanks for the advice and the suggestions about the filters and reusing the blosc shuffle - there is plenty of hardware optimisation there that is worth leveraging.

I have dug through the blosc code and found where the byte shuffle is implemented:
blosc_internal_shuffle(typesize, blocksize, src, tmp);
https://github.com/Blosc/c-blosc/blob/9fae1c9acb659159321aca69aefcdbce663e2374/blosc/shuffle.c#L374

The shuffle.h header is exposed in numcodecs yet, I havent done any cython development work but it seems that the headers can be exposed via a cdef like this: https://github.com/zarr-developers/numcodecs/blob/master/numcodecs/blosc.pyx#L21

And the zarr filters are just calls to numcodec codecs - so I could make a Shuffle(Codec) that uses the blosc shuffle in a similar way that blosc.pyx does it and then we dont need to change the gzip etc codecs.

Have updated the checklist with a revised plan and will see about raising the issue at numcodecs.

@martindurant
Copy link
Member

I think the suggestion may have been that you can already use the blosc compressor, as wrapped by numcodecs, to do gzip/shuffle without any extra code: numcodecs.blosc.Blosc("gzip", compression_level, shuffle, blocksize), where

  • 0: NOSHUFFLE
  • 1: SHUFFLE
  • 2: BITSHUFFLE

@pbranson
Copy link
Author

pbranson commented Nov 27, 2020 via email

@martindurant
Copy link
Member

Will test the Blosc gzip approach in this library.

OK, let us know! Note that, if you are running any tests with Dask and multiple threads, you may want to set the number of blosc threads to 1, e.g., blosc.set_nthreads(1).

Looking through the blosc documentation now, I don't see a way to turn OFF the blosc-specific header data in the comp/decompression functions currently available to python.

@ajelenak
Copy link
Collaborator

Decompressing HDF5 dataset chunks using Blosc gzip and shuffle did not work for me exactly because of that 16-byte Blosc header. That's why I suggested shuffle as a Zarr filter.

@martindurant
Copy link
Member

Annoying, right?
We could propose that blosc expose more deeper functionality to python...

@martindurant
Copy link
Member

@pbranson , any luck with this?

@ajelenak : it does occur to me that it probably is very easy to create the blosc header for each chunk to be decompressed. That could be a super-simple numcodecs filter, or a custom part of a zarr store class. Even fsspec could do this, but it's less obvious a fit there.

@ajelenak
Copy link
Collaborator

ajelenak commented Dec 8, 2020

I am not familiar with the Blosc header but it should be possible to do something like that. I think a proper shuffle filter is still a better long term approach.

@pbranson
Copy link
Author

pbranson commented Dec 9, 2020

After understanding that blosc add's a header I stopped pursuing that approach.

I have made a start on setting up the numcodec cython and classes, but getting swamped with a pre-christmas work crunch.

I need to test if the full blosc context/init is required use the shuffle functions - I am not sure that it does which would be convenient.

I can see how adding a blosc header filter could also be an option, but would that still require a memory copy operation to make the data contiguous prior to sending to blosc-c? In which case it is possibly faster to do the shuffle directly.

Hopefully get back to it tomorrow.

@martindurant
Copy link
Member

would that still require a memory copy operation

Yes, I suppose so, unless we somehow read the bytes data into a buffer with an initial offset of two bytes, and then fill in the header after reading. Of course, the point is that the compressed data should be much smaller than the final data buffer, so maybe a copy isn't so bad.

@pbranson
Copy link
Author

OK I have provided an update of my progress here: zarr-developers/numcodecs#260

Having trouble with incompatible pointer types (I think)

it does occur to me that it probably is very easy to create the blosc header for each chunk to be decompressed

Possibly that is an easier way to go is that documented here: https://blosc-doc.readthedocs.io/en/latest/frame_format.html?highlight=chunk%20format#description-for-different-fields-in-header ?

@martindurant
Copy link
Member

Blosc 2 is not in use anywhere - but it's documentation has completely takes over from Blosc v1! I'm pretty sure it's two bytes, giving the compression type, shuffle, and size. I swear I saw the spec recently.

@pbranson
Copy link
Author

pbranson commented Dec 10, 2020 via email

@pbranson
Copy link
Author

OK I have finally got back to this - I have implemented the shuffle with numba and raised a PR to get some feedback from the numcodecs devs.

Have have updated my fork of fsspec_reference_maker to add support for shuffle and tested it out in this gist
https://gist.github.com/pbranson/6edbe1e3c4214e6b826f4e82e5df55fb
Not terribly scientific, but a 2x speed up over using OpenDAP from my laptop on the uni network - there appeared to be some caching going on then I used %%timeit

Also it seems that these lines in the main branch have been commented out by mistake?
https://github.com/intake/fsspec-reference-maker/blob/main/fsspec_reference_maker/hdf.py#L101
It prevents writing the zarr related dictionary entries (.zgroup etc) to the json.

@martindurant
Copy link
Member

Also it seems that these lines in the main branch have been commented out by mistake?

To be checked, but I don't think we need those, everything is in chunkstore

@martindurant
Copy link
Member

Does numcodecs support numba? Or are you suggesting this as a filter step somewhere else in the pipeline?

@pbranson
Copy link
Author

Does numcodecs support numba?

Not at present, but I have asked about it in zarr-developers/numcodecs#273

It seems according to zarr-developers/numcodecs#274 that they are moving towards a more pure-python approach (for blosc at least) so maybe they might consider adding numba? Should I ping someone specifically in that PR?

Or are you suggesting this as a filter step somewhere else in the pipeline?

Probably not, it seems logical for the shuffle to be in numcodecs - but maybe it should be cython rather than numba.

@martindurant
Copy link
Member

I suppose the usual maintainers will notice; but I doubt that numba as a dep is a great fit. We'll see. Cython would fit better, for the case where existing python bindings to some compiled code doesn't exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants