Skip to content

Support for new compression schemes (BLOSC)

perrygreenfield edited this page Apr 14, 2020 · 2 revisions

It was suggested that support for BLOSC be added to ASDF by Lehman Garrison at the Flatiron Institute. He has done this for his own files but is wondering if this can be made a standard feature.

First a very brief summary of what BLOSC is is in order. It is described as a meta compressor because it supports many different compression algorithms. Its primary goal is providing a means of bypassing the memory bandwidth problem of raw data by breaking up large data sets into smaller blocks that fit into processor caches where they can be compressed and decompressed in fast memory. As a result, it often provides a faster way of moving data in and out of the processor since only the compressed data is moved to slower memory.

Furthermore, BLOSC supports a feature called "filters", which really are data transformations that make compression more effective (examples are creating differences between elements for compression, or gathering bits or bytes of the data words together in blocks)

Currently supported compression algorithms include: FastLZ, LZ4, LZ4HC, Snappy, Zlib, Zstd

BLOSC itself is not language specific. The current implementation is in C though Python bindings are available. The main person behind this projects is Francesc Alted, who is someone that has long been involved in the scientific Python effort (he was an early user of numarray!).

BLOSC is supported through the Python library numcodecs as well (which is used by the zarr project, which is a high-level Python implementation of chunked arrays).

The main issues for support were:

  1. Is this useful to support: Answer: It does look very useful.

  2. Is it possible to add it to ASDF without making it explicitly dependent on Python. Answer: Very likely, but this must be done with some care. Some of the numcodecs options rely specifically on Python, so they must be excluded (or at the very least, be optional). Numcodecs does not store information on the compression scheme or parameters in the binary it generates, so some mechanism must be found to save this information. Would we put it in the YAML, or in the binary block. I think we concluded that it would be best in the binary block. Numcodecs represents this information in JSON, and one possibility is to put a small JSON header in the ASDF binary data block.

  3. The BLOSC library is oriented towards optimizing the block size to the processor that it is running on. This somewhat complicates interchange efforts since the optimum block size depends on the processor. Ideally there would be a sweet spot where the block size is good for a wide range of processors (but that size may evolve over time), and it can be specified specifically in the library.

In summary, we do believe we can support this in a language agnostic way, but also leverage off of the existing Python libraries, perhaps also including zarr for support of chunking. These all, of course, make supporting a C/C++ version of ASDF more work.