Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for parallel processing #40

Closed
kanryu opened this issue Jan 9, 2019 · 7 comments
Closed

Support for parallel processing #40

kanryu opened this issue Jan 9, 2019 · 7 comments

Comments

@kanryu
Copy link

kanryu commented Jan 9, 2019

Do you plan to parallelize compression and decompression ?

Currently I am interested in decompression.

Decompress_template switches between UNCOMPRESSED and huffman code, but each will run on a separate thread. Each UNCOMPRESSED can be divided into further threads if certain conditions are satisfied.

In addition, c ++ 11 adds multithreading, so you do not need to consider the pthread problem.

@ebiggers
Copy link
Owner

ebiggers commented Jan 9, 2019

DEFLATE (and zlib and gzip) streams aren't suitable for parallel decompression.

However, if you aren't locked into a data format that uses a single stream, you can easily parallelize at the application layer by dividing the data into chunks before compression, then compressing and/or decompressing the chunks in parallel. libdeflate already works fine for this; just make sure to allocate a separate libdeflate_compressor or libdeflate_decompressor for each concurrent thread.

@Piezoid
Copy link

Piezoid commented May 20, 2019

First, thank you for libdeflate.

We used your code as a base for experimenting with parallel decompression and found a way to achieve just that: https://github.com/Piezoid/pugz/

It's not yet production ready and we removed lots of features (compression, multiarch: only linux/x86 with SSE3.1 is currently supported). This a rather contrived implementation, I think it should be kept a a specialized library. Notably only ASCII files are currently supported.

The asynchronous API is not yet stabilized. Any input of usage patterns would be appreciated.

@kanryu
Copy link
Author

kanryu commented May 21, 2019

@Piezoid It is an interesting product. In the case of gzip, it is an understanding that it is a mechanism to perform parallel processing using the fact that one gz file contains multiple zlib chunks, is it actually like that?

What I questioned is the argument whether it can be accelerated by parallel processing of huffman decoding and lz decoding in a single zlib chunk, but parallelization of that (gzip) is worth it in itself is.

@Piezoid
Copy link

Piezoid commented May 21, 2019

What you describe, if I'm not mistaken is similar to the bgzip file format. It use the fact that a gzip file can contain multiple gzip "parts" concatenated. This break the LZ77 dependency between two successive segments allow random access and parallelization. It's is quite ubiquitous for compressing bioinformatics text file formats. It is retroc-ompatbile with gzip tools but require recompression.

Pugz aim at decompressing vanilla gzip files, with a single header/part/footer. In a gzip stream, there is multiple deflate blocks, but they only reset the Huffman tables. The LZ77 sliding window is not reset and dependencies (what we call back-references) are carried from one block to the next.

Pugz solves this problem by doing a first pass that record the origins of back-references in the initial unknown sliding window. Then, after thread synchronization, the back-references are "translated" back to the correct characters using the end of the decompressed chunk coming from another thread.

@kanryu
Copy link
Author

kanryu commented May 22, 2019

@Piezoid Is that applicable to a deflate block, such as a PNG image?

@Piezoid
Copy link

Piezoid commented May 27, 2019

Yes, but we don't support binary data atm. It could be done in theory, but at higher overhead (memory bandwidth).
Unless you have few very large PNGs I'm not sure if this would bring performance gains.
You are welcome to open an issue on pugz repository if you want to discuss the matter further.

@ebiggers
Copy link
Owner

Closing since support for parallel processing is currently out of scope for libdeflate itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants