Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-521: Add Brotli compression codec. #344

Closed

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented May 11, 2016

This adds a Brotli codec that shares code with the Snappy codec. Snappy
is "non-blocking", meaning that it always accepts more data and buffers
it without blocking. The first reads is blocking, while the compression
is done off-heap. This strategy doesn't appear to have a noticable
performance impact, but does get better compression for Brotli than
streaming buffers.

The non-blocking part of Parquet's Snappy codec has been refactored so
it can be used for both Brotli and Snappy, which both use direct byte
buffers and compress outside the JVM.

This currently depends on a snapshot release of parquet-format with
PARQUET-609.

protected int getMaxCompressedLength(int numInputBytes) {
// this is based on https://github.com/google/brotli/issues/274
// that page is not very clear, so this is much more conservative
return numInputBytes * 2;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like one of the linked issues pointed to this in the Brotli src: https://github.com/google/brotli/search?utf8=%E2%9C%93&q=max_out_size

const size_t max_out_size = 2 * bytes + 500;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also a Brotli method that gives the max compressed length that we should get into the JNI bindings. This value has been working fine for all of my tests, but I'll update the value on the next pass.

@rdblue rdblue force-pushed the PARQUET-521-add-brotli-compression branch from bea6774 to 28c52d1 Compare May 11, 2016 18:58
This adds a Brotli codec that shares code with the Snappy codec. Snappy
is "non-blocking", meaning that it always accepts more data and buffers
it without blocking. The first reads is blocking, while the compression
is done off-heap. This strategy doesn't appear to have a noticable
performance impact, but does get better compression for Brotli than
streaming buffers.

The non-blocking part of Parquet's Snappy codec has been refactored so
it can be used for both Brotli and Snappy, which both use direct byte
buffers and compress outside the JVM.
@rdblue rdblue force-pushed the PARQUET-521-add-brotli-compression branch from 28c52d1 to 5938fb5 Compare May 11, 2016 20:14
@rdblue
Copy link
Contributor Author

rdblue commented Oct 18, 2017

This is replaced by #430. Brotli is provided by https://github.com/rdblue/brotli-codec.

@rdblue rdblue closed this Oct 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants