-
-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use LZ4 block compression for compatibility with parquet-cpp (#314) #315
Use LZ4 block compression for compatibility with parquet-cpp (#314) #315
Conversation
This removes LZ4 frame compression in favour of LZ4 block compression to ensure the LZ4 compression used is compatible with the paqrquet-cpp implementation. Because the lz4.block decompressor needs to know the size of the uncompressed data, this change also requires: - propagation of page_header.uncompressed_page_size through to decompress_data - the decompression functions for all compressors to handle the uncompressed_size argument Where a decompressor doesn't need to know the uncompressed data size, we simply ignore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of minor questions
fastparquet/compression.py
Outdated
try: | ||
if kwargs['store_size'] == True: | ||
raise RuntimeError( | ||
'store_size cannot be True for LZ4 the compressor' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify the wording here, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can. I have to admit, I was in half a mind to just silently override that option if the user specified it, rather than raise an exception. Similarly for the write_content_size
parameter for zstd.
In both cases, this option, when True
means "store the uncompressed size with the compressed data". We force it to be False
, because the size is stored outside the compressed block in the metadata, so it would be redundant to store it in the compressed frame header.
In essence though, we meet the intent of specifying these to be True by storing the size with in the metadata. So, I'd be comfortable for just silently overriding the option the user has specified, rather than raising an exception. On the other hand, I am happy to leave the exception in.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with you - this is not the typical compression situation, we save the "metadata" separately, so, effectively, the optional parameter is meaningless here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Um. When you say you agree with me, you mean you're happy to drop the exception raising?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
I mean that your point is good, and I'd be happy to drop the exceptions altogether, but I think this is a preference call that I'm happy to leave to you.
fastparquet/compression.py
Outdated
raise RuntimeError('write_content_size cannot be false for the ZSTD compressor') | ||
if kwargs['write_content_size'] == True: | ||
raise RuntimeError( | ||
'write_content_size cannot be false for the ZSTD compressor' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but 'write_content_size
is True
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, typo in the exception message.
@@ -470,12 +470,8 @@ def test_compression_lz4(tempdir): | |||
"y": { | |||
"type": "lz4", | |||
"args": { | |||
"compression_level": 5, | |||
"content_checksum": True, | |||
"block_size": 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these options now obsolete for the (single) block? This is just for curiosity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the block compressor doesn't have these options.
63ea969
to
2f7b35c
Compare
OK, I decided to remove the exception raising part. |
Thanks very much, @jonathanunderwood |
This removes LZ4 frame compression in favour of LZ4 block
compression to ensure the LZ4 compression used is compatible
with the paqrquet-cpp implementation.
Because the lz4.block decompressor needs to know the size
of the uncompressed data, this change also requires:
Where a decompressor doesn't need to know the uncompressed
data size, we simply ignore it.