Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use LZ4 block compression for compatibility with parquet-cpp (#314) #315

Conversation

jonathanunderwood
Copy link

This removes LZ4 frame compression in favour of LZ4 block
compression to ensure the LZ4 compression used is compatible
with the paqrquet-cpp implementation.

Because the lz4.block decompressor needs to know the size
of the uncompressed data, this change also requires:

- propagation of page_header.uncompressed_page_size
  through to decompress_data
- the decompression functions for all compressors
  to handle the uncompressed_size argument

Where a decompressor doesn't need to know the uncompressed
data size, we simply ignore it.

This removes LZ4 frame compression in favour of LZ4 block
compression to ensure the LZ4 compression used is compatible
with the paqrquet-cpp implementation.

Because the lz4.block decompressor needs to know the size
of the uncompressed data, this change also requires:

    - propagation of page_header.uncompressed_page_size
      through to decompress_data
    - the decompression functions for all compressors
      to handle the uncompressed_size argument

Where a decompressor doesn't need to know the uncompressed
data size, we simply ignore it.
Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor questions

try:
if kwargs['store_size'] == True:
raise RuntimeError(
'store_size cannot be True for LZ4 the compressor'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify the wording here, please?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can. I have to admit, I was in half a mind to just silently override that option if the user specified it, rather than raise an exception. Similarly for the write_content_size parameter for zstd.

In both cases, this option, when True means "store the uncompressed size with the compressed data". We force it to be False, because the size is stored outside the compressed block in the metadata, so it would be redundant to store it in the compressed frame header.

In essence though, we meet the intent of specifying these to be True by storing the size with in the metadata. So, I'd be comfortable for just silently overriding the option the user has specified, rather than raising an exception. On the other hand, I am happy to leave the exception in.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with you - this is not the typical compression situation, we save the "metadata" separately, so, effectively, the optional parameter is meaningless here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um. When you say you agree with me, you mean you're happy to drop the exception raising?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)
I mean that your point is good, and I'd be happy to drop the exceptions altogether, but I think this is a preference call that I'm happy to leave to you.

raise RuntimeError('write_content_size cannot be false for the ZSTD compressor')
if kwargs['write_content_size'] == True:
raise RuntimeError(
'write_content_size cannot be false for the ZSTD compressor'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but 'write_content_size is True here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, typo in the exception message.

@@ -470,12 +470,8 @@ def test_compression_lz4(tempdir):
"y": {
"type": "lz4",
"args": {
"compression_level": 5,
"content_checksum": True,
"block_size": 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these options now obsolete for the (single) block? This is just for curiosity.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the block compressor doesn't have these options.

@jonathanunderwood jonathanunderwood force-pushed the pass_uncompressed_size_to_decompress branch from 63ea969 to 2f7b35c Compare March 11, 2018 18:34
@jonathanunderwood
Copy link
Author

OK, I decided to remove the exception raising part.

@martindurant
Copy link
Member

Thanks very much, @jonathanunderwood

@martindurant martindurant merged commit dbb6b22 into dask:master Mar 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants