Skip to content
This repository has been archived by the owner on May 10, 2024. It is now read-only.

PARQUET-456: Add GZip compressor #11

Closed
wants to merge 25 commits into from
Closed

Conversation

knizhnik
Copy link

@knizhnik knizhnik commented Jun 8, 2015

To be able to read files created using Spark I have to add Gzip compression support using zlib

@wesm
Copy link
Member

wesm commented Jan 22, 2016

I created a JIRA to track this issue

https://issues.apache.org/jira/browse/PARQUET-456

@julienledem
Copy link
Member

@knizhnik please prefix PR description with PARQUET-456:
PARQUET-456: Add GZip compressor

@knizhnik knizhnik changed the title Add GZip compressor PARQUET-456: Add GZip compressor Jan 28, 2016
@wesm
Copy link
Member

wesm commented Jan 28, 2016

This will have to be rebased -- pls let us know when there is a passing build.

@julienledem
Copy link
Member

thanks @knizhnik
Please check the output of the build to see what's wrong.

@wesm
Copy link
Member

wesm commented Jan 29, 2016

Can you please add zlib to the sandbox build toolchain? See thirdparty/ and feel free to reuse the CMake module for Snappy so that $ZLIB_HOME can be configured by the user

lz4-codec.cc
snappy-codec.cc
gzip-codec.cc
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The location of this file has changed, can you remove or update?

@wesm
Copy link
Member

wesm commented Jan 29, 2016

Please add zlib as a configurable part of the build toolchain as we should not require the user to install zlib at the system level.

  • Add a cmake module for Zlib to cmake_modules/. You can use FindSnappy.cmake as a starting point. You will need to edit the main CMakeLists.txt. For example, here is the configuration for Snappy
## Snappy
find_package(Snappy REQUIRED)
include_directories(SYSTEM ${SNAPPY_INCLUDE_DIR})
add_library(snappystatic STATIC IMPORTED)
set_target_properties(snappystatic PROPERTIES IMPORTED_LOCATION ${SNAPPY_STATIC_LIB})
  • Add to sandbox: thirdparty/download_thirdparty.sh, thirdparty/build_thirdparty.sh
  • Use the thirdparty-installed zlib in the Travis CI build. See ci/before_script_travis.sh
  • Add zlib to ./setup_build_env.sh

Thank you!

if [ -n "$F_ALL" -o -n "$F_ZLIB" ]; then
cd $TP_DIR/$ZLIB_BASEDIR/cmake_unofficial
CFLAGS=-fPIC cmake -DCMAKE_INSTALL_PREFIX:PATH=$PREFIX $ZLIB_DIR
make -j$PARALLEL install
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like zlib does have a CMakeLists.txt, but it's in the base directory of the tarball, so remove the /cmake_unofficial

@wesm
Copy link
Member

wesm commented Feb 1, 2016

@knizhnik I recommend doing a git merge --squash on this to combine all the commits then rebasing on upstream/master (where upstream is apache/parquet-cpp) to get a passing build. This conflicts with #32 but I can deal with resolving that merge conflict

@wesm
Copy link
Member

wesm commented Feb 1, 2016

@knizhnik I can take over from here if you like, please advise

@knizhnik
Copy link
Author

knizhnik commented Feb 1, 2016

Sorry, I do not have much experience in using git. This is the reason of my delay with the answer - I just wanted to wait for tomorrow and ask help of my colleague. I have already merged with upstream. Not sure if it is still possible to combine all my changes into single commit... So it will be great if you can take over, you will safe a lot of my time:)

@wesm
Copy link
Member

wesm commented Feb 1, 2016

No problem! I really appreciate the work you've done on this. I refactored some things that conflict with this in #32 so I'll resolve the conflict and post a new PR. I'll ping you if I need help with anything else.

@wesm
Copy link
Member

wesm commented Feb 3, 2016

I'll take a look at this within the next week or so; I also want to add unit tests for page decompression. Will circle back

@@ -51,6 +51,12 @@ ColumnReader::ColumnReader(const parquet::ColumnMetaData* metadata,
case CompressionCodec::SNAPPY:
decompressor_.reset(new SnappyCodec());
break;
case CompressionCodec::LZO:
decompressor_.reset(new Lz4Codec());
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this patch also add LZO support?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decompressor was already available, but never initialized here. I'm going to rebase this patch after #38 and add unit tests for the decompression code path using a test fixture (unless someone else wants to do that)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I tried enabling Lz4Codec this way on the current trunk and it crashed. I think more work is needed to enable lzo support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted — thanks for trying it out so I know what to expect =)

@wesm
Copy link
Member

wesm commented Feb 11, 2016

I'm going to pick up this patch after the present patch queue is cleared, so later this week or early next week

@wesm
Copy link
Member

wesm commented Feb 13, 2016

@knizhnik thanks again for getting this going. still some work remaining with this but should be done in the next week or so

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants