Mirror of Apache Parquet
Java Other
Latest commit 89e0607 Dec 20, 2016 Patrick Woody committed with rdblue PARQUET-801: Allow UserDefinedPredicates in DictionaryFilter
Author: Patrick Woody <pwoody@palantir.com>
Author: Patrick Woody <patrick.woody1@gmail.com>

Closes #394 from pwoody/pw/dictionaryUdp and squashes the following commits:

d8499a0 [Patrick Woody] short circuiting and style changes
4cb9f0c [Patrick Woody] more missing imports
1ec0d39 [Patrick Woody] fix missing import
3ee4489 [Patrick Woody] PARQUET-801: Allow UserDefinedPredicates in DictionaryFilter
Permalink
Failed to load latest commit information.
dev PARQUET-392: Fix staging instructions in prepare-release.sh. Oct 13, 2016
doc/dremel_paper removed old doc Mar 11, 2013
parquet-arrow PARQUET-755: create parquet-arrow module with schema converter Nov 9, 2016
parquet-avro PARQUET-765 - Upgrade Avro to 1.8.1 Nov 9, 2016
parquet-benchmarks [maven-release-plugin] prepare for next development iteration Oct 19, 2016
parquet-cascading-common23/src PARQUET-480: Update for Cascading 3.0 Feb 1, 2016
parquet-cascading [maven-release-plugin] prepare for next development iteration Oct 19, 2016
parquet-cascading3 [maven-release-plugin] prepare for next development iteration Oct 19, 2016
parquet-column PARQUET-755: create parquet-arrow module with schema converter Nov 9, 2016
parquet-common PARQUET-423: Replace old Log class with SLF4J Logging Oct 26, 2016
parquet-encoding PARQUET-423: Replace old Log class with SLF4J Logging Oct 26, 2016
parquet-generator [maven-release-plugin] prepare for next development iteration Oct 19, 2016
parquet-hadoop-bundle [maven-release-plugin] prepare for next development iteration Oct 19, 2016
parquet-hadoop PARQUET-801: Allow UserDefinedPredicates in DictionaryFilter Dec 20, 2016
parquet-hive-bundle [maven-release-plugin] prepare for next development iteration Oct 19, 2016
parquet-hive PARQUET-423: Replace old Log class with SLF4J Logging Oct 26, 2016
parquet-jackson [maven-release-plugin] prepare for next development iteration Oct 19, 2016
parquet-pig-bundle [maven-release-plugin] prepare for next development iteration Oct 19, 2016
parquet-pig PARQUET-753: Fixed GroupType.union() to handle original type Oct 26, 2016
parquet-protobuf PARQUET-423: Replace old Log class with SLF4J Logging Oct 26, 2016
parquet-scala [maven-release-plugin] prepare for next development iteration Oct 19, 2016
parquet-scrooge [maven-release-plugin] prepare for next development iteration Oct 19, 2016
parquet-thrift PARQUET-423: Replace old Log class with SLF4J Logging Oct 26, 2016
parquet-tools PARQUET-786: 'java -jar', not 'java jar' closes #377, #374 Dec 5, 2016
src adding back the license header used by the maven plugin Mar 12, 2013
.editorconfig PARQUET-740: Introduce editorconfig Oct 10, 2016
.gitignore PARQUET-480: Update for Cascading 3.0 Feb 1, 2016
.travis.yml PARQUET-696: fix travis build. Broken because google code shut down Aug 29, 2016
CHANGES.md PARQUET-392: Update CHANGES.md for 1.9.0. Oct 13, 2016
KEYS PARQUET-768: Add Uwe L. Korn to KEYS Nov 7, 2016
LICENSE PARQUET-358: Add support for Avro's logical types API. Apr 20, 2016
NOTICE PARQUET-358: Add support for Avro's logical types API. Apr 20, 2016
PoweredBy.md PARQUET-111: Updates for apache release Feb 3, 2015
README.md PARQUET-696: fix travis build. Broken because google code shut down Aug 29, 2016
changelog.sh PARQUET-252 : support nested container type for parquet-scrooge May 4, 2015
parquet_cascading.md PARQUET-480: Update for Cascading 3.0 Feb 1, 2016
pom.xml PARQUET-220: Unnecessary warning in ParquetRecordReader.initialize Dec 6, 2016

README.md

Parquet MR Build Status

Parquet-MR contains the java implementation of the Parquet format. Parquet is a columnar storage format for Hadoop; it provides efficient storage and encoding of data. Parquet uses the record shredding and assembly algorithm described in the Dremel paper to represent nested structures.

You can find some details about the format and intended use cases in our Hadoop Summit 2013 presentation

Building

Parquet-MR uses Maven to build and depends on both the thrift and protoc compilers.

Install Protobuf

To build and install the protobuf compiler, run:

wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
tar xzf protobuf-2.5.0.tar.gz
cd  protobuf-2.5.0
./configure
make
sudo make install
sudo ldconfig

Install Thrift

To build and install the thrift compiler, run:

wget -nv http://archive.apache.org/dist/thrift/0.7.0/thrift-0.7.0.tar.gz
tar xzf thrift-0.7.0.tar.gz
cd thrift-0.7.0
chmod +x ./configure
./configure --disable-gen-erl --disable-gen-hs --without-ruby --without-haskell --without-erlang
sudo make install

Build Parquet with Maven

Once protobuf and thrift are available in your path, you can build the project by running:

LC_ALL=C mvn clean install

Features

Parquet is a very active project, and new features are being added quickly; below is the state as of June 2013.

FeatureIn trunkIn devPlannedExpected release
Type-specific encodingYES1.0
Hive integrationYES (28)1.0
Pig integrationYES1.0
Cascading integrationYES1.0
Crunch integrationYES (CRUNCH-277)1.0
Impala integrationYES (non-nested)1.0
Java Map/Reduce APIYES1.0
Native Avro supportYES1.0
Native Thrift supportYES1.0
Complex structure supportYES1.0
Future-proofed versioningYES1.0
RLEYES1.0
Bit PackingYES1.0
Adaptive dictionary encodingYES1.0
Predicate pushdownYES (68)1.0
Column statsYES2.0
Delta encodingYES2.0
Native Protocol Buffers supportYES1.0
Index pagesYES2.0

Map/Reduce integration

Input and Output formats. Note that to use an Input or Output format, you need to implement a WriteSupport or ReadSupport class, which will implement the conversion of your object to and from a Parquet schema.

We've implemented this for 2 popular data formats to provide a clean migration path as well:

Thrift

Thrift integration is provided by the parquet-thrift sub-project. If you are using Thrift through Scala, you may be using Twitter's Scrooge. If that's the case, not to worry -- we took care of the Scrooge/Apache Thrift glue for you in the parquet-scrooge sub-project.

Avro

Avro conversion is implemented via the parquet-avro sub-project.

Create your own objects

  • The ParquetOutputFormat can be provided a WriteSupport to write your own objects to an event based RecordConsumer.
  • the ParquetInputFormat can be provided a ReadSupport to materialize your own objects by implementing a RecordMaterializer

See the APIs:

Apache Pig integration

A Loader and a Storer are provided to read and write Parquet files with Apache Pig

Storing data into Parquet in Pig is simple:

-- options you might want to fiddle with
SET parquet.page.size 1048576 -- default. this is your min read/write unit.
SET parquet.block.size 134217728 -- default. your memory budget for buffering data
SET parquet.compression lzo -- or you can use none, gzip, snappy
STORE mydata into '/some/path' USING parquet.pig.ParquetStorer;

Reading in Pig is also simple:

mydata = LOAD '/some/path' USING parquet.pig.ParquetLoader();

If the data was stored using Pig, things will "just work". If the data was stored using another method, you will need to provide the Pig schema equivalent to the data you stored (you can also write the schema to the file footer while writing it -- but that's pretty advanced). We will provide a basic automatic schema conversion soon.

Hive integration

Hive integration is provided via the parquet-hive sub-project.

Build

to run the unit tests: mvn test

to build the jars: mvn package

The build runs in Travis CI: Build Status

Add Parquet as a dependency in Maven

The current release is version 1.8.1

  <dependencies>
    <dependency>
      <groupId>org.apache.parquet</groupId>
      <artifactId>parquet-common</artifactId>
      <version>1.8.1</version>
    </dependency>
    <dependency>
      <groupId>org.apache.parquet</groupId>
      <artifactId>parquet-encoding</artifactId>
      <version>1.8.1</version>
    </dependency>
    <dependency>
      <groupId>org.apache.parquet</groupId>
      <artifactId>parquet-column</artifactId>
      <version>1.8.1</version>
    </dependency>
    <dependency>
      <groupId>org.apache.parquet</groupId>
      <artifactId>parquet-hadoop</artifactId>
      <version>1.8.1</version>
    </dependency>
  </dependencies>

How To Contribute

We prefer to receive contributions in the form of GitHub pull requests. Please send pull requests against the github.com/apache/parquet-mr repository. If you've previously forked Parquet from its old location, you will need to add a remote or update your origin remote to https://github.com/apache/parquet-mr.git

If you are looking for some ideas on what to contribute, check out jira issues for this project labeled "pick-me-up". Comment on the issue and/or contact dev@parquet.apache.org with your questions and ideas.

If you’d like to report a bug but don’t have time to fix it, you can still post it to our issue tracker, or email the mailing list dev@parquet.apache.org

To contribute a patch:

  1. Break your work into small, single-purpose patches if possible. It’s much harder to merge in a large change with a lot of disjoint features.
  2. Create a JIRA for your patch on the Parquet Project JIRA.
  3. Submit the patch as a GitHub pull request against the master branch. For a tutorial, see the GitHub guides on forking a repo and sending a pull request. Prefix your pull request name with the JIRA name (ex: https://github.com/apache/parquet-mr/pull/240).
  4. Make sure that your code passes the unit tests. You can run the tests with mvn test in the root directory.
  5. Add new unit tests for your code.

We tend to do fairly close readings of pull requests, and you may get a lot of comments. Some common issues that are not code structure related, but still important:

  • Use 2 spaces for whitespace. Not tabs, not 4 spaces. The number of the spacing shall be 2.
  • Give your operators some room. Not a+b but a + b and not foo(int a,int b) but foo(int a, int b).
  • Generally speaking, stick to the Sun Java Code Conventions
  • Make sure tests pass!

Thank you for getting involved!

Authors and contributors

Code of Conduct

We hold ourselves and the Parquet developer community to two codes of conduct:

  1. The Apache Software Foundation Code of Conduct
  2. The Twitter OSS Code of Conduct

Discussions

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0 See also: