Find file History
Latest commit 7f8e952 Jun 30, 2016 @piyushnarang piyushnarang committed with julienledem PARQUET-642: Improve performance of ByteBuffer based read / write paths
While trying out the newest Parquet version, we noticed that the changes to start using ByteBuffers: 6b605a4 and 6b24a1d (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow down a bit.

Read overhead: 4-6% (in MB_Millis)
Write overhead: 6-10% (MB_Millis).

Seems like this seems to be due to the encoding / decoding of Strings in the [Binary class](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java):
[toStringUsingUTF8()](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L388) - for reads
[encodeUTF8()](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L236) - for writes

With these changes we see around 5% improvement in MB_Millis while running the job on our Hadoop cluster.

Added some microbenchmark details to the jira.

Note that I've left the behavior the same for the avro write path - it still uses CharSequence and the Charset based encoders.

Author: Piyush Narang <pnarang@twitter.com>

Closes #347 from piyushnarang/bytebuffer-encoding-fix-pr and squashes the following commits:

43c5bdd [Piyush Narang] Keep avro on char sequence
2d50c8c [Piyush Narang] Update Binary approach
9e58237 [Piyush Narang] Proof of concept fixes
Permalink
..
Failed to load latest commit information.
Binary.java
Converter.java
GroupConverter.java
PrimitiveConverter.java
RecordConsumer.java PARQUET-528: Fix flush() for RecordConsumer and implementations Mar 5, 2016
RecordMaterializer.java PARQUET-227 Enforce that unions have only 1 set value, tolerate bad r… Apr 30, 2015