Latest commit
7f8e952
Jun 30, 2016
While trying out the newest Parquet version, we noticed that the changes to start using ByteBuffers: 6b605a4 and 6b24a1d (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow down a bit. Read overhead: 4-6% (in MB_Millis) Write overhead: 6-10% (MB_Millis). Seems like this seems to be due to the encoding / decoding of Strings in the [Binary class](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java): [toStringUsingUTF8()](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L388) - for reads [encodeUTF8()](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L236) - for writes With these changes we see around 5% improvement in MB_Millis while running the job on our Hadoop cluster. Added some microbenchmark details to the jira. Note that I've left the behavior the same for the avro write path - it still uses CharSequence and the Charset based encoders. Author: Piyush Narang <pnarang@twitter.com> Closes #347 from piyushnarang/bytebuffer-encoding-fix-pr and squashes the following commits: 43c5bdd [Piyush Narang] Keep avro on char sequence 2d50c8c [Piyush Narang] Update Binary approach 9e58237 [Piyush Narang] Proof of concept fixes
| .. | |||
| Failed to load latest commit information. | |||
|
|
Binary.java | ||
|
|
Converter.java | ||
|
|
GroupConverter.java | ||
|
|
PrimitiveConverter.java | ||
|
|
RecordConsumer.java |
|
|
|
|
RecordMaterializer.java |
|
|