[SPARK-14015][SQL] Support TimestampType in vectorized parquet reader #11882

sameeragarwal · 2016-03-22T07:51:54Z

What changes were proposed in this pull request?

This PR adds support for TimestampType in the vectorized parquet reader

How was this patch tested?

VectorizedColumnReader initially had a gating condition on primitiveType.getPrimitiveTypeName() == PrimitiveType.PrimitiveTypeName.INT96) that made us fall back on parquet-mr for handling timestamps. This condition is now removed.
The ParquetHadoopFsRelationSuite (that tests for all supported hive types -- including TimestampType) fails when the gating condition is removed ([WIP][SPARK-13994][SQL] Investigate types not supported by vectorized parquet reader #11808) and should now pass with this change. Similarly, the ParquetHiveCompatibilitySuite.SPARK-10177 timestamp test that fails when the gating condition is removed, should now pass as well.
Added tests in HadoopFsRelationTest that test both the dictionary encoded and non-encoded versions across all supported datatypes.

sameeragarwal · 2016-03-22T07:52:01Z

cc @nongli

SparkQA · 2016-03-22T07:53:49Z

Test build #53757 has finished for PR 11882 at commit a612f8b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-03-22T07:55:34Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

+      case INT96:
+        if (column.dataType() == DataTypes.TimestampType) {
+          for (int i = rowId; i < rowId + num; ++i) {
+            Binary v = dictionary.decodeToBinary(dictionaryIds.getInt(i));


hm -- maybe we can do something a lot cheaper? At the very least maybe we can remove the creation of the this Binary object, since we are turning it immediately into a Long.

sure, that sounds good

Added a TODO for converting the dictionary of binaries into long to make it cheaper

SparkQA · 2016-03-22T08:37:46Z

Test build #53759 has finished for PR 11882 at commit 3656dae.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-03-22T17:08:18Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

-    if (column.dataType() == DataTypes.LongType ||
+    if (column.dataType() == DataTypes.LongType || column.dataType() == DataTypes.TimestampType ||
        DecimalType.is64BitDecimalType(column.dataType())) {
      defColumn.readLongs(


hmm. How does this work? readLongs is expecting to read parquet int64 physical types. How is it able to read this other physical type?

Do we have tests where this type is not dictionary encoded?

Thanks, fixed. Also added tests in HadoopFsRelationTest that test both the dictionary encoded and non-encoded versions across all supported datatypes.

SparkQA · 2016-03-23T01:27:25Z

Test build #53829 has finished for PR 11882 at commit 1f511ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-03-23T15:10:36Z

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

-
      originalTypes[i] = t.getOriginalType();

      // TODO: Be extremely cautious in what is supported. Expand this.


I think we can remove this check now too.

sure, sounds good.

nongli · 2016-03-23T15:10:45Z

LGTM

SparkQA · 2016-03-23T17:28:45Z

Test build #53940 has finished for PR 11882 at commit 00b1f12.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Support TimestampType in vectorized parquet reader

a612f8b

rxin reviewed Mar 22, 2016
View reviewed changes

fix imports

3656dae

nongli reviewed Mar 22, 2016
View reviewed changes

Support dictionary encoding disabled

1f511ab

nongli reviewed Mar 23, 2016
View reviewed changes

remove check

00b1f12

asfgit closed this in 0a64294 Mar 23, 2016


		originalTypes[i] = t.getOriginalType();

		// TODO: Be extremely cautious in what is supported. Expand this.

[SPARK-14015][SQL] Support TimestampType in vectorized parquet reader #11882

[SPARK-14015][SQL] Support TimestampType in vectorized parquet reader #11882

Uh oh!

Conversation

sameeragarwal commented Mar 22, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

sameeragarwal commented Mar 22, 2016

Uh oh!

SparkQA commented Mar 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nongli commented Mar 23, 2016

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants