[SPARK-17354][SQL] Partitioning by dates/timestamps should work with Parquet vectorized reader #14919

HyukjinKwon · 2016-09-01T13:14:25Z

What changes were proposed in this pull request?

This PR fixes ColumnVectorUtils.populate so that Parquet vectorized reader can read partitioned table with dates/timestamps. This works fine with Parquet normal reader.

This is being only called within VectorizedParquetRecordReader.java#L185.

When partition column types are explicitly given to DateType or TimestampType (rather than inferring the type of partition column), this fails with the exception below:

16/09/01 10:30:07 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 6)
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date
    at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:89)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:185)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:204)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)
...

How was this patch tested?

Unit tests in SQLQuerySuite.

SparkQA · 2016-09-01T15:14:22Z

Test build #64780 has finished for PR 14919 at commit bef83fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sameeragarwal · 2016-09-03T00:26:46Z

@HyukjinKwon can we add a more targeted unit test in one of the vectorized reader test suites that can explicitly test for partitioned columns?

HyukjinKwon · 2016-09-03T04:19:25Z

@sameeragarwal Thanks for your comment. Yeap, I will.

HyukjinKwon · 2016-09-03T06:05:37Z

@sameeragarwal Could you take another look please?

SparkQA · 2016-09-03T07:59:44Z

Test build #64891 has finished for PR 14919 at commit 88f7d29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-04T03:46:02Z

Test build #64909 has finished for PR 14919 at commit acf2a3d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-09-04T23:57:42Z

@davies Do you mind if I ask to review please?

sameeragarwal · 2016-09-06T16:18:45Z

LGTM, thanks @HyukjinKwon! cc @davies

HyukjinKwon · 2016-09-07T00:25:40Z

Thanks @sameeragarwal !

davies · 2016-09-09T21:22:47Z

LGTM, merging into master and 2.0 branch, thanks!

… Parquet vectorized reader This PR fixes `ColumnVectorUtils.populate` so that Parquet vectorized reader can read partitioned table with dates/timestamps. This works fine with Parquet normal reader. This is being only called within [VectorizedParquetRecordReader.java#L185](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L185). When partition column types are explicitly given to `DateType` or `TimestampType` (rather than inferring the type of partition column), this fails with the exception below: ``` 16/09/01 10:30:07 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 6) java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:89) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:185) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:204) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362) ... ``` Unit tests in `SQLQuerySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14919 from HyukjinKwon/SPARK-17354. (cherry picked from commit f7d2143) Signed-off-by: Davies Liu <davies.liu@gmail.com>

… Parquet vectorized reader ## What changes were proposed in this pull request? This PR fixes `ColumnVectorUtils.populate` so that Parquet vectorized reader can read partitioned table with dates/timestamps. This works fine with Parquet normal reader. This is being only called within [VectorizedParquetRecordReader.java#L185](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L185). When partition column types are explicitly given to `DateType` or `TimestampType` (rather than inferring the type of partition column), this fails with the exception below: ``` 16/09/01 10:30:07 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 6) java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:89) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:185) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:204) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362) ... ``` ## How was this patch tested? Unit tests in `SQLQuerySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#14919 from HyukjinKwon/SPARK-17354.

Partitioning by dates/timestamps works with Parquet vectorized reader

bef83fc

Add more unit tests about partition column types

88f7d29

HyukjinKwon mentioned this pull request Sep 3, 2016

[SPARK-17388][SQL] Support for inferring type date/timestamp/decimal for partition column #14947

Closed

Clean up the test in ParquetIOSuite

acf2a3d

asfgit closed this in f7d2143 Sep 9, 2016

HyukjinKwon deleted the SPARK-17354 branch January 2, 2018 03:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17354][SQL] Partitioning by dates/timestamps should work with Parquet vectorized reader #14919

[SPARK-17354][SQL] Partitioning by dates/timestamps should work with Parquet vectorized reader #14919

HyukjinKwon commented Sep 1, 2016 •

edited

Loading

SparkQA commented Sep 1, 2016

sameeragarwal commented Sep 3, 2016

HyukjinKwon commented Sep 3, 2016

HyukjinKwon commented Sep 3, 2016

SparkQA commented Sep 3, 2016

SparkQA commented Sep 4, 2016

HyukjinKwon commented Sep 4, 2016 •

edited

Loading

sameeragarwal commented Sep 6, 2016

HyukjinKwon commented Sep 7, 2016

davies commented Sep 9, 2016

[SPARK-17354][SQL] Partitioning by dates/timestamps should work with Parquet vectorized reader #14919

[SPARK-17354][SQL] Partitioning by dates/timestamps should work with Parquet vectorized reader #14919

Conversation

HyukjinKwon commented Sep 1, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Sep 1, 2016

sameeragarwal commented Sep 3, 2016

HyukjinKwon commented Sep 3, 2016

HyukjinKwon commented Sep 3, 2016

SparkQA commented Sep 3, 2016

SparkQA commented Sep 4, 2016

HyukjinKwon commented Sep 4, 2016 • edited Loading

sameeragarwal commented Sep 6, 2016

HyukjinKwon commented Sep 7, 2016

davies commented Sep 9, 2016

HyukjinKwon commented Sep 1, 2016 •

edited

Loading

HyukjinKwon commented Sep 4, 2016 •

edited

Loading