-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17354][SQL] Partitioning by dates/timestamps should work with Parquet vectorized reader #14919
Conversation
Test build #64780 has finished for PR 14919 at commit
|
@HyukjinKwon can we add a more targeted unit test in one of the vectorized reader test suites that can explicitly test for partitioned columns? |
@sameeragarwal Thanks for your comment. Yeap, I will. |
@sameeragarwal Could you take another look please? |
Test build #64891 has finished for PR 14919 at commit
|
Test build #64909 has finished for PR 14919 at commit
|
@davies Do you mind if I ask to review please? |
LGTM, thanks @HyukjinKwon! cc @davies |
Thanks @sameeragarwal ! |
LGTM, merging into master and 2.0 branch, thanks! |
… Parquet vectorized reader This PR fixes `ColumnVectorUtils.populate` so that Parquet vectorized reader can read partitioned table with dates/timestamps. This works fine with Parquet normal reader. This is being only called within [VectorizedParquetRecordReader.java#L185](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L185). When partition column types are explicitly given to `DateType` or `TimestampType` (rather than inferring the type of partition column), this fails with the exception below: ``` 16/09/01 10:30:07 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 6) java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:89) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:185) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:204) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362) ... ``` Unit tests in `SQLQuerySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14919 from HyukjinKwon/SPARK-17354. (cherry picked from commit f7d2143) Signed-off-by: Davies Liu <davies.liu@gmail.com>
… Parquet vectorized reader ## What changes were proposed in this pull request? This PR fixes `ColumnVectorUtils.populate` so that Parquet vectorized reader can read partitioned table with dates/timestamps. This works fine with Parquet normal reader. This is being only called within [VectorizedParquetRecordReader.java#L185](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L185). When partition column types are explicitly given to `DateType` or `TimestampType` (rather than inferring the type of partition column), this fails with the exception below: ``` 16/09/01 10:30:07 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 6) java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:89) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:185) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:204) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362) ... ``` ## How was this patch tested? Unit tests in `SQLQuerySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#14919 from HyukjinKwon/SPARK-17354.
What changes were proposed in this pull request?
This PR fixes
ColumnVectorUtils.populate
so that Parquet vectorized reader can read partitioned table with dates/timestamps. This works fine with Parquet normal reader.This is being only called within VectorizedParquetRecordReader.java#L185.
When partition column types are explicitly given to
DateType
orTimestampType
(rather than inferring the type of partition column), this fails with the exception below:How was this patch tested?
Unit tests in
SQLQuerySuite
.