Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17354][SQL] Partitioning by dates/timestamps should work with Parquet vectorized reader #14919

Closed
wants to merge 3 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Sep 1, 2016

What changes were proposed in this pull request?

This PR fixes ColumnVectorUtils.populate so that Parquet vectorized reader can read partitioned table with dates/timestamps. This works fine with Parquet normal reader.

This is being only called within VectorizedParquetRecordReader.java#L185.

When partition column types are explicitly given to DateType or TimestampType (rather than inferring the type of partition column), this fails with the exception below:

16/09/01 10:30:07 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 6)
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date
    at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:89)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:185)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:204)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)
...

How was this patch tested?

Unit tests in SQLQuerySuite.

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64780 has finished for PR 14919 at commit bef83fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sameeragarwal
Copy link
Member

@HyukjinKwon can we add a more targeted unit test in one of the vectorized reader test suites that can explicitly test for partitioned columns?

@HyukjinKwon
Copy link
Member Author

@sameeragarwal Thanks for your comment. Yeap, I will.

@HyukjinKwon
Copy link
Member Author

@sameeragarwal Could you take another look please?

@SparkQA
Copy link

SparkQA commented Sep 3, 2016

Test build #64891 has finished for PR 14919 at commit 88f7d29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 4, 2016

Test build #64909 has finished for PR 14919 at commit acf2a3d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Sep 4, 2016

@davies Do you mind if I ask to review please?

@sameeragarwal
Copy link
Member

LGTM, thanks @HyukjinKwon! cc @davies

@HyukjinKwon
Copy link
Member Author

Thanks @sameeragarwal !

@davies
Copy link
Contributor

davies commented Sep 9, 2016

LGTM, merging into master and 2.0 branch, thanks!

@asfgit asfgit closed this in f7d2143 Sep 9, 2016
asfgit pushed a commit that referenced this pull request Sep 9, 2016
… Parquet vectorized reader

This PR fixes `ColumnVectorUtils.populate` so that Parquet vectorized reader can read partitioned table with dates/timestamps. This works fine with Parquet normal reader.

This is being only called within [VectorizedParquetRecordReader.java#L185](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L185).

When partition column types are explicitly given to `DateType` or `TimestampType` (rather than inferring the type of partition column), this fails with the exception below:

```
16/09/01 10:30:07 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 6)
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date
	at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:89)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:185)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:204)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)
...
```

Unit tests in `SQLQuerySuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14919 from HyukjinKwon/SPARK-17354.

(cherry picked from commit f7d2143)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
wgtmac pushed a commit to wgtmac/spark that referenced this pull request Sep 19, 2016
… Parquet vectorized reader

## What changes were proposed in this pull request?

This PR fixes `ColumnVectorUtils.populate` so that Parquet vectorized reader can read partitioned table with dates/timestamps. This works fine with Parquet normal reader.

This is being only called within [VectorizedParquetRecordReader.java#L185](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L185).

When partition column types are explicitly given to `DateType` or `TimestampType` (rather than inferring the type of partition column), this fails with the exception below:

```
16/09/01 10:30:07 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 6)
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date
	at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:89)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:185)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:204)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)
...
```

## How was this patch tested?

Unit tests in `SQLQuerySuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#14919 from HyukjinKwon/SPARK-17354.
@HyukjinKwon HyukjinKwon deleted the SPARK-17354 branch January 2, 2018 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants