Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR proposes schema compatibility for Parquet for normal Parquet reader. It does not fully solve the issue for vectorized reader one.

Currently if user-given schema is different with the Parquet schema, it throws an exception even when the user-given schema is compatible with Parquet schema.

For example, executing the codes below:

import org.apache.spark.sql.types._

spark.conf.set("spark.sql.parquet.enableVectorizedReader", false.toString)
val path = "/tmp/abcd"
val data = (1 to 4).map(Tuple1(_))
spark.createDataFrame(data).toDF("a").write.parquet(path)
val schema = StructType(StructField("a", LongType, true) :: Nil)
spark.read.schema(schema).parquet(path).show()

throws an exception as below:

org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 
...

This PR lets Parquet supports this schema compatibility for numeric types.

How was this patch tested?

Unit tests in ParquetIOSuite.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Apr 10, 2017

@wgtmac and @sameeragarwal, I am sorry it took me a (long) while to reopen this. I was doing this for vectorized reader together but I realised that the PR is getting too big.

I kind of had to go back to the original one (which only deals with numeric types in normal Parquet reader). Do you guys think it is okay to be as a separate PR?

@HyukjinKwon
Copy link
Member Author

Probably it is too late for 2.2. If anyone feels so, I will close this for now again.

@SparkQA
Copy link

SparkQA commented Apr 10, 2017

Test build #75652 has finished for PR 17589 at commit 1d4d40c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 10, 2017

Test build #75653 has finished for PR 17589 at commit c6865bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 11, 2017

Test build #75705 has finished for PR 17589 at commit cbf8a22.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 7, 2017

Test build #81527 has finished for PR 17589 at commit cbf8a22.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Closing this. Will take another look and make a cleaner fix next time, or reopen if I see some more interests in this.

@HyukjinKwon HyukjinKwon deleted the SPARK-16544 branch January 2, 2018 03:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants