[SPARK-6315] [SQL] Also tries the case class string parser while reading Parquet schema #5034

liancheng · 2015-03-15T13:24:32Z

When writing Parquet files, Spark 1.1.x persists the schema string into Parquet metadata with the result of StructType.toString, which was then deprecated in Spark 1.2 by a schema string in JSON format. But we still need to take the old schema format into account while reading Parquet files.

SparkQA · 2015-03-15T13:28:10Z

Test build #28627 has started for PR 5034 at commit dfd0e0a.

This patch merges cleanly.

SparkQA · 2015-03-15T14:46:12Z

Test build #28627 has finished for PR 5034 at commit dfd0e0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-15T14:46:16Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28627/
Test PASSed.

chenghao-intel · 2015-03-15T15:25:08Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

@@ -672,7 +672,11 @@ private[sql] object ParquetRelation2 {
        .getKeyValueMetaData
        .toMap
        .get(RowReadSupport.SPARK_METADATA_KEY)
-        .map(DataType.fromJson(_).asInstanceOf[StructType])
+        .map { serializedSchema =>
+          Try(DataType.fromJson(serializedSchema))


How about output a log for the deprecation also? Just to remind people to use the Json style.

Good point. Thanks!

Hm, after double thinking about this, we may not need a log here. Because the user doesn't need to care about using JSON or old style case class string when reading/writing Parquet files. Spark SQL handles this automatically.

marmbrus · 2015-03-16T22:27:23Z

Can we add a regression test that checks to make sure we can read old data so we avoid this in the future?

marmbrus · 2015-03-18T03:07:15Z

One other comment. It seems like we should not fail when we can't read the schema but instead fall back on using the schema that comes from parquet. We always want to be able to access the data, even if we are missing spark sql specific metadata.

liancheng · 2015-03-18T09:07:48Z

Good point. Updating the code and adding the test case.

SparkQA · 2015-03-18T10:53:15Z

Test build #28793 has started for PR 5034 at commit 04f6fae.

This patch does not merge cleanly.

liancheng · 2015-03-18T10:57:16Z

Addressed comments and added test case. Thanks for the review!

SparkQA · 2015-03-18T10:58:04Z

Test build #28794 has started for PR 5034 at commit a182f58.

This patch merges cleanly.

SparkQA · 2015-03-18T12:15:55Z

Test build #28794 has finished for PR 5034 at commit a182f58.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-18T12:15:58Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28794/
Test PASSed.

SparkQA · 2015-03-18T12:20:15Z

Test build #28793 has finished for PR 5034 at commit 04f6fae.

This patch fails PySpark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-18T12:20:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28793/
Test FAILed.

liancheng · 2015-03-18T13:24:28Z

retest this please

SparkQA · 2015-03-18T13:28:15Z

Test build #28800 has started for PR 5034 at commit a182f58.

This patch merges cleanly.

SparkQA · 2015-03-18T14:45:08Z

Test build #28800 has finished for PR 5034 at commit a182f58.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-18T14:45:11Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28800/
Test PASSed.

yhuai · 2015-03-20T22:44:33Z

LGTM

liancheng · 2015-03-21T03:15:11Z

@yhuai @marmbrus Thanks for the review, I'm merging this into master and branch-1.3.

…ing Parquet schema When writing Parquet files, Spark 1.1.x persists the schema string into Parquet metadata with the result of `StructType.toString`, which was then deprecated in Spark 1.2 by a schema string in JSON format. But we still need to take the old schema format into account while reading Parquet files.  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5034)  Author: Cheng Lian <lian@databricks.com> Closes #5034 from liancheng/spark-6315 and squashes the following commits: a182f58 [Cheng Lian] Adds a regression test b9c6dbe [Cheng Lian] Also tries the case class string parser while reading Parquet schema (cherry picked from commit 937c1e5) Signed-off-by: Cheng Lian <lian@databricks.com>

chenghao-intel reviewed Mar 15, 2015
View reviewed changes

liancheng added 2 commits March 18, 2015 18:55

Also tries the case class string parser while reading Parquet schema

b9c6dbe

Adds a regression test

a182f58

liancheng force-pushed the spark-6315 branch from 04f6fae to a182f58 Compare March 18, 2015 10:56

asfgit closed this in 937c1e5 Mar 21, 2015

liancheng deleted the spark-6315 branch March 21, 2015 03:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6315] [SQL] Also tries the case class string parser while reading Parquet schema #5034

[SPARK-6315] [SQL] Also tries the case class string parser while reading Parquet schema #5034

liancheng commented Mar 15, 2015

SparkQA commented Mar 15, 2015

SparkQA commented Mar 15, 2015

AmplabJenkins commented Mar 15, 2015

chenghao-intel Mar 15, 2015

liancheng Mar 15, 2015

liancheng Mar 15, 2015

marmbrus commented Mar 16, 2015

marmbrus commented Mar 18, 2015

liancheng commented Mar 18, 2015

SparkQA commented Mar 18, 2015

liancheng commented Mar 18, 2015

SparkQA commented Mar 18, 2015

SparkQA commented Mar 18, 2015

AmplabJenkins commented Mar 18, 2015

SparkQA commented Mar 18, 2015

AmplabJenkins commented Mar 18, 2015

liancheng commented Mar 18, 2015

SparkQA commented Mar 18, 2015

SparkQA commented Mar 18, 2015

AmplabJenkins commented Mar 18, 2015

yhuai commented Mar 20, 2015

liancheng commented Mar 21, 2015

[SPARK-6315] [SQL] Also tries the case class string parser while reading Parquet schema #5034

[SPARK-6315] [SQL] Also tries the case class string parser while reading Parquet schema #5034

Conversation

liancheng commented Mar 15, 2015

SparkQA commented Mar 15, 2015

SparkQA commented Mar 15, 2015

AmplabJenkins commented Mar 15, 2015

chenghao-intel Mar 15, 2015

Choose a reason for hiding this comment

liancheng Mar 15, 2015

Choose a reason for hiding this comment

liancheng Mar 15, 2015

Choose a reason for hiding this comment

marmbrus commented Mar 16, 2015

marmbrus commented Mar 18, 2015

liancheng commented Mar 18, 2015

SparkQA commented Mar 18, 2015

liancheng commented Mar 18, 2015

SparkQA commented Mar 18, 2015

SparkQA commented Mar 18, 2015

AmplabJenkins commented Mar 18, 2015

SparkQA commented Mar 18, 2015

AmplabJenkins commented Mar 18, 2015

liancheng commented Mar 18, 2015

SparkQA commented Mar 18, 2015

SparkQA commented Mar 18, 2015

AmplabJenkins commented Mar 18, 2015

yhuai commented Mar 20, 2015

liancheng commented Mar 21, 2015