Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6315] [SQL] Also tries the case class string parser while reading Parquet schema #5034

Closed
wants to merge 2 commits into from

Conversation

liancheng
Copy link
Contributor

When writing Parquet files, Spark 1.1.x persists the schema string into Parquet metadata with the result of StructType.toString, which was then deprecated in Spark 1.2 by a schema string in JSON format. But we still need to take the old schema format into account while reading Parquet files.

Review on Reviewable

@SparkQA
Copy link

SparkQA commented Mar 15, 2015

Test build #28627 has started for PR 5034 at commit dfd0e0a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 15, 2015

Test build #28627 has finished for PR 5034 at commit dfd0e0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28627/
Test PASSed.

@@ -672,7 +672,11 @@ private[sql] object ParquetRelation2 {
.getKeyValueMetaData
.toMap
.get(RowReadSupport.SPARK_METADATA_KEY)
.map(DataType.fromJson(_).asInstanceOf[StructType])
.map { serializedSchema =>
Try(DataType.fromJson(serializedSchema))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about output a log for the deprecation also? Just to remind people to use the Json style.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, after double thinking about this, we may not need a log here. Because the user doesn't need to care about using JSON or old style case class string when reading/writing Parquet files. Spark SQL handles this automatically.

@marmbrus
Copy link
Contributor

Can we add a regression test that checks to make sure we can read old data so we avoid this in the future?

@marmbrus
Copy link
Contributor

One other comment. It seems like we should not fail when we can't read the schema but instead fall back on using the schema that comes from parquet. We always want to be able to access the data, even if we are missing spark sql specific metadata.

@liancheng
Copy link
Contributor Author

Good point. Updating the code and adding the test case.

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28793 has started for PR 5034 at commit 04f6fae.

  • This patch does not merge cleanly.

@liancheng
Copy link
Contributor Author

Addressed comments and added test case. Thanks for the review!

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28794 has started for PR 5034 at commit a182f58.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28794 has finished for PR 5034 at commit a182f58.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28794/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28793 has finished for PR 5034 at commit 04f6fae.

  • This patch fails PySpark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28793/
Test FAILed.

@liancheng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28800 has started for PR 5034 at commit a182f58.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28800 has finished for PR 5034 at commit a182f58.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28800/
Test PASSed.

@yhuai
Copy link
Contributor

yhuai commented Mar 20, 2015

LGTM

@liancheng
Copy link
Contributor Author

@yhuai @marmbrus Thanks for the review, I'm merging this into master and branch-1.3.

asfgit pushed a commit that referenced this pull request Mar 21, 2015
…ing Parquet schema

When writing Parquet files, Spark 1.1.x persists the schema string into Parquet metadata with the result of `StructType.toString`, which was then deprecated in Spark 1.2 by a schema string in JSON format. But we still need to take the old schema format into account while reading Parquet files.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5034)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #5034 from liancheng/spark-6315 and squashes the following commits:

a182f58 [Cheng Lian] Adds a regression test
b9c6dbe [Cheng Lian] Also tries the case class string parser while reading Parquet schema

(cherry picked from commit 937c1e5)
Signed-off-by: Cheng Lian <lian@databricks.com>
@asfgit asfgit closed this in 937c1e5 Mar 21, 2015
@liancheng liancheng deleted the spark-6315 branch March 21, 2015 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants