-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6315] [SQL] Also tries the case class string parser while reading Parquet schema #5034
Conversation
Test build #28627 has started for PR 5034 at commit
|
Test build #28627 has finished for PR 5034 at commit
|
Test PASSed. |
@@ -672,7 +672,11 @@ private[sql] object ParquetRelation2 { | |||
.getKeyValueMetaData | |||
.toMap | |||
.get(RowReadSupport.SPARK_METADATA_KEY) | |||
.map(DataType.fromJson(_).asInstanceOf[StructType]) | |||
.map { serializedSchema => | |||
Try(DataType.fromJson(serializedSchema)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about output a log for the deprecation also? Just to remind people to use the Json
style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, after double thinking about this, we may not need a log here. Because the user doesn't need to care about using JSON or old style case class string when reading/writing Parquet files. Spark SQL handles this automatically.
Can we add a regression test that checks to make sure we can read old data so we avoid this in the future? |
One other comment. It seems like we should not fail when we can't read the schema but instead fall back on using the schema that comes from parquet. We always want to be able to access the data, even if we are missing spark sql specific metadata. |
Good point. Updating the code and adding the test case. |
Test build #28793 has started for PR 5034 at commit
|
Addressed comments and added test case. Thanks for the review! |
Test build #28794 has started for PR 5034 at commit
|
Test build #28794 has finished for PR 5034 at commit
|
Test PASSed. |
Test build #28793 has finished for PR 5034 at commit
|
Test FAILed. |
retest this please |
Test build #28800 has started for PR 5034 at commit
|
Test build #28800 has finished for PR 5034 at commit
|
Test PASSed. |
LGTM |
…ing Parquet schema When writing Parquet files, Spark 1.1.x persists the schema string into Parquet metadata with the result of `StructType.toString`, which was then deprecated in Spark 1.2 by a schema string in JSON format. But we still need to take the old schema format into account while reading Parquet files. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5034) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #5034 from liancheng/spark-6315 and squashes the following commits: a182f58 [Cheng Lian] Adds a regression test b9c6dbe [Cheng Lian] Also tries the case class string parser while reading Parquet schema (cherry picked from commit 937c1e5) Signed-off-by: Cheng Lian <lian@databricks.com>
When writing Parquet files, Spark 1.1.x persists the schema string into Parquet metadata with the result of
StructType.toString
, which was then deprecated in Spark 1.2 by a schema string in JSON format. But we still need to take the old schema format into account while reading Parquet files.