-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13719][SQL] Parse JSON rows having an array type and a struct type in the same fieild #11752
Conversation
cc @yhuai (Since the JIRA is a pretty old one, I was confused if I should make a separate JIRA and PR but I just made this as a follow-up since it is a follow-up for the support of rows wrapped with an array). |
Just to make sure, I am doing this partly due to SPARK-13764, which deals with parse modes just like in CSV data sources. So, the behaviour for failed records should be consistent. |
Test build #53273 has finished for PR 11752 at commit
|
What's the new behavior here? |
Sets |
Except the case above, all the types are set to |
Can you update the description to first explain the current behavior, and behavior after this change? |
@rxin Sure! |
BTW shouldn't the behavior here depends on the parse mode, if we were to introduce those? |
Yeap, Would you check this PR, #11756? |
Oh, sorry I misunderstood. Currently, JSON data source works as a Only the case above throws an exception. For other types, it sets |
def arrayAndStructRecords: RDD[String] = | ||
sqlContext.sparkContext.parallelize( | ||
"""{"a": {"b": 1}}""" :: | ||
"""{"a": []}""" :: Nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this case, we will first throw an exception, catch it, and finally set the value to null, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It will set null
in failedRecord()
.
@HyukjinKwon It will be good to have a new jira. |
LGTM. I am going to merge it to master. It's a good idea to have parse mode. |
…type in the same fieild ## What changes were proposed in this pull request? This apache#2400 added the support to parse JSON rows wrapped with an array. However, this throws an exception when the given data contains array data and struct data in the same field as below: ```json {"a": {"b": 1}} {"a": []} ``` and the schema is given as below: ```scala val schema = StructType( StructField("a", StructType( StructField("b", StringType) :: Nil )) :: Nil) ``` - **Before** ```scala sqlContext.read.schema(schema).json(path).show() ``` ```scala Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50) at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) ... ``` - **After** ```scala sqlContext.read.schema(schema).json(path).show() ``` ```bash +----+ | a| +----+ | [1]| |null| +----+ ``` For other data types, in this case it converts the given values are `null` but only this case emits an exception. This PR makes the support for wrapped rows applied only at the top level. ## How was this patch tested? Unit tests were used and `./dev/run_tests` for code style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#11752 from HyukjinKwon/SPARK-3308-follow-up.
What changes were proposed in this pull request?
This #2400 added the support to parse JSON rows wrapped with an array. However, this throws an exception when the given data contains array data and struct data in the same field as below:
and the schema is given as below:
For other data types, in this case it converts the given values are
null
but only this case emits an exception.This PR makes the support for wrapped rows applied only at the top level.
How was this patch tested?
Unit tests were used and
./dev/run_tests
for code style tests.