-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12057] [SQL] Prevent failure on corrupt JSON records #10043
Conversation
Return failed record when a record cannot be parsed. Allows parsing of files containing corrupt records of any form.
+1 |
please file a JIRA and add it to the title of this PR. See how other patches are opened. |
Done @andrewor14 |
ok to test |
can you add a regression test that reproduces the issue you are trying to fix? |
Test build #46948 has finished for PR 10043 at commit
|
Sure! looking into it. |
Test build #46982 has finished for PR 10043 at commit
|
test this please |
Test build #47003 has finished for PR 10043 at commit
|
Seems there is a legitimate failure? |
i'm trying to replicate the error i get on my machine within the test cases. Do you have any tips for writing tests? It takes 20 mins to test using the provided script. |
s"Failed to parse record $record. Please make sure that each line of " + | ||
"the file (or each string in the RDD) is a valid JSON object or " + | ||
"an array of JSON objects.") | ||
case _ => failedRecord(record) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For that place, we can still throw. But, then, we catch the exception at https://github.com/apache/spark/pull/10043/files#diff-8affe5ec7d691943a88e43eb30af656eR272.
|
sqlContext.sparkContext.parallelize( | ||
"""{"dummy":"test"}""" :: | ||
"""42""" :: | ||
""" ","ian":"test"}""" :: Nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a record like """[1,, 2, 3]"""
(a top level JSON array having and its elements are not JSON objects)?
I am wondering if we should have option in |
@simplyianm Those records will go to corrupt record column is there. But, right now, it is possible that corrupt record column is not in the schema. For example, when you apply a schema of a json dataset. Or, the data does not trigger |
@yhuai as an aside I moderately prefer not to introduce flags merely for the sake of being conservative or flexible. It rarely achieves that goal, just introduces complexity and rarely gets cleaned out, since you've just continued to promise a particular old behavior. |
@srowen Yeah, I agree. My only concern is that users who originally saw exceptions (I do agree that for some exceptions, we should just catch) will only find out the problem after looking at data. What do you think? |
This PR makes JSON parser and schema inference handle more cases where we have unparsed records. It is based on #10043. The last commit fixes the failed test and updates the logic of schema inference. Regarding the schema inference change, if we have something like ``` {"f1":1} [1,2,3] ``` originally, we will get a DF without any column. After this change, we will get a DF with columns `f1` and `_corrupt_record`. Basically, for the second row, `[1,2,3]` will be the value of `_corrupt_record`. When merge this PR, please make sure that the author is simplyianm. JIRA: https://issues.apache.org/jira/browse/SPARK-12057 Closes #10043 Author: Ian Macalinao <me@ian.pw> Author: Yin Huai <yhuai@databricks.com> Closes #10288 from yhuai/handleCorruptJson. (cherry picked from commit 9d66c42) Signed-off-by: Reynold Xin <rxin@databricks.com>
Return failed record when a record cannot be parsed. Allows parsing of files containing corrupt records of any form.