Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12057] [SQL] Prevent failure on corrupt JSON records #10043

Closed
wants to merge 4 commits into from

Conversation

macalinao
Copy link

Return failed record when a record cannot be parsed. Allows parsing of files containing corrupt records of any form.

Return failed record when a record cannot be parsed. Allows parsing of files containing corrupt records of any form.
@CyrusRoshan
Copy link

+1

@andrewor14
Copy link
Contributor

please file a JIRA and add it to the title of this PR. See how other patches are opened.

@macalinao macalinao changed the title Prevent failure on corrupt JSON records [SPARK-12057] Prevent failure on corrupt JSON records Nov 30, 2015
@macalinao
Copy link
Author

Done @andrewor14

@macalinao macalinao changed the title [SPARK-12057] Prevent failure on corrupt JSON records [SPARK-12057] [SQL] Prevent failure on corrupt JSON records Nov 30, 2015
@marmbrus
Copy link
Contributor

marmbrus commented Dec 1, 2015

ok to test

@marmbrus
Copy link
Contributor

marmbrus commented Dec 1, 2015

can you add a regression test that reproduces the issue you are trying to fix?

@SparkQA
Copy link

SparkQA commented Dec 1, 2015

Test build #46948 has finished for PR 10043 at commit d461552.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@macalinao
Copy link
Author

Sure! looking into it.

@SparkQA
Copy link

SparkQA commented Dec 1, 2015

Test build #46982 has finished for PR 10043 at commit 02a742b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Dec 1, 2015

test this please

@SparkQA
Copy link

SparkQA commented Dec 1, 2015

Test build #47003 has finished for PR 10043 at commit 8fd677f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Dec 2, 2015

Seems there is a legitimate failure?

@macalinao
Copy link
Author

i'm trying to replicate the error i get on my machine within the test cases. Do you have any tips for writing tests? It takes 20 mins to test using the provided script.

s"Failed to parse record $record. Please make sure that each line of " +
"the file (or each string in the RDD) is a valid JSON object or " +
"an array of JSON objects.")
case _ => failedRecord(record)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that place, we can still throw. But, then, we catch the exception at https://github.com/apache/spark/pull/10043/files#diff-8affe5ec7d691943a88e43eb30af656eR272.

@marmbrus
Copy link
Contributor

marmbrus commented Dec 2, 2015

build/sbt ~sql/test-only *JsonSuite -- -z SPARK-12057 should run pretty quickly and will auto rerun as you change code.

sqlContext.sparkContext.parallelize(
"""{"dummy":"test"}""" ::
"""42""" ::
""" ","ian":"test"}""" :: Nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a record like """[1,, 2, 3]""" (a top level JSON array having and its elements are not JSON objects)?

@yhuai
Copy link
Contributor

yhuai commented Dec 2, 2015

I am wondering if we should have option in JSONOptions to control the behavior. For some cases, I feel it is still useful to see the exception (rather than find values of columns are just nulls). So, I know there is something wrong (e.g. I try to provide a bad schema to the json data).

@macalinao
Copy link
Author

very useful tip @marmbrus -- thanks!

@yhuai I believe that the current default behavior is to put everything into the corrupt record column; it is just that all cases are not handled.

@yhuai
Copy link
Contributor

yhuai commented Dec 2, 2015

@simplyianm Those records will go to corrupt record column is there. But, right now, it is possible that corrupt record column is not in the schema. For example, when you apply a schema of a json dataset. Or, the data does not trigger JsonParseException during infer schema. Also, since we are in the qa cycle of 1.6, I am thinking introducing a flag and by default, having the old behavior is a safer option. If a user hits the problem, he/she can enable the flag. Later, we can think about change the default value of the conf.

@srowen
Copy link
Member

srowen commented Dec 3, 2015

@yhuai as an aside I moderately prefer not to introduce flags merely for the sake of being conservative or flexible. It rarely achieves that goal, just introduces complexity and rarely gets cleaned out, since you've just continued to promise a particular old behavior.

@yhuai
Copy link
Contributor

yhuai commented Dec 4, 2015

@srowen Yeah, I agree. My only concern is that users who originally saw exceptions (I do agree that for some exceptions, we should just catch) will only find out the problem after looking at data. What do you think?

asfgit pushed a commit that referenced this pull request Dec 17, 2015
This PR makes JSON parser and schema inference handle more cases where we have unparsed records. It is based on #10043. The last commit fixes the failed test and updates the logic of schema inference.

Regarding the schema inference change, if we have something like
```
{"f1":1}
[1,2,3]
```
originally, we will get a DF without any column.
After this change, we will get a DF with columns `f1` and `_corrupt_record`. Basically, for the second row, `[1,2,3]` will be the value of `_corrupt_record`.

When merge this PR, please make sure that the author is simplyianm.

JIRA: https://issues.apache.org/jira/browse/SPARK-12057

Closes #10043

Author: Ian Macalinao <me@ian.pw>
Author: Yin Huai <yhuai@databricks.com>

Closes #10288 from yhuai/handleCorruptJson.

(cherry picked from commit 9d66c42)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@asfgit asfgit closed this in 9d66c42 Dec 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants