[SPARK-12057] [SQL] Prevent failure on corrupt JSON records #10043

macalinao · 2015-11-30T18:23:41Z

Return failed record when a record cannot be parsed. Allows parsing of files containing corrupt records of any form.

CyrusRoshan · 2015-11-30T18:27:31Z

+1

andrewor14 · 2015-11-30T20:14:00Z

please file a JIRA and add it to the title of this PR. See how other patches are opened.

macalinao · 2015-11-30T21:11:19Z

Done @andrewor14

marmbrus · 2015-12-01T06:43:34Z

ok to test

marmbrus · 2015-12-01T06:43:51Z

can you add a regression test that reproduces the issue you are trying to fix?

SparkQA · 2015-12-01T09:05:28Z

Test build #46948 has finished for PR 10043 at commit d461552.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

macalinao · 2015-12-01T11:11:45Z

Sure! looking into it.

SparkQA · 2015-12-01T20:48:26Z

Test build #46982 has finished for PR 10043 at commit 02a742b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-12-01T22:18:48Z

test this please

SparkQA · 2015-12-01T22:46:44Z

Test build #47003 has finished for PR 10043 at commit 8fd677f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-12-02T21:21:19Z

Seems there is a legitimate failure?

macalinao · 2015-12-02T21:30:37Z

i'm trying to replicate the error i get on my machine within the test cases. Do you have any tips for writing tests? It takes 20 mins to test using the provided script.

yhuai · 2015-12-02T21:33:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

-                  s"Failed to parse record $record. Please make sure that each line of " +
-                    "the file (or each string in the RDD) is a valid JSON object or " +
-                    "an array of JSON objects.")
+              case _ => failedRecord(record)


Do you need to also update https://github.com/simplyianm/spark/blob/patch-1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala#L177?

For that place, we can still throw. But, then, we catch the exception at https://github.com/apache/spark/pull/10043/files#diff-8affe5ec7d691943a88e43eb30af656eR272.

marmbrus · 2015-12-02T21:33:53Z

build/sbt ~sql/test-only *JsonSuite -- -z SPARK-12057 should run pretty quickly and will auto rerun as you change code.

yhuai · 2015-12-02T21:57:12Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/TestJsonData.scala

+    sqlContext.sparkContext.parallelize(
+      """{"dummy":"test"}""" ::
+      """42""" ::
+      """     ","ian":"test"}""" :: Nil)


Can you add a record like """[1,, 2, 3]""" (a top level JSON array having and its elements are not JSON objects)?

yhuai · 2015-12-02T22:03:42Z

I am wondering if we should have option in JSONOptions to control the behavior. For some cases, I feel it is still useful to see the exception (rather than find values of columns are just nulls). So, I know there is something wrong (e.g. I try to provide a bad schema to the json data).

macalinao · 2015-12-02T22:49:45Z

very useful tip @marmbrus -- thanks!

@yhuai I believe that the current default behavior is to put everything into the corrupt record column; it is just that all cases are not handled.

yhuai · 2015-12-02T23:51:22Z

@simplyianm Those records will go to corrupt record column is there. But, right now, it is possible that corrupt record column is not in the schema. For example, when you apply a schema of a json dataset. Or, the data does not trigger JsonParseException during infer schema. Also, since we are in the qa cycle of 1.6, I am thinking introducing a flag and by default, having the old behavior is a safer option. If a user hits the problem, he/she can enable the flag. Later, we can think about change the default value of the conf.

srowen · 2015-12-03T08:03:49Z

@yhuai as an aside I moderately prefer not to introduce flags merely for the sake of being conservative or flexible. It rarely achieves that goal, just introduces complexity and rarely gets cleaned out, since you've just continued to promise a particular old behavior.

yhuai · 2015-12-04T20:49:35Z

@srowen Yeah, I agree. My only concern is that users who originally saw exceptions (I do agree that for some exceptions, we should just catch) will only find out the problem after looking at data. What do you think?

This PR makes JSON parser and schema inference handle more cases where we have unparsed records. It is based on #10043. The last commit fixes the failed test and updates the logic of schema inference. Regarding the schema inference change, if we have something like ``` {"f1":1} [1,2,3] ``` originally, we will get a DF without any column. After this change, we will get a DF with columns `f1` and `_corrupt_record`. Basically, for the second row, `[1,2,3]` will be the value of `_corrupt_record`. When merge this PR, please make sure that the author is simplyianm. JIRA: https://issues.apache.org/jira/browse/SPARK-12057 Closes #10043 Author: Ian Macalinao <me@ian.pw> Author: Yin Huai <yhuai@databricks.com> Closes #10288 from yhuai/handleCorruptJson. (cherry picked from commit 9d66c42) Signed-off-by: Reynold Xin <rxin@databricks.com>

Prevent failure on corrupt JSON records

d461552

Return failed record when a record cannot be parsed. Allows parsing of files containing corrupt records of any form.

macalinao changed the title ~~Prevent failure on corrupt JSON records~~ [SPARK-12057] Prevent failure on corrupt JSON records Nov 30, 2015

macalinao changed the title ~~[SPARK-12057] Prevent failure on corrupt JSON records~~ [SPARK-12057] [SQL] Prevent failure on corrupt JSON records Nov 30, 2015

Add regression test for corrupt record JSON parsing

02a742b

macalinao added 2 commits December 1, 2015 14:53

Correct schema

fb4fa7b

dummy

8fd677f

yhuai reviewed Dec 2, 2015
View reviewed changes

yhuai mentioned this pull request Dec 14, 2015

[SPARK-12057] [SQL] Prevent failure on corrupt JSON records #10288

Closed

asfgit closed this in 9d66c42 Dec 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12057] [SQL] Prevent failure on corrupt JSON records #10043

[SPARK-12057] [SQL] Prevent failure on corrupt JSON records #10043

macalinao commented Nov 30, 2015

CyrusRoshan commented Nov 30, 2015

andrewor14 commented Nov 30, 2015

macalinao commented Nov 30, 2015

marmbrus commented Dec 1, 2015

marmbrus commented Dec 1, 2015

SparkQA commented Dec 1, 2015

macalinao commented Dec 1, 2015

SparkQA commented Dec 1, 2015

yhuai commented Dec 1, 2015

SparkQA commented Dec 1, 2015

yhuai commented Dec 2, 2015

macalinao commented Dec 2, 2015

yhuai Dec 2, 2015

yhuai Dec 2, 2015

marmbrus commented Dec 2, 2015

yhuai Dec 2, 2015

yhuai commented Dec 2, 2015

macalinao commented Dec 2, 2015

yhuai commented Dec 2, 2015

srowen commented Dec 3, 2015

yhuai commented Dec 4, 2015

[SPARK-12057] [SQL] Prevent failure on corrupt JSON records #10043

[SPARK-12057] [SQL] Prevent failure on corrupt JSON records #10043

Conversation

macalinao commented Nov 30, 2015

CyrusRoshan commented Nov 30, 2015

andrewor14 commented Nov 30, 2015

macalinao commented Nov 30, 2015

marmbrus commented Dec 1, 2015

marmbrus commented Dec 1, 2015

SparkQA commented Dec 1, 2015

macalinao commented Dec 1, 2015

SparkQA commented Dec 1, 2015

yhuai commented Dec 1, 2015

SparkQA commented Dec 1, 2015

yhuai commented Dec 2, 2015

macalinao commented Dec 2, 2015

yhuai Dec 2, 2015

Choose a reason for hiding this comment

yhuai Dec 2, 2015

Choose a reason for hiding this comment

marmbrus commented Dec 2, 2015

yhuai Dec 2, 2015

Choose a reason for hiding this comment

yhuai commented Dec 2, 2015

macalinao commented Dec 2, 2015

yhuai commented Dec 2, 2015

srowen commented Dec 3, 2015

yhuai commented Dec 4, 2015