[SPARK-21024][SQL] CSV parse mode handles Univocity parser exceptions by maropu · Pull Request #18250 · apache/spark

maropu · 2017-06-09T07:16:30Z

What changes were proposed in this pull request?

This pr fixed code to handle Univocity parser exceptions by CSV parse modes.
The current master cannot skip the illegal records that Univocity parsers:

scala> Seq("0,1", "0,1,2,3").toDF().write.text("/Users/maropu/Desktop/data")
scala> val df = spark.read.format("csv").schema("a int, b int").option("maxColumns", "3").load("/Users/maropu/Desktop/data")
scala> df.show

com.univocity.parsers.common.TextParsingException: java.lang.ArrayIndexOutOfBoundsException - 3
Hint: Number of columns processed may have exceeded limit of 3 columns. Use settings.setMaxColumns(int) to define the maximum number of columns your input can have
Ensure your configuration is correct, with delimiters, quotes and escape sequences that match the input format you are trying to parse
Parser Configuration: CsvParserSettings:
        Auto configuration enabled=true
        Autodetect column delimiter=false
        Autodetect quotes=false
        Column reordering enabled=true
        Empty value=null
        Escape unquoted values=false
        ...

at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
at com.univocity.parsers.common.AbstractParser.handleEOF(AbstractParser.java:195)
at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:544)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:191)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308)
at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:60)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
...

How was this patch tested?

Added tests in CSVSuite.

SparkQA · 2017-06-09T09:36:33Z

Test build #77839 has finished for PR 18250 at commit 5304fbd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

Sounds okay to me.

HyukjinKwon · 2017-06-09T10:38:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

 import scala.util.Try
 import scala.util.control.NonFatal

+import com.univocity.parsers.common.TextParsingException


It looks this one can be removed.

oh, missed. Thanks

HyukjinKwon · 2017-06-09T10:40:33Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+        .option("maxColumns", "2")
+        .option("mode", "PERMISSIVE")
+        .load(path.getAbsolutePath)
+      checkAnswer(df, Row(0, 1) :: Row(null, null) :: Nil)


Should we maybe also check what is put in the malformed column?

okay, I'll set columnNameOfCorruptRecord column.

HyukjinKwon · 2017-06-09T11:39:09Z

@maropu, just to make sure, do you mind if I ask test this with wholeFile too? IIRC, it calls different a code path for parsing tokens.

maropu · 2017-06-09T11:41:36Z

oh, sure. I'll add tests.

maropu · 2017-06-13T01:06:54Z

@HyukjinKwon I tried to fix this case even in wholeFile mode though, I easily couldn't do because CSVParser in wholeFile has state to parse InputStream data. Once the parser hits exception, IIUC it cannot restart parsing the input. So, to fix this in wholeFile, we might need to refactor the current logic there. IMHO in case that parsing exceptions happen in whileFile, we easily couldn't restart to parse because the restart position for parsing seems unclear. WDYT?

maropu · 2017-06-13T01:55:47Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+          .option("columnNameOfCorruptRecord", columnNameOfCorruptRecord)
+          .option("wholeFile", wholeFile)
+          .load(path.getAbsolutePath)
+        checkAnswer(df, Row(0, 1, null) :: Row(null, null, "0,1,2,") :: Nil)


@HyukjinKwon weird behaviour..., when we set maxColumns in a Univocity parser, it seems currentParsedContent returns the (maxColumns + 1) elements in inputs.

SparkQA · 2017-06-13T02:51:52Z

Test build #77956 has finished for PR 18250 at commit b8c4462.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

giaosudau · 2017-06-14T07:22:31Z

I found that if you using inferSchema option it will throw the same error.
Should handle this case too.

maropu · 2017-06-14T07:43:20Z

Yea, I thinks so. But, to fix the inferSchema case, we need some more modifications. So, I feel we first will fix this case (schema specified by user), then fix that case in follow-up.

maropu · 2017-06-25T09:22:10Z

@gatorsmile @HyukjinKwon ping

HyukjinKwon · 2017-07-24T02:42:33Z

I am sorry @maropu. I can't think of a good way to handle this for now ... Will be back after thinking more maybe ..

maropu · 2017-07-24T02:55:07Z

ok, thanks!

maropu · 2017-07-24T04:00:22Z

I think we could be back if we have better way to handle this, so I'll close this for now (we better keeping this discussion in jira).

Handle parser exceptions

5304fbd

HyukjinKwon reviewed Jun 9, 2017

View reviewed changes

Apply review comments

b8c4462

maropu commented Jun 13, 2017

View reviewed changes

maropu closed this Jul 24, 2017

maropu mentioned this pull request Aug 10, 2017

Reset error states and keep parsing data from InputStream uniVocity/univocity-parsers#179

Closed

Conversation

maropu commented Jun 9, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 9, 2017

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 9, 2017

Choose a reason for hiding this comment

Uh oh!

maropu Jun 9, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 9, 2017

Choose a reason for hiding this comment

Uh oh!

maropu Jun 9, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jun 9, 2017

Uh oh!

maropu commented Jun 9, 2017

Uh oh!

maropu commented Jun 13, 2017

Uh oh!

maropu Jun 13, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 13, 2017

Uh oh!

giaosudau commented Jun 14, 2017

Uh oh!

maropu commented Jun 14, 2017

Uh oh!

maropu commented Jun 25, 2017

Uh oh!

HyukjinKwon commented Jul 24, 2017

Uh oh!

maropu commented Jul 24, 2017

Uh oh!

maropu commented Jul 24, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants