Skip to content

Commit

Permalink
[SPARK-22516][SQL] Bump up Univocity version to 2.5.9
Browse files Browse the repository at this point in the history
## What changes were proposed in this pull request?

There was a bug in Univocity Parser that causes the issue in SPARK-22516. This was fixed by upgrading from 2.5.4 to 2.5.9 version of the library :

**Executing**
```
spark.read.option("header","true").option("inferSchema", "true").option("multiLine", "true").option("comment", "g").csv("test_file_without_eof_char.csv").show()
```
**Before**
```
ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 6)
com.univocity.parsers.common.TextParsingException: java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of input reached
...
Internal state when error was thrown: line=3, column=0, record=2, charIndex=31
	at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
	at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
```
**After**
```
+-------+-------+
|column1|column2|
+-------+-------+
|    abc|    def|
+-------+-------+
```

## How was this patch tested?
The already existing `CSVSuite.commented lines in CSV data` test was extended to parse the file also in multiline mode. The test input file was modified to also include a comment in the last line.

Author: smurakozi <smurakozi@gmail.com>

Closes #19906 from smurakozi/SPARK-22516.
  • Loading branch information
smurakozi authored and Marcelo Vanzin committed Dec 6, 2017
1 parent effca98 commit 9948b86
Show file tree
Hide file tree
Showing 5 changed files with 17 additions and 13 deletions.
2 changes: 1 addition & 1 deletion dev/deps/spark-deps-hadoop-2.6
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ stax-api-1.0.1.jar
stream-2.7.0.jar
stringtemplate-3.2.1.jar
super-csv-2.2.0.jar
univocity-parsers-2.5.4.jar
univocity-parsers-2.5.9.jar
validation-api-1.1.0.Final.jar
xbean-asm5-shaded-4.4.jar
xercesImpl-2.9.1.jar
Expand Down
2 changes: 1 addition & 1 deletion dev/deps/spark-deps-hadoop-2.7
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,7 @@ stax-api-1.0.1.jar
stream-2.7.0.jar
stringtemplate-3.2.1.jar
super-csv-2.2.0.jar
univocity-parsers-2.5.4.jar
univocity-parsers-2.5.9.jar
validation-api-1.1.0.Final.jar
xbean-asm5-shaded-4.4.jar
xercesImpl-2.9.1.jar
Expand Down
2 changes: 1 addition & 1 deletion sql/core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
<dependency>
<groupId>com.univocity</groupId>
<artifactId>univocity-parsers</artifactId>
<version>2.5.4</version>
<version>2.5.9</version>
<type>jar</type>
</dependency>
<dependency>
Expand Down
1 change: 1 addition & 0 deletions sql/core/src/test/resources/test-data/comments.csv
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@
6,7,8,9,0,2015-08-21 16:58:01
~0,9,8,7,6,2015-08-22 17:59:02
1,2,3,4,5,2015-08-23 18:00:42
~ comment in last line to test SPARK-22516 - do not add empty line at the end of this file!
Original file line number Diff line number Diff line change
Expand Up @@ -483,18 +483,21 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
}

test("commented lines in CSV data") {
val results = spark.read
.format("csv")
.options(Map("comment" -> "~", "header" -> "false"))
.load(testFile(commentsFile))
.collect()
Seq("false", "true").foreach { multiLine =>

val expected =
Seq(Seq("1", "2", "3", "4", "5.01", "2015-08-20 15:57:00"),
Seq("6", "7", "8", "9", "0", "2015-08-21 16:58:01"),
Seq("1", "2", "3", "4", "5", "2015-08-23 18:00:42"))
val results = spark.read
.format("csv")
.options(Map("comment" -> "~", "header" -> "false", "multiLine" -> multiLine))
.load(testFile(commentsFile))
.collect()

assert(results.toSeq.map(_.toSeq) === expected)
val expected =
Seq(Seq("1", "2", "3", "4", "5.01", "2015-08-20 15:57:00"),
Seq("6", "7", "8", "9", "0", "2015-08-21 16:58:01"),
Seq("1", "2", "3", "4", "5", "2015-08-23 18:00:42"))

assert(results.toSeq.map(_.toSeq) === expected)
}
}

test("inferring schema with commented lines in CSV data") {
Expand Down

0 comments on commit 9948b86

Please sign in to comment.