[SPARK-25393][SQL] Adding new function from_csv() #22379

MaxGekk · 2018-09-10T09:14:04Z

What changes were proposed in this pull request?

The PR adds new function from_csv() similar to from_json() to parse columns with CSV strings. I added the following methods:

def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column

and this signature to call it from Python, R and Java:

def from_csv(e: Column, schema: String, options: java.util.Map[String, String]): Column

How was this patch tested?

Added new test suites CsvExpressionsSuite, CsvFunctionsSuite and sql tests.

MaxGekk · 2018-09-10T11:31:58Z

jenkins, retest this, please

SparkQA · 2018-09-10T15:56:21Z

Test build #95870 has finished for PR 22379 at commit cfb2ac3.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-10T15:56:23Z

Test build #95874 has finished for PR 22379 at commit cfb2ac3.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-09-10T19:24:45Z

sql/catalyst/pom.xml

+      <artifactId>univocity-parsers</artifactId>
+      <version>2.7.3</version>
+      <type>jar</type>
+    </dependency>


Hi, @MaxGekk , @gatorsmile , and @cloud-fan .

I know this is the same approach with from_json, but suddenly I'm wondering if this is the right evolution direction, sql -> catalyst. Recently, we made avro as a external module and the development direction was the opposite. If we put this into catalyst, it become harder to be separated from sql module.

Ideally, we should separate parquet, orc and other built-in data sources from sql module.

I added the dependency only because I have to move UnivocityParser from sql/core to sql/catalyst because it wasn't not accessible from sql/catalyst. Please, tell me what is right approach here?

Ideally we should just make CSV one as a separate external module and this should be the right way given the discussion.

The current change wouldn't necessarily is blocked but I can see the point of moving the dependency makes further refactoring potentially harder as pointed out. Looks many people agreed upon separating them.

The concern here is, it sounds we are stepping back from the ideal approach.

SparkQA · 2018-09-10T21:35:55Z

Test build #95885 has finished for PR 22379 at commit d2bfd94.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-09-10T21:45:29Z

jenkins, retest this, please

HyukjinKwon · 2018-09-10T23:48:38Z

R/pkg/R/functions.R

+                              "from_csv",
+                              x@jc, schema, options)
+            column(jc)
+          })


newline at the end

HyukjinKwon · 2018-09-10T23:54:29Z

sql/catalyst/pom.xml

+      <artifactId>univocity-parsers</artifactId>
+      <version>2.7.3</version>
+      <type>jar</type>
+    </dependency>


Ideally we should just make CSV one as a separate external module and this should be the right way given the discussion.

The current change wouldn't necessarily is blocked but I can see the point of moving the dependency makes further refactoring potentially harder as pointed out. Looks many people agreed upon separating them.

The concern here is, it sounds we are stepping back from the ideal approach.

SparkQA · 2018-09-11T01:50:17Z

Test build #95901 has finished for PR 22379 at commit d2bfd94.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2018-09-11T03:55:20Z

R/pkg/R/functions.R

+#'
+#' @note from_csv since 3.0.0
+setMethod("from_csv", signature(x = "Column", schema = "character"),
+          function(x, schema, ...) {


can you add to the doc for ... (in column_collection_functions) to indicate the use options for this function? if there is anything new?

@felixcheung I added a doc but when I run run-tests.sh, it outputs the warning:

* checking Rd \usage sections ... WARNING Duplicated \argument entries in documentation object 'column_collection_functions': ‘schema’ Functions with \usage entries need to have the appropriate \alias entries, and all their arguments documented. The \usage entries must correspond to syntactically valid R code. See chapter ‘Writing R documentation files’ in the ‘Writing R Extensions’ manual.

Is it ok? If it isn't, any ideas what can cause it.

no no, this will break - I am referring to find the original doc @rdname column_collection_functions that has ... already documented, and then add this in

here

spark/R/pkg/R/functions.R

Line 199 in d2bfd94

#' @param ... additional argument(s). In \code{to_json} and \code{from_json}, this contains

SparkQA · 2018-09-11T19:37:30Z

Test build #95949 has finished for PR 22379 at commit 42b8227.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-09-11T19:44:36Z

jenkins, retest this, please

SparkQA · 2018-09-11T20:12:42Z

Test build #95948 has finished for PR 22379 at commit 147c978.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-11T23:57:02Z

Test build #95963 has finished for PR 22379 at commit 42b8227.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-12T01:24:38Z

Test build #95965 has finished for PR 22379 at commit 2a0b65b.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2018-09-12T01:52:36Z

see comment above/

SparkQA · 2018-09-12T12:13:45Z

Test build #95980 has finished for PR 22379 at commit 1ccca30.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-15T04:12:58Z

Looks pretty good! just one minor commet

cloud-fan · 2018-10-15T04:19:32Z

BTW how would the schema_of_csv function help with from_csv?

HyukjinKwon · 2018-10-15T07:06:21Z

Like ..

val csvData = "a,1,2"
from_csv(column, schema = schema_of_csv(lit(csvData)))

That's why I suggested #22379 (comment)

cloud-fan · 2018-10-15T07:33:20Z

I wouldn't make schema a Columm. Column means it's dynamic, while I think it should be a static value, to simplify the implementation.

HyukjinKwon · 2018-10-15T07:44:16Z

I wouldn't too, but there's no way for using schema_of_csv otherwise ..

Address from csv

HyukjinKwon · 2018-10-15T10:47:24Z

retest this please

HyukjinKwon · 2018-10-15T10:50:15Z

While addressing the review comments, I also reviewed at the same time. The change looks pretty good to go.

For #22379 (comment), I guess we can add the string one later at Scala side ..

cloud-fan · 2018-10-15T14:21:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

@@ -777,7 +777,6 @@ case class SchemaOfJson(
 }

 object JsonExprUtils {
-
  def evalSchemaExpr(exp: Expression): DataType = exp match {


do we still need it?

I was following @MaxGekk's opinion (https://github.com/apache/spark/pull/22379/files/93d094f45b02afc0ab2f0650bbde1513823471a2#r224846183).

makes sense, thanks!

cloud-fan · 2018-10-15T14:22:01Z

have you addressed https://github.com/apache/spark/pull/22379/files#r225033808 ?

HyukjinKwon · 2018-10-15T14:39:41Z

Sorry, I missed that comment. I replied in that comment.

SparkQA · 2018-10-15T16:11:06Z

Test build #97385 has finished for PR 22379 at commit 93d094f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-15T16:24:01Z

Test build #97383 has finished for PR 22379 at commit b26e49e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-16T04:51:57Z

LGTM

SparkQA · 2018-10-16T07:05:01Z

Test build #97429 has finished for PR 22379 at commit cb23bd7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-16T07:07:42Z

retest this please

SparkQA · 2018-10-16T10:57:30Z

Test build #97440 has finished for PR 22379 at commit cb23bd7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-16T17:24:30Z

Test build #97455 has finished for PR 22379 at commit 205e4a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-17T01:32:15Z

Merged to master.

HyukjinKwon · 2018-10-17T01:33:21Z

Thanks all!

MaxGekk · 2018-10-31T15:42:13Z

@HyukjinKwon Thank you for your work on the PR. @cloud-fan @felixcheung @dongjoon-hyun @gatorsmile Thanks for your reviews.

## What changes were proposed in this pull request? The PR adds new function `from_csv()` similar to `from_json()` to parse columns with CSV strings. I added the following methods: ```Scala def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column ``` and this signature to call it from Python, R and Java: ```Scala def from_csv(e: Column, schema: String, options: java.util.Map[String, String]): Column ``` ## How was this patch tested? Added new test suites `CsvExpressionsSuite`, `CsvFunctionsSuite` and sql tests. Closes apache#22379 from MaxGekk/from_csv. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>

MaxGekk added 7 commits September 9, 2018 15:07

Initial implementation of CsvToStruct

344c2ab

Added CSV Expression Test Suite

5905019

Register from_csv functions and add tests

c5ac432

Fix imports

bd2124c

Adding SQL tests

b9bb081

from_csv for PySpark

14ae619

from_csv for SparkR

cfb2ac3

Making Python style checker happy

d2bfd94

dongjoon-hyun reviewed Sep 10, 2018

View reviewed changes

HyukjinKwon reviewed Sep 10, 2018

View reviewed changes

felixcheung reviewed Sep 11, 2018

View reviewed changes

MaxGekk added 4 commits September 11, 2018 15:56

Merge remote-tracking branch 'origin/master' into from_csv

b3a8666

Addressing review comments

d19242d

Updating comments

147c978

added new line at the end of file

42b8227

Moving from_csv closer to from_json and fixing warnings

2a0b65b

Deduplication of schema description

1ccca30

HyukjinKwon and others added 4 commits October 15, 2018 18:30

Address comments at 22379

2ffed5f

nit

a32bbcb

Merge pull request #10 from HyukjinKwon/address-from_csv

b26e49e

Address from csv

name nits (#11)

93d094f

cloud-fan reviewed Oct 15, 2018

View reviewed changes

Address comments (#12)

cb23bd7

Fix some nits (#13)

205e4a4

asfgit closed this in e9af946 Oct 17, 2018

MaxGekk deleted the from_csv branch August 17, 2019 13:35

jeff303 mentioned this pull request Oct 7, 2019

[SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read #26027

Closed

[SPARK-25393][SQL] Adding new function from_csv() #22379

[SPARK-25393][SQL] Adding new function from_csv() #22379

Conversation

MaxGekk commented Sep 10, 2018

What changes were proposed in this pull request?

How was this patch tested?

MaxGekk commented Sep 10, 2018

SparkQA commented Sep 10, 2018

SparkQA commented Sep 10, 2018

Choose a reason for hiding this comment

MaxGekk Sep 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 10, 2018

MaxGekk commented Sep 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 11, 2018

MaxGekk commented Sep 11, 2018

SparkQA commented Sep 11, 2018

SparkQA commented Sep 11, 2018

SparkQA commented Sep 12, 2018

felixcheung commented Sep 12, 2018

SparkQA commented Sep 12, 2018

cloud-fan commented Oct 15, 2018

cloud-fan commented Oct 15, 2018

HyukjinKwon commented Oct 15, 2018 • edited Loading

cloud-fan commented Oct 15, 2018

HyukjinKwon commented Oct 15, 2018

HyukjinKwon commented Oct 15, 2018

HyukjinKwon commented Oct 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Oct 15, 2018

HyukjinKwon commented Oct 15, 2018

SparkQA commented Oct 15, 2018

SparkQA commented Oct 15, 2018

cloud-fan commented Oct 16, 2018

SparkQA commented Oct 16, 2018

HyukjinKwon commented Oct 16, 2018

SparkQA commented Oct 16, 2018

SparkQA commented Oct 16, 2018

HyukjinKwon commented Oct 17, 2018

HyukjinKwon commented Oct 17, 2018

MaxGekk commented Oct 31, 2018

MaxGekk Sep 10, 2018 •

edited

Loading

HyukjinKwon commented Oct 15, 2018 •

edited

Loading