Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25393][SQL] Adding new function from_csv() #22379

Closed
wants to merge 64 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Sep 10, 2018

What changes were proposed in this pull request?

The PR adds new function from_csv() similar to from_json() to parse columns with CSV strings. I added the following methods:

def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column

and this signature to call it from Python, R and Java:

def from_csv(e: Column, schema: String, options: java.util.Map[String, String]): Column

How was this patch tested?

Added new test suites CsvExpressionsSuite, CsvFunctionsSuite and sql tests.

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 10, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Sep 10, 2018

Test build #95870 has finished for PR 22379 at commit cfb2ac3.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 10, 2018

Test build #95874 has finished for PR 22379 at commit cfb2ac3.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

<artifactId>univocity-parsers</artifactId>
<version>2.7.3</version>
<type>jar</type>
</dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @MaxGekk , @gatorsmile , and @cloud-fan .

I know this is the same approach with from_json, but suddenly I'm wondering if this is the right evolution direction, sql -> catalyst. Recently, we made avro as a external module and the development direction was the opposite. If we put this into catalyst, it become harder to be separated from sql module.

Ideally, we should separate parquet, orc and other built-in data sources from sql module.

Copy link
Member Author

@MaxGekk MaxGekk Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the dependency only because I have to move UnivocityParser from sql/core to sql/catalyst because it wasn't not accessible from sql/catalyst. Please, tell me what is right approach here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should just make CSV one as a separate external module and this should be the right way given the discussion.

The current change wouldn't necessarily is blocked but I can see the point of moving the dependency makes further refactoring potentially harder as pointed out. Looks many people agreed upon separating them.

The concern here is, it sounds we are stepping back from the ideal approach.

@SparkQA
Copy link

SparkQA commented Sep 10, 2018

Test build #95885 has finished for PR 22379 at commit d2bfd94.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 10, 2018

jenkins, retest this, please

"from_csv",
x@jc, schema, options)
column(jc)
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline at the end

<artifactId>univocity-parsers</artifactId>
<version>2.7.3</version>
<type>jar</type>
</dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should just make CSV one as a separate external module and this should be the right way given the discussion.

The current change wouldn't necessarily is blocked but I can see the point of moving the dependency makes further refactoring potentially harder as pointed out. Looks many people agreed upon separating them.

The concern here is, it sounds we are stepping back from the ideal approach.

@SparkQA
Copy link

SparkQA commented Sep 11, 2018

Test build #95901 has finished for PR 22379 at commit d2bfd94.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#'
#' @note from_csv since 3.0.0
setMethod("from_csv", signature(x = "Column", schema = "character"),
function(x, schema, ...) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add to the doc for ... (in column_collection_functions) to indicate the use options for this function? if there is anything new?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung I added a doc but when I run run-tests.sh, it outputs the warning:

* checking Rd \usage sections ... WARNING
Duplicated \argument entries in documentation object 'column_collection_functions':
  ‘schema’

Functions with \usage entries need to have the appropriate \alias
entries, and all their arguments documented.
The \usage entries must correspond to syntactically valid R code.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.

Is it ok? If it isn't, any ideas what can cause it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no no, this will break - I am referring to find the original doc @rdname column_collection_functions that has ... already documented, and then add this in

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

#' @param ... additional argument(s). In \code{to_json} and \code{from_json}, this contains

@SparkQA
Copy link

SparkQA commented Sep 11, 2018

Test build #95949 has finished for PR 22379 at commit 42b8227.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 11, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Sep 11, 2018

Test build #95948 has finished for PR 22379 at commit 147c978.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 11, 2018

Test build #95963 has finished for PR 22379 at commit 42b8227.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 12, 2018

Test build #95965 has finished for PR 22379 at commit 2a0b65b.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

see comment above/

@SparkQA
Copy link

SparkQA commented Sep 12, 2018

Test build #95980 has finished for PR 22379 at commit 1ccca30.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Looks pretty good! just one minor commet

@cloud-fan
Copy link
Contributor

BTW how would the schema_of_csv function help with from_csv?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Oct 15, 2018

Like ..

val csvData = "a,1,2"
from_csv(column, schema = schema_of_csv(lit(csvData)))

That's why I suggested #22379 (comment)

@cloud-fan
Copy link
Contributor

I wouldn't make schema a Columm. Column means it's dynamic, while I think it should be a static value, to simplify the implementation.

@HyukjinKwon
Copy link
Member

I wouldn't too, but there's no way for using schema_of_csv otherwise ..

@HyukjinKwon
Copy link
Member

retest this please

@HyukjinKwon
Copy link
Member

While addressing the review comments, I also reviewed at the same time. The change looks pretty good to go.

For #22379 (comment), I guess we can add the string one later at Scala side ..

@@ -777,7 +777,6 @@ case class SchemaOfJson(
}

object JsonExprUtils {

def evalSchemaExpr(exp: Expression): DataType = exp match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thanks!

@cloud-fan
Copy link
Contributor

@HyukjinKwon
Copy link
Member

Sorry, I missed that comment. I replied in that comment.

@SparkQA
Copy link

SparkQA commented Oct 15, 2018

Test build #97385 has finished for PR 22379 at commit 93d094f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 15, 2018

Test build #97383 has finished for PR 22379 at commit b26e49e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Oct 16, 2018

Test build #97429 has finished for PR 22379 at commit cb23bd7.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Oct 16, 2018

Test build #97440 has finished for PR 22379 at commit cb23bd7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 16, 2018

Test build #97455 has finished for PR 22379 at commit 205e4a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

Thanks all!

@asfgit asfgit closed this in e9af946 Oct 17, 2018
@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 31, 2018

@HyukjinKwon Thank you for your work on the PR. @cloud-fan @felixcheung @dongjoon-hyun @gatorsmile Thanks for your reviews.

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

The PR adds new function `from_csv()` similar to `from_json()` to parse columns with CSV strings. I added the following methods:
```Scala
def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column
```
and this signature to call it from Python, R and Java:
```Scala
def from_csv(e: Column, schema: String, options: java.util.Map[String, String]): Column
```

## How was this patch tested?

Added new test suites `CsvExpressionsSuite`, `CsvFunctionsSuite` and sql tests.

Closes apache#22379 from MaxGekk/from_csv.

Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Co-authored-by: hyukjinkwon <gurwls223@apache.org>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
@MaxGekk MaxGekk deleted the from_csv branch August 17, 2019 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants