-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25393][SQL] Adding new function from_csv() #22379
Conversation
jenkins, retest this, please |
Test build #95870 has finished for PR 22379 at commit
|
Test build #95874 has finished for PR 22379 at commit
|
<artifactId>univocity-parsers</artifactId> | ||
<version>2.7.3</version> | ||
<type>jar</type> | ||
</dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @MaxGekk , @gatorsmile , and @cloud-fan .
I know this is the same approach with from_json
, but suddenly I'm wondering if this is the right evolution direction, sql
-> catalyst
. Recently, we made avro
as a external module and the development direction was the opposite. If we put this into catalyst
, it become harder to be separated from sql module.
Ideally, we should separate parquet, orc and other built-in data sources from sql module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the dependency only because I have to move UnivocityParser
from sql/core
to sql/catalyst
because it wasn't not accessible from sql/catalyst
. Please, tell me what is right approach here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should just make CSV one as a separate external module and this should be the right way given the discussion.
The current change wouldn't necessarily is blocked but I can see the point of moving the dependency makes further refactoring potentially harder as pointed out. Looks many people agreed upon separating them.
The concern here is, it sounds we are stepping back from the ideal approach.
Test build #95885 has finished for PR 22379 at commit
|
jenkins, retest this, please |
R/pkg/R/functions.R
Outdated
"from_csv", | ||
x@jc, schema, options) | ||
column(jc) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline at the end
<artifactId>univocity-parsers</artifactId> | ||
<version>2.7.3</version> | ||
<type>jar</type> | ||
</dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should just make CSV one as a separate external module and this should be the right way given the discussion.
The current change wouldn't necessarily is blocked but I can see the point of moving the dependency makes further refactoring potentially harder as pointed out. Looks many people agreed upon separating them.
The concern here is, it sounds we are stepping back from the ideal approach.
Test build #95901 has finished for PR 22379 at commit
|
R/pkg/R/functions.R
Outdated
#' | ||
#' @note from_csv since 3.0.0 | ||
setMethod("from_csv", signature(x = "Column", schema = "character"), | ||
function(x, schema, ...) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add to the doc for ...
(in column_collection_functions) to indicate the use options for this function? if there is anything new?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung I added a doc but when I run run-tests.sh
, it outputs the warning:
* checking Rd \usage sections ... WARNING
Duplicated \argument entries in documentation object 'column_collection_functions':
‘schema’
Functions with \usage entries need to have the appropriate \alias
entries, and all their arguments documented.
The \usage entries must correspond to syntactically valid R code.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.
Is it ok? If it isn't, any ideas what can cause it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no no, this will break - I am referring to find the original doc @rdname column_collection_functions
that has ...
already documented, and then add this in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
Line 199 in d2bfd94
#' @param ... additional argument(s). In \code{to_json} and \code{from_json}, this contains |
Test build #95949 has finished for PR 22379 at commit
|
jenkins, retest this, please |
Test build #95948 has finished for PR 22379 at commit
|
Test build #95963 has finished for PR 22379 at commit
|
Test build #95965 has finished for PR 22379 at commit
|
see comment above/ |
Test build #95980 has finished for PR 22379 at commit
|
Looks pretty good! just one minor commet |
BTW how would the |
Like ..
That's why I suggested #22379 (comment) |
I wouldn't make schema a |
I wouldn't too, but there's no way for using |
retest this please |
While addressing the review comments, I also reviewed at the same time. The change looks pretty good to go. For #22379 (comment), I guess we can add the string one later at Scala side .. |
@@ -777,7 +777,6 @@ case class SchemaOfJson( | |||
} | |||
|
|||
object JsonExprUtils { | |||
|
|||
def evalSchemaExpr(exp: Expression): DataType = exp match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following @MaxGekk's opinion (https://github.com/apache/spark/pull/22379/files/93d094f45b02afc0ab2f0650bbde1513823471a2#r224846183).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, thanks!
have you addressed https://github.com/apache/spark/pull/22379/files#r225033808 ? |
Sorry, I missed that comment. I replied in that comment. |
Test build #97385 has finished for PR 22379 at commit
|
Test build #97383 has finished for PR 22379 at commit
|
LGTM |
Test build #97429 has finished for PR 22379 at commit
|
retest this please |
Test build #97440 has finished for PR 22379 at commit
|
Test build #97455 has finished for PR 22379 at commit
|
Merged to master. |
Thanks all! |
@HyukjinKwon Thank you for your work on the PR. @cloud-fan @felixcheung @dongjoon-hyun @gatorsmile Thanks for your reviews. |
## What changes were proposed in this pull request? The PR adds new function `from_csv()` similar to `from_json()` to parse columns with CSV strings. I added the following methods: ```Scala def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column ``` and this signature to call it from Python, R and Java: ```Scala def from_csv(e: Column, schema: String, options: java.util.Map[String, String]): Column ``` ## How was this patch tested? Added new test suites `CsvExpressionsSuite`, `CsvFunctionsSuite` and sql tests. Closes apache#22379 from MaxGekk/from_csv. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>
What changes were proposed in this pull request?
The PR adds new function
from_csv()
similar tofrom_json()
to parse columns with CSV strings. I added the following methods:and this signature to call it from Python, R and Java:
How was this patch tested?
Added new test suites
CsvExpressionsSuite
,CsvFunctionsSuite
and sql tests.