New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16285][SQL] Implement sentences SQL functions #14004
Conversation
Test build #61575 has finished for PR 14004 at commit
|
Test build #61590 has finished for PR 14004 at commit
|
cc @rxin and @cloud-fan |
/** | ||
* Splits a string into arrays of sentences, where each sentence is an array of words. | ||
*/ | ||
public ArrayList<ArrayList<UTF8String>> sentences(UTF8String language, UTF8String country) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is implemented by String not UTF8String, I think we should put it into another util object(maybe in scala) and takes String as arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I thought it's a convention.
+1. Splitting not-native UTF8String functions sounds good to me. We need to split the followings together.
- split
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L795 - translate
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L805 - toTitleCaseSlow
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L408
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For backward compatibility, we need to make Wrappers here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, I'm confused too. @rxin what's the convention here?
Just rebased. |
Test build #61664 has finished for PR 14004 at commit
|
Test build #61680 has finished for PR 14004 at commit
|
Rebased to resolve conflicts. |
Test build #61694 has finished for PR 14004 at commit
|
|
||
checkEvaluation( | ||
Sentences("Hi there! The price was $1,234.56.... But, not now.", "en", "US"), | ||
Seq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong ident here
Test build #61710 has finished for PR 14004 at commit
|
/** | ||
* Return a locale of the given language and country, or a default locale when failures occur. | ||
*/ | ||
private Locale getLocale(UTF8String language, UTF8String country) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok to inline this method into sentences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. No problem. I'll move this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I mean the sentences
method, not the Sentences
expression...
Now, the PR became more concise. Thank you for decision, @cloud-fan . |
Oh, @cloud-fan . |
If then, I will reposition that. |
Test build #61807 has finished for PR 14004 at commit
|
def this(str: Expression) = this(str, Literal(""), Literal("")) | ||
def this(str: Expression, language: Expression) = this(str, language, Literal("")) | ||
|
||
override def dataType: DataType = ArrayType(ArrayType(StringType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the String element nullable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Right. The return dataType
could be null
, but has no nullable
element. I'll fix like the following.
By the way, I found that |
Test build #61816 has finished for PR 14004 at commit
|
Test build #61818 has finished for PR 14004 at commit
|
Hi, @cloud-fan . I've updated the PR. The following is a summary of changes.
|
Test build #61827 has finished for PR 14004 at commit
|
Test build #61829 has finished for PR 14004 at commit
|
checkEvaluation( | ||
Sentences("Hi there! The price was $1,234.56.... But, not now."), | ||
correct_answer, | ||
EmptyRow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EmptyRow
is the default value, we don't need to pass it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
LGTM except some style comment, thanks for working on it! |
Thank you, @cloud-fan . |
null | ||
} else { | ||
var locale = Locale.getDefault | ||
if (language != null && language.eval(input) != null && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we don't check the nullability of expression, but the eval result of expression. And it's really bad we call eval
twice: one in the if condition and one in the if body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I didn't think in that way. Right for both. Thank you.
Test build #61895 has finished for PR 14004 at commit
|
Test build #61899 has finished for PR 14004 at commit
|
Hi, @cloud-fan . |
var locale = Locale.getDefault | ||
val lang = language.eval(input) | ||
val coun = country.eval(input) | ||
if (lang != null && coun != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to write:
val languageStr = language.eval(input).asInstanceOf[UTF8String]
val countryStr = country.eval(input).asInstanceOf[UTF8String]
val locale = if (languageStr != null && countryStr != null) {
new Locale(languageStr, countryStr)
} else {
Locale.getDefault
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
language.eval(input)
is null
, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's my mistake. Sorry. I'll fix soon.
and one comment for the old thread: #14004 (comment) |
Test build #61920 has finished for PR 14004 at commit
|
Hi, @rxin . |
* The 'lang' and 'country' arguments are optional, and if omitted, the default locale is used. | ||
*/ | ||
@ExpressionDescription( | ||
usage = "_FUNC_(str, lang, country) - Splits str into an array of array of words.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_FUNC_(str[, lang, country])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review! I'll fix soon.
This looks alright. I left some minor comments. Please move this out of the regex file. Seems like it should go into stringExpressions file. |
LGTM pending Jenkins. |
Thank you for review again! |
Test build #61969 has finished for PR 14004 at commit
|
Hi, @cloud-fan . |
Oops. You already did. Thank you, @cloud-fan . |
## What changes were proposed in this pull request? This PR implements `sentences` SQL function. ## How was this patch tested? Pass the Jenkins tests with a new testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14004 from dongjoon-hyun/SPARK_16285. (cherry picked from commit a54438c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
thanks, merging to master and 2.0! |
What changes were proposed in this pull request?
This PR implements
sentences
SQL function.How was this patch tested?
Pass the Jenkins tests with a new testcase.