Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16285][SQL] Implement sentences SQL functions #14004

Closed
wants to merge 9 commits into from
Closed

[SPARK-16285][SQL] Implement sentences SQL functions #14004

wants to merge 9 commits into from

Conversation

dongjoon-hyun
Copy link
Member

What changes were proposed in this pull request?

This PR implements sentences SQL function.

How was this patch tested?

Pass the Jenkins tests with a new testcase.

@SparkQA
Copy link

SparkQA commented Jul 1, 2016

Test build #61575 has finished for PR 14004 at commit f29a8c3.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Sentences(

@SparkQA
Copy link

SparkQA commented Jul 1, 2016

Test build #61590 has finished for PR 14004 at commit d5f84c9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Sentences(

@dongjoon-hyun
Copy link
Member Author

cc @rxin and @cloud-fan

/**
* Splits a string into arrays of sentences, where each sentence is an array of words.
*/
public ArrayList<ArrayList<UTF8String>> sentences(UTF8String language, UTF8String country) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is implemented by String not UTF8String, I think we should put it into another util object(maybe in scala) and takes String as arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backward compatibility, we need to make Wrappers here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, I'm confused too. @rxin what's the convention here?

@dongjoon-hyun
Copy link
Member Author

Just rebased.

@SparkQA
Copy link

SparkQA commented Jul 2, 2016

Test build #61664 has finished for PR 14004 at commit c9e235a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Sentences(

@SparkQA
Copy link

SparkQA commented Jul 3, 2016

Test build #61680 has finished for PR 14004 at commit ea75373.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Rebased to resolve conflicts.

@SparkQA
Copy link

SparkQA commented Jul 3, 2016

Test build #61694 has finished for PR 14004 at commit d021d39.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


checkEvaluation(
Sentences("Hi there! The price was $1,234.56.... But, not now.", "en", "US"),
Seq(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong ident here

@SparkQA
Copy link

SparkQA commented Jul 4, 2016

Test build #61710 has finished for PR 14004 at commit 922e6e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* Return a locale of the given language and country, or a default locale when failures occur.
*/
private Locale getLocale(UTF8String language, UTF8String country) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to inline this method into sentences

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. No problem. I'll move this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I mean the sentences method, not the Sentences expression...

@dongjoon-hyun
Copy link
Member Author

Now, the PR became more concise. Thank you for decision, @cloud-fan .

@dongjoon-hyun
Copy link
Member Author

Oh, @cloud-fan .
Is there some misunderstanding?

@dongjoon-hyun
Copy link
Member Author

If then, I will reposition that.
Do you mean making a new java file containing not-really-UTF8String function?

@SparkQA
Copy link

SparkQA commented Jul 6, 2016

Test build #61807 has finished for PR 14004 at commit f1a5c1b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def this(str: Expression) = this(str, Literal(""), Literal(""))
def this(str: Expression, language: Expression) = this(str, language, Literal(""))

override def dataType: DataType = ArrayType(ArrayType(StringType))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the String element nullable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Right. The return dataType could be null, but has no nullable element. I'll fix like the following.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jul 6, 2016

By the way, I found that TernaryExpression.eval seems not to be compatible for sentences('', null, null).
I'll let you know after finishing the update.

@SparkQA
Copy link

SparkQA commented Jul 6, 2016

Test build #61816 has finished for PR 14004 at commit 50629b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2016

Test build #61818 has finished for PR 14004 at commit a98c05e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan . I've updated the PR. The following is a summary of changes.

  • To support sentence('', null, null), Sentences extends Expression instead of TerneryExpression.
  • getLocale is merged into getSentences. (according to the comment.)
  • getSentences is now in Sentences expression. (Yep. I know. It's accidentaly due to my misunderstanding)
  • Add more testcases and move test location into StringExpressionsSuite and StringFunctionsSuite.

@SparkQA
Copy link

SparkQA commented Jul 6, 2016

Test build #61827 has finished for PR 14004 at commit d2da078.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2016

Test build #61829 has finished for PR 14004 at commit 9e8cfbc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

checkEvaluation(
Sentences("Hi there! The price was $1,234.56.... But, not now."),
correct_answer,
EmptyRow)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EmptyRow is the default value, we don't need to pass it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@cloud-fan
Copy link
Contributor

LGTM except some style comment, thanks for working on it!

@dongjoon-hyun
Copy link
Member Author

Thank you, @cloud-fan .
I updated the PR according to your comments.

null
} else {
var locale = Locale.getDefault
if (language != null && language.eval(input) != null &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we don't check the nullability of expression, but the eval result of expression. And it's really bad we call eval twice: one in the if condition and one in the if body.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I didn't think in that way. Right for both. Thank you.

@SparkQA
Copy link

SparkQA commented Jul 7, 2016

Test build #61895 has finished for PR 14004 at commit 8d7c3d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 7, 2016

Test build #61899 has finished for PR 14004 at commit 4144e7f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan .
Finally, it passed again.

var locale = Locale.getDefault
val lang = language.eval(input)
val coun = country.eval(input)
if (lang != null && coun != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to write:

val languageStr = language.eval(input).asInstanceOf[UTF8String]
val countryStr = country.eval(input).asInstanceOf[UTF8String]
val locale = if (languageStr != null && countryStr != null) {
  new Locale(languageStr, countryStr)
} else {
  Locale.getDefault
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

language.eval(input) is null, isn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's my mistake. Sorry. I'll fix soon.

@cloud-fan
Copy link
Contributor

and one comment for the old thread: #14004 (comment)

@SparkQA
Copy link

SparkQA commented Jul 7, 2016

Test build #61920 has finished for PR 14004 at commit 9164f54.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
Could you review and merge this sentences PR?

* The 'lang' and 'country' arguments are optional, and if omitted, the default locale is used.
*/
@ExpressionDescription(
usage = "_FUNC_(str, lang, country) - Splits str into an array of array of words.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_FUNC_(str[, lang, country])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review! I'll fix soon.

@rxin
Copy link
Contributor

rxin commented Jul 8, 2016

This looks alright. I left some minor comments. Please move this out of the regex file. Seems like it should go into stringExpressions file.

@rxin
Copy link
Contributor

rxin commented Jul 8, 2016

LGTM pending Jenkins.

@dongjoon-hyun
Copy link
Member Author

Thank you for review again!

@SparkQA
Copy link

SparkQA commented Jul 8, 2016

Test build #61969 has finished for PR 14004 at commit 7912bf7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Sentences(

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan .
Could you merge this sentences PR, too?

@dongjoon-hyun
Copy link
Member Author

Oops. You already did. Thank you, @cloud-fan .
And, thank you, @rxin .

@asfgit asfgit closed this in a54438c Jul 8, 2016
asfgit pushed a commit that referenced this pull request Jul 8, 2016
## What changes were proposed in this pull request?

This PR implements `sentences` SQL function.

## How was this patch tested?

Pass the Jenkins tests with a new testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14004 from dongjoon-hyun/SPARK_16285.

(cherry picked from commit a54438c)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to master and 2.0!

@dongjoon-hyun dongjoon-hyun deleted the SPARK_16285 branch July 20, 2016 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants