Skip to content

Conversation

@OopsOutOfMemory
Copy link
Contributor

@marmbrus @chenghao-intel
Add three string operation functions to support spark sql and hiveql.
eg:
sql("select trim(' a b ') from src ").collect() --> 'a b'
sql("select ltrim(' a b ') from src ").collect() --> 'a b '
sql("select rtrim(' a b ') from src ").collect() --> ' a b'
sql("select length('ab') from src ").collect() --> 2

And Rename the trait of stringOperations.scala.
I prefer to rename trait CaseConversionExpression to StringTransformationExpression, it is more make sence than before so that this trait can support more string transformation but not only caseconversion.

And also add a trait StringCalculationExpression that do string computation like length, indexof etc....

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@OopsOutOfMemory OopsOutOfMemory changed the title [SQL] Add string operation function trim, ltrim, rtrim, length to support SparkSql (HiveQL) [SPARK-4151][SQL] Add string operation function trim, ltrim, rtrim, length to support SparkSql (HiveQL) Oct 30, 2014
@marmbrus
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22550 has started for PR 2998 at commit ab29a7e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22550 has finished for PR 2998 at commit ab29a7e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait StringTransformationExpression
    • trait StringCalculationExpression
    • case class Ltrim(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Rtrim(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Length(child: Expression) extends UnaryExpression with StringCalculationExpression
    • case class Trim(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Upper(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Lower(child: Expression) extends UnaryExpression with StringTransformationExpression

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22550/
Test FAILed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to override the override def dataType=IntegerType, it's StringType by default. That's why it causes failure in the unittest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it is.

@chenghao-intel
Copy link
Contributor

@OopsOutOfMemory thank you working on this, it will be nice if we have those functions. I have some comments on it.

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22606 has started for PR 2998 at commit b7790f4.

  • This patch merges cleanly.

@OopsOutOfMemory
Copy link
Contributor Author

@chenghao-intel thanks for your review and comment :)

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22606 has finished for PR 2998 at commit b7790f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait StringTransformationExpression
    • trait StringCalculationExpression
    • case class Ltrim(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Rtrim(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Length(child: Expression) extends UnaryExpression with StringCalculationExpression
    • case class Trim(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Upper(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Lower(child: Expression) extends UnaryExpression with StringTransformationExpression

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22606/
Test PASSed.

@OopsOutOfMemory
Copy link
Contributor Author

@marmbrus test passed, this can be merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to add this dependency to catalyst? Doesn't Scala have these string functions natively? Are these significantly faster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marmbrus I'm not sure which is more faster, may be could be the same.
But I got your idea is do not add external dependency libs into catalyst if only they have significantly improvement.
The native implementation can be done with scala.collection.immutable.StringOps.
eg:

ltrim : str.dropWhile ( _ == ' ')
rtrim : str.reverse.dropWhile(_ == ' ').reverse 

So I will change this and retest it . Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be nicer if you can provide a micro-benchmark comparison. :), and also the regex version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenghao-intel it's a good advice, this also should look into the their implementations. I will do it later :)

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22793 has started for PR 2998 at commit dca6adb.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22793 has finished for PR 2998 at commit dca6adb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait StringTransformationExpression
    • trait StringCalculationExpression
    • case class Ltrim(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Rtrim(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Length(child: Expression) extends UnaryExpression with StringCalculationExpression
    • case class Trim(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Upper(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Lower(child: Expression) extends UnaryExpression with StringTransformationExpression

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22793/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent issue

@tianyi
Copy link
Contributor

tianyi commented Nov 4, 2014

You have added a empty file "case sensitivity" in golden files, is it related to this PR?

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22855 has started for PR 2998 at commit 5989358.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22856 has started for PR 2998 at commit 0925b32.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22855 has finished for PR 2998 at commit 5989358.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait StringTransformationExpression
    • trait StringCalculationExpression
    • case class Ltrim(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Rtrim(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Length(child: Expression) extends UnaryExpression with StringCalculationExpression
    • case class Trim(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Upper(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Lower(child: Expression) extends UnaryExpression with StringTransformationExpression

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22855/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22856 has finished for PR 2998 at commit 0925b32.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait StringTransformationExpression
    • trait StringCalculationExpression
    • case class Ltrim(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Rtrim(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Length(child: Expression) extends UnaryExpression with StringCalculationExpression
    • case class Trim(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Upper(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Lower(child: Expression) extends UnaryExpression with StringTransformationExpression

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22856/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22861 has started for PR 2998 at commit addfbd9.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22861 has finished for PR 2998 at commit addfbd9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait StringTransformationExpression
    • trait StringCalculationExpression
    • case class Ltrim(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Rtrim(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Length(child: Expression) extends UnaryExpression with StringCalculationExpression
    • case class Trim(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Upper(child: Expression) extends UnaryExpression with StringTransformationExpression
    • case class Lower(child: Expression) extends UnaryExpression with StringTransformationExpression

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22861/
Test PASSed.

@OopsOutOfMemory
Copy link
Contributor Author

@tianyi thanks for your review and comment : )

@marmbrus
Copy link
Contributor

Thanks for working on this, but I'm afraid that our current approach to adding functions is becoming unsustainable. I've detailed the reasons in SPARK-4867. For this reason, I propose we close this issue for now and reopen it once that work is complete. What do you think?

@asfgit asfgit closed this in ca12608 Dec 17, 2014
@OopsOutOfMemory
Copy link
Contributor Author

@marmbrus
sorry for comment it late.
yeah, I agree with you~
:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants