[SPARK-4151][SQL] Add string operation function trim, ltrim, rtrim, length to support SparkSql (HiveQL) #2998

OopsOutOfMemory · 2014-10-29T12:09:33Z

@marmbrus @chenghao-intel
Add three string operation functions to support spark sql and hiveql.
eg:
sql("select trim(' a b ') from src ").collect() --> 'a b'
sql("select ltrim(' a b ') from src ").collect() --> 'a b '
sql("select rtrim(' a b ') from src ").collect() --> ' a b'
sql("select length('ab') from src ").collect() --> 2

And Rename the trait of stringOperations.scala.
I prefer to rename trait CaseConversionExpression to StringTransformationExpression, it is more make sence than before so that this trait can support more string transformation but not only caseconversion.

And also add a trait StringCalculationExpression that do string computation like length, indexof etc....

…ormationExpression and add a StringCalculationExpression for string calculation in stringOperation.scala , eg: trim is for transformation and length is for calculation

…nto sparksql

AmplabJenkins · 2014-10-29T12:12:10Z

Can one of the admins verify this patch?

marmbrus · 2014-10-30T19:38:29Z

ok to test

SparkQA · 2014-10-30T19:45:03Z

Test build #22550 has started for PR 2998 at commit ab29a7e.

This patch merges cleanly.

SparkQA · 2014-10-30T20:17:06Z

Test build #22550 has finished for PR 2998 at commit ab29a7e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait StringTransformationExpression
- trait StringCalculationExpression
- case class Ltrim(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Rtrim(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Length(child: Expression) extends UnaryExpression with StringCalculationExpression
- case class Trim(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Upper(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Lower(child: Expression) extends UnaryExpression with StringTransformationExpression

AmplabJenkins · 2014-10-30T20:17:09Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22550/
Test FAILed.

chenghao-intel · 2014-10-31T05:48:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala

You have to override the override def dataType=IntegerType, it's StringType by default. That's why it causes failure in the unittest.

chenghao-intel · 2014-10-31T06:09:45Z

@OopsOutOfMemory thank you working on this, it will be nice if we have those functions. I have some comments on it.

…ons-lang StringUtils

…o sparksql

SparkQA · 2014-10-31T07:24:51Z

Test build #22606 has started for PR 2998 at commit b7790f4.

This patch merges cleanly.

OopsOutOfMemory · 2014-10-31T07:32:54Z

@chenghao-intel thanks for your review and comment :)

SparkQA · 2014-10-31T08:15:56Z

Test build #22606 has finished for PR 2998 at commit b7790f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait StringTransformationExpression
- trait StringCalculationExpression
- case class Ltrim(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Rtrim(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Length(child: Expression) extends UnaryExpression with StringCalculationExpression
- case class Trim(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Upper(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Lower(child: Expression) extends UnaryExpression with StringTransformationExpression

AmplabJenkins · 2014-10-31T08:15:59Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22606/
Test PASSed.

OopsOutOfMemory · 2014-11-01T06:04:48Z

@marmbrus test passed, this can be merged.

marmbrus · 2014-11-03T00:19:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala

Is there a reason to add this dependency to catalyst? Doesn't Scala have these string functions natively? Are these significantly faster?

@marmbrus I'm not sure which is more faster, may be could be the same.
But I got your idea is do not add external dependency libs into catalyst if only they have significantly improvement.
The native implementation can be done with scala.collection.immutable.StringOps.
eg:

ltrim : str.dropWhile ( _ == ' ') rtrim : str.reverse.dropWhile(_ == ' ').reverse

So I will change this and retest it . Thanks!

It will be nicer if you can provide a micro-benchmark comparison. :), and also the regex version.

@chenghao-intel it's a good advice, this also should look into the their implementations. I will do it later :)

SparkQA · 2014-11-03T02:40:01Z

Test build #22793 has started for PR 2998 at commit dca6adb.

This patch merges cleanly.

SparkQA · 2014-11-03T03:34:10Z

Test build #22793 has finished for PR 2998 at commit dca6adb.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait StringTransformationExpression
- trait StringCalculationExpression
- case class Ltrim(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Rtrim(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Length(child: Expression) extends UnaryExpression with StringCalculationExpression
- case class Trim(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Upper(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Lower(child: Expression) extends UnaryExpression with StringTransformationExpression

AmplabJenkins · 2014-11-03T03:34:12Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22793/
Test PASSed.

tianyi · 2014-11-04T01:45:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala

indent issue

tianyi · 2014-11-04T01:59:46Z

You have added a empty file "case sensitivity" in golden files, is it related to this PR?

SparkQA · 2014-11-04T03:10:13Z

Test build #22855 has started for PR 2998 at commit 5989358.

This patch merges cleanly.

SparkQA · 2014-11-04T03:12:32Z

Test build #22856 has started for PR 2998 at commit 0925b32.

This patch merges cleanly.

SparkQA · 2014-11-04T04:05:55Z

Test build #22855 has finished for PR 2998 at commit 5989358.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait StringTransformationExpression
- trait StringCalculationExpression
- case class Ltrim(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Rtrim(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Length(child: Expression) extends UnaryExpression with StringCalculationExpression
- case class Trim(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Upper(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Lower(child: Expression) extends UnaryExpression with StringTransformationExpression

AmplabJenkins · 2014-11-04T04:05:57Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22855/
Test PASSed.

SparkQA · 2014-11-04T04:07:51Z

Test build #22856 has finished for PR 2998 at commit 0925b32.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait StringTransformationExpression
- trait StringCalculationExpression
- case class Ltrim(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Rtrim(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Length(child: Expression) extends UnaryExpression with StringCalculationExpression
- case class Trim(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Upper(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Lower(child: Expression) extends UnaryExpression with StringTransformationExpression

AmplabJenkins · 2014-11-04T04:07:54Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22856/
Test PASSed.

SparkQA · 2014-11-04T05:04:55Z

Test build #22861 has started for PR 2998 at commit addfbd9.

This patch merges cleanly.

SparkQA · 2014-11-04T06:06:07Z

Test build #22861 has finished for PR 2998 at commit addfbd9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait StringTransformationExpression
- trait StringCalculationExpression
- case class Ltrim(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Rtrim(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Length(child: Expression) extends UnaryExpression with StringCalculationExpression
- case class Trim(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Upper(child: Expression) extends UnaryExpression with StringTransformationExpression
- case class Lower(child: Expression) extends UnaryExpression with StringTransformationExpression

AmplabJenkins · 2014-11-04T06:06:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22861/
Test PASSed.

OopsOutOfMemory · 2014-11-04T09:49:55Z

@tianyi thanks for your review and comment : )

marmbrus · 2014-12-17T19:24:02Z

Thanks for working on this, but I'm afraid that our current approach to adding functions is becoming unsustainable. I've detailed the reasons in SPARK-4867. For this reason, I propose we close this issue for now and reopen it once that work is complete. What do you think?

OopsOutOfMemory · 2014-12-19T14:27:53Z

@marmbrus
sorry for comment it late.
yeah, I agree with you~
：）

OopsOutOfMemory added 15 commits October 28, 2014 16:43

Add description of SQL CLI and HiveServer.

5809496

Merge branch 'master' of https://github.com/apache/spark

c577589

Merge branch 'master' of https://github.com/apache/spark

e654704

Add system function trim

3269eab

HiveQL support trim

ce6899a

modify keyword of Trim

e2781ee

correct spelling mistake

2166c77

correct spelling mistake

a4f4e4b

add test suit for trim

2137e28

support 3 functions : ltrim rtrim length in SparkQL and HiveQl

10d8ace

change the name of the trait CaseConversionExpression to StringTransf…

0a0f4e0

…ormationExpression and add a StringCalculationExpression for string calculation in stringOperation.scala , eg: trim is for transformation and length is for calculation

change return type

0fa2cd6

deleted: sql/README.md

b358048

new file: sql/README.md

558d7bf

Merge branch 'sparksql' of https://github.com/OopsOutOfMemory/spark i…

ab29a7e

…nto sparksql

OopsOutOfMemory changed the title ~~[SQL] Add string operation function trim, ltrim, rtrim, length to support SparkSql (HiveQL)~~ [SPARK-4151][SQL] Add string operation function trim, ltrim, rtrim, length to support SparkSql (HiveQL) Oct 30, 2014

chenghao-intel reviewed Oct 31, 2014
View reviewed changes

OopsOutOfMemory added 2 commits October 31, 2014 15:11

change the implementation of ltrim and trim from regex to apache comm…

57111f5

…ons-lang StringUtils

Merge branch 'master' of https://github.com/OopsOutOfMemory/spark int…

b7790f4

…o sparksql

marmbrus reviewed Nov 3, 2014
View reviewed changes

change to scala native implementation

dca6adb

tianyi reviewed Nov 4, 2014
View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala Outdated

Copy link

Contributor

tianyi Nov 4, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent issue

OopsOutOfMemory added 2 commits November 4, 2014 11:02

fix ident issues and add test case

5989358

spelling correct

0925b32

remove unused golden file

addfbd9

asfgit closed this in ca12608 Dec 17, 2014

[SPARK-4151][SQL] Add string operation function trim, ltrim, rtrim, length to support SparkSql (HiveQL) #2998

[SPARK-4151][SQL] Add string operation function trim, ltrim, rtrim, length to support SparkSql (HiveQL) #2998

Uh oh!

Conversation

OopsOutOfMemory commented Oct 29, 2014

Uh oh!

AmplabJenkins commented Oct 29, 2014

Uh oh!

marmbrus commented Oct 30, 2014

Uh oh!

SparkQA commented Oct 30, 2014

Uh oh!

SparkQA commented Oct 30, 2014

Uh oh!

AmplabJenkins commented Oct 30, 2014

Uh oh!

chenghao-intel Oct 31, 2014

Choose a reason for hiding this comment

Uh oh!

OopsOutOfMemory Oct 31, 2014

Choose a reason for hiding this comment

Uh oh!

chenghao-intel commented Oct 31, 2014

Uh oh!

SparkQA commented Oct 31, 2014

Uh oh!

OopsOutOfMemory commented Oct 31, 2014

Uh oh!

SparkQA commented Oct 31, 2014

Uh oh!

AmplabJenkins commented Oct 31, 2014

Uh oh!

OopsOutOfMemory commented Nov 1, 2014

Uh oh!

marmbrus Nov 3, 2014

Choose a reason for hiding this comment

Uh oh!

OopsOutOfMemory Nov 3, 2014

Choose a reason for hiding this comment

Uh oh!

chenghao-intel Nov 3, 2014

Choose a reason for hiding this comment

Uh oh!

OopsOutOfMemory Nov 3, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 3, 2014

Uh oh!

SparkQA commented Nov 3, 2014

Uh oh!

AmplabJenkins commented Nov 3, 2014

Uh oh!

tianyi Nov 4, 2014

Choose a reason for hiding this comment

Uh oh!

tianyi commented Nov 4, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

AmplabJenkins commented Nov 4, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

AmplabJenkins commented Nov 4, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

AmplabJenkins commented Nov 4, 2014

Uh oh!

OopsOutOfMemory commented Nov 4, 2014

Uh oh!

marmbrus commented Dec 17, 2014

Uh oh!

OopsOutOfMemory commented Dec 19, 2014