-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-4151][SQL] Add string operation function trim, ltrim, rtrim, length to support SparkSql (HiveQL) #2998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ormationExpression and add a StringCalculationExpression for string calculation in stringOperation.scala , eg: trim is for transformation and length is for calculation
|
Can one of the admins verify this patch? |
|
ok to test |
|
Test build #22550 has started for PR 2998 at commit
|
|
Test build #22550 has finished for PR 2998 at commit
|
|
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to override the override def dataType=IntegerType, it's StringType by default. That's why it causes failure in the unittest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it is.
|
@OopsOutOfMemory thank you working on this, it will be nice if we have those functions. I have some comments on it. |
…ons-lang StringUtils
|
Test build #22606 has started for PR 2998 at commit
|
|
@chenghao-intel thanks for your review and comment :) |
|
Test build #22606 has finished for PR 2998 at commit
|
|
Test PASSed. |
|
@marmbrus test passed, this can be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to add this dependency to catalyst? Doesn't Scala have these string functions natively? Are these significantly faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marmbrus I'm not sure which is more faster, may be could be the same.
But I got your idea is do not add external dependency libs into catalyst if only they have significantly improvement.
The native implementation can be done with scala.collection.immutable.StringOps.
eg:
ltrim : str.dropWhile ( _ == ' ')
rtrim : str.reverse.dropWhile(_ == ' ').reverse
So I will change this and retest it . Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be nicer if you can provide a micro-benchmark comparison. :), and also the regex version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenghao-intel it's a good advice, this also should look into the their implementations. I will do it later :)
|
Test build #22793 has started for PR 2998 at commit
|
|
Test build #22793 has finished for PR 2998 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent issue
|
You have added a empty file "case sensitivity" in golden files, is it related to this PR? |
|
Test build #22855 has started for PR 2998 at commit
|
|
Test build #22856 has started for PR 2998 at commit
|
|
Test build #22855 has finished for PR 2998 at commit
|
|
Test PASSed. |
|
Test build #22856 has finished for PR 2998 at commit
|
|
Test PASSed. |
|
Test build #22861 has started for PR 2998 at commit
|
|
Test build #22861 has finished for PR 2998 at commit
|
|
Test PASSed. |
|
@tianyi thanks for your review and comment : ) |
|
Thanks for working on this, but I'm afraid that our current approach to adding functions is becoming unsustainable. I've detailed the reasons in SPARK-4867. For this reason, I propose we close this issue for now and reopen it once that work is complete. What do you think? |
|
@marmbrus |
@marmbrus @chenghao-intel
Add three string operation functions to support spark sql and hiveql.
eg:
sql("select trim(' a b ') from src ").collect() --> 'a b'
sql("select ltrim(' a b ') from src ").collect() --> 'a b '
sql("select rtrim(' a b ') from src ").collect() --> ' a b'
sql("select length('ab') from src ").collect() --> 2
And Rename the trait of stringOperations.scala.
I prefer to rename trait CaseConversionExpression to StringTransformationExpression, it is more make sence than before so that this trait can support more string transformation but not only caseconversion.
And also add a trait StringCalculationExpression that do string computation like length, indexof etc....