[SPARK-14402][SQL] initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string by dongjoon-hyun · Pull Request #12175 · apache/spark

dongjoon-hyun · 2016-04-05T09:36:20Z

What changes were proposed in this pull request?

Current, SparkSQL initCap is using toTitleCase function. However, UTF8String.toTitleCase implementation changes only the first letter and just copy the other letters: e.g. sParK --> SParK. This is the correct implementation toTitleCase.

hive> select initcap('sParK');
Spark

scala> sql("select initcap('sParK')").head
res0: org.apache.spark.sql.Row = [SParK]

This PR updates the implementation of initcap using toLowerCase and toTitleCase.

How was this patch tested?

Pass the Jenkins tests (including new testcase).

…tters in lowercase

SparkQA · 2016-04-05T11:39:44Z

Test build #54976 has finished for PR 12175 at commit d165bad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… stringExpression.

dongjoon-hyun · 2016-04-05T17:12:44Z

Hi, @srowen . I minimized the change on master.

Undo the changes on common module.
Implement initCap by the following changes in stringExpressions.scala of catalyst module.

   override def nullSafeEval(string: Any): Any = {
-    string.asInstanceOf[UTF8String].toTitleCase
+    string.asInstanceOf[UTF8String].toLowerCase.toTitleCase
   }
   override def genCode(ctx: CodegenContext, ev: ExprCode): String = {
-    defineCodeGen(ctx, ev, str => s"$str.toTitleCase()")
+    defineCodeGen(ctx, ev, str => s"$str.toLowerCase().toTitleCase()")
   }

I think it's enough for initCap function as a small fix for now. How do you think about this?

srowen · 2016-04-05T17:34:22Z

I think that's pretty reasonable as a minimally invasive fix. CC @marmbrus for visibility as it's technically a behavior change

dongjoon-hyun · 2016-04-05T17:39:18Z

Thank you, @srowen !

marmbrus · 2016-04-05T18:10:36Z

It does seem reasonable to match hive since that was probably the original intention. I've tagged the JIRA for inclusion in the release notes. A few comments:

Update the scala doc to correctly describe the new behavior.
(existing) it would be great to add an expression description annotation to InitCap
(minor) We are now double allocating the byte array. It might be nice to actually implement this in UTF8String, but we don't have to do this to merge this PR.
Is there no way to do title case anymore?

dongjoon-hyun · 2016-04-05T18:18:16Z

Thank you, @marmbrus . I will update the scala docand add description annotation for InitCap.
And the for last comment, if you agree, I hope to implement toTitleCase as a new function (also in another PR.)

SparkQA · 2016-04-05T18:25:42Z

Test build #55000 has finished for PR 12175 at commit 69c6e1c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-05T19:41:34Z

Test build #55006 has finished for PR 12175 at commit e6258aa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-05T20:15:34Z

Test build #55009 has finished for PR 12175 at commit 003ab98.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-04-05T20:29:49Z

Finally, I passed the Jenkins with the updated code.
Please let me know if there is something I missed.
Thank you, @marmbrus and @srowen .

marmbrus · 2016-04-05T20:30:35Z

Thanks, merging to master.

[SPARK-14402][CORE] UTF8String.toTitleCase should return non-first le…

d165bad

…tters in lowercase

dongjoon-hyun changed the title ~~[SPARK-14402][CORE] UTF8String.toTitleCase should return non-first letters in lowercase~~ [WIP][SPARK-14402][CORE] initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string Apr 5, 2016

Undo the changes on toTitleCase function and implement initCap in…

69c6e1c

… stringExpression.

dongjoon-hyun changed the title ~~[WIP][SPARK-14402][CORE] initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string~~ [SPARK-14402][CORE] initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string Apr 5, 2016

dongjoon-hyun changed the title ~~[SPARK-14402][CORE] initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string~~ [SPARK-14402][SQL] initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string Apr 5, 2016

Add ExpressionDescription annotation and update scaladoc.

e6258aa

Fix unittest.

003ab98

asfgit closed this in c59abad Apr 5, 2016

dongjoon-hyun deleted the SPARK-14402 branch May 12, 2016 00:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14402][SQL] initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string#12175

[SPARK-14402][SQL] initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string#12175
dongjoon-hyun wants to merge 4 commits intoapache:masterfrom
dongjoon-hyun:SPARK-14402

dongjoon-hyun commented Apr 5, 2016

Uh oh!

SparkQA commented Apr 5, 2016

Uh oh!

dongjoon-hyun commented Apr 5, 2016

Uh oh!

srowen commented Apr 5, 2016

Uh oh!

dongjoon-hyun commented Apr 5, 2016

Uh oh!

marmbrus commented Apr 5, 2016

Uh oh!

dongjoon-hyun commented Apr 5, 2016

Uh oh!

SparkQA commented Apr 5, 2016

Uh oh!

SparkQA commented Apr 5, 2016

Uh oh!

SparkQA commented Apr 5, 2016

Uh oh!

dongjoon-hyun commented Apr 5, 2016

Uh oh!

marmbrus commented Apr 5, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dongjoon-hyun commented Apr 5, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 5, 2016

Uh oh!

dongjoon-hyun commented Apr 5, 2016

Uh oh!

srowen commented Apr 5, 2016

Uh oh!

dongjoon-hyun commented Apr 5, 2016

Uh oh!

marmbrus commented Apr 5, 2016

Uh oh!

dongjoon-hyun commented Apr 5, 2016

Uh oh!

SparkQA commented Apr 5, 2016

Uh oh!

SparkQA commented Apr 5, 2016

Uh oh!

SparkQA commented Apr 5, 2016

Uh oh!

dongjoon-hyun commented Apr 5, 2016

Uh oh!

marmbrus commented Apr 5, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants