-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8271][SQL]string function: soundex #7115
Conversation
@@ -153,7 +153,8 @@ object FunctionRegistry { | |||
expression[Substring]("substr"), | |||
expression[Substring]("substring"), | |||
expression[Upper]("ucase"), | |||
expression[Upper]("upper") | |||
expression[Upper]("upper"), | |||
expression[SoundEx]("SoundEx") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The registered function name should be in lower case, and sort the code in alphabet order.
if (string == null) { | ||
null | ||
} else { | ||
val str: String = string.toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you don't need to declare the type explicitly. val str = string.toString
will be OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Remove the : String
.
@hujiayin can you rebase the code? |
@rxin @davies @liancheng can you trigger the unit test? |
Test build #999 has finished for PR 7115 at commit
|
null | ||
} else { | ||
val str: String = string.toString | ||
UTF8String.fromString(so.encode(str)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to return null if there is an exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also can you copy the body of common codec's soundex function into UTF8String?
ok to test |
} else { | ||
try { | ||
val str: String = string.toString | ||
UTF8String.fromString(so.encode(str)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hujiayin I was talking about this line (in the other pr #7208). Even if str = ""
Soundex gets created (because of so.encode
), but as @chenghao-intel mentioned, it might be no big deal.
if(idxp >= 0 && idxp <= US_ENGLISH_MAPPING.length | ||
&& US_ENGLISH_MAPPING[idxp] - '0' == 0 | ||
&& ch - data[j-1] == 0) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: move the {
to the end of last line.
Looks much better now, except some minor issues. And I didn't check the logic of |
Test build #37565 has finished for PR 7115 at commit
|
Test build #37581 has finished for PR 7115 at commit
|
cc @davies for review too |
if (j > 3) break; | ||
if (getByte(i) == getByte(i - 1) | ||
|| (getByte(i) - 32 == getByte(i - 1) | ||
|| (getByte(i) == getByte(i - 1) - 32))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct, for example, A = ! + 32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the check was added
Test build #37785 has finished for PR 7115 at commit
|
} | ||
if (ch - '0' > 0) { | ||
tmp = Character.toString(ch); | ||
System.arraycopy(tmp.getBytes(), 0, data, j, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems too heavy, can we update the data
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just tested.
scala> val c = 'A'
c: Char = A
scala> Character.toString(c)
res0: String = A
scala> val c = '0'
c: Char = 0
scala> Character.toString(c)
res1: String = 0
As we are sure the ch
will be the character defined in US_ENGLISH_MAPPING
, only a single byte will be copied, then we can simplify the code like:
data[j] = (byte)ch;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ch is just one byte and the data is only 4 bytes. what about the type conversion failed, from char to byte : (
@hujiayin Thanks for working on this, do you mind me to take over it? |
ok : ) the unit test is ok from my side. |
This PR brings SQL function soundex(), see https://issues.apache.org/jira/browse/HIVE-9738 It's based on #7115 , thanks to HuJiayin Author: HuJiayin <jiayin.hu@intel.com> Author: Davies Liu <davies@databricks.com> Closes #7812 from davies/soundex and squashes the following commits: fa75941 [Davies Liu] Merge branch 'master' of github.com:apache/spark into soundex a4bd6d8 [Davies Liu] fix soundex 2538908 [HuJiayin] add codegen soundex d15d329 [HuJiayin] add back ut ded1a14 [HuJiayin] Merge branch 'master' of https://github.com/apache/spark e2dec2c [HuJiayin] support soundex rebase code
Add soundex for SQL
It is written according to https://en.wikipedia.org/wiki/Soundex
b, f, p, v → 1
c, g, j, k, q, s, x, z → 2
d, t → 3
l → 4
m, n → 5
r → 6
Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" and "Ashcroft" both yield "A261" and not "A226" (the chars 's' and 'c' in the name would receive a single number of 2 and not 22 since an 'h' lies in between them). "Tymczak" yields "T522" not "T520" (the chars 'z' and 'k' in the name are coded as 2 twice since a vowel lies in between them). "Pfister" yields "P236" not "P123" (the first two letters have the same number and are coded once as 'P').