-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8271][SQL]string function: soundex #7812
Conversation
|
||
for (int i = 1; i < numBytes; i++) { | ||
b = getByte(i); | ||
if ('a' <= b && b <= 'z') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hujiayin are you saying the current code has a problem or the previous code had a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin The current code has a problem. I encounter some Chinese word will have a byte which just equals to the number in a to z, and the Chinese word have many multiple bytes to represent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In UTF-8, all the bytes for multiple-byte characters are greater than 128 (or less than 0), so they can be overlap, this is the greatness of UTF-8, see https://en.wikipedia.org/wiki/UTF-8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, maybe the Chinese encoding I met previously is not standard.
Test build #39128 has finished for PR 7812 at commit
|
Test build #1252 has finished for PR 7812 at commit
|
the other problem is |
@hujiayin This is situation that is not well defined. For some non-letters, for example, numbers and '-', they should be skipped, see http://rosettacode.org/wiki/Soundex. They current approach just treat all non-letters the same way. We can re-visit this, if it really bother users. |
ping @rxin |
Thanks - merging this in. |
This PR brings SQL function soundex(), see https://issues.apache.org/jira/browse/HIVE-9738
It's based on #7115 , thanks to @hujiayin