Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8271][SQL]string function: soundex #7812

Closed
wants to merge 6 commits into from
Closed

Conversation

davies
Copy link
Contributor

@davies davies commented Jul 31, 2015

This PR brings SQL function soundex(), see https://issues.apache.org/jira/browse/HIVE-9738

It's based on #7115 , thanks to @hujiayin


for (int i = 1; i < numBytes; i++) {
b = getByte(i);
if ('a' <= b && b <= 'z') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hujiayin are you saying the current code has a problem or the previous code had a problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin The current code has a problem. I encounter some Chinese word will have a byte which just equals to the number in a to z, and the Chinese word have many multiple bytes to represent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In UTF-8, all the bytes for multiple-byte characters are greater than 128 (or less than 0), so they can be overlap, this is the greatness of UTF-8, see https://en.wikipedia.org/wiki/UTF-8.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, maybe the Chinese encoding I met previously is not standard.

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39128 has finished for PR 7812 at commit fa75941.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #1252 has finished for PR 7812 at commit fa75941.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hujy
Copy link
Contributor

hujy commented Jul 31, 2015

the other problem is
the current code failed at z測試, actual: Z000; expected: z測試
z測試 returns Z000, but 测试 returns 测试

@davies
Copy link
Contributor Author

davies commented Jul 31, 2015

@hujiayin This is situation that is not well defined. For some non-letters, for example, numbers and '-', they should be skipped, see http://rosettacode.org/wiki/Soundex. They current approach just treat all non-letters the same way.

We can re-visit this, if it really bother users.

@davies
Copy link
Contributor Author

davies commented Jul 31, 2015

ping @rxin

@rxin
Copy link
Contributor

rxin commented Jul 31, 2015

Thanks - merging this in.

@asfgit asfgit closed this in 4d5a6e7 Jul 31, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants