Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8271][SQL]string function: soundex #7115

Closed
wants to merge 4 commits into from
Closed

Conversation

hujy
Copy link
Contributor

@hujy hujy commented Jun 30, 2015

Add soundex for SQL

It is written according to https://en.wikipedia.org/wiki/Soundex

  1. Retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w.
  2. Replace consonants with digits as follows (after the first letter):
    b, f, p, v → 1
    c, g, j, k, q, s, x, z → 2
    d, t → 3
    l → 4
    m, n → 5
    r → 6
  3. If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
  4. Iterate the previous step until you have one letter and three numbers. If you have too few letters in your word that you can't assign three numbers, append with zeros until there are three numbers. If you have more than 3 letters, just retain the first 3 numbers.

Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" and "Ashcroft" both yield "A261" and not "A226" (the chars 's' and 'c' in the name would receive a single number of 2 and not 22 since an 'h' lies in between them). "Tymczak" yields "T522" not "T520" (the chars 'z' and 'k' in the name are coded as 2 twice since a vowel lies in between them). "Pfister" yields "P236" not "P123" (the first two letters have the same number and are coded once as 'P').

@@ -153,7 +153,8 @@ object FunctionRegistry {
expression[Substring]("substr"),
expression[Substring]("substring"),
expression[Upper]("ucase"),
expression[Upper]("upper")
expression[Upper]("upper"),
expression[SoundEx]("SoundEx")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The registered function name should be in lower case, and sort the code in alphabet order.

if (string == null) {
null
} else {
val str: String = string.toString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you don't need to declare the type explicitly. val str = string.toString will be OK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Remove the : String.

@chenghao-intel
Copy link
Contributor

@hujiayin can you rebase the code?

@chenghao-intel
Copy link
Contributor

@rxin @davies @liancheng can you trigger the unit test?

@SparkQA
Copy link

SparkQA commented Jul 6, 2015

Test build #999 has finished for PR 7115 at commit f88d320.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class Word2VecModel(JavaVectorTransformer, JavaSaveable, JavaLoader):
    • case class CreateNamedStruct(children: Seq[Expression]) extends Expression
    • case class Factorial(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class ShiftLeft(left: Expression, right: Expression) extends BinaryExpression
    • case class ShiftRight(left: Expression, right: Expression) extends BinaryExpression
    • case class Md5(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class Sha1(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class Crc32(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class Not(child: Expression) extends UnaryExpression with Predicate with ExpectsInputTypes
    • trait StringRegexExpression extends ExpectsInputTypes
    • trait CaseConversionExpression extends ExpectsInputTypes
    • trait StringComparison extends ExpectsInputTypes
    • case class StringLength(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class SoundEx(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • protected[sql] abstract class AtomicType extends DataType
    • abstract class NumericType extends AtomicType
    • abstract class DataType extends AbstractDataType

null
} else {
val str: String = string.toString
UTF8String.fromString(so.encode(str))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to return null if there is an exception

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also can you copy the body of common codec's soundex function into UTF8String?

@liancheng
Copy link
Contributor

ok to test

} else {
try {
val str: String = string.toString
UTF8String.fromString(so.encode(str))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hujiayin I was talking about this line (in the other pr #7208). Even if str = "" Soundex gets created (because of so.encode), but as @chenghao-intel mentioned, it might be no big deal.

if(idxp >= 0 && idxp <= US_ENGLISH_MAPPING.length
&& US_ENGLISH_MAPPING[idxp] - '0' == 0
&& ch - data[j-1] == 0)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move the { to the end of last line.

@chenghao-intel
Copy link
Contributor

Looks much better now, except some minor issues. And I didn't check the logic of soundex, it will be great if you can add more unit test for soundex.
Some examples can be found at https://issues.apache.org/jira/browse/HIVE-9738

@SparkQA
Copy link

SparkQA commented Jul 17, 2015

Test build #37565 has finished for PR 7115 at commit 9a35ada.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class SoundEx(child: Expression) extends UnaryExpression with ExpectsInputTypes

@SparkQA
Copy link

SparkQA commented Jul 17, 2015

Test build #37581 has finished for PR 7115 at commit a2d6313.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class SoundEx(child: Expression) extends UnaryExpression with ExpectsInputTypes

@rxin
Copy link
Contributor

rxin commented Jul 18, 2015

cc @davies for review too

if (j > 3) break;
if (getByte(i) == getByte(i - 1)
|| (getByte(i) - 32 == getByte(i - 1)
|| (getByte(i) == getByte(i - 1) - 32))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct, for example, A = ! + 32

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the check was added

@SparkQA
Copy link

SparkQA commented Jul 20, 2015

Test build #37785 has finished for PR 7115 at commit 8302b94.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • case class SoundEx(child: Expression) extends UnaryExpression with ExpectsInputTypes

}
if (ch - '0' > 0) {
tmp = Character.toString(ch);
System.arraycopy(tmp.getBytes(), 0, data, j, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems too heavy, can we update the data directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just tested.

scala> val c = 'A'
c: Char = A

scala> Character.toString(c)
res0: String = A

scala> val c = '0'
c: Char = 0

scala> Character.toString(c)
res1: String = 0

As we are sure the ch will be the character defined in US_ENGLISH_MAPPING, only a single byte will be copied, then we can simplify the code like:

  data[j] = (byte)ch;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ch is just one byte and the data is only 4 bytes. what about the type conversion failed, from char to byte : (

@davies
Copy link
Contributor

davies commented Jul 30, 2015

@hujiayin Thanks for working on this, do you mind me to take over it?

@hujy
Copy link
Contributor Author

hujy commented Jul 31, 2015

ok : ) the unit test is ok from my side.

asfgit pushed a commit that referenced this pull request Jul 31, 2015
This PR brings SQL function soundex(), see https://issues.apache.org/jira/browse/HIVE-9738

It's based on #7115 , thanks to HuJiayin

Author: HuJiayin <jiayin.hu@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7812 from davies/soundex and squashes the following commits:

fa75941 [Davies Liu] Merge branch 'master' of github.com:apache/spark into soundex
a4bd6d8 [Davies Liu] fix soundex
2538908 [HuJiayin] add codegen soundex
d15d329 [HuJiayin] add back ut
ded1a14 [HuJiayin] Merge branch 'master' of https://github.com/apache/spark
e2dec2c [HuJiayin] support soundex rebase code
@hujy hujy closed this Aug 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants