# Phonetic Hashing

It is a technique to reduce a word to its base form. Stemming and Lemmatisation are types of Canonicalisation. Stemming tries to reduce a word to its root form. Lemmatization tries to reduce a word to its lemma. The root and the lemma are nothing but the base forms of the inflected words.

But there are some cases that can’t be handled either by stemming nor lemmatization. We need another preprocessing method in order to stem or lemmatize the words efficiently.

The problem of pronunciation which has to do with different dialects present in the same language. For example, the word ‘colour’ is used in British English, while ‘color’ is used in American English. Both are correct spellings, but ‘colouring’ and ‘coloring’ will result in different stems and lemma.

To deal with different spellings that occur due to different pronunciations, <b>phonetic hashing</b> which be helpful to canonicalise different versions of the same word to a base word.

Phonetic hashing buckets all the similar phonemes (words with similar sound or pronunciation) into a single bucket and gives all these variations a single hash code. Hence, the word ‘Dilli’ and ‘Delhi’ will have the same code.

# Soundex Algorithm

Phonetic hashing is done using the Soundex algorithm. American Soundex is the most popular Soundex algorithm. It buckets British and American spellings of a word to a common code. It doesn't matter which language the input word comes from - as long as the words sound similar, they will get the same hash code.

Algorithm is as below:

- Phonetic hashing is a four-letter code. The first letter of the code is the first letter of the input word. Hence it is retained as is.

- Now, we need to map all the consonant letters (except the first letter) as below


              1. b,f,p,v => 1

              2. c,g,j,k,q,s,x,z => 2

              3. d,t => 3

              4. l => 4

              5. m,n => 5

              6. r => 6

              7. h,w,y => encoded

    
- The third step is to remove all the vowels.

- The fourth step is to force the code to make it a four-letter code. We either need to pad it with zeroes in case it is less than four characters in length. Or we need to truncate it from the right side in case it is more than four characters in length. 

In [1]:
def get_soundex(token):

    token = token.upper()

    soundex = ''

    # take the first letter of the token
    soundex =  soundex + token[0]

    # create a dictionary which maps letters to respective soundex codes. Vowels, H, W and Y will be represented with .
    consonant_dict = {"BFPV": "1", "CGJKQSXZ":"2", "DT":"3", "L":"4", "MN":"5", "R":"6", "AEIOUHWY":"."}

    for char in token[1:]:
        for key in consonant_dict.keys():
            if char in key:
                code = consonant_dict[key]
                if code != soundex[-1]:
                    soundex = soundex + code

    # remove vowels and 'H', 'W' and 'Y' from soundex
    soundex = soundex.replace(".", "")
    
    # trim or pad to make soundex a 4-character code
    soundex = soundex[:4].ljust(4, "0")
        
    return soundex
    

In [3]:
# Lets validate
print(get_soundex("Bombay"))
print(get_soundex("Bambai"))

B510
B510


In [4]:
print(get_soundex("Aggrawal"))
print(get_soundex("Agrawal"))
print(get_soundex("Aggarwal"))
print(get_soundex("Agarwal"))

A264
A264
A264
A264
