In [10]:
import jellyfish

# 1. Phonetic Encoding (English Characters Only)
These algorithms convert a string to a normalized phonetic encoding, converting a word to a representation of its pronunciation. Each takes a single string and returns a coded representation.

## 1.1. American Soundex
Soundex is an algorithm to convert a word (typically a name) to a four digit code in the form ‘A123’ where ‘A’ is the first letter of the name and the digits represent similar sounds.

For example `soundex('Ann') == soundex('Anne') == 'A500'` and `soundex('Rupert') == soundex('Robert') == 'R163'`.

https://en.wikipedia.org/wiki/Soundex

In [11]:
jellyfish.soundex(u'Jellyfish')

'J412'

## 1.2. Metaphone
The metaphone algorithm was designed as an improvement on Soundex. It transforms a word into a string consisting of ‘0BFHJKLMNPRSTWXY’ where ‘0’ is pronounced ‘th’ and ‘X’ is a ‘[sc]h’ sound.

For example `metaphone('Klumpz') == metaphone('Clumps') == 'KLMPS'`.

https://en.wikipedia.org/wiki/Metaphone

In [12]:
jellyfish.metaphone(u'Jellyfish')

'JLFX'

## 1.3. NYSIIS
The NYSIIS algorithm is an algorithm developed by the New York State Identification and Intelligence System. It transforms a word into a phonetic code. Like soundex and metaphone it is primarily intended for use on names (as they would be pronounced in English).

For example `nysiis('John') == nysiis('Jan') == JAN`.

https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System

In [13]:
jellyfish.nysiis(u'Jellyfish')

'JALYF'

## 1.4. Match Rating Approach (codex)
The Match rating approach algorithm is an algorithm for determining whether or not two names are pronounced similarly. The algorithm consists of an encoding function (similar to soundex or nysiis) which is implemented here as well as `match_rating_comparison()` which does the actual comparison.

https://en.wikipedia.org/wiki/Match_rating_approach

In [14]:
jellyfish.match_rating_codex(u'Jellyfish')

'JLYFSH'

-----

# 2. String Comparison
These methods are all measures of the difference (aka edit distance) between two strings.

## 2.1. Levenshtein Distance
Levenshtein distance represents the number of insertions, deletions, and substitutions required to change one word to another.

For example: `levenshtein_distance('berne', 'born') == 2` representing the transformation of the first e to o and the deletion of the second e.

https://en.wikipedia.org/wiki/Levenshtein_distance

In [15]:
distance = jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')
print(distance, 'edit(s)')

2 edit(s)


## 2.2. Damerau-Levenshtein Distance
A modification of Levenshtein distance, Damerau-Levenshtein distance counts transpositions (such as ifsh for fish) as a single edit.

Where `levenshtein_distance('fish', 'ifsh') == 2` as it would require a deletion and an insertion, though `damerau_levenshtein_distance('fish', 'ifsh') == 1` as this counts as a transposition.

https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

In [16]:
distance = jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs')
print(distance, 'edit(s)')

1 edit(s)


## 2.3. Hamming Distance
Hamming distance is the measure of the number of characters that differ between two strings.

Typically Hamming distance is undefined when strings are of different length, but this implementation considers extra characters as differing. For example `hamming_distance('abc', 'abcd') == 1`.

https://en.wikipedia.org/wiki/Hamming_distance

In [17]:
distance = jellyfish.hamming_distance(u'jellyfish', u'smellyfish')
print(distance, 'different character(s)')

9 different character(s)
