# Basic Examples of using Abydos

We start by importing the phonetic & distance modules of Abydos, along with Pandas.

In [1]:
from abydos.phonetic import *
from abydos.distance import *

import pandas as pd

The we load some data into a DataFrame. In this case, we'll load the US Census surnames data ranked by frequency.

In [2]:
names = pd.read_csv('../tests/corpora/uscensus2000.csv',
                    comment='#', index_col=1, usecols=(0,1), keep_default_na=False)
names.head()

Unnamed: 0_level_0,name
rank,Unnamed: 1_level_1
1,SMITH
2,JOHNSON
3,WILLIAMS
4,BROWN
5,JONES


We can create a dictionary of Soundex values mapping to all the surnames with the same Soundex code. These represent Soundex collisions (or blocking). Getting the basic Soundex value of a string is as simple as calling ``soundex()`` on it.

In [3]:
soundex('WILLIAMSON')

'W452'

Better yet, we can construct a ``Soundex()`` object to reuse for encoding multiple names.

In [4]:
sdx = Soundex()
reverse_soundex = {}
for name in names.name:
    encoded = sdx.encode(name)
    if encoded not in reverse_soundex:
        reverse_soundex[encoded] = set()
    reverse_soundex[encoded].add(name)

With this dictionary, we can retrieve all the names that map to the same Soundex value as, for example, the name Williamson.

In [5]:
reverse_soundex[soundex('WILLIAMSON')]

{'WALENGA',
 'WALING',
 'WALINSKI',
 'WALLENIUS',
 'WALLENS',
 'WALLENSTEIN',
 'WALLING',
 'WALLINGA',
 'WALLINGER',
 'WALLINGFORD',
 'WALLINGSFORD',
 'WALLINGTON',
 'WALMSLEY',
 'WEHLING',
 'WELENC',
 'WELLENS',
 'WELLENSTEIN',
 'WELLING',
 'WELLINGER',
 'WELLINGHOFF',
 'WELLINGS',
 'WELLINGTON',
 'WELLINS',
 'WELLMAKER',
 'WELLONS',
 'WELMAKER',
 'WELNIAK',
 'WHALING',
 'WHEELING',
 'WHEELINGTON',
 'WIELENGA',
 'WIELINSKI',
 'WILAMOWSKI',
 'WILENS',
 'WILENSKY',
 'WILINSKI',
 'WILLAIMS',
 'WILLAMSON',
 'WILLEMS',
 'WILLEMSE',
 'WILLEMSEN',
 'WILLEMSSEN',
 'WILLENS',
 'WILLIAMS',
 'WILLIAMSBEY',
 'WILLIAMSBROWN',
 'WILLIAMSEN',
 'WILLIAMSJONES',
 'WILLIAMSLEE',
 'WILLIAMSMAE',
 'WILLIAMSON',
 'WILLIAMSSMITH',
 'WILLIAMSTON',
 'WILLIANSON',
 'WILLIMAS',
 'WILLIMSON',
 'WILLING',
 'WILLINGER',
 'WILLINGHAM',
 'WILLINGS',
 'WILLINGTON',
 'WILLINK',
 'WILLINS',
 'WILLLIAMS',
 'WILLMES',
 'WILLMS',
 'WILMES',
 'WILMS',
 'WILMSEN',
 'WILMSMEYER',
 'WOHLENHAUS',
 'WOLANSKI',
 'WOLANSKY',
 'W

We can build up a DataFrame with some interesting information about these names. First, we'll just collect all the names in a column.

In [6]:
df = pd.DataFrame(sorted(reverse_soundex[soundex('WILLIAMSON')]), columns=['name'])
df

Unnamed: 0,name
0,WALENGA
1,WALING
2,WALINSKI
3,WALLENIUS
4,WALLENS
5,WALLENSTEIN
6,WALLING
7,WALLINGA
8,WALLINGER
9,WALLINGFORD


To that, let's add a few distance measures.

In [7]:
# Levenshtein distance from 'WILLIAMSON'
lev = Levenshtein()
df['Levenshtein'] = df.name.apply(lambda name: lev.dist_abs('WILLIAMSON', name))
# Jaccard similarity on 2-grams
jac = Jaccard()
df['Jaccard'] = df.name.apply(lambda name: jac.sim('WILLIAMSON', name))
# Jaro-Winkler similarity
jw = JaroWinkler()
df['Jaro_Winkler'] = df.name.apply(lambda name: jw.sim('WILLIAMSON', name))

And finally, we'll add a few phonetic encodings.

In [8]:
# Double Metaphone (first code only)
dm = DoubleMetaphone()
df['Double_Metaphone'] = df.name.apply(lambda name: dm.encode(name)[0])
# NYSIIS
nysiis = NYSIIS()
df['NYSIIS'] = df.name.apply(lambda name: nysiis.encode(name))
# Alpha-SIS (first code only)
alphasis = AlphaSIS()
df['Alpha_SIS'] = df.name.apply(lambda name: alphasis.encode(name)[0])

In [9]:
df

Unnamed: 0,name,Levenshtein,Jaccard,Jaro_Winkler,Double_Metaphone,NYSIIS,Alpha_SIS
0,WALENGA,8,0.055556,0.465079,ALNK,WALANG,45270000000000
1,WALING,7,0.125000,0.605556,ALNK,WALANG,45270000000000
2,WALINSKI,6,0.111111,0.755000,ALNSK,WALANS,45207000000000
3,WALLENIUS,7,0.105263,0.757619,ALNS,WALAN,45200000000000
4,WALLENS,6,0.117647,0.737143,ALNS,WALAN,45200000000000
5,WALLENSTEIN,7,0.150000,0.604040,ALNSTN,WALANS,45201200000000
6,WALLING,6,0.187500,0.787143,ALNK,WALANG,45270000000000
7,WALLINGA,6,0.176471,0.755000,ALNK,WALANG,45270000000000
8,WALLINGER,6,0.166667,0.730000,ALNKR,WALANG,45274000000000
9,WALLINGFORD,6,0.150000,0.683550,ALNKFRT,WALANG,45278410000000


Let's check the row for WILLIAMSON.

In [10]:
df[df.name == 'WILLIAMSON']

Unnamed: 0,name,Levenshtein,Jaccard,Jaro_Winkler,Double_Metaphone,NYSIIS,Alpha_SIS
50,WILLIAMSON,0,1.0,1.0,ALMSN,WALANS,45302000000000


In addition to their Soundex collision, 7 names have matching first Double Metaphone encodings.

In [11]:
df[df.Double_Metaphone == 'ALMSN']

Unnamed: 0,name,Levenshtein,Jaccard,Jaro_Winkler,Double_Metaphone,NYSIIS,Alpha_SIS
37,WILLAMSON,1,0.75,0.98,ALMSN,WALANS,45302000000000
40,WILLEMSEN,3,0.4,0.895556,ALMSN,WALANS,45302000000000
41,WILLEMSSEN,4,0.375,0.88,ALMSN,WALANS,45302000000000
46,WILLIAMSEN,1,0.692308,0.96,ALMSN,WALANS,45302000000000
50,WILLIAMSON,0,1.0,1.0,ALMSN,WALANS,45302000000000
55,WILLIMSON,1,0.75,0.98,ALMSN,WALANS,45302000000000
68,WILMSEN,4,0.357143,0.873333,ALMSN,WALNSA,45302000000000


28 have matching NYSIIS encodings.

In [12]:
df[df.NYSIIS == 'WALANS']

Unnamed: 0,name,Levenshtein,Jaccard,Jaro_Winkler,Double_Metaphone,NYSIIS,Alpha_SIS
2,WALINSKI,6,0.111111,0.755,ALNSK,WALANS,45207000000000
5,WALLENSTEIN,7,0.15,0.60404,ALNSTN,WALANS,45201200000000
16,WELLENSTEIN,7,0.15,0.584848,ALNSTN,WALANS,45201200000000
31,WIELINSKI,5,0.166667,0.76,ALNSK,WALANS,45207000000000
34,WILENSKY,6,0.176471,0.633333,ALNSK,WALANS,45207000000000
35,WILINSKI,5,0.25,0.795833,ALNSK,WALANS,45207000000000
37,WILLAMSON,1,0.75,0.98,ALMSN,WALANS,45302000000000
39,WILLEMSE,4,0.333333,0.87,ALMS,WALANS,45300000000000
40,WILLEMSEN,3,0.4,0.895556,ALMSN,WALANS,45302000000000
41,WILLEMSSEN,4,0.375,0.88,ALMSN,WALANS,45302000000000


And 7 have matching first Alpha-SIS encodings.

In [13]:
df[df.Alpha_SIS == '45302000000000']

Unnamed: 0,name,Levenshtein,Jaccard,Jaro_Winkler,Double_Metaphone,NYSIIS,Alpha_SIS
37,WILLAMSON,1,0.75,0.98,ALMSN,WALANS,45302000000000
40,WILLEMSEN,3,0.4,0.895556,ALMSN,WALANS,45302000000000
41,WILLEMSSEN,4,0.375,0.88,ALMSN,WALANS,45302000000000
46,WILLIAMSEN,1,0.692308,0.96,ALMSN,WALANS,45302000000000
50,WILLIAMSON,0,1.0,1.0,ALMSN,WALANS,45302000000000
55,WILLIMSON,1,0.75,0.98,ALMSN,WALANS,45302000000000
68,WILMSEN,4,0.357143,0.873333,ALMSN,WALNSA,45302000000000


6 names match in all four of the phonetic algorithms considered here.

In [14]:
df[(df.Alpha_SIS == '45302000000000') & (df.NYSIIS == 'WALANS') &
   (df.Double_Metaphone == 'ALMSN')]

Unnamed: 0,name,Levenshtein,Jaccard,Jaro_Winkler,Double_Metaphone,NYSIIS,Alpha_SIS
37,WILLAMSON,1,0.75,0.98,ALMSN,WALANS,45302000000000
40,WILLEMSEN,3,0.4,0.895556,ALMSN,WALANS,45302000000000
41,WILLEMSSEN,4,0.375,0.88,ALMSN,WALANS,45302000000000
46,WILLIAMSEN,1,0.692308,0.96,ALMSN,WALANS,45302000000000
50,WILLIAMSON,0,1.0,1.0,ALMSN,WALANS,45302000000000
55,WILLIMSON,1,0.75,0.98,ALMSN,WALANS,45302000000000
