# Comparing Different Techniques

## **Table of Contents:**
* [Setting Up](#1)
* [Fuzz Ratio](#2)
* [Fuzz Partial Ratio](#3)
* [Token Sort Ratio](#4)
* [Token Set Ratio](#5)
* [Comparison](#6)

## Setting Up <a class="anchor" id="1"></a>


In [42]:
# pip install recordlinkage

In [43]:
import recordlinkage
import pandas as pd
import time

Taking a dataset that is already present in record linkage

In [44]:
from recordlinkage.datasets import load_febrl4

In [45]:
dfA, dfB, true_links = load_febrl4(return_links=True)
print("Dataset A")
display(dfA.sort_index().head())
print("Dataset B")
display(dfB.sort_index().head())

Dataset A


Unnamed: 0_level_0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
rec_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
rec-0-org,rachael,dent,1,knox street,lakewood estate,byford,4129,vic,19280722.0,1683994
rec-1-org,isabella,everett,25,pike place,rowethorpe,marsden,2152,nsw,19110816.0,6653129
rec-10-org,lachlan,reid,5,carrington road,legacy vlge,yagoona,2464,nsw,19500531.0,3232033
rec-100-org,hayden,stapley,38,tindale street,villa 2,cromer heights,4125,vic,,4620080
rec-1000-org,victoria,zbierski,70,wybalena grove,inverneath,paralowie,5065,nsw,19720503.0,1267612


Dataset B


Unnamed: 0_level_0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
rec_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
rec-0-dup-0,rachael,dent,4.0,knox street,lakewood estate,byford,4129,vic,19280722.0,1683994
rec-1-dup-0,isabella,everett,25.0,pike mlace,rowethorpe,marsden,2152,nsw,19110816.0,6653129
rec-10-dup-0,lachlnn,reid,5.0,carrington road,legacy vlge,yagoona,2446,nsw,19500531.0,3232033
rec-100-dup-0,hayden,stapley,,tindale street,villa 2,cromer heights,4125,vic,,4620080
rec-1000-dup-0,victoria,zbierski,70.0,wybalena grove,inverbeath,paralowie,5065,nsw,19720503.0,1267612


We will now add a column which contains the initials of the person whose record it is to both of the dataframes

In [46]:
dfA["initials"] = (dfA["given_name"].str[0]  + dfA["surname"].str[0])
dfB["initials"] = (dfB["given_name"].str[0]  + dfB["surname"].str[0])

Converting the values in soc_sec_id to numeric type

In [47]:
dfA['soc_sec_id']= pd.to_numeric(dfA['soc_sec_id'])
dfB['soc_sec_id']= pd.to_numeric(dfB['soc_sec_id'])

We create the multi-indexer, candidate links. It contains all the pairwise indexes that contain the same initials. For eg, for all records, which have the initials AB, candidate links will contain all pairwise combinations of the indexes of all such records.

In [48]:
indexer = recordlinkage.Index()
indexer.block('initials')
candidate_links = indexer.index(dfA, dfB)

In [49]:
!pip install fuzzywuzzy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import time



For all the techniques that we apply below, we first apply blocking based on the initials. After this we apply the technique. We store the pairwise index combination of the records which have a similarity score greater than what we have classified as the threshold. 

We use this column along with the true_links (contains the index combinations of all the actual matches) column to calculate the precision and recall scores in order to evaluate the performance of the technique. We also see the time taken for the technique to get executed.



### Fuzz Ratio <a class="anchor" id="2"></a>


This technique compares two strings A and B and outputs a ratio that estimates the distance between them. We are talking here about the Levenshtein distance, which is the distance between A and B in terms of how many changes we have to make to the string A in order to transform it into string B. The changes include removing, adding or substituting characters. The fewer the changes we have to make, the more similar A and B are, which results in a higher ratio.

In [60]:
matches=[]
threshold=80
start = time.time()
for i in candidate_links:
  ind1=i[0]
  ind2=i[1]
  compare1=dfA.loc[ind1,'given_name']+' '+dfA.loc[ind1,'surname']
  compare2=dfB.loc[ind2,'given_name']+' '+dfB.loc[ind2,'surname']
  val=fuzz.ratio(compare1.lower(),compare2.lower())
  if val>threshold:
    matches.append((ind1,ind2))
end = time.time()
fr= end -start
print("The time of execution of above program is :", frt)    



The time of execution of above program is : 2.28541898727417


In [51]:
matches1=pd.MultiIndex.from_tuples(matches)
pre_fr = recordlinkage.precision(true_links, matches1)
rcl_fr = recordlinkage.recall(true_links, matches1)
print("When using Fuzz Ratio " + " precision is " + str(pre_fr) + " and recall is " + str(rcl_fr))

When using Fuzz Ratio  precision is 0.7295585032666798 and recall is 0.737


### Fuzz Partial Ratio <a class="anchor" id="3"></a>

fuzz.partial_ratio (PR) takes into account subsets of the strings it compares, and then returns a ratio according to their similarities. For example, it will return a ratio of 100% if it compares Dwayne The Rock Johnson with Dwayne

In [52]:
matches=[]
threshold=80
start = time.time()
for i in candidate_links:
  ind1=i[0]
  ind2=i[1]
  compare1=dfA.loc[ind1,'given_name']+' '+dfA.loc[ind1,'surname']
  compare2=dfB.loc[ind2,'given_name']+' '+dfB.loc[ind2,'surname']
  val=fuzz.partial_ratio(compare1.lower(),compare2.lower())
  if val>threshold:
    matches.append((ind1,ind2))
end = time.time()
fpr= end-start
print("The time of execution of above program is :", fpr)    


The time of execution of above program is : 2.9320318698883057


In [53]:
matches1=pd.MultiIndex.from_tuples(matches)
pre_fpr = recordlinkage.precision(true_links, matches1)
rcl_fpr = recordlinkage.recall(true_links, matches1)
print("When using Fuzz Partial Ratio " + " precision is " + str(pre_fpr) + " and recall is " + str(rcl_fpr))

When using Fuzz Partial Ratio  precision is 0.7032146957520092 and recall is 0.735


### Token Sort Ratio <a class="anchor" id="4"></a>


In token methods have the advantage of ignoring case and punctuation (all characters get turned to lowercase characters). In the case of fuzz.token_sort_ratio (TSoR), the ‘Tokenized’ strings (each word is turned into a token) get sorted in alphanumeric order before applying the basic fuzz.ratio (R) on them, so the order of the words in both strings compared doesn’t matter (unlike the previous non-token methods)

In [54]:
matches=[]
threshold=80
start = time.time()
for i in candidate_links:
  ind1=i[0]
  ind2=i[1]
  compare1=dfA.loc[ind1,'given_name']+' '+dfA.loc[ind1,'surname']
  compare2=dfB.loc[ind2,'given_name']+' '+dfB.loc[ind2,'surname']
  val=fuzz.token_sort_ratio(compare1.lower(),compare2.lower())
  if val>threshold:
    matches.append((ind1,ind2))
end = time.time()
tsr= end-start
print("The time of execution of above program is :", tsr)    
   

The time of execution of above program is : 2.7590198516845703


In [55]:
matches1=pd.MultiIndex.from_tuples(matches)
pre_tsr = recordlinkage.precision(true_links, matches1)
rcl_tsr = recordlinkage.recall(true_links, matches1)
print("When using Token Sort Ratio " + " precision is " + str(pre_tsr) + " and recall is " + str(rcl_tsr))

When using Token Sort Ratio  precision is 0.7309533306741125 and recall is 0.733


### Token Set Ratio <a class="anchor" id="5"></a>

Token Set Ratio is similar to Token Sort Ratio, except it ignores duplicated words . It also conducts a pair to pair comparison on tokens that are common to both strings compared.

In [56]:
matches=[]
threshold=80
start = time.time()
for i in candidate_links:
  ind1=i[0]
  ind2=i[1]
  compare1=dfA.loc[ind1,'given_name']+' '+dfA.loc[ind1,'surname']
  compare2=dfB.loc[ind2,'given_name']+' '+dfB.loc[ind2,'surname']
  val=fuzz.token_set_ratio(compare1.lower(),compare2.lower())
  if val>threshold:
    matches.append((ind1,ind2))
end = time.time()
tsr1= end-start
print("The time of execution of above program is :", tsr1)    


The time of execution of above program is : 3.0303218364715576


In [57]:
matches1=pd.MultiIndex.from_tuples(matches)
pre_tsr1 = recordlinkage.precision(true_links, matches1)
rcl_tsr1 = recordlinkage.recall(true_links, matches1)
print("When using Token Set Ratio " + " precision is " + str(pre_tsr1) + " and recall is " + str(rcl_tsr1))

When using Token Set Ratio  precision is 0.7265532251681837 and recall is 0.7344


### Comparison <a class="anchor" id="6"></a>

In [61]:
Techniques=['Fuzz Ratio', 'Fuzz Partial Ratio', 'Token Sort Ratio', 'Token Set Ratio']
Precision=[pre_fr, pre_fpr, pre_tsr, pre_tsr1]
Recall=[rcl_fr, rcl_fpr, rcl_tsr, rcl_tsr1]
Time_Execution= [fr, fpr, tsr, tsr1]

In [62]:
df = pd.DataFrame(list(zip(Techniques, Precision, Recall, Time_Execution)), 
                  columns =['Method', 'Precision', 'Recall', 'Time_Execution'])
df

Unnamed: 0,Method,Precision,Recall,Time_Execution
0,Fuzz Ratio,0.729559,0.737,2.233487
1,Fuzz Partial Ratio,0.703215,0.735,2.932032
2,Token Sort Ratio,0.730953,0.733,2.75902
3,Token Set Ratio,0.726553,0.7344,3.030322
