## Entity Resoultion using PyPi Edit Distance

In [1]:
import edit_distance
import pandas as pd

In [2]:
actual_names = ['Los Angeles','New York City','Bangalore','Mumbai','Chennai','Kolkata','New Delhi',\
                'Saint Petersburg','Melbourne','Gothenburg','Vienna','Barcelona','Las Vegas']

input_names = ['City of Los Angeles','New York','Bengaluru','Bombay','Madras','Calutta','Delhi',\
               'St. Petersburg','Melborne','Goteborg','Wien','Barca', 'Las Vegas']

In [3]:
def edit_dist_metrics(actual_names, input_names, threshold):
    '''
    input:  list : actual_names list
            list : input_names list
            float : threshold value for similarity score 0 <= threshold <= 1
            
    The function compares every string in actual_names with every string in input_names
    using edit_distance and provides a similarity score. If the score is more than or equal to a 
    given threshold, the two strings are matched as the same entity and compared with ground truth results.
    The results, precision and recall is printed out.
    '''
    res = []
    for i, a_name in enumerate(actual_names):
        for j, i_name in enumerate(input_names):
            r = edit_distance.SequenceMatcher(a_name.lower(), i_name.lower()).ratio()
            if r >= threshold:
                res.append([i_name, a_name, r, i==j])

    df = pd.DataFrame(res, columns=['Input Name','Predicted Name','Similarity Score','Ground Truth'])
    precision = round(sum(df['Ground Truth'])/len(df),3)
    recall = round(sum(df['Ground Truth'])/len(actual_names),3)
    print(df,'\n')
    print("Precision: "+str(precision))
    print("Recall: "+str(recall))
    return

### The entity resolution is run for different threshold values

### The function takes in a threshold value which is the similarity score for a pair of strings above which they are predicted as matches. The threshold value can range from 0 to 1 where 1 is a perfect match.

In [4]:
edit_dist_metrics(actual_names,input_names,0.4)

             Input Name    Predicted Name  Similarity Score  Ground Truth
0   City of Los Angeles       Los Angeles          0.733333          True
1             Las Vegas       Los Angeles          0.500000         False
2              New York     New York City          0.761905          True
3             Bengaluru         Bangalore          0.666667          True
4                 Barca         Bangalore          0.428571         False
5                Bombay            Mumbai          0.500000          True
6               Calutta           Kolkata          0.428571          True
7              New York         New Delhi          0.470588         False
8                 Delhi         New Delhi          0.714286          True
9        St. Petersburg  Saint Petersburg          0.800000          True
10             Goteborg  Saint Petersburg          0.416667         False
11             Melborne         Melbourne          0.941176          True
12       St. Petersburg        Gothenb

In [5]:
edit_dist_metrics(actual_names,input_names,0.5)

             Input Name    Predicted Name  Similarity Score  Ground Truth
0   City of Los Angeles       Los Angeles          0.733333          True
1             Las Vegas       Los Angeles          0.500000         False
2              New York     New York City          0.761905          True
3             Bengaluru         Bangalore          0.666667          True
4                Bombay            Mumbai          0.500000          True
5                 Delhi         New Delhi          0.714286          True
6        St. Petersburg  Saint Petersburg          0.800000          True
7              Melborne         Melbourne          0.941176          True
8              Goteborg        Gothenburg          0.777778          True
9                  Wien            Vienna          0.600000          True
10                Barca         Barcelona          0.714286          True
11            Las Vegas         Las Vegas          1.000000          True 

Precision: 0.917
Recall: 0.846


In [6]:
edit_dist_metrics(actual_names,input_names,0.6)

            Input Name    Predicted Name  Similarity Score  Ground Truth
0  City of Los Angeles       Los Angeles          0.733333          True
1             New York     New York City          0.761905          True
2            Bengaluru         Bangalore          0.666667          True
3                Delhi         New Delhi          0.714286          True
4       St. Petersburg  Saint Petersburg          0.800000          True
5             Melborne         Melbourne          0.941176          True
6             Goteborg        Gothenburg          0.777778          True
7                 Wien            Vienna          0.600000          True
8                Barca         Barcelona          0.714286          True
9            Las Vegas         Las Vegas          1.000000          True 

Precision: 1.0
Recall: 0.769


In [7]:
edit_dist_metrics(actual_names,input_names,0.7)

            Input Name    Predicted Name  Similarity Score  Ground Truth
0  City of Los Angeles       Los Angeles          0.733333          True
1             New York     New York City          0.761905          True
2                Delhi         New Delhi          0.714286          True
3       St. Petersburg  Saint Petersburg          0.800000          True
4             Melborne         Melbourne          0.941176          True
5             Goteborg        Gothenburg          0.777778          True
6                Barca         Barcelona          0.714286          True
7            Las Vegas         Las Vegas          1.000000          True 

Precision: 1.0
Recall: 0.615


### As the threshold increases, we can see that the precision increases but the recall decreases. A threshold value of 0.6 is ideal as it gives 100% precision with a good recall.

### This threshold of 0.6 can comfortably resolve names local names against official names like 'Goteborg' v 'Gothernburg' and 'Bangalore' v 'Bengaluru' while disambiguating similar but different names like 'Los Angeles' and 'Las Vegas' . It also resolves spelling mistakes like 'Melborne' and short form of names like 'St. Petersburg'.