# Basic String Matching

Below is an example of calculating string distances and collecting the shortest distance. No tie-breaking rules were used and no pre-cleaning of text data was performed. The example shown in the markup tables below are <b>fake</b> data.

#### Import Libraries:
<ol>
<li>Pandas</li>
<li>Numpy</li>
<li>JellyFish</li>
<li>IterTools</li>
</ol>

In [None]:
import pandas as pd
import numpy as np
import jellyfish as jf
import itertools as it

#### Import Datasets

<ol>
<li>
The first table will have all of the outlets you want to standardize (ie, clean or get a suggested outlet for). It should have two columns. The first column will be the unstandardized/dirty outlet name. The second column is the answer key, ie, the correct outlet name. You are required to have a correct outlet name to compared the suggested outlet to.
</li>

<table style="width:25%">
  <caption>Format for Table 1</caption>
    <tr>
      <th>Unstandardized Outlet Name</th>
      <th>Standardized Outlet Name</th> 
    </tr>
    <tr>
      <td>Bst Bye</td>
      <td>Best Buy</td>
    </tr>
    <tr>
      <td>Wal mert</td>
      <td>Wal-Mart</td> 
    </tr>
    <tr>
      <td>While food</td>
      <td>WholeFoods</td> 
    </tr>
</table>

<br>

<li>
The second table will have the universe of clean/standardized outlet names. This is the table you will compare the dirty outlet name to and get a suggestion from. It is important for this experiment that the correct answer exist in this table, otherwise there will be errors of a certain type that are not important to the experiment. However, those types of errors will certainly take place in practice.
</li>

<table style="width:15%">
    <caption>Format for Table 2</caption>
    <tr>
        <th>Standardized Outlet Name</th> 
    </tr>
    <tr>
        <td>Wal-Mart</td>
    </tr>
    <tr>
        <td>Best Buy</td>
    </tr>
    <tr>
        <td>WholeFoods</td>
    </tr>
</table>

</ol>

In [None]:
#Table 1
unstandard_outlets = pd.read_excel(io="",
                             names=["Unstandardized","Answer"])

#Table 2
universe_outlets = pd.read_excel(io="",
                              names=["Universe"])

#### Functions
Below is where you would define the functions which would calculate the string distances. I used five measures (listed below), but for simplicity, I only show one function as an example.
<ol>
<li>Soundex</li>
<li>Levenshtein Distance</li>
<li>Damerau-Levenshtein Distance</li>
<li>Jaro Distance</li>
<li>Jaro-Winkler Distance</li>
</ol>

In [None]:
def calc_leven(row):
    outlet_d = row['Unstandardized']
    outlet_u = row['Universe']
    return jf.levenshtein_distance(outlet_d, outlet_u)

#### Find Shortest Distance (Best Match)
<ol>
<li>
For each outlet to clean (standardize), create a dataset with one column for the dirty/unstandardized outlet and one column for every standardized outlet in the universe. It should produce a table with as name rows as there are standardized outlet in the universe. The column with the dirty outlet will be the same outlet repeated for every row.
</li>
<table style="width:25%">
    <caption>Example for "Bst Bye"</caption>
    <tr>
      <th>Unstandardized Outlet Name (Dirty)</th>
      <th>Standardized Outlet (Universe)</th>
    </tr>
    <tr>
        <td>Bst Bye</td>
        <td>Wal-Mart</td>
    </tr>
    <tr>
        <td>Bst Bye</td>
        <td>Best Buy</td>
    </tr>
    <tr>
        <td>Bst Bye</td>
        <td>Whole Foods</td>
    </tr>
</table>
<br>
<li>
Add the clean/standardized version of the dirty outlet name (answer key). We need it back to compare it to the "suggested" outlet from the algorithm.
</li>
<table style="width:50%">
    <caption>Example for "Bst Bye"</caption>
    <tr>
      <th>Unstandardized Outlet Name (Dirty)</th>
      <th>Standardized Outlet (Answer/Clean)</th>
      <th>Standardized Outlet (Universe)</th>
    </tr>
    <tr>
        <td>Bst Bye</td>
        <td>Best Buy</td>
        <td>Wal-Mart</td>
    </tr>
    <tr>
        <td>Bst Bye</td>
        <td>Best Buy</td>
        <td>Best Buy</td>
    </tr>
    <tr>
        <td>Bst Bye</td>
        <td>Best Buy</td>
        <td>Whole Foods</td>
    </tr>
</table>
<br>
<li>
Calculate the distance between the dirty outlet name and the universe. Sort the distances from low to high.
</li>
<table style="width:75%">
    <caption>Example for "Bst Bye"</caption>
    <tr>
      <th>Unstandardized Outlet Name (Dirty)</th>
      <th>Standardized Outlet (Answer/Clean)</th>
      <th>Standardized Outlet (Universe)</th>
      <th>Distance Measure (Unstandard vs Universe)</th>
    </tr>
    <tr>
        <td>Bst Bye</td>
        <td>Best Buy</td>
        <td>Best Buy</td>
        <td>0</td>
    </tr>
    <tr>
        <td>Bst Bye</td>
        <td>Best Buy</td>
        <td>Wal-Mart</td>
        <td>12</td>
    </tr>
    <tr>
        <td>Bst Bye</td>
        <td>Best Buy</td>
        <td>Whole Foods</td>
        <td>10</td>
    </tr>
</table>
<br>
<li>
Take the first row of the dataset (minimum distance). Point of concern: <i>no rules for breaking ties</i>. Append the results to another dataset for each outlet. The final dataset will have the suggested outlets based on the minimum distance.
</li>
<table style="width:75%">
    <caption>Final dataset with suggested outlet</caption>
    <tr>
      <th>Unstandardized Outlet Name (Dirty)</th>
      <th>Standardized Outlet (Answer/Clean)</th>
      <th>Standardized Outlet (Universe)</th>
      <th>Distance Measure (Unstandard vs Universe)</th>
    </tr>
    <tr>
        <td>Bst Bye</td>
        <td>Best Buy</td>
        <td>Best Buy</td>
        <td>0</td>
    </tr>
</table>
</ol>

In [None]:
tdf_f = pd.DataFrame(columns = ['Unstandardized', 'Universe', 'Answer', 'Distance'])

for index, outlet in unstandardized_outlets.iterrows():
    #1.
    combination = list(it.product([outlet[0]], universe_outlets['Universe']))
    tdf = pd.DataFrame(combination, columns=['Unstandardized', 'Universe'])
    
    #2.
    tdf['Answer'] = outlet[1]
    
    #3.
    tdf['Distance'] = tdf.apply(calc_leven, axis=1)

    #4.
    tdf = tdf.sort_values('Distance').head(1)
    tdf_f = tdf_f.append(tdf)

#### Determining a Match

If the suggested outlet equals the clean outlet (the answer) then Match = 1, otherwise, 0.

In [None]:
tdf_f['Match'] = np.where(tdf_f['Answer'] == tdf_f['Universe'], 1, 0)

The match rate is the proportion of Match = 1, which we can create here and print for reference.

In [None]:
Match_Counts = tdf_f['Match'].value_counts()
Match_Rate = Match_Counts[1] / (Match_Counts[0] + Match_Counts[1])
Match_Rate

#### Export to Excel

In [None]:
tdf_f.to_excel("",
               sheet_name="Suggested Outlets",
               index=False)