## Cleaning Strings

we'll be working with the restaurants DataFrame which has data on various restaurants. Our ultimate goal is to create a restaurant recommendation engine, but we need to first clean the data.

In [1]:
# Import pandas
import pandas as pd

# Import process from fuzzywuzzy
from fuzzywuzzy import process



In [2]:
# Read csv
data=pd.read_csv("datasets//restaurants_L2.csv")

# Info of csv
print(data.info())

# Let's see have a look in the data
print(data.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336 entries, 0 to 335
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    336 non-null    int64 
 1   name          336 non-null    object
 2   addr          336 non-null    object
 3   city          336 non-null    object
 4   phone         336 non-null    int64 
 5   type          336 non-null    object
 6   cuisine_type  336 non-null    object
dtypes: int64(2), object(5)
memory usage: 18.5+ KB
None
   Unnamed: 0                       name                       addr  \
0           0  arnie morton's of chicago   435 s. la cienega blv .    
1           1         art's delicatessen       12224 ventura blvd.    
2           2                  campanile       624 s. la brea ave.    
3           3                      fenix    8358 sunset blvd. west    
4           4         grill on the alley           9560 dayton way    

          city       phone      typ

In [4]:
# Store the unique values of cuisine_type in unique_types
unique_types = data["cuisine_type"].unique()

# print the unique values
print(unique_types)

['america' 'mercian' 'amurican' 'americen' 'americann' 'asiian' 'italian'
 'asiann' 'asian' 'american' 'co0ffebar' 'assina' 'southwestern'
 'steakhouses' 'southern' 'mexicam' 'mexicana' 'itallian' 'talina'
 'mexico' 'coffee bar' 'cofebar' 'cooffeebar' 'coffeebar' 'mexicann'
 'mejicana' 'mexican' 'itlian']


In [7]:
# Calculate similarity of 'asian' to all values of unique_types
print(process.extract('asian', unique_types, limit = len(unique_types)))

[('asian', 100), ('asiian', 91), ('asiann', 91), ('assina', 73), ('italian', 67), ('amurican', 62), ('american', 62), ('itallian', 62), ('americann', 57), ('talina', 55), ('itlian', 55), ('mexicana', 54), ('mexicann', 54), ('mejicana', 54), ('america', 50), ('mercian', 50), ('mexican', 50), ('americen', 46), ('southwestern', 36), ('southern', 31), ('co0ffebar', 26), ('coffee bar', 26), ('cooffeebar', 26), ('coffeebar', 26), ('steakhouses', 25), ('mexico', 18), ('mexicam', 17), ('cofebar', 17)]


In [8]:
# Calculate similarity of 'american' to all values of unique_types
print(process.extract('american', unique_types, limit = len(unique_types)))

[('american', 100), ('americann', 94), ('america', 93), ('amurican', 88), ('americen', 88), ('mercian', 80), ('mexican', 80), ('mexicana', 75), ('mexicann', 75), ('mejicana', 75), ('mexicam', 67), ('asian', 62), ('asiian', 57), ('asiann', 57), ('mexico', 57), ('italian', 53), ('itallian', 50), ('assina', 43), ('talina', 43), ('itlian', 43), ('southwestern', 41), ('southern', 38), ('cofebar', 27), ('co0ffebar', 24), ('coffeebar', 24), ('coffee bar', 22), ('cooffeebar', 22), ('steakhouses', 21)]


In [9]:
# Calculate similarity of 'italian' to all values of unique_types
print(process.extract('italian', unique_types, limit = len(unique_types)))

[('italian', 100), ('itallian', 93), ('itlian', 92), ('talina', 77), ('asian', 67), ('asiian', 62), ('asiann', 62), ('mercian', 43), ('mexican', 43), ('amurican', 40), ('american', 40), ('mexicana', 40), ('mexicann', 40), ('mejicana', 40), ('americann', 38), ('assina', 31), ('america', 29), ('mexicam', 29), ('americen', 27), ('southern', 27), ('southwestern', 26), ('steakhouses', 26), ('mexico', 15), ('cofebar', 14), ('co0ffebar', 12), ('coffee bar', 12), ('cooffeebar', 12), ('coffeebar', 12)]


Take a look at the output, what do you think should be the similarity cutoff point when remapping categories?

80 could be a good cutoff point

## Remapping categories 
In the last exercise, we determined that the distance cutoff point for remapping typos of 'american', 'asian', and 'italian' cuisine types stored in the cuisine_type column should be 80.

Here, we're going to put it all together by finding matches with similarity scores equal to or higher than 80 by using fuzywuzzy.process's extract() function

In [13]:
# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian', data['cuisine_type'], limit = len(data))

# Inspect the first 5 matches
print(matches[0:5])

[('italian', 100, 6), ('italian', 100, 10), ('italian', 100, 11), ('italian', 100, 16), ('italian', 100, 19)]


Now we're getting somewhere! Now we can iterate through matches to reassign similar entries.

In [16]:
# Iterate through the list of matches to italian
for match in matches:

  # Check whether the similarity score is greater than or equal to 80
   if match[1]>=80:

    # Select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
    data.loc[data['cuisine_type'] == match[0], 'cuisine_type'] = 'italian'

In [18]:
# Store again the unique values of cuisine_type in unique_types to see changes
unique_types2 = data["cuisine_type"].unique()

# print the unique values 
print(unique_types2)

['america' 'mercian' 'amurican' 'americen' 'americann' 'asiian' 'italian'
 'asiann' 'asian' 'american' 'co0ffebar' 'assina' 'southwestern'
 'steakhouses' 'southern' 'mexicam' 'mexicana' 'talina' 'mexico'
 'coffee bar' 'cofebar' 'cooffeebar' 'coffeebar' 'mexicann' 'mejicana'
 'mexican']


In [23]:
#we'll adapt the code to work with every restaurant type in unique_types

# Iterate through unique_types
for cuisine in unique_types:  
  # Create a list of matches, comparing cuisine with the cuisine_type column
  matches = process.extract(cuisine, data['cuisine_type'], limit=len(data.cuisine_type))

  # Iterate through the list of matches
  for match in matches:
     # Check whether the similarity score is greater than or equal to 80
    if match[1] >= 80:
      # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
      data.loc[data['cuisine_type'] == match[0]] = cuisine
      
# Inspect the final result
print(data['cuisine_type'].unique())


['mexican' 'asian' 'italian' 'coffeebar' 'assina' 'southern' 'steakhouses'
 'talina' 'mexico']


## Linkage

" linkage is the act of linking data from different sources regarding the same entity. But unlike joins, record linkage does not require exact matches between different pairs of data, and instead can find close matches using string similarity. This is why record linkage is effective when there are no common unique keys between the data sources you can rely upon when linking data sources such as a unique identifier."