# Prototype Notebook - Fuzzy String Matching
## Chris Kimber
## Insight Project

This notebook addresses the problem of matching webshop product entries to ingredients in a cleaned list. At this stage in development, an index of product names from the Metro online grocery shop has been scraped (as appear in the product tile layout) and an ingredient list from the simplified-1M+ recipes database is being used. 

The challenge is that product names contain a lot of extraneous information in many cases. In general, ingredient names represent substrings of product names. Cleaning of the product names has not yet taken place, and there is some partial pseudo-replication in the ingredient list. Nevertheless, this notebook will explore matching using fuzzywuzzy to remove products that do not match well to ingredients and find the 'best' match for those that do.

In [39]:
import numpy as np
import pandas as pd
from fuzzywuzzy import fuzz, process

Load the simplified-1M+ dataset and extract the ingredients list from it.

In [7]:
with np.load('/Users/chrki23/Documents/Insight_Project/data/raw/simplified-recipes-1M.npz', allow_pickle = True) as data:
    ingredients_raw = data['ingredients']

Load the scraped webshop data as a pickle file.

In [8]:
import pickle

file = open('/Users/chrki23/Documents/Insight_Project/data/cleaned/grocery_names.data', 'rb')
names_raw = pickle.load(file)
file.close()

In [9]:
print(type(names))
names[:5]

<class 'list'>


['Banana',
 'English cucumber',
 'Raspberries',
 'Lean Ground Beef, Value Pack',
 'White mushrooms']

Light preprocessing of each list by converting both to lower case

In [11]:
ingredients = [x.lower() for x in ingredients_raw]
names = [x.lower() for x in names_raw]
print(ingredients[:5], names[:5])

['salt', 'pepper', 'butter', 'garlic', 'sugar'] ['banana', 'english cucumber', 'raspberries', 'lean ground beef, value pack', 'white mushrooms']


Since finding the best match for a given product from the ingredient list is the goal, the best fuzzywuzzy option seems to be extract/extractOne from the process module, which calculates the string with the highest similarity to a target string from a vector of strings. extract apparently uses WRatio by default, which is a weighted average of the different fuzzywuzzy metrics. Kind of cool.

Start by testing with an example string from the products.

In [12]:
print(names[0])
test_name = names[0]
Ratios = process.extract(test_name, ingredients)
print(Ratios)

[('banana', 100), ('bananas', 92), ('a', 90), ('mashed banana', 90), ('frozen banana', 90)]


Obviously a stop word issue in the ingredients. Let's try another example.

In [13]:
print(names[1])
test_name = names[1]
Ratios = process.extract(test_name, ingredients)
print(Ratios)

[('english cucumber', 100), ('cucumber', 90), ('cucumbers', 85), ('seedless cucumber', 73), ('amber', 72)]


These first examples are somewhat forgiving because they are simple (popular?) food grocery products. I will manually curate a non-food item that should match poorly to the ingredient list, and test extract on it.

In [17]:
print(names[10980])
test_name = names[10980]
Ratios = process.extract(test_name, ingredients)
print(Ratios)

[('dairy', 72), ('peaches', 64), ('cashews', 64), ('half & half', 64), ('thai chile', 63)]


This test of a non-food item suggests that poor matching is taking place as required. A threshold for similarity of >72 seems like a good idea. A couple more tests for a hand-wavy thresholding of the similarity and then an automatic matcher can be written.

In [18]:
print(names[14013])
test_name = names[14013]
Ratios = process.extract(test_name, ingredients)
print(Ratios)

extra hold hairspray, style+care
[('extra', 90), ('extra firm tofu', 86), ('cream style corn', 86), ('ranch style beans', 86), ('extra light olive oil', 86)]


Extra is a poor match. Again, suggests cleaning of the ingredients is needed. Otherwise, suggests a threshold probably should be above 86 to be useful.

In [19]:
print(names[14053])
test_name = names[14053]
Ratios = process.extract(test_name, ingredients)
print(Ratios)

odour abosrber
[('beer', 68), ('crab', 68), ('kosher', 60), ('a', 60), ('raw', 60)]


More examples are probably good to tune the score for cutting off similarity 'usefully' but in the interest of time, moving on. First loop through all the product names in the web shop and extract the best match in the ingredients for each.

In [36]:
products_table = []
for name in names:
    product_dict = dict()
    
    best_match = process.extractOne(name, ingredients)
    
    product_dict['name'] = name
    product_dict['ingredient_match'] = best_match[0]
    product_dict['score'] = best_match[1]
    
    products_table.append(product_dict)

In [37]:
products_table[0:2]

[{'name': 'banana', 'ingredient_match': 'banana', 'score': 100},
 {'name': 'english cucumber',
  'ingredient_match': 'english cucumber',
  'score': 100}]

Write the resulting dictionary to a dataframe.

In [40]:
products_df = pd.DataFrame.from_dict(products_table)

In [41]:
products_df.head()

Unnamed: 0,name,ingredient_match,score
0,banana,banana,100
1,english cucumber,english cucumber,100
2,raspberries,raspberries,100
3,"lean ground beef, value pack",ground,90
4,white mushrooms,white mushrooms,100


In [42]:
products_df.tail()

Unnamed: 0,name,ingredient_match,score
16356,smartfoam™ effervescent mint whitening toothpa...,mint leaves,86
16357,ground espelette pepper,pepper,90
16358,gluten free organic chewy candies,organic,90
16359,horseradish mustard,mustard,90
16360,"soya and lavender scented candle, loft",lavender,90


In [43]:
filehandler = open('/Users/chrki23/Documents/Insight_Project/data/cleaned/products_matched_scores.data', 'wb')
pickle.dump(products_df, filehandler)

Can see some issues just in the head and tail, caused by poor ingredient curation. Will have to address later. For now, save out data because loop takes over an hour. Based on poor matches at 90, filter above 90 to have a set of hopefully 'good' products for the MVP.

In [44]:
products_good_match = products_df[products_df.score > 90]

In [45]:
products_good_match.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1601 entries, 0 to 16354
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   name              1601 non-null   object
 1   ingredient_match  1601 non-null   object
 2   score             1601 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 50.0+ KB


In [46]:
filehandler = open('/Users/chrki23/Documents/Insight_Project/data/cleaned/products_matched_scores_filt.data', 'wb')
pickle.dump(products_good_match, filehandler)

Inspecting the length, only 1 in 10 products will move forward to interact with the model results in the MVP. This is quite poor so curation to improve list quality and therefore lower threshold will be an important step in week 2. For now, save out for use in the MVP.