# Ingredient Standardization

## Importing packages and loading data

In [None]:
import pandas as pd
from itertools import chain
from sklearn.feature_extraction.text import TfidfVectorizer
#from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel
import re
import pickle
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet as wn

In [None]:
# Load JSON ingredient data
recipes = pd.read_json("Documents/Food-Network/recipes.json")

In [None]:
# Load validation set
validationSet = pickle.load(open("Documents/Food-Network/validationSet.py", "rb"))

## Cleaning up the text and prepping it for analysis

In [None]:
# Pull ingredients into a list
ingredients = list(chain.from_iterable(recipes.ingredients.tolist()))
ingredientsLower = [i.lower() for i in ingredients]

In [None]:
# Extract unique ingredients
ingredientsLower = list(set(ingredients))

In [None]:
# Identify words in the rightmost stage of the phrase
last_tokens = [i.lower().split()[-1] for i in ingredientsLower]
last_tokens = set(last_tokens)

In [None]:
# Stem words and take the unique entries
ps = PorterStemmer()
last_tokens = set([ps.stem(w) for w in last_tokens])

## Method 1: TF-IDF Vectorization (Word Analyzer) and Cosine Similarity

### In the TF-IDF vectorizer call, the "text analyzer" we will use is "word." 

In [None]:
# Obtain a TF-IDF matrix of vectors
    # to which we can apply the cosine similarity algorithm
vectorizer = TfidfVectorizer(min_df =1, analyzer = "word")
tfidf = vectorizer.fit_transform(ingredientsLower)

print(tfidf[0:2])

According to https://stackoverflow.com/questions/12118720/python-tf-idf-cosine-to-find-document-similarity, the linear kernel dot product is functionally the same as the cosine similarity, because the TF-IDF vectors are normalized. 

In [None]:
# Get the cosine similarity of the first ingredient ("document") with every other ingredient    
cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()

In [None]:
# What are the top five matches?
indicies = cosine_similarities.argsort()[:-5:-1]

Now, I'll loop over all entries in the tfidf matrix and pull out the top five similarities. I am finding similarities this way because computing the TF-IDF matrix of every document with every other document gave me nonsense indicies that did not correspond to the ingredients dataset.

In [None]:
similarities = []

for i in range(0, len(ingredientsLower) - 1 ):
    cosine_similarities = linear_kernel(tfidf[i], tfidf).flatten()
    indicies = cosine_similarities.argsort()[:-5:-1]
    similarities.append({ingredientsLower[i] : [ingredientsLower[i] for i in indicies]})

del i, vectorizer, tfidf, cosine_similarities, indicies

In [None]:
# Check the accuracy of the matches
similarities[0:5]

So this method kinda works. But looking at the third entry in the similarities list, "pineapple rings" is matching with "onion rings" and "hot pepper rings", which is clearly inaccurate. Let's try this method with a different analyzer-- the n-gram analyzer.

## Method 2: TF-IDF Vectorization (N-gram Analyzer) and Cosine Similarity 

In [None]:
# Implementing method 1 using an n-gram analyzer
    # Adapted from https://bergvca.github.io/2017/10/14/super-fast-string-matching.html
def ngrams(string, n=3):
    string = re.sub(r'[,-./]|\sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]

vectorizer = TfidfVectorizer(min_df =1, analyzer = ngrams)
tfidf = vectorizer.fit_transform(ingredientsLower)

In [None]:
# Identify the similarity of every ingredient with every other ingredient and record the top five matches
similarities = []

for i in range(0, len(ingredientsLower) - 1 ):
    cosine_similarities = linear_kernel(tfidf[i], tfidf).flatten()
    indicies = cosine_similarities.argsort()[:-5:-1]
    similarities.append({ingredientsLower[i] : [ingredientsLower[i] for i in indicies]})

similarities[0:5]

OK, looking at the third entry in similarities again, we are getting more similar results. But how swappable is "pineapple juice" with "pineapple salsa"? The TF-IDF method is still not quite working. One solution is to place  greater weights on words that appear at the end of a phrase. We will use the "word" analyzer to implement this method.

## Method 3: TF-IDF Vectorization ("Anchor" Analyzer) and Cosine Similarity

In [None]:
# Obtain a TF-IDF matrix of vectors
    # to which we can apply the cosine similarity algorithm
vectorizer = TfidfVectorizer(min_df =1, analyzer = "word")
tfidf = vectorizer.fit_transform(ingredientsLower)

In [None]:
# Identify ingredients which contain the anchor word
position = []
for i in last_tokens:
    if i in vectorizer.vocabulary_.keys():
        position.append(vectorizer.vocabulary_[i])

In [None]:
# Place greater weights on entries in the TF-IDF matrix that contain the anchor word. The weight is notional
tfidf[:, position] *= 5.0

In [None]:
# Obtain word similarities
similarities = []

for i in range(0, len(ingredientsLower) - 1 ):
    cosine_similarities = linear_kernel(tfidf[i], tfidf).flatten()
    indicies = cosine_similarities.argsort()[:-5:-1]
    similarities.append({ingredientsLower[i] : [ingredientsLower[i] for i in indicies]})

del i, vectorizer, tfidf, cosine_similarities, indicies

similarities[0:5]

What I still did not realize when implementing this was that this is essentially just scaling the weights on all the entries up, since all the ingedients have an "anchor" ingredient. One to-do item for me is to find a smarter weighting method that differentially weights anchor ingredients higher if they're already assigned as matches to the main ingredient.  

Where I'm left is with a bunch of matches to a particular ingredient, some of which are relevant and some of which are not. What are ways we can weed out the irrelevant ingredients? One thought is to use Wordnet synsets. NLTK's Wordnet interface has functions to compute path similarities, as well as determine word hyponyms and hypernyms. The first method I will try is to weed out matched ingredients with a path similarity from the anchor word that's lower than some arbitrary threshold. Another method for determining "families" of words is to compare hypernyms of words within a phrase and only keep words under the same family of words. Instructions on using the NLTK implementation of Wordnet can be found here: http://www.nltk.org/howto/wordnet.html.

## Method 4: Synsets

In [None]:
# Define a function to further weed out irrelevant entries identified by TF-IDF

def cutter(textbox):
    for i in range(0, len(textbox)):
        cut = []
        anchor = "".join(textbox[i].keys()).split()[1]
        matches = [i for i in textbox[i].values()][0]
        for i in matches:
            if i.split()[1] == anchor:
                cut.append(i)
            else:
                i = i.split()