In [1]:
import pandas as pd
from gensim.models import Word2Vec
import nltk
import warnings
import ast
import numpy as np
import string
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score
from collections import defaultdict
from sklearn.pipeline import Pipeline
from sklearn.ensemble import ExtraTreesClassifier
import unidecode
from nltk import WordNetLemmatizer
import re
import itertools
from collections import Counter

warnings.filterwarnings("ignore")

# Content Based Recommender
In a content-based recommendation system, the objective is to discern the user's preferences and construct a set of features that the algorithm can utilize to suggest previously unseen recipes that align with the user's unique taste profile.

In our recommendation system, we will be using ingredients as the features to construct the item profile. We will not be constructing user profiling as we do not have an implicit feedback that we can use. This information includes number of clicks/view on a recipe, view time of recipe, etc.

To develop our recommender system, we must transform each ingredient into a vector, followed by the aggregation of these vectors within the recipe. This aggregation process involves calculating the average of these vectors, resulting in the creation of a comprehensive document vector.

In our project, our objective is to generate recipe recommendations based on a user's input list of ingredients. To achieve this, we will recommend recipes to the user by evaluating the vectorized representations of the input ingredients in relation to the mean aggregation of a list of potential ingredients. This comparison will be facilitated by assessing the recipes that exhibit the closest proximity in vector space.

Furthermore, we will employ the Jaccard Similarity score to enhance our recipe recommendations based on user-input ingredients. The Jaccard similarity metric quantifies similarity by considering the shared and distinct values between sets. In this approach, as the number of input ingredients found in a recipe increases, so does the Jaccard similarity score, leading to a more robust recommendation for the user.

In [2]:
#Import recipe dataset
df = pd.read_csv('Cleaned_recipes.csv')

In [3]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,RecipeId,RecipeName,RecipeCategory,Keywords,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,...,ingredients,ingredients_raw_str,minutes,n_steps,n_ingredients,AggregatedRating,ReviewCount,year,month,day_of_week
0,0,120,Carrot Cake II,Dessert,['Vegetable' 'Weeknight' 'Oven' '< 4 Hours'],1173.5,69.8,19.4,154.7,720.2,...,"['carrot', 'egg', 'sugar', 'all purpose flour'...","[""1 lb carrot, freshly grated "",""4 larg...",75,11,13,4.25,4.0,1999,9,6
1,1,122,Commissary Carrot Cake,Dessert,['Vegetable' 'Low Protein' 'Weeknight' 'For La...,1011.8,67.4,28.3,146.0,401.4,...,"['sugar', 'flour', 'salt', 'heavy cream', 'uns...","[""1 1/2 cups sugar"",""1/4 cup flour"",""3...",240,34,18,3.0,1.0,1999,9,3


In [4]:
#Drop Unnamed: 0 column
df.drop(columns = ['Unnamed: 0'], inplace = True)

## Preprocessing ingredients
Before we run out model, we need to ensure that the ingredients column is cleaned up before we run our Word2Vec algorithm. We will build a function to clean the ingredients.

In [5]:
#Change String to List
df['ingredients'] = df['ingredients'].apply(lambda s: list(ast.literal_eval(s)))

In [6]:
#Ingredient preprocess function
def ingredient_preprocess(ingredients):
        
    ingrd_list = []
    translator = str.maketrans('','', string.punctuation)
    num_pattern = r'[0-9]'
    non_alphabet = r'[\W_]'
    lemmatizer = WordNetLemmatizer()
    for i in ingredients:
        #remove punctuations
        items = i.translate(translator)
        
        #Making all characters lowercase
        items = i.lower()
        
        #remove any numbers
        items = re.sub(num_pattern, ' ', items)
        
        #remove accents
        items = unidecode.unidecode(items)
        
        #remove any non-alphabet characters
        items = re.sub(non_alphabet, ' ', items)
        
        #Lemmatize words
        items = lemmatizer.lemmatize(items)
        
        ingrd_list.append(items)
    return ingrd_list
   

In [7]:
#Clean up output of ingredients from recommendation
def ingredient_parser_final(ingredient):
    """
    cleanup ingredients output
    """
    if isinstance(ingredient, list):
        ingredients = ingredient
    else:
        ingredients = ast.literal_eval(ingredient)

    ingredients = ",".join(ingredients)
    ingredients = unidecode.unidecode(ingredients)
    return ingredients

# Word2Vec

There are two main training algorithms for Word2Vec gensim model. One is continous bag of words(CBOW) and the other is skip-gram. <br>

CBOW method uses context and the surrounding words to predict the middle word. Skip-gram method uses a word to predict a target context. <br>

We will be using the CBOW method for our Word2Vec model as it is a more efficient method and works well with more frequent words, which is the case for our ingredients column.

Since the CBOW method uses surrounding words to predict the middle word, the structure of the list is very important in the prediction. To standardize and ensure that we have the best accuracy in the prediction, we need to sort the ingredients list by alphabetical order.

In [8]:
#To sort ingredients list in alphabetical order
def get_and_sort_corpus(data):
    corpus_sorted = []
    for doc in data['ingredients'].values:
        doc.sort()
        corpus_sorted.append(doc)
    return corpus_sorted

In [9]:
ingredients_corpus = get_and_sort_corpus(df)

In [10]:
#Building Word2Vec model
total_lengths = [len(ingredients) for ingredients in df['ingredients']]
avg_len = sum(total_lengths) / len(total_lengths)

model_Word2Vec = Word2Vec(ingredients_corpus, 
                          sg = 0, 
                          workers = 3, 
                          min_count = 1, 
                          window = avg_len, 
                          vector_size = 100)

**Parameters** <br>
sg: CBOW (0) or skip gram (1) <br>
workers: number of patritions during training (default is 3) <br>
min_count = minimum count of words to consider when training the model <br>
window: maximum distance between target word and words around (sliding window length) <br>
vector size: dimensionality of word vectors. Smaller vector size captures more general word relationship, larger vector size captuer more nuanced and specific words. 

In [11]:
w2v = {word: model_Word2Vec.wv[word] for word in model_Word2Vec.wv.key_to_index}

Now that we have the word vectors created, we will now created the mean embedding vector for the recipe.

In [12]:
class MeanEmbeddingVectorizer(object):
    
    def __init__(self, model_Word2Vec):
        self.model_Word2Vec = model_Word2Vec
        self.vector_size = model_Word2Vec.wv.vector_size
    
    def transform(self, docs):
        doc_word_vector = self.doc_average_list(docs)
        return doc_word_vector
    
    def doc_average(self, doc):
        """
        Compute average word vector for a recipe's ingredient
        
        :param doc: list of ingredients
        :return
            mean: float of average word vectors
        
        """
        
        
        mean = []
        for word in doc:
            if word in self.model_Word2Vec.wv.index_to_key:
                mean.append(self.model_Word2Vec.wv.get_vector(word))
                
        if not mean: #empty words
            #If text empty, return vector of zeros
            return np.zeros(self.vector_size)
        else:
            mean = np.array(mean).mean(axis = 0)
            return mean
        
    def doc_average_list(self, docs):
        """
        Compute average word vector for multiple docs (doc has been tokenized)
        
        :param docs: list of recipes in list of tokens
        :return
            array of average word vector
        
        """
        return np.vstack([self.doc_average(doc) for doc in docs])

In [13]:
mean_vec_tr = MeanEmbeddingVectorizer(model_Word2Vec)

In [14]:
doc_vec = mean_vec_tr.transform(ingredients_corpus)

### Cosine similarity recommendation
Now we can build our cosine similarity recommendation system. We will have a recommendation funciton that will recommend top-N based on the highest cosine similarity score.

In [15]:
#Cosine Recommendation
def get_recommendations(N, scores):
    """
    Rank scores and output a pandas data frame containing all the details of the top N recipes.
    :param scores: list of cosine similarities
    """
    # order the scores with and filter to get the highest N scores
    top = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:N]
    # create dataframe to load in recommendations
    recommendation = pd.DataFrame(columns=["recipe", "ingredients", "score","rating"])
    count = 0
    for i in top:
        recommendation.loc[count, "recipe"] = df["RecipeName"][i]
        recommendation.loc[count, "ingredients"] = df["ingredients"][i]
        recommendation.loc[count, "score"] = f"{scores[i]}"
        recommendation.loc[count, "rating"] = df['AggregatedRating'][i]
        count += 1
    return recommendation

def get_recs_cosine(ingredients, N=5):
    """
    Get the top N recipe recomendations.
    :param ingredients: comma seperated string listing ingredients
    :param N: number of recommendations
    """
    # load in word2vec model
    model = model_Word2Vec
    # normalize embeddings
    model.init_sims(replace=True)
    # load in data
    data = df
    # create corpus
    corpus = get_and_sort_corpus(data)

    # get average embdeddings for each document
    mean_vec_tr = MeanEmbeddingVectorizer(model_Word2Vec)
    doc_vec = mean_vec_tr.transform(corpus)
    doc_vec = [doc.reshape(1, -1) for doc in doc_vec]
    assert len(doc_vec) == len(corpus)
    

    # create embeddings for input text
    input = ingredients
    # create tokens with elements
    input = input.split(",")
    # parse ingredient list
    input = ingredient_preprocess(input)
    # get embeddings for ingredient doc
    input_embedding = mean_vec_tr.transform([input])[0].reshape(1, -1)
   
    # get cosine similarity between input embedding and all the document embeddings
    cos_sim = map(lambda x: cosine_similarity(input_embedding, x)[0][0], doc_vec)
    scores = list(cos_sim)
    # Filter top N recommendations
    recommendations = get_recommendations(N, scores)
    return recommendations

  

In [16]:
input_ingredient = "chicken, onion, spinach, garlic, pasta"
rec_cosine = get_recs_cosine(input_ingredient, N=10)
print(rec_cosine)

                                              recipe  \
0  Chicken or Beef Flavored Brown Rice Using Pamp...   
1            Nif's Butterflied Grilled Whole Chicken   
2                          Amanda's Chicken and Rice   
3   Buffalo Chicken Deviled Eggs (Aka Buffalo Horns)   
4    Too Tired, &amp; Broke, Yellow Rice and Chicken   
5                             No Noodle Chicken Soup   
6                                         Cup A Rice   
7                      Easiest Chicken Recipe of All   
8                                     Phoney Abalone   
9              Quick Chicken, Rice &amp; Veggie Soup   

                                         ingredients               score  \
0               [brown rice, butter, chicken, water]  0.7459433078765869   
1                      [chicken, dry rub seasonings]   0.743054986000061   
2                      [butter, chicken, white rice]  0.7408969402313232   
3  [butter, celery, chicken, cooked chicken, hard...   0.732718825340271   
4  

### Jaccard similarity
We will now build the recommendation system based on the highest jaccard similarity score.

In [17]:
def get_recommendations(N, scores):
    """
    Rank scores and output a pandas data frame containing all the details of the top N recipes.
    :param scores: list of cosine similarities
    """
    # order the scores with and filter to get the highest N scores
    top = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:N]
    # create dataframe to load in recommendations
    recommendation = pd.DataFrame(columns=["recipe", "ingredients", "score", "rating"])
    count = 0
    for i in top:
        recommendation.loc[count, "recipe"] = df["RecipeName"][i]
        recommendation.loc[count, "ingredients"] = df["ingredients"][i]
        recommendation.loc[count, "score"] = f"{scores[i]}"
        recommendation.loc[count, "rating"] = df['AggregatedRating'][i]
        count += 1
    return recommendation


def get_recs_jaccard(ingredients, N=5):
    """
    Get the top N recipe recomendations.
    :param ingredients: comma seperated string listing ingredients
    :param N: number of recommendations
    """
    # load in word2vec model
    model = model_Word2Vec
    # normalize embeddings
    model.init_sims(replace=True)
    # load in data
    data = df[df['n_ingredients'] >5]
    
     # create embeddings for input text
    input = ingredients
    # create tokens with elements
    input = input.split(",")
    # parse ingredient list
    input_ingredient = ingredient_preprocess(input)
    
    scores = []
    
    for i, row in data.iterrows():
        intersection = len(set(input_ingredient).intersection(set(row['ingredients'])))
        union = len(set(input_ingredient).union(set(row['ingredients'])))
        jaccard_similarity = intersection / union
        scores.append(jaccard_similarity)
        
    # Filter top N recommendations
    recommendations = get_recommendations(N, scores)
    return recommendations

  

In [18]:
input_ingredient = "chicken, onion, spinach, garlic, pasta"
rec_jaccard = get_recs_jaccard(input_ingredient, N=10)
print(rec_jaccard)

                                   recipe  \
0                   Lemon-Buttered Salmon   
1                   Lemony Salmon Patties   
2                        Shepherd's Pie V   
3                        Spicy Fish Cakes   
4                          Karo Pecan Pie   
5                Puffy Parmesan Pinwheels   
6                     Ricotta Spinach Pie   
7  Baked Chicken with Garlic and Rosemary   
8                Aussie Tuna Summer Salad   
9      Baked Brie with Caramelized Pecans   

                                         ingredients score rating  
0  [butter, herb seasoned salt, lemon juice, papr...   0.1    4.6  
1  [all purpose flour, butter, cayenne pepper, eg...   0.1   4.42  
2  [beef, cream style corn, mashed potatoes, onio...   0.1    4.5  
3  [butter, creole seasoning, egg, fish fillets, ...   0.1    3.0  
4  [butter, dark karo syrup, egg, pecan, salt, su...   0.1    4.0  
5  [black olives, hungarian paprika, parmesan che...   0.1    5.0  
6  [butter, butter, egg yolk

### Combination of cosine and Jaccard
We will not test the the recommendation with a mix of the cosine and Jaccard similarity score and evaluate what results are produced.

In [19]:
def get_recommendations(N, combine_scores):
    """
    Rank scores and output a pandas data frame containing all the details of the top N recipes.
    :param scores: list of cosine similarities
    """
    # order the scores with and filter to get the highest N scores
    top = sorted(range(len(combine_scores)), key=lambda i: combine_scores[i], reverse=True)[:N]
    # create dataframe to load in recommendations
    recommendation = pd.DataFrame(columns=["recipe", "ingredients", "score", "rating"])
    count = 0
    for i in top:
        recommendation.loc[count, "recipe"] = df["RecipeName"][i]
        recommendation.loc[count, "ingredients"] = df["ingredients"][i]
        recommendation.loc[count, "score"] = f"{combine_scores[i]}"
        recommendation.loc[count, "rating"] = df['AggregatedRating'][i]
        count += 1
    return recommendation


def get_recs(ingredients, cosine_weight, N=5):
    """
    Get the top N recipe recomendations.
    :param ingredients: comma seperated string listing ingredients
    :param N: number of recommendations
    :param cosine_weight: weight applied to cosine similarity score; other weight used is the jaccard similarity score
    """
    assert cosine_weight <= 1
    
    # load in word2vec model
    model = model_Word2Vec
    # normalize embeddings
    model.init_sims(replace=True)
    # load in data
    data = df
    # create corpus
    corpus = get_and_sort_corpus(data)

    # get average embdeddings for each document
    mean_vec_tr = MeanEmbeddingVectorizer(model_Word2Vec)
    doc_vec = mean_vec_tr.transform(corpus)
    doc_vec = [doc.reshape(1, -1) for doc in doc_vec]
    assert len(doc_vec) == len(corpus)
    
     # create embeddings for input text
    input = ingredients
    # create tokens with elements
    input = input.split(",")
    # parse ingredient list
    input = ingredient_preprocess(input)
    # get embeddings for ingredient doc
    input_embedding = mean_vec_tr.transform([input])[0].reshape(1, -1)
    input_ingredient = input
    
   
    # get cosine similarity between input embedding and all the document embeddings
    cos_sim = map(lambda x: cosine_similarity(input_embedding, x)[0][0], doc_vec)
    cosine_scores = list(cos_sim)
    
    jaccard_scores = []
    
    for i, row in data.iterrows():
        intersection = len(set(input_ingredient).intersection(set(row['ingredients'])))
        union = len(set(input_ingredient).union(set(row['ingredients'])))
        jaccard_similarity = intersection / union
        jaccard_scores.append(jaccard_similarity)
        
    jaccard_weight = 1 - cosine_weight    
        
    combine_scores = (np.array(cosine_scores) * cosine_weight) + (np.array(jaccard_scores) * jaccard_weight)
        
        
    # Filter top N recommendations
    recommendations = get_recommendations(N, combine_scores)
    return recommendations

  

In [20]:
#Cosine_weight of 0.2
input_ingredient = "chicken, onion, spinach, garlic, pasta"
combined_similar = get_recs(input_ingredient, cosine_weight = 0.2, N = 10)
print(combined_similar)

                                              recipe  \
0            Nif's Butterflied Grilled Whole Chicken   
1    Too Tired, &amp; Broke, Yellow Rice and Chicken   
2                      Easiest Chicken Recipe of All   
3                  The Easiest Chicken in the World!   
4                             Barbecued Oven Chicken   
5                          Amanda's Chicken and Rice   
6                     Crock Pot Tasty Tomato Chicken   
7                           Rick's Crock Pot Chicken   
8                              Beerinthebutt chicken   
9  Chicken or Beef Flavored Brown Rice Using Pamp...   

                            ingredients                score rating  
0         [chicken, dry rub seasonings]  0.28194432755311327    4.0  
1                [chicken, yellow rice]   0.2791937967141469    4.0  
2                       [chicken, salt]   0.2768626441558202    3.0  
3                    [chicken, ketchup]   0.2743554939826329    4.0  
4             [barbecue sauce, ch

In [21]:
#Cosine weight of 0.5
input_ingredient = "chicken, onion, spinach, garlic, pasta"
combined_similar = get_recs(input_ingredient, N=10, cosine_weight = 0.5)
print(combined_similar)

                                              recipe  \
0            Nif's Butterflied Grilled Whole Chicken   
1    Too Tired, &amp; Broke, Yellow Rice and Chicken   
2                      Easiest Chicken Recipe of All   
3                          Amanda's Chicken and Rice   
4                  The Easiest Chicken in the World!   
5  Chicken or Beef Flavored Brown Rice Using Pamp...   
6                           Rick's Crock Pot Chicken   
7                             Barbecued Oven Chicken   
8                              Beerinthebutt chicken   
9   Buffalo Chicken Deviled Eggs (Aka Buffalo Horns)   

                                         ingredients                score  \
0                      [chicken, dry rub seasonings]  0.45486082633336383   
1                             [chicken, yellow rice]   0.4479844768842061   
2                                    [chicken, salt]   0.4421566029389699   
3                      [butter, chicken, white rice]    0.441877041544233  

In [23]:
#Cosine weight of 0.8
input_ingredient = "chicken, onion, spinach, garlic, pasta"
combined_similar = get_recs(input_ingredient, cosine_weight = 0.8, N = 10)
print(combined_similar)

                                              recipe  \
0            Nif's Butterflied Grilled Whole Chicken   
1  Chicken or Beef Flavored Brown Rice Using Pamp...   
2                          Amanda's Chicken and Rice   
3    Too Tired, &amp; Broke, Yellow Rice and Chicken   
4                      Easiest Chicken Recipe of All   
5   Buffalo Chicken Deviled Eggs (Aka Buffalo Horns)   
6                  The Easiest Chicken in the World!   
7                                     Phoney Abalone   
8  All Purpose Chicken &amp; Broth from the Crock...   
9                           Rick's Crock Pot Chicken   

                                         ingredients               score  \
0                      [chicken, dry rub seasonings]  0.6277773102124532   
1               [brown rice, butter, chicken, water]  0.6217546701431275   
2                      [butter, chicken, white rice]   0.621289016519274   
3                             [chicken, yellow rice]  0.6167751868565877   
4  

# Initial Model Iteration
With the following parameters (sg = 0, workers = 3, min_count = 1, window = 9, vector_size = 100), the model produced good results with cosine similarity, but horrible results for jaccard similarity. Our goal for the project is to produce recipes that have same ingredients and produce a grocery list of ingredients from those recipes.Even with a good cosine similarity result, the recipes produced were very simple and plain recipes. 

Since GridSearchCV doesn't work with Word2Vec, we need to manually try other hyperparameters to see if the outcomes from the new model will provide better results. The hyperparameters I believe that we won't need to change as it won't make a drastic change to the results are
- sg as we want to use only CBOW model
- workers, we are going to leave it as default
- min_count as we want all ingredients to have an embedding
- window we want to set as the average length of all documents.

The only hyperparameter that we can adjust and may see different results is the vector size. The vector size refers to the dimensionality and determine whether general word or specific nuanced relationship between words will affect the prediction.

# 2nd Iteration with vec size 50

In [24]:
#Building Word2Vec model with vec size of 50
total_lengths = [len(ingredients) for ingredients in df['ingredients']]
avg_len = sum(total_lengths) / len(total_lengths)

model_Word2Vec = Word2Vec(ingredients_corpus, 
                          sg = 0, 
                          workers = 3, 
                          min_count = 1, 
                          window = avg_len, 
                          vector_size = 50)

w2v = {word: model_Word2Vec.wv[word] for word in model_Word2Vec.wv.key_to_index}

In [25]:
#Cosine Recommendation
def get_recommendations(N, scores):
    """
    Rank scores and output a pandas data frame containing all the details of the top N recipes.
    :param scores: list of cosine similarities
    """
    # order the scores with and filter to get the highest N scores
    top = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:N]
    # create dataframe to load in recommendations
    recommendation = pd.DataFrame(columns=["recipe", "ingredients", "score","rating"])
    count = 0
    for i in top:
        recommendation.loc[count, "recipe"] = df["RecipeName"][i]
        recommendation.loc[count, "ingredients"] = df["ingredients"][i]
        recommendation.loc[count, "score"] = f"{scores[i]}"
        recommendation.loc[count, "rating"] = df['AggregatedRating'][i]
        count += 1
    return recommendation

def get_recs_cosine_2(ingredients, N=5):
    """
    Get the top N recipe recomendations.
    :param ingredients: comma seperated string listing ingredients
    :param N: number of recommendations
    """
    # load in word2vec model
    model = model_Word2Vec
    # normalize embeddings
    model.init_sims(replace=True)
    # load in data
    data = df
    # create corpus
    corpus = get_and_sort_corpus(data)

    # get average embdeddings for each document
    mean_vec_tr = MeanEmbeddingVectorizer(model_Word2Vec)
    doc_vec = mean_vec_tr.transform(corpus)
    doc_vec = [doc.reshape(1, -1) for doc in doc_vec]
    assert len(doc_vec) == len(corpus)
    

    # create embeddings for input text
    input = ingredients
    # create tokens with elements
    input = input.split(",")
    # parse ingredient list
    input = ingredient_preprocess(input)
    # get embeddings for ingredient doc
    input_embedding = mean_vec_tr.transform([input])[0].reshape(1, -1)
   
    # get cosine similarity between input embedding and all the document embeddings
    cos_sim = map(lambda x: cosine_similarity(input_embedding, x)[0][0], doc_vec)
    scores = list(cos_sim)
    # Filter top N recommendations
    recommendations = get_recommendations(N, scores)
    return recommendations

  

In [26]:
input_ingredient = "chicken, onion, spinach, garlic, pasta"
rec_cosine_2 = get_recs_cosine_2(input_ingredient, N=10)
print(rec_cosine_2)

                                              recipe  \
0                          Amanda's Chicken and Rice   
1  Chicken or Beef Flavored Brown Rice Using Pamp...   
2            Nif's Butterflied Grilled Whole Chicken   
3   Buffalo Chicken Deviled Eggs (Aka Buffalo Horns)   
4                             No Noodle Chicken Soup   
5                                         Cup A Rice   
6                    Crock Pot Chicken With Potatoes   
7                                     Phoney Abalone   
8              Quick Chicken, Rice &amp; Veggie Soup   
9  All Purpose Chicken &amp; Broth from the Crock...   

                                         ingredients               score  \
0                      [butter, chicken, white rice]   0.759509265422821   
1               [brown rice, butter, chicken, water]  0.7584822773933411   
2                      [chicken, dry rub seasonings]  0.7480008602142334   
3  [butter, celery, chicken, cooked chicken, hard...  0.7480000257492065   
4  

In [27]:
def get_recommendations(N, scores):
    """
    Rank scores and output a pandas data frame containing all the details of the top N recipes.
    :param scores: list of cosine similarities
    """
    # order the scores with and filter to get the highest N scores
    top = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:N]
    # create dataframe to load in recommendations
    recommendation = pd.DataFrame(columns=["recipe", "ingredients", "score", "rating"])
    count = 0
    for i in top:
        recommendation.loc[count, "recipe"] = df["RecipeName"][i]
        recommendation.loc[count, "ingredients"] = df["ingredients"][i]
        recommendation.loc[count, "score"] = f"{scores[i]}"
        recommendation.loc[count, "rating"] = df['AggregatedRating'][i]
        count += 1
    return recommendation


def get_recs_jaccard_2(ingredients, N=5):
    """
    Get the top N recipe recomendations.
    :param ingredients: comma seperated string listing ingredients
    :param N: number of recommendations
    """
    # load in word2vec model
    model = model_Word2Vec
    # normalize embeddings
    model.init_sims(replace=True)
    # load in data
    data = df[df['n_ingredients'] >5]
    
     # create embeddings for input text
    input = ingredients
    # create tokens with elements
    input = input.split(",")
    # parse ingredient list
    input_ingredient = ingredient_preprocess(input)
    
    scores = []
    
    for i, row in data.iterrows():
        intersection = len(set(input_ingredient).intersection(set(row['ingredients'])))
        union = len(set(input_ingredient).union(set(row['ingredients'])))
        jaccard_similarity = intersection / union
        scores.append(jaccard_similarity)
        
    # Filter top N recommendations
    recommendations = get_recommendations(N, scores)
    return recommendations

  

In [28]:
input_ingredient = "chicken, onion, spinach, garlic, pasta"
rec_jaccard_2 = get_recs_jaccard_2(input_ingredient, N=10)
print(rec_jaccard_2)

                                   recipe  \
0                   Lemon-Buttered Salmon   
1                   Lemony Salmon Patties   
2                        Shepherd's Pie V   
3                        Spicy Fish Cakes   
4                          Karo Pecan Pie   
5                Puffy Parmesan Pinwheels   
6                     Ricotta Spinach Pie   
7  Baked Chicken with Garlic and Rosemary   
8                Aussie Tuna Summer Salad   
9      Baked Brie with Caramelized Pecans   

                                         ingredients score rating  
0  [butter, herb seasoned salt, lemon juice, papr...   0.1    4.6  
1  [all purpose flour, butter, cayenne pepper, eg...   0.1   4.42  
2  [beef, cream style corn, mashed potatoes, onio...   0.1    4.5  
3  [butter, creole seasoning, egg, fish fillets, ...   0.1    3.0  
4  [butter, dark karo syrup, egg, pecan, salt, su...   0.1    4.0  
5  [black olives, hungarian paprika, parmesan che...   0.1    5.0  
6  [butter, butter, egg yolk

## 2nd Iteration Conclusion
We can see the cosine scores for the recipes increased drastically for the top recipes up from approximtaley 77% to 90%. For the Jaccard scores, they remain the same at 0.1.

The cosine recommendation have a few of the same recipes as the first interation, but ranked differently.

# 3rd Iteration with vec size 150

In [29]:
#Building Word2Vec model with vec size of 150
total_lengths = [len(ingredients) for ingredients in df['ingredients']]
avg_len = sum(total_lengths) / len(total_lengths)

model_Word2Vec = Word2Vec(ingredients_corpus, 
                          sg = 0, 
                          workers = 3, 
                          min_count = 1, 
                          window = avg_len, 
                          vector_size = 150)

w2v = {word: model_Word2Vec.wv[word] for word in model_Word2Vec.wv.key_to_index}

In [30]:
#Cosine Recommendation
def get_recommendations(N, scores):
    """
    Rank scores and output a pandas data frame containing all the details of the top N recipes.
    :param scores: list of cosine similarities
    """
    # order the scores with and filter to get the highest N scores
    top = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:N]
    # create dataframe to load in recommendations
    recommendation = pd.DataFrame(columns=["recipe", "ingredients", "score","rating"])
    count = 0
    for i in top:
        recommendation.loc[count, "recipe"] = df["RecipeName"][i]
        recommendation.loc[count, "ingredients"] = df["ingredients"][i]
        recommendation.loc[count, "score"] = f"{scores[i]}"
        recommendation.loc[count, "rating"] = df['AggregatedRating'][i]
        count += 1
    return recommendation

def get_recs_cosine_3(ingredients, N=5):
    """
    Get the top N recipe recomendations.
    :param ingredients: comma seperated string listing ingredients
    :param N: number of recommendations
    """
    # load in word2vec model
    model = model_Word2Vec
    # normalize embeddings
    model.init_sims(replace=True)
    # load in data
    data = df
    # create corpus
    corpus = get_and_sort_corpus(data)

    # get average embdeddings for each document
    mean_vec_tr = MeanEmbeddingVectorizer(model_Word2Vec)
    doc_vec = mean_vec_tr.transform(corpus)
    doc_vec = [doc.reshape(1, -1) for doc in doc_vec]
    assert len(doc_vec) == len(corpus)
    

    # create embeddings for input text
    input = ingredients
    # create tokens with elements
    input = input.split(",")
    # parse ingredient list
    input = ingredient_preprocess(input)
    # get embeddings for ingredient doc
    input_embedding = mean_vec_tr.transform([input])[0].reshape(1, -1)
   
    # get cosine similarity between input embedding and all the document embeddings
    cos_sim = map(lambda x: cosine_similarity(input_embedding, x)[0][0], doc_vec)
    scores = list(cos_sim)
    # Filter top N recommendations
    recommendations = get_recommendations(N, scores)
    return recommendations

  

In [31]:
input_ingredient = "chicken, onion, spinach, garlic, pasta"
rec_cosine_3 = get_recs_cosine_3(input_ingredient, N=10)
print(rec_cosine_3)

                                              recipe  \
0  Chicken or Beef Flavored Brown Rice Using Pamp...   
1            Nif's Butterflied Grilled Whole Chicken   
2                          Amanda's Chicken and Rice   
3                                     Phoney Abalone   
4                                         Cup A Rice   
5                             No Noodle Chicken Soup   
6   Buffalo Chicken Deviled Eggs (Aka Buffalo Horns)   
7                    Crock Pot Chicken With Potatoes   
8  All Purpose Chicken &amp; Broth from the Crock...   
9                   Quick and Easy Blackened Chicken   

                                         ingredients               score  \
0               [brown rice, butter, chicken, water]  0.7620192766189575   
1                      [chicken, dry rub seasonings]   0.760386049747467   
2                      [butter, chicken, white rice]  0.7549707889556885   
3  [chicken, chicken breasts, cracker crumbs, egg...  0.7502956390380859   
4  

In [32]:
def get_recommendations(N, scores):
    """
    Rank scores and output a pandas data frame containing all the details of the top N recipes.
    :param scores: list of cosine similarities
    """
    # order the scores with and filter to get the highest N scores
    top = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:N]
    # create dataframe to load in recommendations
    recommendation = pd.DataFrame(columns=["recipe", "ingredients", "score", "rating"])
    count = 0
    for i in top:
        recommendation.loc[count, "recipe"] = df["RecipeName"][i]
        recommendation.loc[count, "ingredients"] = df["ingredients"][i]
        recommendation.loc[count, "score"] = f"{scores[i]}"
        recommendation.loc[count, "rating"] = df['AggregatedRating'][i]
        count += 1
    return recommendation


def get_recs_jaccard_3(ingredients, N=5):
    """
    Get the top N recipe recomendations.
    :param ingredients: comma seperated string listing ingredients
    :param N: number of recommendations
    """
    # load in word2vec model
    model = model_Word2Vec
    # normalize embeddings
    model.init_sims(replace=True)
    # load in data
    data = df[df['n_ingredients'] >5]
    
     # create embeddings for input text
    input = ingredients
    # create tokens with elements
    input = input.split(",")
    # parse ingredient list
    input_ingredient = ingredient_preprocess(input)
    
    scores = []
    
    for i, row in data.iterrows():
        intersection = len(set(input_ingredient).intersection(set(row['ingredients'])))
        union = len(set(input_ingredient).union(set(row['ingredients'])))
        jaccard_similarity = intersection / union
        scores.append(jaccard_similarity)
        
    # Filter top N recommendations
    recommendations = get_recommendations(N, scores)
    return recommendations

  

In [33]:
input_ingredient = "chicken, onion, spinach, garlic, pasta"
rec_jaccard_3 = get_recs_jaccard_3(input_ingredient, N=10)
print(rec_jaccard_3)

                                   recipe  \
0                   Lemon-Buttered Salmon   
1                   Lemony Salmon Patties   
2                        Shepherd's Pie V   
3                        Spicy Fish Cakes   
4                          Karo Pecan Pie   
5                Puffy Parmesan Pinwheels   
6                     Ricotta Spinach Pie   
7  Baked Chicken with Garlic and Rosemary   
8                Aussie Tuna Summer Salad   
9      Baked Brie with Caramelized Pecans   

                                         ingredients score rating  
0  [butter, herb seasoned salt, lemon juice, papr...   0.1    4.6  
1  [all purpose flour, butter, cayenne pepper, eg...   0.1   4.42  
2  [beef, cream style corn, mashed potatoes, onio...   0.1    4.5  
3  [butter, creole seasoning, egg, fish fillets, ...   0.1    3.0  
4  [butter, dark karo syrup, egg, pecan, salt, su...   0.1    4.0  
5  [black olives, hungarian paprika, parmesan che...   0.1    5.0  
6  [butter, butter, egg yolk

# Vector Size Conclusion
Based on a the three choices for vector size (50, 100, 150), 150 attained the highest cosine similarity score while vector size of 100 attained the lowest cosine similarity score. It seems that a more general relationship between words allowed for the model to attain a better performance. As such we will use the 50 as our vector size for the model.

# Test trial with Content-Based Recommender

In [34]:
input_ingredient = "beef, potato, rice, pepper"
rec_cosine_3 = get_recs_cosine_3(input_ingredient, N=10)
print(rec_cosine_3)

                                  recipe  \
0                Man-Style Spanish Steak   
1                J.w.'s Quick Coq Au Vin   
2          South African Chutney Chicken   
3  Easy Delicious Slow Cooker Roast Beef   
4            Sweet and Sour Cabbage Stew   
5        Yorkshire Corned Beef Hash Soup   
6                      Lo Sung Beef Soup   
7              Beef and Black Bean Sauce   
8                         Plum Pot Roast   
9                       Chinese Stir-Fry   

                                         ingredients               score  \
0  [bay leaf, canned tomatoes, chuck steaks, dry ...  0.8232737183570862   
1  [beef bouillon cube, chicken, dry red wine, li...  0.8156026601791382   
2      [chicken thighs, chutney, dry onion soup mix]  0.7985493540763855   
3  [bay leaves, beef roast, black pepper, dry oni...  0.7871913313865662   
4  [beef stew meat, cabbage, chili sauce, jellied...  0.7787079215049744   
5  [bay leaves, beef stock cube, button mushrooms...  0.777

In [35]:
input_ingredient = "fish, carrot, rice, celery"
rec_cosine_3 = get_recs_cosine_3(input_ingredient, N=10)
print(rec_cosine_3)

                                            recipe  \
0                                 Lao Papaya Salad   
1            Thai Catfish Salad (Yam Pla Dook Foo)   
2  Delicious Indian Spicy Chicken Sandwich Filling   
3                  Vietnamese Chicken Lettuce Cups   
4           Ohn-No-Kauk-Swe (Burmese Chicken Soup)   
5                                Aussie Rice Salad   
6                 Chez's Apricot and Mango Chicken   
7       Indonesian Chicken Noodle Soup (Soto Ayam)   
8                                            Laksa   
9                  Creamy Garlicy Seafood Marinara   

                                         ingredients               score  \
0  [cherry tomatoes, chilies, crab, fish, fish sa...  0.8127458095550537   
1  [birds eye chiles, breadcrumb, brown sugar, ca...  0.7886019349098206   
2  [chestnut, chicken breast, chicken stock powde...  0.7845373749732971   
3  [carrot, chicken, chili, iceberg lettuce, mint...  0.7814772129058838   
4  [boiling water, boilin

# Conclusion
Using Jaccard similarity score does not produce any good results for our recommendation as all the scores are very low. The low scores are indicating that not a lot of recipes contain the same ingredients as the input ingredients. Even when using a hybrid score of Jaccard and Cosine, when the Jaccard socre weight increases, the overall score decreases. Given that our initial goal of recommending a list of recipes with the same ingredients but differing taste, our model using the jaccard score is not doing a good job. We will need to dropthe jaccard similarity score in our content-based recommender system and just use the cosine similarity score. 

Our cosine recommender system runs with with multiple different input ingredients receiving cosine scores between 70% to 80%.

Recapping on our conclusion, we will be using the following hyperparameters for our Word2Vec model for the most optimized cosine similarity score:  sg = 0, workers = 3, min_count = 1, window = average length of document in corpuse, vector_size = 50. 

The downside of how our content-based recommender system is that it doesn't take into account of the history of the user's like. We may be recommending certain items that although has a high cosine similarity score, may not be liked by the user. One way we can solve this is to combine both the content-based recommender and collaborative recommender, which will be discussed in the hybrid model notebook.

The pros of using the content-based recommender system is that it can capture specific recipes that the user inputs, which the collaborative model does not do.