## Faezeh Yazdi Capstone Project - Food Recomendation System

## Notebook : Content base recommendation system
This notebook aims to build a recommendation system by finding similar recipes to each other, based on different features.

In [2]:
#import packages
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity 


In [3]:
#pd.set_option('display.max_colwidth', None)
# Set the maximum number of columns to display to None
#pd.set_option('display.max_columns', None)

In [4]:
np.random.seed(0)

As dataset is very large, and our computational power is low, we will build our models on a sample of our dataset. this smaller dataset can be selected random, or it can be the one we have filtered for recipes with more than 

In [5]:
#use a small sample to speed up the computation

#df = pd.read_parquet("cleanedRecipedf.parquet")
#big_df, small_df = train_test_split(df, test_size=0.05, random_state=25, stratify=df["AggregatedRating"])
#small_df.reset_index(inplace=True)
#df=small_df

In [6]:
df = pd.read_parquet('Transformed Data/ Cleaned-Sampled-Recipes.parquet')

In [7]:
df.head(2)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,DatePublished,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,...,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeInstructions,PrepTimeHours,CookingTimeHours,TotalTimeHours,CaloriesCategory,rateTwoScale
0,45809,Bourbon Chicken,58278,LinMarie,2002-11-12 20:13:00+00:00,I searched and finally found this recipe on th...,[https://img.sndimg.com/food/image/upload/w_55...,Chicken Breast,"[Chicken, Poultry, Meat, Chinese, Asian, High ...","[2, 1 -2, 1, 1⁄4, 3⁄4, 1⁄4, 1⁄3, 2, 1, 1⁄2, 1⁄3]",...,23.4,0.3,21.5,50.1,[Editor's Note: Named Bourbon Chicken because...,0.25,0.333333,0.583333,lessthan550,1
1,2886,Best Banana Bread,1762,lkadlec,1999-09-26 20:49:00+00:00,Make and share this Best Banana Bread recipe f...,[https://img.sndimg.com/food/image/upload/w_55...,Quick Breads,"[Breads, Fruit, Oven, < 4 Hours]","[1⁄2, 1, 2, 3, 1 1⁄2, 1, 1⁄2, 1⁄2]",...,42.5,1.4,24.4,3.7,"[Remove odd pots and pans from oven., Preheat ...",0.166667,1.0,1.166667,lessthan550,1


### Preprocessing Data types
Before using our columns some of the types should change like Keywords and RecipeIngredientParts. These two columns are a list of strings and we want them to be pure string to be able to work with them more easily.

In [8]:
#data type before change
type(df['Keywords'][0])

numpy.ndarray

In [9]:
#change column type Keywords, RecipeIngredientParts

def set_to_string (s):
    try:
        rec = ''
        for i in range(len(s)):
            rec = rec + s[i] + ', '
        return rec
    except:
        return 'Nan'
    
df["Keywords"] = df["Keywords"].apply(lambda x: set_to_string(x))
df["RecipeIngredientParts"] = df["RecipeIngredientParts"].apply(lambda x: set_to_string(x))

In [10]:
#sanity check
#data type before change
type(df['Keywords'][0])

str

### Choosing Models Feature

We are not using `rating` and `reviews` as similarity indicators, as well as author and name.

In [11]:
X = df[['Description','RecipeCategory', 'Keywords','RecipeIngredientParts',
        'Calories', 'FatContent', 'SaturatedFatContent', 
        'CholesterolContent',
       'SodiumContent', 'CarbohydrateContent', 'FiberContent', 'SugarContent',
       'ProteinContent', 'PrepTimeHours',
       'CookingTimeHours', 'TotalTimeHours']]

#seperating text and numeric columns
X_non_numeric = ['Description','RecipeCategory', 'Keywords','RecipeIngredientParts']
X_numeric = ['Calories', 'FatContent', 'SaturatedFatContent', 
        'CholesterolContent',
       'SodiumContent', 'CarbohydrateContent', 'FiberContent', 'SugarContent',
       'ProteinContent', 'PrepTimeHours',
       'CookingTimeHours', 'TotalTimeHours']

### Customize text transformer 

Based on the analysis on `description` column in the previous notebook, the following customized text vectorizer has been designed.

In [12]:
# `description` column tokenizer


#import:
#stemmer
import nltk
stemmer = nltk.stem.PorterStemmer()
# import stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords 
# import punctuation list
import string

#remove digit
def removedigits(s):
    result = ''.join([i for i in s if not i.isdigit()])
    return result

def removebadcharacters(s):
    s = s.replace('\r','')
    s = s.replace('\n','')
    s = s.replace('\nPs','')
    s = s.replace('\ni','')
    s = s.replace('\nthi','')
    return s
    

#add some stopwords
ENGLISH_STOP_WORDS = stopwords.words('english')

# define custome tokenizer
def my_tokenizer(sentence):
    
    #remove numbers
    sentence = removedigits(sentence)
    sentence = removebadcharacters(sentence)
    
    # remove punctuation and set to lower case
    for punctuation_mark in string.punctuation:
        sentence = sentence.replace(punctuation_mark,'').lower()

    # split sentence into words
    listofwords = sentence.split(' ')
    listofstemmed_words = []
    
    # remove stopwords and any tokens that are just empty strings
    for word in listofwords:
        if (not word in ENGLISH_STOP_WORDS) and (word!=''):
            # Stem words
            stemmed_word = stemmer.stem(word)
            listofstemmed_words.append(stemmed_word)

    return listofstemmed_words

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Arianayazdi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Then we will transform description column using TFIDF, as it is more suitable for this situation than bag of words. Beacuase the TFIDF focus on more specific and unique words of each description, we will be able to find similarities better.

In [13]:
#description column transformer
My_normal_TFIDF_vectorizer = TfidfVectorizer(tokenizer = my_tokenizer, min_df = 10, max_df = 0.9)

#keywords and ingredients
My_phrase_vectorizer = TfidfVectorizer(tokenizer=lambda x: x.split(','),min_df=10)

### Columns transformer
Transform text to numeric and scale numeric columns to have a complete numeric table of all features. For `descriptin` column we are using the tocenizer we defined with TFIDF transformer; for ingredients and keywords we are using TFIDF as well, but with the tokenizer which seperate the words based on ','; For `Category` column we are using one hot encoder to transform each category to a column. <br>Note: for reasons behind choosing each of this transformers please see the Text cleaning notebook.

In [14]:
preprocessor = ColumnTransformer(
    transformers=[
        ('RecipeCategory_transform', OneHotEncoder(), ['RecipeCategory']),
        ('description_transform', My_normal_TFIDF_vectorizer, 'Description'),
        ('keywords_transform', My_phrase_vectorizer, 'Keywords'),
        ('ingredients_transform', My_phrase_vectorizer, 'RecipeIngredientParts'),
        ('num', StandardScaler(), X_numeric)
    ])

# Fit the column transformer to the data and transform the input
X_transformed = preprocessor.fit_transform(df)

# Output the transformed data
X_transformed.toarray()

array([[ 0.        ,  0.        ,  0.        , ..., -0.05368676,
        -0.04060806, -0.05300098],
       [ 0.        ,  0.        ,  0.        , ..., -0.06646121,
        -0.01351221, -0.03008818],
       [ 0.        ,  0.        ,  0.        , ..., -0.07923566,
         0.31163793,  0.28087122],
       ...,
       [ 0.        ,  0.        ,  0.        , ..., -0.06646121,
        -0.04873681, -0.06413005],
       [ 0.        ,  0.        ,  0.        , ..., -0.08690033,
        -0.04534983, -0.066094  ],
       [ 0.        ,  0.        ,  0.        , ..., -0.06646121,
         0.18970663,  0.16630723]])

In [15]:
X_transformed.shape

(9750, 3084)

Now we have all of recipes in rows and about 3K different features in columns.

## Find Similarity

Here, based on all the 3K features we will find the similarity between all the recipes.

In [15]:
#use cosine similarity for similar recipes
similarities = cosine_similarity(X_transformed, dense_output=False)

In [16]:
similarities

<9750x9750 sparse matrix of type '<class 'numpy.float64'>'
	with 95062500 stored elements in Compressed Sparse Row format>

We have a sparse matrix of similarity between each two recipe.

## RECIPE RECOMMENDER
In this step, we define a function which recieve a recipe name, and will give back the the most similar recipes with at least 10 reviews. 

In [17]:
def content_recommender(recipename, df, similarities, RevCount_threshold=10) :
    
    # Get the recipe by the name
    Recipe_index = df[df['Name'] == recipename].index
    
    # Create a dataframe with info of recipe
    sim_df = pd.DataFrame(
        {'RecipeName': df['Name'],'RecipeId':df['RecipeId'],'Description': df['Description'],'rating':df['AggregatedRating'],'rev':df['ReviewCount'],
         'cal':df['Calories'],'Keywords':df['Keywords'],'ingredients':df['RecipeIngredientParts'],'FatContent':df['FatContent'],
         'SaturatedFatContent':df['SaturatedFatContent'], 'CholesterolContent':df['CholesterolContent'],'SodiumContent':df['SodiumContent'],
         'CarbohydrateContent':df['CarbohydrateContent'],'FiberContent':df['FiberContent'],'SugarContent':df['SugarContent'],
         'ProteinContent':df['ProteinContent'],'PrepTimeHours':df['PrepTimeHours'],'CookingTimeHours':df['CookingTimeHours'],
         'TotalTimeHours':df['TotalTimeHours'],'image':df['image']
         'similarity': np.array(similarities[Recipe_index, :].todense()).squeeze()
        })
    
    # Get the top 10 recipe with > 10 review
    top_recipe = sim_df[sim_df['rev'] > RevCount_threshold].sort_values(by='similarity', ascending=False).head(10)
    
    #drop the input recipe
    top_recipe = top_recipe[top_recipe['RecipeName'] != recipename]
    
    return top_recipe

## Evaluating the model
Now lets choose a recipe and find similar ones.

In [18]:
#sample the df to choose a name you like
df.sample(3)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,DatePublished,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,...,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeInstructions,PrepTimeHours,CookingTimeHours,TotalTimeHours,CaloriesCategory,rateTwoScale
7398,206003,Linda's Thai Sweet Chili Sauce for Dipping (...,68526,Lindas Busy Kitchen,2007-01-17 15:18:00+00:00,I created this recipe to make Thai Chicken Win...,[https://img.sndimg.com/food/image/upload/w_55...,Sauces,"Asian, < 30 Mins, Easy,","[2, 1, 2, 1⁄3, 2, 1, 2, 1⁄4]",...,41.8,0.2,33.3,0.2,[In a small pan add the first 6 ingredients to...,0.083333,0.25,0.333333,lessthan550,1
7767,16957,Hot Creamy Crab Dip,27416,William Uncle Bill,2002-01-08 09:43:00+00:00,An excellent easy to make appetizer. I have be...,[https://img.sndimg.com/food/image/upload/w_55...,Spreads,"Lunch/Snacks, Crab, Canadian, < 30 Mins, Oven,...","[8, 1⁄2, 7 1⁄2, 4, 1, 1⁄8]",...,1.2,0.1,0.7,5.7,"[Preheat oven to 350 F degrees., In a mixing b...",0.166667,0.333333,0.5,lessthan550,1
7638,137627,Lemon Chicken-Just Like Take out !,162888,Chef Dee,2005-09-15 19:45:00+00:00,"Lightly breaded and a distinct lemon flavor, t...",[https://img.sndimg.com/food/image/upload/w_55...,Chicken,"Poultry, Meat, Asian, Savory, < 60 Mins, Easy,","[6, 2, 1⁄4, 1⁄2, 2, 2, 1⁄2, 2, 2, 1 1⁄2, 2]",...,25.1,0.1,17.1,29.1,"[Preheat oven to 325., Cut each chicken breast...",0.333333,0.5,0.833333,between550-4K,1


Some recipe names to try: Peanut Butter Pie , Lemon Chicken-Just Like Take out !.

In [27]:
#input the chosen name to below function
similar_recipe = content_recommender('Peanut Butter Pie', df, similarities, RevCount_threshold=4)

# see the top 5 recommended recipes
similar_recipe.head(5)

Unnamed: 0,RecipeName,RecipeId,Description,rating,rev,cal,Keywords,ingredients,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,PrepTimeHours,CookingTimeHours,TotalTimeHours,similarity
7330,Lemon Pie,8264,Make and share this Lemon Pie recipe from Food...,5.0,25.0,425.9,"Dessert, < 15 Mins,","sugar, all-purpose flour, cornstarch, salt, bo...",13.2,3.6,110.7,364.9,74.2,0.7,50.4,4.1,0.0,0.0,0.0,0.550458
9148,No Bake Creamy Peanut Butter Fudge Pie,35084,I found this recipe on a message board a whil...,5.0,22.0,560.3,"Dessert, < 15 Mins, Easy,","cream cheese, powdered sugar, creamy peanut bu...",38.7,20.9,31.2,329.5,49.5,1.4,40.2,7.5,0.166667,0.0,0.166667,0.546098
7903,Chess Pie,8167,Make and share this Chess Pie recipe from Food...,5.0,24.0,407.9,"Dessert, < 60 Mins, Oven, Easy,","eggs, sugar, butter, vinegar, cornmeal, flour,...",20.9,8.4,144.3,271.6,49.3,1.2,33.5,6.3,0.25,0.583333,0.833333,0.525053
9537,Orange Creamsicle Pie,95375,Make and share this Orange Creamsicle Pie reci...,5.0,21.0,299.3,"Dessert, < 15 Mins, For Large Groups, Beginner...","cream cheese, instant vanilla pudding, orange ...",17.5,10.5,15.6,297.4,33.8,0.3,27.2,2.8,0.166667,0.0,0.166667,0.522662
376,Butterfinger Pie,29478,This chilled pie will be gone right before you...,5.0,188.0,582.8,"Dessert, < 15 Mins, Easy,","cream cheese, Cool Whip, graham cracker crust,",36.3,20.5,31.2,374.2,62.7,1.3,42.5,5.8,0.166667,0.0,0.166667,0.519569


Now, lets redo it by less and more important featuers like keywords and ingredientds.

In [20]:
X_important = df[['Keywords','RecipeIngredientParts']]

In [21]:
preprocessor_second = ColumnTransformer(
    transformers=[
        ('keywords_transform', My_phrase_vectorizer, 'Keywords'),
        ('ingredients_transform', My_phrase_vectorizer, 'RecipeIngredientParts')
    ])

# Fit the column transformer to the data and transform the input
# by defualt the columntransformer will drop the other columns which are not modified in the function
X_important_transformed = preprocessor_second.fit_transform(df)

# Output the transformed data
X_important_transformed.toarray()

array([[0.09092719, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.16601963, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.0957935 , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.11343809, 0.        , 0.29045596, ..., 0.        , 0.        ,
        0.        ],
       [0.13384394, 0.38896704, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.20175782, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [22]:
X_important_transformed.shape

(9750, 954)

Now we have one third of features in theprevious model.

In [23]:
#use cosine similarity for similar recipes
similarities_second = cosine_similarity(X_important_transformed, dense_output=False)

In [24]:
similarities_second

<9750x9750 sparse matrix of type '<class 'numpy.float64'>'
	with 95055660 stored elements in Compressed Sparse Row format>

In [None]:
#import joblib

# Pickle our new pipeline for both the vectorizer and the sentiment model
#joblib.dump(similarities_second, "similarities_second_image.pkl")

To build a webapplication, we have retrained the model with data excluding recipes without image and have saved the similarities in a pkl file. Thus the above code is commented to prevent rewriting.

In [25]:
def content_recommender_second(recipename, df, similarities, RevCount_threshold=10) :
    
    # Get the recipe by the name
    Recipe_index = df[df['Name'] == recipename].index
    
    # Create a dataframe with info of recipe
    sim_df = pd.DataFrame(
        {'RecipeName': df['Name'],'RecipeId':df['RecipeId'],'Description': df['Description'],'rating':df['AggregatedRating'],'rev':df['ReviewCount'],
         'category':df['RecipeCategory'],'cal':df['Calories'],'Keywords':df['Keywords'],'ingredients':df['RecipeIngredientParts'],
         'similarity': np.array(similarities_second[Recipe_index, :].todense()).squeeze()
        })
    
    # Get the top 10 recipe with > 10 review
    top_recipe = sim_df[sim_df['rev'] > RevCount_threshold].sort_values(by='similarity', ascending=False).head(10)
    
    #drop the input recipe
    top_recipe = top_recipe[top_recipe['RecipeName'] != recipename]
    
    return top_recipe

In [28]:
#input the chosen name to below function
similar_recipe = content_recommender_second('Peanut Butter Pie', df, similarities_second, RevCount_threshold=4)

# see the top 5 recommended recipes
similar_recipe.head(5)

Unnamed: 0,RecipeName,RecipeId,Description,rating,rev,category,cal,Keywords,ingredients,similarity
2514,Best Ever Buckeyes,16716,Small peanut butter balls that taste even bett...,5.0,56.0,Candy,151.8,"Dessert, Fruit, Peanut Butter, Kid Friendly, C...","peanut butter, margarine, butter, powdered sug...",0.542836
2337,Super-Easy Microwave Peanut Butter Fudge,42547,"This is a very easy, delicious peanut butter f...",5.0,59.0,Candy,136.0,"Dessert, Fruit, Nuts, Microwave, < 15 Mins, Ea...","peanut butter,",0.542686
1394,Peanut Butter Frosting,33520,This is soooo good on a chocolate cake! I usua...,5.0,83.0,Dessert,288.7,"Kid Friendly, Kosher, < 15 Mins, Easy,","peanut butter, margarine, powdered sugar, milk,",0.534876
6356,White Chocolate No-Bake Cheesecake Pie,24423,Make and share this White Chocolate No-Bake Ch...,5.0,28.0,Cheesecake,547.2,"Dessert, Cheese, European, Kid Friendly, Potlu...","white chocolate chips, cream cheese, 9-inch gr...",0.534001
6155,"Chocolate, Butterscotch, Pb Rice Krispies Treats",100298,"Born with a Rice Krispie treat in my hand, you...",5.0,29.0,Bar Cookie,56.4,"Dessert, Lunch/Snacks, Cookie & Brownie, Potlu...","peanut butter, corn syrup, sugar,",0.533334


As we see, two models resulting in two different answers, which both are acceptable and related recipes. However, the weakness of this model is that we do not have a numeric criteria to evaluate the model. In the next notebook we will solve this problem by using other recommendation methods.