<h3 style="display:inline">Content-based Collaborative Filtering:</h3><h4 style="display:inline; margin-left:-40px;">Food Recommender System Case Study</h4>


In this case study, you are asked to develop a food recommender system using content-based filtering. You are given records of different types of food recipes, and rating users have given on these recipes. Your task consist of 

<ol>
    <li>Building a food recommender engine that suggests top similar recipes to a given product</li>
    <li>Estimate a user rating on a recipe he has never tasted</li>
</ol>

<b style="color:blue">Step 1. Load the datasets</b>

In [55]:
import pandas as pd

food_df=pd.read_csv('../datasets/food_recommender_datasets/1662574418893344.csv')
food_df.head()

Unnamed: 0,Food_ID,Name,C_Type,Veg_Non,Describe
0,1,summer squash salad,Healthy Food,veg,"white balsamic vinegar, lemon juice, lemon rin..."
1,2,chicken minced salad,Healthy Food,non-veg,"olive oil, chicken mince, garlic (minced), oni..."
2,3,sweet chilli almonds,Snack,veg,"almonds whole, egg white, curry leaves, salt, ..."
3,4,tricolour salad,Healthy Food,veg,"vinegar, honey/sugar, soy sauce, salt, garlic ..."
4,5,christmas cake,Dessert,veg,"christmas dry fruits (pre-soaked), orange zest..."


<h3>Preprocessing and Future Extraction</h3><br/>
<b style="color:blue">Step 2. Verify whether there are missing values and Impute data/Remove rows if necessary</b>

In [57]:
food_df.isna().any(axis=1).sum()

0

<b style="color:blue">Step 3. Remove unwanted characters and put every word to lower case </b>

In [58]:
import re

food_df['Describe'] = food_df['Describe'].apply(lambda x:re.sub(r'[(),]','',x).lower())
food_df['Name'] = food_df['Name'].apply(lambda x:x.lower())
food_df['C_Type'] = food_df['C_Type'].apply(lambda x:x.lower())
food_df.head()

Unnamed: 0,Food_ID,Name,C_Type,Veg_Non,Describe
0,1,summer squash salad,healthy food,veg,white balsamic vinegar lemon juice lemon rind ...
1,2,chicken minced salad,healthy food,non-veg,olive oil chicken mince garlic minced onion sa...
2,3,sweet chilli almonds,snack,veg,almonds whole egg white curry leaves salt suga...
3,4,tricolour salad,healthy food,veg,vinegar honey/sugar soy sauce salt garlic clov...
4,5,christmas cake,dessert,veg,christmas dry fruits pre-soaked orange zest le...


<b style="color:blue">Step 4. Transform C_type using One Hot Encoding on C-Type </b>

In [59]:
food_df = pd.get_dummies(food_df,columns=['C_Type'])


<b style="color:blue">Step 5. Transform Veg_Non to a binary variable </b>

In [60]:
food_df['Veg_Non']=food_df['Veg_Non'].apply(lambda x:1 if x=='veg'else 0)

In [61]:
food_df.head()

Unnamed: 0,Food_ID,Name,Veg_Non,Describe,C_Type_ korean,C_Type_beverage,C_Type_chinese,C_Type_dessert,C_Type_french,C_Type_healthy food,C_Type_indian,C_Type_italian,C_Type_japanese,C_Type_korean,C_Type_mexican,C_Type_nepalese,C_Type_snack,C_Type_spanish,C_Type_thai,C_Type_vietnames
0,1,summer squash salad,1,white balsamic vinegar lemon juice lemon rind ...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,2,chicken minced salad,0,olive oil chicken mince garlic minced onion sa...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,3,sweet chilli almonds,1,almonds whole egg white curry leaves salt suga...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
3,4,tricolour salad,1,vinegar honey/sugar soy sauce salt garlic clov...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,5,christmas cake,1,christmas dry fruits pre-soaked orange zest le...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0


<b style="color:blue">Step 6. Vectorise Describe using TF-IDF and check the new data shape</b>

In [62]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1),
                     min_df=0.0, stop_words='english')
tf_feats_describe = tf.fit_transform(food_df['Describe'])
tf_feats_describe.shape

(400, 1234)

<b style="color:blue">Step 7. Vectorise Name using TF-IDF and check the new data shape</b>

In [63]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1),
                     min_df=0.0, stop_words='english')
tf_feats_name = tf.fit_transform(food_df['Name'])
tf_feats_name.shape

(400, 588)

<b style="color:blue">Step 8. Transform Describe and Name TD-IF feautures using PCA</b>
n_components = 10. Use 'from sklearn.decomposition import TruncatedSVD'

In [64]:
from sklearn.decomposition import TruncatedSVD

n_pc = 10
pca_describe = TruncatedSVD(n_components=10)#10 principal components
tf_feats_describe_pca = pca_describe.fit_transform(tf_feats_describe)
#tf_feats_describe.shape
tf_feats_desc_df = pd.DataFrame(tf_feats_describe_pca,columns=['pca_desc_%d'%(i) for i in range(n_pc)])
tf_feats_desc_df.head()

Unnamed: 0,pca_desc_0,pca_desc_1,pca_desc_2,pca_desc_3,pca_desc_4,pca_desc_5,pca_desc_6,pca_desc_7,pca_desc_8,pca_desc_9
0,0.284247,-0.062287,-0.264641,-0.27105,0.08296,-0.107568,0.069868,0.05581,0.015324,-0.139829
1,0.392255,-0.111961,-0.415806,0.150743,-0.138408,0.039652,-0.001579,-0.008053,0.126398,-0.032689
2,0.318873,0.102878,0.066758,-0.032148,0.05719,-0.084277,-0.105956,-0.121203,0.249922,0.05625
3,0.264772,-0.019824,-0.218803,0.133301,0.005273,0.006286,-0.027738,0.039464,0.139446,-0.121798
4,0.088976,0.1651,-0.045557,-0.055352,0.046763,0.071036,0.141771,0.111638,-0.088913,0.001936


In [65]:
n_pc = 10
pca_name = TruncatedSVD(n_components=10)#10 principal components
tf_feats_name_pca = pca_describe.fit_transform(tf_feats_name)
#tf_feats_describe.shape
tf_feats_name_df = pd.DataFrame(tf_feats_name_pca,columns=['pca_name_%d'%(i) for i in range(n_pc)])
tf_feats_name_df.head()

Unnamed: 0,pca_name_0,pca_name_1,pca_name_2,pca_name_3,pca_name_4,pca_name_5,pca_name_6,pca_name_7,pca_name_8,pca_name_9
0,0.036675,0.019676,-0.032374,0.002551,-0.065253,0.367056,-0.310481,0.129158,0.144913,-0.151642
1,0.340218,-0.077537,-0.041473,-0.005378,-0.048795,0.287903,-0.301515,0.081607,0.117242,-0.184053
2,0.056903,0.023022,0.009557,0.004509,-0.027871,0.068497,0.043004,-0.140271,-0.079894,0.083603
3,0.033116,0.007037,-0.02258,0.008105,-0.069008,0.361292,-0.344352,0.100822,0.156287,-0.196954
4,0.006947,0.014694,-0.018714,0.453186,-0.419888,-0.238931,-0.11974,0.101354,-0.020678,-0.022727


<b style="color:blue">Step 9. Merge all columns - pca_features, food_id, veg and c_types into a new feature dataframe</b>


In [66]:
food_feats_df = pd.concat([food_df.drop(columns=['Name','Describe']),tf_feats_desc_df,tf_feats_name_df],axis=1)
food_feats_df.head()

Unnamed: 0,Food_ID,Veg_Non,C_Type_ korean,C_Type_beverage,C_Type_chinese,C_Type_dessert,C_Type_french,C_Type_healthy food,C_Type_indian,C_Type_italian,...,pca_name_0,pca_name_1,pca_name_2,pca_name_3,pca_name_4,pca_name_5,pca_name_6,pca_name_7,pca_name_8,pca_name_9
0,1,1,0,0,0,0,0,1,0,0,...,0.036675,0.019676,-0.032374,0.002551,-0.065253,0.367056,-0.310481,0.129158,0.144913,-0.151642
1,2,0,0,0,0,0,0,1,0,0,...,0.340218,-0.077537,-0.041473,-0.005378,-0.048795,0.287903,-0.301515,0.081607,0.117242,-0.184053
2,3,1,0,0,0,0,0,0,0,0,...,0.056903,0.023022,0.009557,0.004509,-0.027871,0.068497,0.043004,-0.140271,-0.079894,0.083603
3,4,1,0,0,0,0,0,1,0,0,...,0.033116,0.007037,-0.02258,0.008105,-0.069008,0.361292,-0.344352,0.100822,0.156287,-0.196954
4,5,1,0,0,0,1,0,0,0,0,...,0.006947,0.014694,-0.018714,0.453186,-0.419888,-0.238931,-0.11974,0.101354,-0.020678,-0.022727


<b style="color:blue">Step 10. Build the similarity matrix</b>

In [69]:
from sklearn.metrics.pairwise import cosine_similarity 

food_feats_df_clean = food_feats_df.drop(columns=['Food_ID'])
cosine_sim_matrix = cosine_similarity(food_feats_df_clean,food_feats_df_clean)
cosine_sim_matrix.shape

(400, 400)

In [70]:
cosine_sim_matrix

array([[ 1.        ,  0.68162744,  0.43269221, ...,  0.06431563,
         0.02802861,  0.91217603],
       [ 0.68162744,  1.        ,  0.04499624, ...,  0.09198534,
         0.03936256,  0.60176082],
       [ 0.43269221,  0.04499624,  1.        , ...,  0.04295885,
         0.02114053,  0.47767572],
       ...,
       [ 0.06431563,  0.09198534,  0.04295885, ...,  1.        ,
        -0.0076961 ,  0.06977001],
       [ 0.02802861,  0.03936256,  0.02114053, ..., -0.0076961 ,
         1.        ,  0.02372617],
       [ 0.91217603,  0.60176082,  0.47767572, ...,  0.06977001,
         0.02372617,  1.        ]])

<b style="color:blue">Step 11. Build the similarity matrix</b>

<b>1. Given the similarity matrix create a function that returns a sorted list of food-ids and names in descending order of similarities given only the food_id</b>

The returned dataframe must have three columns: food_id, name, description, similarity

In [82]:
import numpy as np

def top_food_resemblance(food_id,cosine_matrix,food_df):
        num_foods = len(food_df)
        if food_id > num_foods or food_id < 1:
            raise ValueError('incorrect food id')
        
        sort_ids = np.argsort(cosine_matrix[food_id-1])
        sort_ids=sort_ids[::-1]#reverse sorting order
        similarity_vals = cosine_matrix[food_id-1][sort_ids]
        food_df_tmp = food_df.iloc[sort_ids,:]
        output_df = pd.DataFrame({'food_id':food_df_tmp.Food_ID,'name':food_df_tmp.Name,'describe':food_df_tmp.Describe,'sim':similarity_vals})
        return output_df
    



<b style="color:blue">Step 12. Call the Function on any food id and return the dataframe with a list of suggestions</b>

In [84]:
food_id = 3
output_df = top_food_resemblance(food_id,cosine_sim_matrix,food_df)
output_df.head(10)

Unnamed: 0,food_id,name,describe,sim
2,3,sweet chilli almonds,almonds whole egg white curry leaves salt suga...,1.0
279,280,baked raw banana samosa,onion ginger curry powder fresh coriander gree...,0.949952
305,306,banana chips,dried slices of bananas fruits of herbaceous p...,0.94988
270,271,californian breakfast benedict,brioche loaf avocado paste eggs tomato spinach...,0.942452
273,274,banana phirni tartlets with fresh strawberries,basmati rice soaked in water milk cardamom pow...,0.941029
59,60,caramelized sesame smoked almonds,red lentils or masoor dal half-boiled potato g...,0.93754
247,248,cheese and ham roll,hung curd butter cream ground pimento lemon ju...,0.935924
264,265,arbi kofta with mint yogurt dip,arbi/colocasia roots water chestnut flour kutt...,0.93541
25,26,almond pearls,toasted almonds blueberries oats corn flakes o...,0.92339
16,17,baked namakpara with roasted almond dip,almonds crushed tomato garlic cloves basil spr...,0.909895


<b style="color:blue">Step 13.Create a function that will return a user_ID rating for a food_id he has never rated</b>

In [85]:
rating_df = pd.read_csv('../datasets/food_recommender_datasets/ratings.csv')
rating_df.head()

Unnamed: 0,User_ID,Food_ID,Rating
0,1.0,88.0,4.0
1,1.0,46.0,3.0
2,1.0,24.0,5.0
3,1.0,25.0,4.0
4,2.0,49.0,1.0


In [136]:
import warnings
import heapq # <-- Efficient sorting of large lists

def rate_food(user_id,food_id,cosine_matrix,rating_df,food_df,k_thr=10,sim_thr=0.0):
    n_f = len(food_df)
    if food_id > n_f or food_id < 1:
        raise ValueError('food id does not exist')
    neighs = []
    tmp_frame = rating_df[(rating_df['Food_ID']==food_id) & (rating_df['User_ID']==user_id)]
    if len(tmp_frame) != 0:
        warnings.warn('Food already rated..')
        return tmp_frame.Rating[0]
    else:        
        for index, row in rating_df[rating_df['User_ID']==user_id].iterrows():            
            sim_ij = cosine_matrix[food_id-1,int(row['Food_ID'])-1]
            r_i = row['Rating']
            neighs.append((sim_ij,row['Rating']))
        k_neighbors = heapq.nlargest(k_thr,neighs,key=lambda t:t[0])
        
        #compute weight average
        sim_tot = 0
        weighSum = 0
        for (sim_ij,r_i) in k_neighbors:
            if sim_ij > sim_thr:
                sim_tot+= sim_ij
                weighSum += sim_ij*r_i
        try:
            pred_r = weighSum/sim_tot
        except ZeroDivisionError:
            pred_r = np.mean(rating_df[rating_df['Food_ID']==food_id]['Rating'])  
        pred_r = max(pred_r,0)
        pred_r = min(pred_r,5)
        return pred_r

<b style="color:blue">Step 14. Rate any food</b>

In [149]:
user_id = 9
food_id = 13

rating = rate_food(user_id,food_id,cosine_sim_matrix,rating_df,food_df)
print('Rating for foodID',food_id,'by user',user_id,':',rating)

Rating for foodID 13 by user 9 : 3.4410696586371183
