<h3 style="display:inline">Content-based Collaborative Filtering:</h3><h4 style="display:inline; margin-left:-40px;">Food Recommender System Case Study</h4>


In this case study, you are asked to develop a food recommender system using content-based filtering. You are given records of different types of food recipes, and rating users have given on these recipes. Your task consist of 

<ol>
    <li>Building a food recommender engine that suggests top similar recipes to a given product</li>
    <li>Estimating a user rating on a recipe he has never tasted</li>
</ol>

<b style="color:blue">Step 1. Load the datasets</b>

In [8]:
import pandas as pd

food_df=pd.read_csv('../datasets/food_recommender_datasets/1662574418893344.csv')
food_df.head()

Unnamed: 0,Food_ID,Name,C_Type,Veg_Non,Describe
0,1,summer squash salad,Healthy Food,veg,"white balsamic vinegar, lemon juice, lemon rin..."
1,2,chicken minced salad,Healthy Food,non-veg,"olive oil, chicken mince, garlic (minced), oni..."
2,3,sweet chilli almonds,Snack,veg,"almonds whole, egg white, curry leaves, salt, ..."
3,4,tricolour salad,Healthy Food,veg,"vinegar, honey/sugar, soy sauce, salt, garlic ..."
4,5,christmas cake,Dessert,veg,"christmas dry fruits (pre-soaked), orange zest..."


<h3>Preprocessing and Future Extraction</h3><br/>
<b style="color:blue">Step 2. Verify whether there are missing values and Impute data/Remove rows if necessary</b>

In [9]:
food_df.isna().any(axis=1).sum()

0

<b style="color:blue">Step 3. Remove unwanted characters and put every word to lower case </b>

In [10]:
import re

food_df['Describe'] = food_df['Describe'].apply(lambda x:re.sub(r'[(),/]','',x).lower())
food_df['Name'] = food_df['Name'].apply(lambda x:x.lower())
food_df['C_Type'] = food_df['C_Type'].apply(lambda x:x.lower())
food_df.head()

Unnamed: 0,Food_ID,Name,C_Type,Veg_Non,Describe
0,1,summer squash salad,healthy food,veg,white balsamic vinegar lemon juice lemon rind ...
1,2,chicken minced salad,healthy food,non-veg,olive oil chicken mince garlic minced onion sa...
2,3,sweet chilli almonds,snack,veg,almonds whole egg white curry leaves salt suga...
3,4,tricolour salad,healthy food,veg,vinegar honeysugar soy sauce salt garlic clove...
4,5,christmas cake,dessert,veg,christmas dry fruits pre-soaked orange zest le...


<b style="color:blue">Step 4. Transform C_type using One Hot Encoding on C-Type </b>

In [11]:
food_df = pd.get_dummies(food_df,columns=['C_Type'])

<b style="color:blue">Step 5. Transform Veg_Non to a binary variable </b>

In [12]:
food_df['Veg_Non']=food_df['Veg_Non'].apply(lambda x:1 if x=='veg'else 0)

In [13]:
food_df.head()

Unnamed: 0,Food_ID,Name,Veg_Non,Describe,C_Type_ korean,C_Type_beverage,C_Type_chinese,C_Type_dessert,C_Type_french,C_Type_healthy food,C_Type_indian,C_Type_italian,C_Type_japanese,C_Type_korean,C_Type_mexican,C_Type_nepalese,C_Type_snack,C_Type_spanish,C_Type_thai,C_Type_vietnames
0,1,summer squash salad,1,white balsamic vinegar lemon juice lemon rind ...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,2,chicken minced salad,0,olive oil chicken mince garlic minced onion sa...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,3,sweet chilli almonds,1,almonds whole egg white curry leaves salt suga...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
3,4,tricolour salad,1,vinegar honeysugar soy sauce salt garlic clove...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,5,christmas cake,1,christmas dry fruits pre-soaked orange zest le...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0


<b style="color:blue">Step 6. Vectorise Describe using TF-IDF and check the new data shape</b>

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1),
                     min_df=0.0, stop_words='english')
tf_feats_describe = tf.fit_transform(food_df['Describe'])
tf_feats_describe.shape

(400, 1250)

<b style="color:blue">Step 7. Vectorise Name using TF-IDF and check the new data shape</b>

In [15]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1),
                     min_df=0.0, stop_words='english')
tf_feats_name = tf.fit_transform(food_df['Name'])
tf_feats_name.shape

(400, 588)

<b style="color:blue">Step 8. Transform Describe and Name TD-IF feautures using PCA</b>
n_components = 10. Use 'from sklearn.decomposition import TruncatedSVD'

In [17]:
from sklearn.decomposition import TruncatedSVD

n_pc = 10
pca_describe = TruncatedSVD(n_components=10)#10 principal components
tf_feats_describe_pca = pca_describe.fit_transform(tf_feats_describe)
tf_feats_desc_df = pd.DataFrame(tf_feats_describe_pca,columns=['pca_desc_%d'%(i) for i in range(n_pc)])
tf_feats_desc_df.head()

Unnamed: 0,pca_desc_0,pca_desc_1,pca_desc_2,pca_desc_3,pca_desc_4,pca_desc_5,pca_desc_6,pca_desc_7,pca_desc_8,pca_desc_9
0,0.283839,-0.061271,-0.267903,-0.263829,0.097769,-0.097875,0.063718,0.028383,-0.025399,-0.099756
1,0.390966,-0.112194,-0.416189,0.143468,-0.149485,0.04557,-0.008012,-0.011423,-0.12474,-0.004027
2,0.318375,0.104172,0.063715,-0.030263,0.062052,-0.087501,-0.138645,-0.187421,-0.182885,0.112693
3,0.234773,-0.093126,-0.206978,0.116605,-0.017578,0.005853,0.004593,0.04701,-0.092201,0.000155
4,0.089116,0.164535,-0.04528,-0.0536,0.044816,0.086403,0.145371,0.116911,0.068564,-0.023995


In [18]:
n_pc = 10
pca_name = TruncatedSVD(n_components=10)#10 principal components
tf_feats_name_pca = pca_name.fit_transform(tf_feats_name)
#tf_feats_describe.shape
tf_feats_name_df = pd.DataFrame(tf_feats_name_pca,columns=['pca_name_%d'%(i) for i in range(n_pc)])
tf_feats_name_df.head()

Unnamed: 0,pca_name_0,pca_name_1,pca_name_2,pca_name_3,pca_name_4,pca_name_5,pca_name_6,pca_name_7,pca_name_8,pca_name_9
0,0.036757,0.019132,-0.032075,0.00242,-0.071726,0.397194,-0.281854,0.106626,0.150609,-0.175634
1,0.340223,-0.077403,-0.041345,-0.007075,-0.05568,0.323284,-0.276088,0.065762,0.098386,-0.208115
2,0.057051,0.022189,0.00899,0.005966,-0.030988,0.065085,0.044,-0.099904,-0.028384,0.040872
3,0.033061,0.007468,-0.022531,0.005832,-0.074487,0.407478,-0.318196,0.087056,0.129986,-0.211353
4,0.006724,0.014475,-0.018324,0.451858,-0.436046,-0.198094,-0.15349,0.107566,-0.057731,-0.009076


<b style="color:blue">Step 9. Merge all columns - pca_features, food_id, veg and c_types into a new feature dataframe</b>


In [20]:
food_feats_df = pd.concat([food_df.drop(columns=['Name','Describe']),tf_feats_desc_df,tf_feats_name_df],axis=1)
food_feats_df.head()

Unnamed: 0,Food_ID,Veg_Non,C_Type_ korean,C_Type_beverage,C_Type_chinese,C_Type_dessert,C_Type_french,C_Type_healthy food,C_Type_indian,C_Type_italian,...,pca_name_0,pca_name_1,pca_name_2,pca_name_3,pca_name_4,pca_name_5,pca_name_6,pca_name_7,pca_name_8,pca_name_9
0,1,1,0,0,0,0,0,1,0,0,...,0.036757,0.019132,-0.032075,0.00242,-0.071726,0.397194,-0.281854,0.106626,0.150609,-0.175634
1,2,0,0,0,0,0,0,1,0,0,...,0.340223,-0.077403,-0.041345,-0.007075,-0.05568,0.323284,-0.276088,0.065762,0.098386,-0.208115
2,3,1,0,0,0,0,0,0,0,0,...,0.057051,0.022189,0.00899,0.005966,-0.030988,0.065085,0.044,-0.099904,-0.028384,0.040872
3,4,1,0,0,0,0,0,1,0,0,...,0.033061,0.007468,-0.022531,0.005832,-0.074487,0.407478,-0.318196,0.087056,0.129986,-0.211353
4,5,1,0,0,0,1,0,0,0,0,...,0.006724,0.014475,-0.018324,0.451858,-0.436046,-0.198094,-0.15349,0.107566,-0.057731,-0.009076


In [21]:
food_feats_df.columns

Index(['Food_ID', 'Veg_Non', 'C_Type_ korean', 'C_Type_beverage',
       'C_Type_chinese', 'C_Type_dessert', 'C_Type_french',
       'C_Type_healthy food', 'C_Type_indian', 'C_Type_italian',
       'C_Type_japanese', 'C_Type_korean', 'C_Type_mexican', 'C_Type_nepalese',
       'C_Type_snack', 'C_Type_spanish', 'C_Type_thai', 'C_Type_vietnames',
       'pca_desc_0', 'pca_desc_1', 'pca_desc_2', 'pca_desc_3', 'pca_desc_4',
       'pca_desc_5', 'pca_desc_6', 'pca_desc_7', 'pca_desc_8', 'pca_desc_9',
       'pca_name_0', 'pca_name_1', 'pca_name_2', 'pca_name_3', 'pca_name_4',
       'pca_name_5', 'pca_name_6', 'pca_name_7', 'pca_name_8', 'pca_name_9'],
      dtype='object')

<b style="color:blue">Step 10. Build the similarity matrix</b>

In [22]:
from sklearn.metrics.pairwise import cosine_similarity 

food_feats_df_clean = food_feats_df.drop(columns=['Food_ID'])
cosine_sim_matrix = cosine_similarity(food_feats_df_clean,food_feats_df_clean)
cosine_sim_matrix.shape

(400, 400)

In [23]:
cosine_sim_matrix

array([[1.        , 0.68380929, 0.44243159, ..., 0.05940871, 0.03060034,
        0.91653207],
       [0.68380929, 1.        , 0.05232306, ..., 0.08546885, 0.04955832,
        0.59964096],
       [0.44243159, 0.05232306, 1.        , ..., 0.04573374, 0.03485136,
        0.48057429],
       ...,
       [0.05940871, 0.08546885, 0.04573374, ..., 1.        , 0.00617223,
        0.07004375],
       [0.03060034, 0.04955832, 0.03485136, ..., 0.00617223, 1.        ,
        0.02508271],
       [0.91653207, 0.59964096, 0.48057429, ..., 0.07004375, 0.02508271,
        1.        ]])

<b style="color:blue">Step 11. Suggest Top N similar Food Recipes for a given recipe</b>

<b>1. Given the similarity matrix create a function that returns a sorted list of food-ids and names in descending order of similarities given only the food_id</b>

The returned dataframe must have three columns: food_id, name, description, similarity

In [24]:
import numpy as np

def top_food_resemblance(food_id,cosine_matrix,food_df):
        num_foods = len(food_df)
        if food_id > num_foods or food_id < 1:
            raise ValueError('incorrect food id')
        
        sort_ids = np.argsort(cosine_matrix[food_id-1])
        sort_ids=sort_ids[::-1]#reverse sorting order
        similarity_vals = cosine_matrix[food_id-1][sort_ids]
        food_df_tmp = food_df.iloc[sort_ids,:]
        output_df = pd.DataFrame({'food_id':food_df_tmp.Food_ID,'name':food_df_tmp.Name,'describe':food_df_tmp.Describe,'sim':similarity_vals})
        return output_df 



<b style="color:blue">Step 12. Call the Function on any food id and return the dataframe with a list of suggestions</b>

In [26]:
food_id = 1
output_df = top_food_resemblance(food_id,cosine_sim_matrix,food_df)
output_df.head(10)

Unnamed: 0,food_id,name,describe,sim
0,1,summer squash salad,white balsamic vinegar lemon juice lemon rind ...,1.0
39,40,corn and raw mango salad,corn kernels onions green onions paprika raw m...,0.986028
278,279,mixed beans salad,mixed boiled beans choose from rajma chawli ch...,0.983001
69,70,shepherds salad (tamatar-kheera salaad),1 cucumber peeled and chopped onion tomato gre...,0.979341
148,149,prawn and litchi salad,prawns shelled and cleaned spring onions mango...,0.974504
358,359,shirazi salad,spring onion cheese lemon juice cucumber onion...,0.973449
10,11,coconut lime quinoa salad,uncooked quinoa water red onion cucumber diced...,0.969144
3,4,tricolour salad,vinegar honeysugar soy sauce salt garlic clove...,0.960929
26,27,hawaiin papaya salad,papaya fresh lime juiced watermelon balls or s...,0.960773
27,28,vegetable som tam salad,raw papaya carrot french bean diamond cherry t...,0.956705


<b style="color:blue">Step 13.Create a function that will return a user_ID rating for a food_id he has never rated</b>

In [27]:
rating_df = pd.read_csv('../datasets/food_recommender_datasets/ratings.csv')
rating_df.head()

Unnamed: 0,User_ID,Food_ID,Rating
0,1.0,88.0,4.0
1,1.0,46.0,3.0
2,1.0,24.0,5.0
3,1.0,25.0,4.0
4,2.0,49.0,1.0


<b style="color:blue">Step 14. Rate any food</b>

In [28]:
import warnings
import heapq # <-- Efficient sorting of large lists

def rate_food(user_id,food_id,cosine_matrix,rating_df,food_df,k_thr=10,sim_thr=0.0):
    n_f = len(food_df)
    if food_id > n_f or food_id < 1:
        raise ValueError('food id does not exist')
    neighs = []
    tmp_frame = rating_df[(rating_df['Food_ID']==food_id) & (rating_df['User_ID']==user_id)]
    if len(tmp_frame) != 0:
        warnings.warn('Food already rated..')
        return tmp_frame.Rating[0]
    else:        
        for index, row in rating_df[rating_df['User_ID']==user_id].iterrows():            
            sim_ij = cosine_matrix[food_id-1,int(row['Food_ID'])-1]
            r_i = row['Rating']
            neighs.append((sim_ij,row['Rating']))
        k_neighbors = heapq.nlargest(k_thr,neighs,key=lambda t:t[0])
        
        #compute weight average
        sim_tot = 0
        weighSum = 0
        for (sim_ij,r_i) in k_neighbors:
            if sim_ij > sim_thr:
                sim_tot+= sim_ij
                weighSum += sim_ij*r_i
        try:
            pred_r = weighSum/sim_tot
        except ZeroDivisionError:
            pred_r = np.mean(rating_df[rating_df['Food_ID']==food_id]['Rating'])  
        pred_r = max(pred_r,0)
        pred_r = min(pred_r,5)
        return pred_r

In [None]:
user_id = 9
food_id = 13

rating = rate_food(user_id,food_id,cosine_sim_matrix,rating_df,food_df)
print('Rating for foodID',food_id,'by user',user_id,':',rating)