<h3 style="display:inline">Content-based Collaborative Filtering:</h3><h4 style="display:inline; margin-left:-40px;">Food Recommender System Case Study</h4>


In this case study, you are asked to develop a food recommender system using content-based filtering. You are given records of different types of food recipes, and rating users have given on these recipes. Your task consist of 

<ol>
    <li>Building a food recommender engine that suggests top similar recipes to a given product</li>
    <li>Estimate a user rating on a recipe he has never tasted</li>
</ol>

<b style="color:blue">Step 1. Load the datasets</b>

In [103]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

food_df=pd.read_csv('../datasets/food_recommender_datasets/1662574418893344.csv')
food_df.head()

Unnamed: 0,Food_ID,Name,C_Type,Veg_Non,Describe
0,1,summer squash salad,Healthy Food,veg,"white balsamic vinegar, lemon juice, lemon rin..."
1,2,chicken minced salad,Healthy Food,non-veg,"olive oil, chicken mince, garlic (minced), oni..."
2,3,sweet chilli almonds,Snack,veg,"almonds whole, egg white, curry leaves, salt, ..."
3,4,tricolour salad,Healthy Food,veg,"vinegar, honey/sugar, soy sauce, salt, garlic ..."
4,5,christmas cake,Dessert,veg,"christmas dry fruits (pre-soaked), orange zest..."


<h3>Preprocessing and Future Extraction</h3><br/>
<b style="color:blue">Step 2. Verify whether there are missing values and Impute data/Remove rows if necessary</b>

In [104]:
food_df.isna().any(axis=1).sum()

0

<b style="color:blue">Step 3. Remove unwanted characters and put every word to lower case </b>

In [105]:
import re

food_df['Describe'] = food_df['Describe'].apply(lambda x:re.sub(r'[(),/]','',x).lower())
food_df['Name'] = food_df['Name'].apply(lambda x:x.lower())
food_df['C_Type'] = food_df['C_Type'].apply(lambda x:x.lower())
food_df.head()

Unnamed: 0,Food_ID,Name,C_Type,Veg_Non,Describe
0,1,summer squash salad,healthy food,veg,white balsamic vinegar lemon juice lemon rind ...
1,2,chicken minced salad,healthy food,non-veg,olive oil chicken mince garlic minced onion sa...
2,3,sweet chilli almonds,snack,veg,almonds whole egg white curry leaves salt suga...
3,4,tricolour salad,healthy food,veg,vinegar honeysugar soy sauce salt garlic clove...
4,5,christmas cake,dessert,veg,christmas dry fruits pre-soaked orange zest le...


<b style="color:blue">Step 4. Transform C_type using One Hot Encoding on C-Type </b>

In [106]:
food_df = pd.get_dummies(food_df,columns=['C_Type'])


<b style="color:blue">Step 5. Transform Veg_Non to a binary variable </b>

In [107]:
food_df['Veg_Non']=food_df['Veg_Non'].apply(lambda x:1 if x=='veg'else 0)

In [108]:
food_df.head()

Unnamed: 0,Food_ID,Name,Veg_Non,Describe,C_Type_ korean,C_Type_beverage,C_Type_chinese,C_Type_dessert,C_Type_french,C_Type_healthy food,C_Type_indian,C_Type_italian,C_Type_japanese,C_Type_korean,C_Type_mexican,C_Type_nepalese,C_Type_snack,C_Type_spanish,C_Type_thai,C_Type_vietnames
0,1,summer squash salad,1,white balsamic vinegar lemon juice lemon rind ...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,2,chicken minced salad,0,olive oil chicken mince garlic minced onion sa...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,3,sweet chilli almonds,1,almonds whole egg white curry leaves salt suga...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
3,4,tricolour salad,1,vinegar honeysugar soy sauce salt garlic clove...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,5,christmas cake,1,christmas dry fruits pre-soaked orange zest le...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0


<b style="color:blue">Step 6. Vectorise Describe using TF-IDF and check the new data shape</b>

In [109]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1),
                     min_df=0.0, stop_words='english')
tf_feats_describe = tf.fit_transform(food_df['Describe'])
tf_feats_describe.shape

(400, 1250)

<b style="color:blue">Step 7. Vectorise Name using TF-IDF and check the new data shape</b>

In [110]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1),
                     min_df=0.0, stop_words='english')
tf_feats_name = tf.fit_transform(food_df['Name'])
tf_feats_name.shape

(400, 588)

<b style="color:blue">Step 8. Transform Describe and Name TD-IF feautures using PCA</b>
n_components = 10. Use 'from sklearn.decomposition import TruncatedSVD'

In [111]:
from sklearn.decomposition import TruncatedSVD

n_pc = 10
pca_describe = TruncatedSVD(n_components=10)#10 principal components
tf_feats_describe_pca = pca_describe.fit_transform(tf_feats_describe)
tf_feats_desc_df = pd.DataFrame(tf_feats_describe_pca,columns=['pca_desc_%d'%(i) for i in range(n_pc)])
tf_feats_desc_df.head()

Unnamed: 0,pca_desc_0,pca_desc_1,pca_desc_2,pca_desc_3,pca_desc_4,pca_desc_5,pca_desc_6,pca_desc_7,pca_desc_8,pca_desc_9
0,0.283839,-0.061375,-0.267959,-0.263435,0.09893,-0.103976,0.076266,0.041666,-0.025774,-0.100784
1,0.390966,-0.112175,-0.416042,0.142686,-0.150914,0.038732,-0.011227,-0.036221,-0.127551,-0.025904
2,0.318374,0.104133,0.064408,-0.027618,0.061958,-0.109967,-0.142331,-0.166184,-0.149108,0.109038
3,0.234773,-0.093218,-0.206799,0.117222,-0.018748,0.008214,0.004309,0.046693,-0.078165,0.02052
4,0.089115,0.164629,-0.045145,-0.056056,0.045667,0.086528,0.15068,0.12534,0.051864,-0.037825


In [112]:
n_pc = 10
pca_name = TruncatedSVD(n_components=10)#10 principal components
tf_feats_name_pca = pca_describe.fit_transform(tf_feats_name)
#tf_feats_describe.shape
tf_feats_name_df = pd.DataFrame(tf_feats_name_pca,columns=['pca_name_%d'%(i) for i in range(n_pc)])
tf_feats_name_df.head()

Unnamed: 0,pca_name_0,pca_name_1,pca_name_2,pca_name_3,pca_name_4,pca_name_5,pca_name_6,pca_name_7,pca_name_8,pca_name_9
0,0.036623,0.019966,-0.032677,0.003674,-0.078604,0.403915,-0.274101,0.09447,0.184145,-0.138741
1,0.340199,-0.077651,-0.041336,-0.007138,-0.055736,0.330942,-0.272682,0.055859,0.152091,-0.155796
2,0.056959,0.022358,0.009651,0.001166,-0.03103,0.075919,0.041209,-0.113304,-0.045147,0.102087
3,0.033014,0.0071,-0.022282,0.004736,-0.077539,0.412557,-0.314333,0.079123,0.184503,-0.175123
4,0.006752,0.014371,-0.018692,0.449449,-0.440409,-0.188924,-0.135861,0.097986,-0.05472,-0.013005


<b style="color:blue">Step 9. Merge all columns - pca_features, food_id, veg and c_types into a new feature dataframe</b>


In [113]:
food_feats_df = pd.concat([food_df.drop(columns=['Name','Describe']),tf_feats_desc_df,tf_feats_name_df],axis=1)
food_feats_df.head()

Unnamed: 0,Food_ID,Veg_Non,C_Type_ korean,C_Type_beverage,C_Type_chinese,C_Type_dessert,C_Type_french,C_Type_healthy food,C_Type_indian,C_Type_italian,...,pca_name_0,pca_name_1,pca_name_2,pca_name_3,pca_name_4,pca_name_5,pca_name_6,pca_name_7,pca_name_8,pca_name_9
0,1,1,0,0,0,0,0,1,0,0,...,0.036623,0.019966,-0.032677,0.003674,-0.078604,0.403915,-0.274101,0.09447,0.184145,-0.138741
1,2,0,0,0,0,0,0,1,0,0,...,0.340199,-0.077651,-0.041336,-0.007138,-0.055736,0.330942,-0.272682,0.055859,0.152091,-0.155796
2,3,1,0,0,0,0,0,0,0,0,...,0.056959,0.022358,0.009651,0.001166,-0.03103,0.075919,0.041209,-0.113304,-0.045147,0.102087
3,4,1,0,0,0,0,0,1,0,0,...,0.033014,0.0071,-0.022282,0.004736,-0.077539,0.412557,-0.314333,0.079123,0.184503,-0.175123
4,5,1,0,0,0,1,0,0,0,0,...,0.006752,0.014371,-0.018692,0.449449,-0.440409,-0.188924,-0.135861,0.097986,-0.05472,-0.013005


<b style="color:blue">Step 10. Create a User-Item Matrix</b>

In [114]:
rating_df = pd.read_csv('../datasets/food_recommender_datasets/ratings.csv')
rating_df.head()

Unnamed: 0,User_ID,Food_ID,Rating
0,1.0,88.0,4.0
1,1.0,46.0,3.0
2,1.0,24.0,5.0
3,1.0,25.0,4.0
4,2.0,49.0,1.0


In [115]:
avg_ratings_df = rating_df.groupby('Food_ID').agg(avg_rating=('Rating','mean'),num_ratings=('Rating','count')).reset_index()
avg_ratings_df = avg_ratings_df.sort_values(by=['num_ratings'],ascending=False).reset_index()
avg_ratings_df.drop(columns=['index'],inplace=True)
avg_ratings_df.head()

Unnamed: 0,Food_ID,avg_rating,num_ratings
0,163.0,3.571429,7
1,23.0,3.333333,6
2,5.0,6.5,6
3,49.0,5.5,6
4,65.0,4.8,5


In [116]:
import numpy as np

threshold_r = 5 #np.floor(avg_ratings_df[['num_ratings']].mean()+0.5)[0]
print(f'median number of ratings {threshold_r}')

median number of ratings 5


In [117]:
top_rated_movies_df = avg_ratings_df[avg_ratings_df['num_ratings']>=threshold_r]
top_rated_movies_df['Food_ID'] = top_rated_movies_df['Food_ID'].astype('int')
top_rated_movies_df.head()

Unnamed: 0,Food_ID,avg_rating,num_ratings
0,163,3.571429,7
1,23,3.333333,6
2,5,6.5,6
3,49,5.5,6
4,65,4.8,5


In [118]:
print('Number of popular foods:', len(top_rated_movies_df))

Number of popular foods: 12


In [119]:
#len(rating_df)

In [124]:
merged_rating_df = pd.merge(rating_df,top_rated_movies_df,on='Food_ID',how='inner')
merged_rating_df['Food_ID'] = merged_rating_df['Food_ID'].astype('int')
merged_rating_df.head()

Unnamed: 0,User_ID,Food_ID,Rating,avg_rating,num_ratings
0,1.0,46,3.0,5.4,5
1,3.0,46,2.0,5.4,5
2,20.0,46,6.0,5.4,5
3,69.0,46,9.0,5.4,5
4,97.0,46,7.0,5.4,5


In [121]:
#len(merged_rating_df)

In [148]:
pivot_df = merged_rating_df.pivot_table(index='User_ID',columns='Food_ID',values='Rating',aggfunc='mean',fill_value=0).reset_index()
pivot_df.head()

Food_ID,User_ID,5,7,18,21,22,23,46,47,49,53,65,163
0,1.0,0,0,0,0,0,0,3,0,0.0,0,0,0
1,2.0,0,0,0,0,0,0,0,0,1.0,0,0,0
2,3.0,0,0,0,0,0,0,2,0,0.0,0,3,0
3,4.0,0,0,0,1,0,0,0,0,0.0,0,0,0
4,6.0,0,0,0,0,5,0,0,0,0.0,0,0,0


<b style="color:blue">Step 11. Build User-based Similarity matrix</b>

In [146]:
from sklearn.metrics.pairwise import cosine_similarity 

user_cosine_sim_matrix = cosine_similarity(pivot_df,pivot_df)
user_cosine_sim_matrix.shape

(51, 51)

In [147]:
user_cosine_sim_matrix

array([[1.        , 0.28284271, 0.60677988, ..., 0.38369165, 0.3153588 ,
        0.31528023],
       [0.28284271, 1.        , 0.57207755, ..., 0.89210726, 0.8919694 ,
        0.90525847],
       [0.60677988, 0.57207755, 1.        , ..., 0.6686346 , 0.63784459,
        0.68277455],
       ...,
       [0.38369165, 0.89210726, 0.6686346 , ..., 1.        , 0.99466547,
        0.99441763],
       [0.3153588 , 0.8919694 , 0.63784459, ..., 0.99466547, 1.        ,
        0.99426396],
       [0.31528023, 0.90525847, 0.68277455, ..., 0.99441763, 0.99426396,
        1.        ]])