# Item Collaborative Recommender Systems

Reference: GA class notebook by Riley Dallas
**This note book can answer the question: Given a person’s preferences of one recipe, could I recommend other recipes they might enjoy?**

In [1]:
#
import pandas as pd
import numpy as np
from scipy import sparse
from sklearn.metrics.pairwise import pairwise_distances, cosine_distances, cosine_similarity
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'    # with this can handle more merories, avoid kernel dead error.
import time

## Load `recipes.csv` and `reviews.csv`

In [2]:
t0=time.time()
recipes = pd.read_csv('organized_recipes.csv')
recipes.head(3)

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients,calories,total_fat,sugar,sodium,protein,sat_fat,carbs
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7,51.5,0.0,13.0,0.0,2.0,0.0,4.0
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6,173.4,18.0,0.0,17.0,22.0,35.0,1.0
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",13,269.8,22.0,32.0,48.0,39.0,27.0,5.0


In [3]:
ratings = pd.read_csv('cleaned_reviews.csv')
ratings.head(3)

Unnamed: 0,user_id,recipe_id,date,rating,review
0,38094,40893,2003-02-17,4,Great with a salad. Cooked on top of stove for...
1,1293707,40893,2011-12-21,5,"So simple, so delicious! Great for chilly fall..."
2,8937,44394,2002-12-01,4,This worked very well and is EASY. I used not...


## Drop unnecessary columns
---

We won't need the `date` or `review` column from `ratings`. We only  `name` and `id` columns from `recipes`. 

In [4]:
ratings = ratings[['user_id', 'recipe_id', 'rating']]

In [5]:
recipes = recipes[['id','name']]

In [6]:
recipes.shape,  ratings.shape

((191481, 2), (1071351, 3))

In [7]:
df = pd.merge(ratings, recipes, how='inner', left_on='recipe_id', right_on='id').drop(columns='id')

In [8]:
print(df.shape)
df.head()

(885798, 4)


Unnamed: 0,user_id,recipe_id,rating,name
0,76535,134728,4,kfc honey bbq strips
1,273745,134728,5,kfc honey bbq strips
2,353911,134728,5,kfc honey bbq strips
3,190375,134728,5,kfc honey bbq strips
4,255338,134728,5,kfc honey bbq strips


In [9]:
df['user_id'].value_counts().head(20)

424680     6740
37449      4914
383346     4263
169430     3583
128473     3445
89831      2882
58104      2761
199848     2720
133174     2697
226863     2592
305531     2591
498271     2425
369715     2246
4470       2234
286566     2029
1072593    2000
176615     1995
107583     1974
80353      1950
166642     1889
Name: user_id, dtype: int64

In [10]:
review_count = df.groupby('recipe_id').count()
review_count

Unnamed: 0_level_0,user_id,rating,name
recipe_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
40,9,9,9
45,2,2,2
46,2,2,2
49,18,18,18
58,7,7,7
...,...,...,...
537319,1,1,1
537458,1,1,1
537459,1,1,1
537485,1,1,1


In [11]:
selected_recipes = review_count[(review_count['rating'] > 4) & (review_count['rating'] < 9)].index
selected_recipes

Int64Index([    58,     91,     92,     93,    136,    139,    170,    210,
               224,    240,
            ...
            526222, 530478, 531253, 532736, 532740, 533699, 534900, 535779,
            536119, 536678],
           dtype='int64', name='recipe_id', length=23095)

In [12]:
df = df.set_index('recipe_id').loc[selected_recipes,:]

In [13]:
df.reset_index(inplace=True)

In [14]:
print(df.shape)
df.head()

(141512, 4)


Unnamed: 0,recipe_id,user_id,rating,name
0,58,437767,3,low fat burgundy beef vegetable stew
1,58,162826,5,low fat burgundy beef vegetable stew
2,58,5060,5,low fat burgundy beef vegetable stew
3,58,1060485,3,low fat burgundy beef vegetable stew
4,58,1279229,5,low fat burgundy beef vegetable stew


In [15]:
# clear up the memeories
del ratings
ratings = pd.DataFrame() 

del recipes
recipes = pd.DataFrame()

In [16]:
t1=time.time()-t0
t1

11.103318929672241

## Create pivot table
---

Because we're creating an item-based collaborative recommender (where item in this case is our recipes), we'll set up our pivot table as follows:
1. The `name` will be the index
2. The `user_id` will be the column
3. The `rating` will be the value


In [17]:
pivot = pd.pivot_table(df, index='name', columns='user_id', values='rating')

pivot.head()

user_id,1533,1535,1634,1676,1792,1891,1962,2046,2054,2059,...,2002361642,2002363091,2002363779,2002364091,2002364382,2002368192,2002368412,2002368953,2002369279,2002371843
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1 000 artichoke hearts,,,,,,,,,,,...,,,,,,,,,,
1 2 3 jambalaya,,,,,,,,,,,...,,,,,,,,,,
1 asian noodle salad,,,,,,,,,,,...,,,,,,,,,,
1 favorite chinese steamed whole fish by sy,,,,,,,,,,,...,,,,,,,,,,
1 gram fat pumpkin spice muffins low fat,,,,,,,,,,,...,,,,,,,,,,


In [18]:
pivot.shape

(23079, 37737)

In [19]:
t2=time.time()-t0
t2

69.66885495185852

## Create sparse matrix
---

Calculate the cosine similarity for each recipe using the `pairwise_distances` function. Before that, we need to create a sparse matrix (datatype) using `scipy`'s `sparse` module like so:


In [20]:
sparse_pivot = sparse.csr_matrix(pivot.fillna(0))
print(sparse_pivot)

  (0, 72)	5.0
  (0, 2407)	3.0
  (0, 3538)	5.0
  (0, 6774)	5.0
  (0, 7450)	5.0
  (0, 8494)	5.0
  (0, 25630)	5.0
  (0, 36215)	5.0
  (1, 287)	4.0
  (1, 1150)	4.0
  (1, 5088)	5.0
  (1, 14120)	5.0
  (1, 15235)	5.0
  (1, 32119)	5.0
  (2, 2754)	3.0
  (2, 2869)	5.0
  (2, 3223)	4.0
  (2, 7743)	4.0
  (2, 14727)	5.0
  (3, 4630)	5.0
  (3, 7070)	5.0
  (3, 19533)	5.0
  (3, 22772)	5.0
  (3, 24045)	4.0
  (4, 11704)	5.0
  :	:
  (23074, 15606)	4.0
  (23075, 1128)	5.0
  (23075, 4884)	5.0
  (23075, 6174)	5.0
  (23075, 6284)	5.0
  (23075, 7394)	5.0
  (23075, 9696)	5.0
  (23075, 21189)	5.0
  (23076, 401)	4.0
  (23076, 3788)	5.0
  (23076, 6863)	5.0
  (23076, 9954)	4.0
  (23076, 20353)	5.0
  (23076, 24324)	5.0
  (23077, 6386)	5.0
  (23077, 29081)	4.0
  (23077, 29546)	5.0
  (23077, 31893)	4.0
  (23077, 36524)	3.0
  (23077, 37370)	4.0
  (23078, 703)	5.0
  (23078, 4772)	5.0
  (23078, 8339)	5.0
  (23078, 14081)	5.0
  (23078, 21189)	5.0


## Calculate cosine similarity
---

`sklearn` has a built-in `pairwise_distances` function that we can use for our recommender. It will return a square matrix, comparing every recipe with every other resipe in the dataset.

In [21]:
# Note that a distance of 1 is a similarity of 0.
dists = pairwise_distances(sparse_pivot, metric='cosine')
# dists = cosine_distances(sparse_pivot)                         # Identical but more concise

dists

array([[0., 1., 1., ..., 1., 1., 1.],
       [1., 0., 1., ..., 1., 1., 1.],
       [1., 1., 0., ..., 1., 1., 1.],
       ...,
       [1., 1., 1., ..., 0., 1., 1.],
       [1., 1., 1., ..., 1., 0., 1.],
       [1., 1., 1., ..., 1., 1., 0.]])

In [22]:
np.round(dists,3)

array([[0., 1., 1., ..., 1., 1., 1.],
       [1., 0., 1., ..., 1., 1., 1.],
       [1., 1., 0., ..., 1., 1., 1.],
       ...,
       [1., 1., 1., ..., 0., 1., 1.],
       [1., 1., 1., ..., 1., 0., 1.],
       [1., 1., 1., ..., 1., 1., 0.]])

In [23]:
# Here, similarity is 1 - distance.
similarities = cosine_similarity(sparse_pivot)

In [24]:
# # Verify they are the same

# np.all(np.isclose((1.0 - dists), similarities))

## Create distances DataFrame
---

At this point, we essentially have a recommender. We'll load it into a `pandas` DataFrame for readability. 

You'll notice that each movie has a "distance" of 0 with itself (along the diagonal).

In [25]:
recommender_df = pd.DataFrame(similarities, 
                              columns=pivot.index, 
                              index=pivot.index)
recommender_df.head()

name,1 000 artichoke hearts,1 2 3 jambalaya,1 asian noodle salad,1 favorite chinese steamed whole fish by sy,1 gram fat pumpkin spice muffins low fat,1 hour smoky ham and lentil soup,1 minute stromboli,1 squash dressing,10 bean soup,10 layer poor man s lasagna casserole,...,zucchini with bacon cheese,zucchini with chickpea and mushroom stuffing,zucchini with salsa,zucchini yellow squash stir fry,zuccuash bake from nimz territory,zuke soup,zulu cabbage,zuppa di broccoli broccoli soup,zwiebelkuchen southwest german onion cake,zydeco ya ya deviled eggs
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1 000 artichoke hearts,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1 2 3 jambalaya,0.0,1.0,0.0,0.0,0.0,0.142134,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1 asian noodle salad,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1 favorite chinese steamed whole fish by sy,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1 gram fat pumpkin spice muffins low fat,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
recommender_df.shape

(23079, 23079)

In [27]:
t3=time.time()-t0
t3

250.87963700294495

In [28]:
#recommender_df.to_csv('recommender_df.csv', index=False)  # Save this file for quick access

## Evaluate recommender performance
---

Now comes the fun part! Let's check out a few recipes to see if the recommender aligns with our intuition. In the cell below we'll do the following:
1. Create a search term
2. Use that to find all titles matching the search query
3. For each title, we'll list off the following:
  1. The average rating
  2. The number of ratings
  3. The ten most similar movies

In [29]:
# User input recipe name
query = 'besan  chickpea flour  pastry'
names = recommender_df[recommender_df.index.str.contains(query)].index

for name in names:
    print(name)
#     print('Average rating', recommender_df.loc[name, :].mean())
#     print('Number of ratings', recommender_df.T[name].count())
#     print('')
#     print('10 closest recipes')
    print(recommender_df[name].sort_values(ascending=False)[1:11])
    print('')
    print('*******************************************************************************************')
    print('')

besan  chickpea flour  pastry
name
vegan fried chicken vegan chicken nuggets  gluten free    0.256495
good old fashioned english chip shop style chips          0.253967
garam masala green beans                                  0.247963
gluten free cornmeal muffins                              0.247963
vegan speedy alfredo style sauce  no tofu                 0.240229
havarti and sun dried tomato cheesecake                   0.238149
southwestern sugar cookies                                0.238149
palm springs bagel                                        0.229416
instant cookies                                           0.219942
cinnamon sugar rattle snakes  i mean snacks               0.216007
Name: besan  chickpea flour  pastry, dtype: float64

*******************************************************************************************



In [36]:
q = 'fried chicken'
names = recommender_df[recommender_df.index.str.contains(q)].index

for name in names:
#     print(name)
#     print('Average rating', recommender_df.loc[name, :].mean())
#     print('Number of ratings', recommender_df.T[name].count())
#     print('')
#     print('10 closest recipes')
    print(recommender_df[name].sort_values(ascending=False)[1:11])
    print('')
    print('*******************************************************************************************')
    print('')

name
deep in the heart of texas bbq rub    0.432338
spicy egg salad sandwiches            0.406579
cheesy bacon stuffed burgers          0.400000
popover pizza                         0.400000
hawaiian ham and swiss sandwich       0.400000
three onion crossroads potatoes       0.400000
thai chili mayonnaise                 0.376622
grilled new yorker                    0.376622
fruity ice cubes                      0.365148
italian style lasagna                 0.365148
Name: 4th of july fried chicken, dtype: float64

*******************************************************************************************

name
pollo en crema                             0.380129
osman s weiner schnitzel                   0.290228
basic biscuits                             0.228571
chicken panang                             0.219133
baked mushrooms   a vegetable side dish    0.217855
ginger lime salmon                         0.214152
aunt mag s spoonburgers                    0.212675
beef teriyaki 

name
green chile baked chicken                         0.391031
cheesy green chile corn                           0.262838
oven fried chicken breasts with new potatoes      0.253823
mom s macaroni   cheese                           0.251019
double crunch honey garlic chicken                0.245677
linda s veggie and cream cheese stuffed celery    0.241888
the blt omelet                                    0.241888
hot cheese hors d oeuvres                         0.241888
cornflake ranch chicken fingers or breasts        0.241888
potato and beef pie                               0.241888
Name: garlic fried chicken breasts, dtype: float64

*******************************************************************************************

name
herbed zucchini soup                       0.298169
kicked up carrot salad                     0.269435
lemon turkey breast                        0.259099
asian chicken pitas                        0.259099
mojo chicken                               0.25

name
afghani lamb and rice dish                 0.317408
curry butter chicken legs on the barbie    0.228104
cocktail turkey meatballs                  0.228104
broccoli with browned butter               0.219806
roast lime chicken                         0.217597
baked chicken breasts and rice             0.217597
porky pockets  rsc                         0.216517
kit kat bars ii                            0.213371
potato egg pie                             0.210359
kumquat curry with shrimp                  0.207471
Name: oven fried chicken monterey, dtype: float64

*******************************************************************************************

name
chicken and noodle stew                                  0.203535
spanakopita  spinach and cheese pie                      0.203535
joe s stone crab garlic creamed spinachby todd wilbur    0.203535
warm lobster rolls                                       0.195480
grilled german potato salad                              0.195

In [31]:
query2 = 'fried chicken'
names = recommender_df[recommender_df.index.str.contains(query2)].index

In [32]:
pd.DataFrame(recommender_df[[name]].sort_values(by=name, ascending=False).head(6))

name,vegan fried chicken vegan chicken nuggets gluten free
name,Unnamed: 1_level_1
vegan fried chicken vegan chicken nuggets gluten free,1.0
besan chickpea flour pastry,0.256495
good old fashioned english chip shop style chips,0.247537
garam masala green beans,0.241684
vegan speedy alfredo style sauce no tofu,0.234146
southwestern sugar cookies,0.232119


In [33]:
recommender_df[[name]].sort_values(by=name, ascending=False).head(6).set_axis(['Value'], axis=1)

Unnamed: 0_level_0,Value
name,Unnamed: 1_level_1
vegan fried chicken vegan chicken nuggets gluten free,1.0
besan chickpea flour pastry,0.256495
good old fashioned english chip shop style chips,0.247537
garam masala green beans,0.241684
vegan speedy alfredo style sauce no tofu,0.234146
southwestern sugar cookies,0.232119


In [34]:
chosen_recipe2 = pd.DataFrame()
for name in names:
    chosen_recipe2 = pd.concat([chosen_recipe2, recommender_df[[name]].sort_values(by=name, ascending=False).head(6).set_axis(['Value'], axis=1)])

In [35]:
#print out a dataframe that has name contains 'fried chicken' and their closest similar recipes
chosen_recipe2

Unnamed: 0_level_0,Value
name,Unnamed: 1_level_1
4th of july fried chicken,1.000000
deep in the heart of texas bbq rub,0.432338
spicy egg salad sandwiches,0.406579
hawaiian ham and swiss sandwich,0.400000
three onion crossroads potatoes,0.400000
...,...
oven fried chicken ww,0.547723
nif s baked chicken fingers,0.507093
diet chocolate sauce,0.507093
mr howell s left bank french toast,0.447214
