# Recommended Movies Using Item-Item Collaborative Filtering

## Project Goals
The goal of this project is to recommend new movies to users based on their existing movie review records. I used item-item collaborative filtering and cosine similarity to predict the ratings of the unseen movies, and recommended movies with top predicted ratings to users.

# 1. Load movie rating data

First, we loaded data into pandas data frame. There are 100,000 ratings for 1,682 movies rated by 943 users.

In [2]:
import numpy as np
import pandas as pd
#from scipy import sparse
#from time import time

In [27]:
## Load data to pandas
df_ratings_contents = pd.read_table("movierating.data", names=["user_name", "movie_name", "rating", "timestamp"])

In [28]:
# data dimension
df_ratings_contents.shape

(100000, 4)

In [29]:
# 943 users
df_ratings_contents.user_name.nunique()

943

In [30]:
# 1682 movies
df_ratings_contents.movie_name.nunique()

1682

In [31]:
df_ratings_contents.head(3)

Unnamed: 0,user_name,movie_name,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116


In [32]:
df_ratings_contents.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
user_name     100000 non-null int64
movie_name    100000 non-null int64
rating        100000 non-null int64
timestamp     100000 non-null int64
dtypes: int64(4)
memory usage: 3.1 MB


In [33]:
 # user_name and movie_name start from 1. end in 943 and 1682. 
df_ratings_contents.describe() 

Unnamed: 0,user_name,movie_name,rating,timestamp
count,100000.0,100000.0,100000.0,100000.0
mean,462.48475,425.53013,3.52986,883528900.0
std,266.61442,330.798356,1.125674,5343856.0
min,1.0,1.0,1.0,874724700.0
25%,254.0,175.0,3.0,879448700.0
50%,447.0,322.0,4.0,882826900.0
75%,682.0,631.0,4.0,888260000.0
max,943.0,1682.0,5.0,893286600.0


# 2. Convert movie rating data into user-item rating matrix

Use pivot_table to convert long-form rating data into user-item utility matrix. The unseen movies are filled with 0. The result is an 943 x 1682 rating matrix.

In [42]:
# transform long-form to wide-form matrix, fill 0 when cell is NaN
rating_mat = pd.pivot_table(data=df_ratings_contents, values='rating', 
                            index='user_name', columns='movie_name', fill_value=0)     

In [45]:
rating_mat.shape

(943, 1682)

In [43]:
rating_mat.head(3)

movie_name,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5,3,4,3,3,5,4,1,5,3,...,0,0,0,0,0,0,0,0,0,0
2,4,0,0,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# 3. Calculate item-item cosine similarity score

We calculated the cosine similarity score matrix using the user-item rating data. The result is a 1682x1682 item-item similarity score matrix.

In [48]:
from sklearn.metrics.pairwise import cosine_similarity

# Item-Item similarity matrix using cosine similarity on ratings
# need to transpose utilimary matrix to item * user
item_sim_mat = cosine_similarity(rating_mat.T)   

In [49]:
item_sim_mat.shape  

(1682, 1682)

In [60]:
# peak the similarity matrix
item_sim_mat[0:5,0:5]    

array([[1.        , 0.40238218, 0.33024479, 0.45493792, 0.28671351],
       [0.40238218, 1.        , 0.27306918, 0.50257077, 0.31883618],
       [0.33024479, 0.27306918, 1.        , 0.32486639, 0.21295656],
       [0.45493792, 0.50257077, 0.32486639, 1.        , 0.33423948],
       [0.28671351, 0.31883618, 0.21295656, 0.33423948, 1.        ]])

# 4. Find the top-75 most similar movies for each movie based on similarity scores

I sorted the similarity score are sorted and identified the 75 most similar movies to each movie. These closest neightbors will be used to calculate the item-based weighted ratings in Step 5. 

In [84]:
#sort matrix within each row (column-wise), return the INDICES for the sorted values from small to large
least_to_most_sim_indexes = np.argsort(item_sim_mat, axis=1) 

# find 75 neighborhoods' indices with highest similarity (76 ~ 2 positions to the right, as -1 is itself)
neighborhood_size = 75   

neighborhoods_loc = least_to_most_sim_indexes[:, -(neighborhood_size+1):-1:1]

In [85]:
neighborhoods_loc[0:2,:]

array([[  98,  844,  273,   87,    7,  185,   69,  190,  283,  596,  160,
         495,  257,    8,  392,  142,  233,  275,  131,  567,  152,  264,
          88,   96,  281,  215,  195,  317,  175,  545,  143,  182,   21,
         470,  293,  234,   63,  201,   55,  227,  126,   70,   95,   81,
         587,  110,  422,  194,  167,  124,  172,   24,  209,   97,   14,
         741,   78,   94,   68,  203,  256,   27,  171,  117,    6,  173,
         236,   99,  221,  150,  404,  116,  120,  180,   49],
       [ 422,  684,   63,  195,   87,   70,  390,  392,  185,   11,   79,
         719,  754,   30,  187,  469,  180,  182,  801,   55,  366,   10,
         167,  731,   68,   32,  398,   27,   21,   88,  172,  201,  203,
          53,    3,  175,  183,   93,  678,  231,  745, 1227,  264,  171,
          94,  143,  173,  430,  209,  229,   78,  553,  227,  226,  683,
         238,  228,  194,   81,   67,   37,  565,  577,   28,  567,   95,
         549,  230,  225,  575,   61,  402,  384,

In [86]:
neighborhoods_loc.shape, type(neighborhoods_loc)

((1682, 75), numpy.ndarray)

# 5. Predict ratings of unseen moviesfor a user

In this section, I predicted the unseen movies' ratings for user_id=100

In [105]:
user_id = 100

In [106]:
# find the index location for user_id=100
user_id_list = list(rating_mat.index)
user_id_loc = user_id_list.index(user_id)
user_id_loc

99

In [107]:
# ratings_mat[user_id] return a row with user_id row index from the numpy array
# .nonzero() return INDICES of the elements that are non-zero (movies that have been rated)
# return a list of column index positions of the movies rated by user_id=100
# if not using [0], nonzero() will return a tuple of 1 element with all numbers inside the element
# user_id=100 has rated 59 movies
itemloc_rated_by_this_user=rating_mat.loc[user_id].nonzero()[0] 

# these are column INDEX position, NOT item name !!!
itemloc_rated_by_this_user, len(itemloc_rated_by_this_user) 

(array([ 257,  265,  267,  268,  269,  270,  271,  285,  287,  288,  291,
         293,  299,  301,  309,  312,  314,  315,  320,  322,  325,  327,
         332,  339,  341,  343,  345,  346,  347,  348,  353,  354,  677,
         688,  689,  690,  749,  750,  751,  873,  878,  879,  880,  884,
         885,  886,  891,  894,  897,  899,  904,  907,  989, 1232, 1233,
        1234, 1235, 1236, 1237], dtype=int64), 59)

In [109]:
# confirm that movie_name 258 (row index 257) is rated by user_id=100
rating_mat.loc[100, 250:260]

movie_name
250    0
251    0
252    0
253    0
254    0
255    0
256    0
257    0
258    4
259    0
260    0
Name: 100, dtype: int64

In [111]:
# movie_name's that are rated by user 100
itemname_rated_by_this_user = rating_mat.columns[itemloc_rated_by_this_user]    
itemname_rated_by_this_user, len(itemname_rated_by_this_user) 

(Int64Index([ 258,  266,  268,  269,  270,  271,  272,  286,  288,  289,  292,
              294,  300,  302,  310,  313,  315,  316,  321,  323,  326,  328,
              333,  340,  342,  344,  346,  347,  348,  349,  354,  355,  678,
              689,  690,  691,  750,  751,  752,  874,  879,  880,  881,  885,
              886,  887,  892,  895,  898,  900,  905,  908,  990, 1233, 1234,
             1235, 1236, 1237, 1238],
            dtype='int64', name='movie_name'), 59)

In [114]:
#the movie names rated by user id 100, same as above. check good!
rating_mat.columns[rating_mat.loc[user_id]>0]   

Int64Index([ 258,  266,  268,  269,  270,  271,  272,  286,  288,  289,  292,
             294,  300,  302,  310,  313,  315,  316,  321,  323,  326,  328,
             333,  340,  342,  344,  346,  347,  348,  349,  354,  355,  678,
             689,  690,  691,  750,  751,  752,  874,  879,  880,  881,  885,
             886,  887,  892,  895,  898,  900,  905,  908,  990, 1233, 1234,
            1235, 1236, 1237, 1238],
           dtype='int64', name='movie_name')

In [116]:
# Initialize prediction matrix dimension
n_users = rating_mat.shape[0]    
n_items = rating_mat.shape[1]
print(n_users, n_items)

(943, 1682)

In [117]:
# create empty array to save predicted ratings
out = np.zeros(n_items)
len(out)

1682

Apply item-item collaborative filtering equation to predict the ratings of unseen movies

In [151]:
# loop through all movies not rated by user 100
for x in range(n_items):    
    
    relevant_items_loc = np.intersect1d(neighborhoods_loc[x,:], itemloc_rated_by_this_user, assume_unique=True)  
    
    # make prediction for (u, i)
    out[x] = np.sum(rating_mat.iloc[user_id_loc, relevant_items_loc] * item_sim_mat[x, relevant_items_loc]) /  \
    (item_sim_mat[x, relevant_items_loc].sum())

# replace NaN by 0
pred_ratings = np.nan_to_num(out) 

#check result
print(pred_ratings[:20])



[4.         0.         0.         0.         2.         3.
 3.3075684  0.         3.48191162 0.         2.         0.
 3.         3.45513014 4.         3.         0.         4.
 3.60698455 3.        ]


# 6. Recommend new movies to user based on highest predicted ratings

Recommend top-100 new movies for user_id=100 based on the highest predicted ratings.

In [152]:
# sort the predicted ratings in decreasing order, return item INDICES
itemloc_sorted_by_pred_rating = list(np.argsort(pred_ratings))[::-1]

# check the predicted ratings (including predict the movies already seen by user_id=100)
print(pred_ratings[itemloc_sorted_by_pred_rating])
print(pred_ratings[itemloc_sorted_by_pred_rating].shape)

[4.0128994 4.        4.        ... 0.        0.        0.       ]
(1682,)


### MAKE A FINAL RECOMMENDATION

In [153]:
# number of movies to recommend
n = 100

# extract the movie_names corresponding to the order of the predicted rating
#need to exclude the movies that have already been rated by user_id=100
itemname_recommend_excludeseen = [rating_mat.columns[i] for i in itemloc_sorted_by_pred_rating 
                                      if rating_mat.columns[i] not in itemname_rated_by_this_user]

print('Top-100 new movie recommendation for user_id=100:','\n', itemname_recommend_excludeseen[:n])

Top-100 new movie recommendation for user_id=100: 
 [912, 911, 409, 181, 1152, 50, 845, 811, 799, 596, 220, 756, 533, 1602, 1197, 1317, 1314, 1621, 256, 1627, 1628, 25, 1416, 864, 1281, 146, 109, 926, 934, 121, 122, 1482, 125, 1462, 993, 1443, 865, 456, 1023, 1028, 1033, 1047, 1051, 1056, 1060, 866, 713, 1, 274, 267, 291, 1214, 1245, 284, 669, 281, 280, 1255, 225, 18, 15, 1279, 1661, 1483, 1656, 1464, 345, 916, 250, 871, 252, 297, 236, 906, 116, 298, 311, 275, 744, 126, 19, 255, 362, 312, 245, 754, 1313, 1618, 257, 894, 1596, 1595, 896, 1061, 904, 471, 676, 306, 237, 902]


# 7. Check prediction performance

During prediction, I already predicted the ratings for those movies that user_id=100 have rated. Here I compared the true ratings (rated by user_id=100) and the predicted ratings, can calculated the MAE and MSE.

In [154]:
# true ratings by user_id=100
ratings_true = rating_mat.iloc[user_id_loc, itemloc_rated_by_this_user] 

# prediction
ratings_pred = pred_ratings[itemloc_rated_by_this_user]

#print(list(zip(np.array(ratings_true).squeeze(),ratings_pred)))
print(list(zip(ratings_true,ratings_pred)))

[(4, 3.5468646513368975), (2, 2.962536455874424), (3, 3.401921624724125), (4, 3.5666630685180025), (3, 3.5099163641159863), (3, 3.5548628579705066), (4, 3.609749850957933), (3, 3.5294282809148756), (2, 3.477526852312378), (3, 3.4882560994076277), (2, 3.2034047164739223), (4, 3.350995270945521), (4, 3.5032582049344105), (4, 3.5827822648537713), (3, 3.4663421405345094), (5, 3.4812572429409854), (5, 3.458844884992249), (5, 3.5322998697089107), (1, 3.537638791279426), (3, 3.4632683887987477), (3, 3.3912796507695298), (4, 3.428473563448733), (3, 3.5138802341121385), (3, 3.4599994730404005), (3, 3.4700421665530485), (4, 3.4880892316710463), (3, 3.543307540075546), (4, 3.560884154166691), (3, 3.5148395097422753), (3, 3.212814683011735), (2, 3.6624373488994157), (4, 3.3748907859711106), (3, 3.3796380263097747), (3, 3.5259327233309836), (4, 3.3399419844783793), (4, 3.564995857590156), (4, 3.5379402996125306), (4, 3.5287043022808495), (4, 3.329584760967123), (1, 2.6649515920723115), (4, 3.328171

### Evaluate rating prediction accuracy using MAE and MSE

In [155]:
# Mean Absolute Error (MAE)
print(abs(ratings_true-ratings_pred).mean())

0.8103744080815104


In [156]:
# Mean Squared Error (MSE)
(np.array(ratings_true-ratings_pred)**2).mean()

0.9880603990967685

* MSE and MAE show that the recommendated movies have ratings similar to true user ratings

# 8. Summary

* The item-item collaborative filtering using cosine-similarity and 75 neighbors achieve high recommendation/prediction accuracy