In this notebook, we demonstrate a movie recommender system using a Collaborative Filtering algorithm. Here, we have used a a simple Pearson correlation for finding the neighborhood of a given user (most correlated users) and use a normalized weighted average to calculate a score based on the rankings of the user's neighbors. We use the MovieLens dataset that is widely availble online.

Below we explore the data and build the recommender system step by step. At the end, we put it all together in one function that can be used to recommend N movies to a given user.

In [1]:
import os
import pandas as pd

In [2]:
# configure file path
data_path = 'data/movielens-small'
movies_filename = 'movies.csv'
ratings_filename = 'ratings.csv'

# read data
df_movies = pd.read_csv(
    os.path.join(data_path, movies_filename),
    usecols=['movieId', 'title'],
    dtype={'movieId': 'int32', 'title': 'str'})

df_ratings = pd.read_csv(
    os.path.join(data_path, ratings_filename),
    usecols=['userId', 'movieId', 'rating'],
    dtype={'userId': 'int32', 'movieId': 'int32', 'rating': 'float32'})

In [3]:
df_movies.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [4]:
df_movies.shape

(9742, 2)

In [5]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [6]:
df_ratings.shape

(100836, 3)

We want the data to be in an m x n array, where m is the number of movies and n is the number of users. To reshape dataframe of ratings, we’ll pivot the dataframe to the wide format with movies as rows and users as columns. Then we’ll fill the missing observations with 0s since we’re going to be performing linear algebra operations.

In [7]:
# pivot ratings into movie features with movies in the rows and users as columns
df_movie_user = df_ratings.pivot(
    index='movieId',
    columns='userId',
    values='rating'
).fillna(0)

print(df_movie_user.shape)
df_movie_user.head()

(9724, 610)


userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
num_users = len(df_ratings.userId.unique())
num_items = len(df_ratings.movieId.unique())
print('There are {} unique users and {} unique movies in this data set'.format(num_users, num_items))

There are 610 unique users and 9724 unique movies in this data set


In [9]:
# get rating frequency
#number of ratings each movie got.
df_movies_cnt = pd.DataFrame(df_ratings.groupby('movieId').size(), columns=['count'])
df_movies_cnt.head()

Unnamed: 0_level_0,count
movieId,Unnamed: 1_level_1
1,215
2,110
3,52
4,7
5,49


Now we need to take only movies that have been rated at least 10 times to get some idea of the reactions of users towards it

In [10]:
popularity_thres = 10
popular_movies = list(set(df_movies_cnt.query('count >= @popularity_thres').index))
df_ratings_drop_movies = df_ratings[df_ratings.movieId.isin(popular_movies)]
print('shape of original ratings data: ', df_ratings.shape)
print('shape of ratings data after dropping unpopular movies: ', df_ratings_drop_movies.shape)

shape of original ratings data:  (100836, 3)
shape of ratings data after dropping unpopular movies:  (81116, 3)


In [11]:
# pivot ratings into movie features with movies in the rows and users as columns
df_pop_movie_user = df_ratings_drop_movies.pivot(
    index='movieId',
    columns='userId',
    values='rating'
).fillna(0)

print(df_pop_movie_user.shape)
df_pop_movie_user.head()

(2269, 610)


userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
6,4.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,3.0,4.0,3.0,0.0,0.0,0.0,0.0,0.0,5.0


In [16]:
# df_user_sim = df_pop_movie_user.corr(method='pearson') 
# df_user_sim.head()

from sklearn.metrics.pairwise import cosine_similarity
df_user_sim = pd.DataFrame(cosine_similarity(df_pop_movie_user))
df_user_sim.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2259,2260,2261,2262,2263,2264,2265,2266,2267,2268
0,1.0,0.410562,0.296917,0.308762,0.376316,0.277491,0.232586,0.395573,0.323976,0.21968,...,0.066689,0.174998,0.162959,0.106232,0.169289,0.128873,0.14025,0.068119,0.116364,0.095825
1,0.410562,1.0,0.282438,0.287795,0.297009,0.228576,0.044835,0.417693,0.322252,0.168642,...,0.092751,0.233863,0.09654,0.095374,0.256621,0.112057,0.203355,0.037398,0.154905,0.162151
2,0.296917,0.282438,1.0,0.417802,0.284257,0.402831,0.30484,0.242954,0.249568,0.203237,...,0.0,0.056977,0.052839,0.012543,0.054778,0.081208,0.078187,0.0,0.060823,0.0
3,0.308762,0.287795,0.417802,1.0,0.298968,0.474002,0.335058,0.218061,0.272182,0.207889,...,0.060572,0.06172,0.029112,0.036859,0.033018,0.034417,0.045416,0.0,0.033512,0.0
4,0.376316,0.297009,0.284257,0.298968,1.0,0.244105,0.214088,0.386414,0.289365,0.168019,...,0.0,0.177203,0.056091,0.095533,0.190849,0.12694,0.174638,0.0,0.065079,0.088478


In [17]:
# pick user by userId
userNum = 3

# let's get an idea of what our user has rated (sorted by most liked)
# seems like some science-fiction, horror, dystopian
df_user_ratings = df_ratings[df_ratings['userId']==userNum]
df_user_ratings = pd.merge(df_user_ratings, 
                           df_movies[['movieId', 'title']],
                           left_on='movieId',
                           right_on='movieId',
                           how='left').sort_values(by=['rating'], ascending=False)
df_user_ratings.head(20)

Unnamed: 0,userId,movieId,rating,title
28,3,5181,5.0,Hangar 18 (1980)
37,3,70946,5.0,Troll 2 (1990)
35,3,7991,5.0,Death Race 2000 (1975)
5,3,849,5.0,Escape from L.A. (1996)
33,3,6835,5.0,Alien Contamination (1980)
21,3,2851,5.0,Saturn 3 (1980)
31,3,5919,5.0,Android (1982)
24,3,3703,5.0,"Road Warrior, The (Mad Max 2) (1981)"
29,3,5746,5.0,Galaxy of Terror (Quest) (1981)
26,3,4518,5.0,The Lair of the White Worm (1988)


In [18]:
num_neighbors = 30

# get weights for top k neighbors (most correlated)
top_sim_neighbors = df_user_sim[userNum].sort_values(ascending=False)[1:num_neighbors+1]

# get ratings for top neighbors
neighbor_ratings = df_pop_movie_user[top_sim_neighbors.index]
# ignore 0.0 ratings (treat them as NaN)
neighbor_ratings = neighbor_ratings[neighbor_ratings[top_sim_neighbors.index]!=0]
neighbor_ratings.head()

userId,5,46,320,40,2,140,291,311,420,345,...,323,6,328,302,304,294,214,56,259,168
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,5.0,,5.0,,3.0,4.0,,4.0,,...,3.5,,5.0,,5.0,,3.0,,,
2,,,,,,3.5,,,,,...,4.0,4.0,,,4.0,3.0,,,2.0,
3,,,,,,,,,,,...,,5.0,,3.0,,1.0,,,,
5,,,,,,,,,,,...,,5.0,,,,,,,,
6,,,,,,5.0,,,,,...,,4.0,,,,3.0,,,,


In [19]:
# compute the average rating of neighbor while ignoring 0.0 ratings (no ratings)
neighbor_mean_ratings = neighbor_ratings.mean()
neighbor_mean_ratings

userId
5      3.636364
46     4.000000
320    3.666667
40     3.784946
2      3.980769
140    3.572248
291    4.300000
311    2.173913
420    3.838710
345    3.852273
290    4.230047
66     4.092982
289    3.527778
334    3.410072
333    2.550000
52     4.500000
324    3.461539
575    3.650000
65     4.029412
306    3.312500
323    3.200000
6      3.615044
328    3.237154
302    4.032258
304    3.892157
294    2.742604
214    2.842105
56     3.804348
259    3.053571
168    4.432099
dtype: float32

In [20]:
# normalize the ratings of neighbors by subtracting their mean ratings (ignore 0.0 ratings)
neighbor_norm_ratings = neighbor_ratings - neighbor_mean_ratings
neighbor_norm_ratings.head()

userId,5,46,320,40,2,140,291,311,420,345,...,323,6,328,302,304,294,214,56,259,168
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.363636,1.0,,1.215054,,-0.572248,-0.3,,0.16129,,...,0.3,,1.762846,,1.107843,,0.157895,,,
2,,,,,,-0.072248,,,,,...,0.8,0.384956,,,0.107843,0.257396,,,-1.053571,
3,,,,,,,,,,,...,,1.384956,,-1.032258,,-1.742604,,,,
5,,,,,,,,,,,...,,1.384956,,,,,,,,
6,,,,,,1.427752,,,,,...,,0.384956,,,,0.257396,,,,


In [21]:
top_sim_neighbors

5      0.474002
46     0.465586
320    0.425670
40     0.418757
2      0.417802
140    0.416203
291    0.409324
311    0.401510
420    0.400913
345    0.400142
290    0.388230
66     0.385533
289    0.378982
334    0.377953
333    0.375399
52     0.363690
324    0.359142
575    0.355113
65     0.353285
306    0.345097
323    0.335926
6      0.335058
328    0.332776
302    0.329232
304    0.326705
294    0.321737
214    0.319684
56     0.319414
259    0.318525
168    0.318133
Name: 3, dtype: float32

In [22]:
# compute sum of products of normalized neighbor ratings
sum_prod = neighbor_norm_ratings.fillna(0).dot(top_sim_neighbors)
sum_prod.head()

movieId
1    1.859113
2    0.150111
3   -0.825983
5    0.428193
6    0.806031
dtype: float32

In [23]:
# create a rating mask (0,1) and apply to weights (correlations)
# so that we sum over neighbor weights that have ratings
rating_mask = neighbor_ratings[top_sim_neighbors.index]>0
rating_mask.head(10)

userId,5,46,320,40,2,140,291,311,420,345,...,323,6,328,302,304,294,214,56,259,168
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,True,True,False,True,False,True,True,False,True,False,...,True,False,True,False,True,False,True,False,False,False
2,False,False,False,False,False,True,False,False,False,False,...,True,True,False,False,True,True,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,...,False,True,False,True,False,True,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
6,False,False,False,False,False,True,False,False,False,False,...,False,True,False,False,False,True,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,...,False,True,False,False,True,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
10,False,True,False,False,False,False,False,False,False,False,...,False,True,False,False,True,True,False,True,False,False
11,False,False,False,False,False,True,False,False,False,True,...,False,True,False,False,True,False,False,True,False,False
12,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False


In [24]:
sum_weights = rating_mask.dot(top_sim_neighbors)
sum_weights.head(10)

movieId
1     5.051592
2     2.054156
3     1.365010
5     0.720591
6     1.072999
7     0.661764
9     0.000000
10    2.146455
11    2.185752
12    0.321737
dtype: float32

In [25]:
# get user ratings
user_ratings = df_pop_movie_user[userNum]
# ignore 0.0 ratings (treat them as NaN)
user_mean_rating = user_ratings[user_ratings!=0].mean()
user_mean_rating

1.4791666269302368

In [26]:
# compute the normalized weighted average score
weighted_avg_score = (sum_prod/sum_weights)
weighted_avg_score.sort_values(ascending=False, inplace=True)
weighted_avg_score.head()

movieId
5992    2.326087
2301    2.257396
417     2.257396
441     2.257396
1884    2.257396
dtype: float32

In [27]:
# filter movies that have been seen (rated) by user from df_movies
unseen_movies = df_ratings[df_ratings['userId']!=userNum]['movieId'].to_list()
df_unseen_movies = df_movies[df_movies['movieId'].isin(unseen_movies)]
df_unseen_movies.head(10)

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)
5,6,Heat (1995)
6,7,Sabrina (1995)
7,8,Tom and Huck (1995)
8,9,Sudden Death (1995)
9,10,GoldenEye (1995)


In [28]:
top_n = 10
count = 0
recommended_movies = []

# append the top_n movies from df_unseen_movies with the highest scores in weighted_avg_score
for index, value in weighted_avg_score.items():  
    recommended_movies.append(df_unseen_movies[df_unseen_movies['movieId']==index]['title'].to_string(index=False).strip())
    count += 1
    if count >= top_n:
        break
        
recommended_movies

['Hours, The (2002)',
 'History of the World: Part I (1981)',
 'Barcelona (1994)',
 'Dazed and Confused (1993)',
 'Fear and Loathing in Las Vegas (1998)',
 "Cheech and Chong's Up in Smoke (1978)",
 'Bulworth (1998)',
 'Welcome to the Dollhouse (1995)',
 'Young Frankenstein (1974)',
 'Toy Story 3 (2010)']

In [31]:
# needs user movie ratings (df_pop_movie_user) to be available

from sklearn.metrics.pairwise import cosine_similarity

def get_similarity_matrix():
    
    # pearson correlation
    # df_user_sim = df_pop_movie_user.corr(method='pearson')     
    # cosine similarity
    df_user_sim = pd.DataFrame(cosine_similarity(df_pop_movie_user))
    
    return df_user_sim
    

# puttin it all together in a function
def get_movie_recommendations(user_num, top_n, num_neighbors):
    
    df_user_sim = get_similarity_matrix()
    
    # get weights for top similar neighbor
    top_sim_neighbors = df_user_sim[user_num].sort_values(ascending=False)[1:num_neighbors+1]
    
    # get ratings for top neighbors
    neighbor_ratings = df_pop_movie_user[top_sim_neighbors.index]
    # ignore 0.0 ratings (treat them as NaN)
    neighbor_ratings = neighbor_ratings[neighbor_ratings[top_sim_neighbors.index]!=0]
    
    # compute the average rating of neighbor while ignoring 0.0 ratings (no ratings)
    neighbor_mean_ratings = neighbor_ratings.mean()

    # normalize the ratings of neighbors by subtracting their mean ratings (ignore 0.0 ratings)
    neighbor_norm_ratings = neighbor_ratings - neighbor_mean_ratings

    # compute sum of products of normalized neighbor ratings
    sum_prod = neighbor_norm_ratings.fillna(0).dot(top_sim_neighbors)

    # create a rating mask (0,1) and apply to weights (correlations)
    # so that we sum over neighbor weights that have ratings
    rating_mask = neighbor_ratings[top_sim_neighbors.index]>0
    sum_weights = rating_mask.dot(top_sim_neighbors)

    # get user ratings
    user_ratings = df_pop_movie_user[user_num]
    # ignore 0.0 ratings (treat them as NaN)
    user_mean_rating = user_ratings[user_ratings!=0].mean()

    # compute the normalized weighted average score and sort from highest to lowest
    weighted_avg_score = (sum_prod/sum_weights)
    weighted_avg_score.sort_values(ascending=False, inplace=True)

    # filter movies that have been seen (rated) by user from df_movies
    unseen_movies = df_ratings[df_ratings['userId']!=userNum]['movieId'].to_list()
    df_unseen_movies = df_movies[df_movies['movieId'].isin(unseen_movies)]

    # append the top_n movies from df_unseen_movies with the highest scores in weighted_avg_score
    count = 0
    recommended_movies = []
    for index, value in weighted_avg_score.items():  
        recommended_movies.append(df_unseen_movies[df_unseen_movies['movieId']==index]['title'].to_string(index=False).strip())
        count += 1
        if count >= top_n:
            break

    return recommended_movies

In [32]:
user_num = 3
top_n = 10
num_neighbors = 30
rec_movies = get_movie_recommendations(user_num, top_n, num_neighbors)
rec_movies

['Hours, The (2002)',
 'History of the World: Part I (1981)',
 'Barcelona (1994)',
 'Dazed and Confused (1993)',
 'Fear and Loathing in Las Vegas (1998)',
 "Cheech and Chong's Up in Smoke (1978)",
 'Bulworth (1998)',
 'Welcome to the Dollhouse (1995)',
 'Young Frankenstein (1974)',
 'Toy Story 3 (2010)']