This notebook implements a Movie recommender algorithm using Matrix Factorization. 

A user-movie ratings matrix is an overcomplicated representation of users' tastes in movies. By performing dimensionality reduction on this matrix we reduce compuational complexity for our movie predictions, and get a more compact and representation of users' tastes. To do this, we use singular value decomposition (SVD) to decompose the user-movie ratings matrix into k latent factors. This method decoposes the ratings matrix (R) into three matrices for user's matrix U, item's matrix V and a diagonal matrix S, that contains the singular values representing how important each latent factor is in expressing R.

R = U x S x VT

(VT is V transposed)

In [40]:
import os
import pandas as pd
import numpy as np

Read movielens files and explore our movies and ratings data

In [41]:
# configure file path
data_path = 'data/movielens-small'
movies_filename = 'movies.csv'
ratings_filename = 'ratings.csv'

# read data
df_movies = pd.read_csv(
    os.path.join(data_path, movies_filename),
    usecols=['movieId', 'title'],
    dtype={'movieId': 'int32', 'title': 'str'})

df_ratings = pd.read_csv(
    os.path.join(data_path, ratings_filename),
    usecols=['userId', 'movieId', 'rating'],
    dtype={'userId': 'int32', 'movieId': 'int32', 'rating': 'float32'})

In [42]:
df_movies.shape

(9742, 2)

In [43]:
df_movies.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [44]:
df_ratings.shape

(100836, 3)

In [45]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


Create R matrix by pivoting ratings into movies, with users as the rows and movies as columns

In [46]:
df_R = df_ratings.pivot(
    index='userId',
    columns='movieId',
    values='rating'
).fillna(0)

print(df_R.shape)
df_R.head()

(610, 9724)


movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Normalize the ratings matrix by subtracting the mean user ratings

In [47]:
R = df_R.values
mean_user_ratings = np.mean(R, axis = 1)
R_norm = R - mean_user_ratings.reshape(-1, 1)
R_norm

array([[ 3.8958247 , -0.10417524,  3.8958247 , ..., -0.10417524,
        -0.10417524, -0.10417524],
       [-0.01177499, -0.01177499, -0.01177499, ..., -0.01177499,
        -0.01177499, -0.01177499],
       [-0.00976964, -0.00976964, -0.00976964, ..., -0.00976964,
        -0.00976964, -0.00976964],
       ...,
       [ 2.2321575 ,  1.7321576 ,  1.7321576 , ..., -0.26784244,
        -0.26784244, -0.26784244],
       [ 2.9875565 , -0.01244344, -0.01244344, ..., -0.01244344,
        -0.01244344, -0.01244344],
       [ 4.506119  , -0.4938811 , -0.4938811 , ..., -0.4938811 ,
        -0.4938811 , -0.4938811 ]], dtype=float32)

Singular Value Decomposition

Here, we use scipy's SVD function to decompose our normalized ratings matrix. Numpy also has a SVD function, but scipy also allows us to set the how many latent factors we want (k). 

In [51]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R_norm, k = 30)

In [52]:
U.shape

(610, 30)

In [53]:
Vt.shape

(30, 9724)

In [54]:
sigma

array([ 79.21219 ,  80.5675  ,  81.54676 ,  82.19736 ,  83.0445  ,
        85.116875,  85.74872 ,  86.51713 ,  87.91551 ,  90.33576 ,
        90.934074,  92.262726,  93.399796,  97.10067 ,  99.289055,
        99.8236  , 101.84795 , 105.97368 , 107.04784 , 109.20842 ,
       112.8084  , 120.61529 , 122.64724 , 134.58719 , 139.63722 ,
       153.93088 , 163.73082 , 184.86183 , 231.22456 , 474.2061  ],
      dtype=float32)

Note that sigma here is a list of singular values and not a diagonal matrix. In order to perform our matrix multipication we can easily convert sigma into a diagonal matrix.

In [55]:
sigma = np.diag(sigma)

Now let's generate the rating predictions by multiplying our 3 matrices together and adding back the mean rating we subtracted earlier to normalize ratings matrix R. 

In [69]:
predicted_ratings = np.dot(np.dot(U, sigma), Vt) + mean_user_ratings.reshape(-1, 1)
df_preds = pd.DataFrame(predicted_ratings, columns = df_R.columns)

In [70]:
df_preds.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
0,2.673603,1.051826,0.985558,-0.086072,0.122101,2.101631,-0.019406,0.019755,0.160969,1.726027,...,-0.010627,-0.009506,-0.011747,-0.011747,-0.010627,-0.011747,-0.010627,-0.010627,-0.010627,-0.042164
1,0.270084,-0.032866,0.014413,-0.009994,0.042291,-0.022205,-0.014739,0.005446,0.05657,-0.035256,...,0.007933,0.006519,0.009346,0.009346,0.007933,0.009346,0.007933,0.007933,0.007933,0.014577
2,-0.080945,0.035863,0.053648,0.008028,-0.010964,0.080076,-0.009024,-0.001583,0.006937,-0.014392,...,0.009266,0.009331,0.009202,0.009202,0.009266,0.009202,0.009266,0.009266,0.009266,0.008447
3,2.879184,0.068719,0.051349,0.060496,0.262046,0.604343,0.549285,0.029039,0.160018,-0.037984,...,-0.017895,-0.017231,-0.01856,-0.01856,-0.017895,-0.01856,-0.017895,-0.017895,-0.017895,-0.02358
4,1.399122,0.934423,0.292726,0.108831,0.430248,0.599319,0.434722,0.132197,0.012172,1.155039,...,-0.006577,-0.006308,-0.006847,-0.006847,-0.006577,-0.006847,-0.006577,-0.006577,-0.006577,-0.002686


Make top-n recommendations for user_id = 3, using our predictions matrix.

In [85]:
user_id = 3
user_row = user_id - 1
sorted_user_predictions = df_preds.iloc[user_row].sort_values(ascending=False)
sorted_user_predictions.head()

movieId
1200    0.294680
1214    0.265374
1127    0.219211
2529    0.208552
1240    0.207508
Name: 2, dtype: float32

In [86]:
# filter movies that have been seen (rated) by user from df_movies
unseen_movies = df_ratings[df_ratings['userId']!=user_id]['movieId'].to_list()
df_unseen_movies = df_movies[df_movies['movieId'].isin(unseen_movies)]
df_unseen_movies.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [87]:
# append the top_n movies from df_unseen_movies with the highest scores in sorted_user_predictions
top_n = 10
count = 0
recommended_movies = []
movie_ids = []
for index, value in sorted_user_predictions.items():  
    recommended_movies.append(df_unseen_movies[df_unseen_movies['movieId']==index]['title'].to_string(index=False).strip())
    movie_ids.append(df_unseen_movies[df_unseen_movies['movieId']==index])
    count += 1
    if count >= top_n:
        break

recommended_movies

['Aliens (1986)',
 'Alien (1979)',
 'Abyss, The (1989)',
 'Planet of the Apes (1968)',
 'Terminator, The (1984)',
 'Thing, The (1982)',
 'Army of Darkness (1993)',
 'Star Wars: Episode V - The Empire Strikes Back...',
 'Predator (1987)',
 'Road Warrior, The (Mad Max 2) (1981)']

Let's put our recommender into a nice function.

In [88]:
def recommend_movies(df_predictions, user_id, df_movies, df_ratings, top_n=5):
    
    # Get and sort the user's predictions
    user_row = user_id - 1
    sorted_user_predictions = df_predictions.iloc[user_row].sort_values(ascending=False)
    
    # filter movies that have been seen (rated) by user from df_movies
    unseen_movies = df_ratings[df_ratings['userId']!=user_id]['movieId'].to_list()
    df_unseen_movies = df_movies[df_movies['movieId'].isin(unseen_movies)]

    count = 0
    recommended_movies = []
    movie_ids = []
    for index, value in sorted_user_predictions.items():  
        recommended_movies.append(df_unseen_movies[df_unseen_movies['movieId']==index]['title'].to_string(index=False).strip())
        movie_ids.append(df_unseen_movies[df_unseen_movies['movieId']==index])
        count += 1
        if count >= top_n:
            break
    
    return recommended_movies

In [98]:
recommendations = recommend_movies(df_predictions=df_preds, user_id=3, df_movies=df_movies, df_ratings=df_ratings, top_n=10)
recommendations

['Aliens (1986)',
 'Alien (1979)',
 'Abyss, The (1989)',
 'Planet of the Apes (1968)',
 'Terminator, The (1984)',
 'Thing, The (1982)',
 'Army of Darkness (1993)',
 'Star Wars: Episode V - The Empire Strikes Back...',
 'Predator (1987)',
 'Road Warrior, The (Mad Max 2) (1981)']