# Movie Recommendations Using Matrix Factorization

Recommendation algorithms are being implemented everywhere nowadays. So many companies require the use of a good recomendation system to have their technology and product run effectively. 

Good recommendations create more positive experiences for the user and thus more time on the product. That was the basis for Netflix to hold their Netflix Prize, a competition to improve their collaborative filtering algorithm that predicts a users ratings for movies based on only a dataset users and their ratings of movies. 

I attempt to tackle this problem by using Matrix Factorization. This method was implemented in the winning team's solution. Matrix Factorization essentially is stemmed from Model-Based Collaborative Filtering. It works as an unsupervised learning problem dealing with latent factors that exist in the problem. MF works to learn the latent factors of both users and movies and taking the dot product results in the prediction of the unknown rating.

The original user-movie matrix is very sparse as it contains mostly 0s of unknown ratings. MF turns this sparse matrix into low-rank structure by compressing the sparse information into a k-dimenional space, where k represents the number of latent factors. It creates a smaller U matrix ("row factor") and smaller V matrix ("column factor"). Multiplying these creates the approximation of the orginial, sparse matrix.

In [245]:
import pandas as pd
import numpy as np
import implicit
from scipy.sparse import csr_matrix
from scipy.sparse import coo_matrix
from sklearn.model_selection import train_test_split
import io

In [246]:
ratings = '/Users/dgrubis/Desktop/u.data'
headers = ['userID', 'movieID', 'rating', 'timestamp']
header_row = None
ratings_df = pd.read_csv(ratings,
                         sep='\t',
                         names=headers,
                         header=header_row,
                         dtype={
                           'userID': np.int32,
                           'movieID': np.int32,
                           'rating': np.float32,
                           'timestamp': np.int32,
                         })
ratings_df.head()

#For this project I'm just going with the ratings file of the 100k MovieLens dataset to predict from only previous ratings
#Further work for me to try out is working in the users.csv(includes age and gender for users) and movies.tsv(includes year and genre of movies) to see how the predictions change with more features

Unnamed: 0,userID,movieID,rating,timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596


In [247]:
n_users = np.unique(ratings_df.userID.as_matrix()).shape[0]
n_movies = np.unique(ratings_df.movieID.as_matrix()).shape[0]

In [248]:
n_users

943

In [249]:
n_movies

1682

Dataset contains ratings for 943 users and 1,682 movies. This will also be the dimension of the n_users x n_movies sparse matrix.

In [250]:
#We need to turn the pandas dataframe into a numpy array in order to create the sparse matrix
ratings = ratings_df.as_matrix(['userID', 'movieID', 'rating'])
ratings[:,0] -= 1
ratings[:,1] -= 1
ratings

array([[1.950e+02, 2.410e+02, 3.000e+00],
       [1.850e+02, 3.010e+02, 3.000e+00],
       [2.100e+01, 3.760e+02, 1.000e+00],
       ...,
       [2.750e+02, 1.089e+03, 1.000e+00],
       [1.200e+01, 2.240e+02, 2.000e+00],
       [1.100e+01, 2.020e+02, 3.000e+00]])

In [251]:
#split the ratings array into training and testing data
ratings_train, ratings_test = train_test_split(ratings, test_size = 0.10)

In [252]:
len(ratings_train)

90000

In [253]:
len(ratings_test)

10000

## Creating the Sparse Matrix

In [254]:
user_tr, movie_tr, rating_tr = zip(*ratings_train)
sparse_tr = csr_matrix((rating_tr, (user_tr, movie_tr)), shape = (n_users, n_movies))
user_ts, movie_ts, rating_ts = zip(*ratings_test)
sparse_ts = csr_matrix((rating_ts, (user_ts, movie_ts)), shape = (n_users, n_movies))

Zip unpacks the matrix of each element contained and then the sparse matrix is created by csr_matrix(). This creates a sparse matrix with confidence levels and uses the user id and movie id as the indexes to each rating.

The dimensions of the matrix is 943 x 1682.

## Using Alternating Least Squares to Fit Model

The algorithm I will be implementing to Matrix Factorization is Alternating Least Squares. ALS is the method for finding the row factor (which can be called U) and the column factor (which can be called V) described above. It works by randomly initializing U and solving for V. Then that value of V can then be used to solve U. This proccess then iterates back and forth until it converges to the best approximation of the orginal matrix.

This is a popular algorithm for dealing with implicit interactions. These consist of ratings that users don't give directly but can be assumed and determined from other methods and tracking data. Explicit ratings are ratings that users give directly. While the Netflix prize dealt with explicit ratings, implicit ratings are far more common in actual recommendation systems. Users don't always give ratings or other feedback so the data is much more scare than implicit feedback, which could be time spent on a video or number of clicks. This kind of data is much more available and thus easier to work into recommendation systems. 

I will make a key assumption about the MovieLens dataset and say that the ratings 1-5, represent an implicit feedback. Perhaps representing the time spent on a certain movie and then binned up into a discrete scale.

In [300]:
model = implicit.als.AlternatingLeastSquares(factors = 50,
                                            regularization = 0.1,
                                            iterations = 150,
                                            calculate_training_loss = True)
#A regularization term is used as a parameter to avoid overfitting

In [301]:
model.fit(sparse_tr)

100%|██████████| 150.0/150 [00:21<00:00,  7.00it/s, loss=0.114]


After tuning the hyperparamters the final model defines 50 latent factors to, a regularization term of 0.1 and 150 iterations to run until convergence.

## Evaluating the Model

In [302]:
movies_df = pd.read_table('/Users/dgrubis/Desktop/STATS 535/movies.tsv')
movies_df.head()

Unnamed: 0,movieID,name,year,genre1,genre2,genre3
0,1,Toy Story,1995,Animation,Children's,Comedy
1,2,Jumanji,1995,Adventure,Children's,Fantasy
2,3,Grumpier Old Men,1995,Comedy,Romance,
3,4,Waiting to Exhale,1995,Comedy,Drama,
4,5,Father of the Bride Part II,1995,Comedy,,


In [346]:
def recommendations(userID):
    
    user_data = ratings_df[ratings_df.userID == (userID)]
    user_full = (user_data.merge(movies_df, how = 'left', left_on = 'movieID', right_on = 'movieID').
                     sort_values(['rating'], ascending=False)
                 )
    print ('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    movie_rec = pd.DataFrame(model.recommend(userID, sparse_ts), columns = ['movieID', 'rating'])
    user_movie = movie_rec.merge(movies_df, how = 'outer', left_on = 'movieID', right_on = 'movieID').sort_values('rating', ascending = False)
    user_movie_2 = user_movie[['name', 'genre1', 'genre2', 'genre3']]
    return user_movie_2.head()

recommendations(670)

User 670 has already rated 46 movies.


Unnamed: 0,name,genre1,genre2,genre3
0,Toy Story,Animation,Children's,Comedy
1,Dracula: Dead and Loving It,Comedy,Horror,
2,Exotica,Drama,,
3,Spellbound,Mystery,Romance,Thriller
4,Heat,Action,Crime,Thriller


I wrote a function above that generates the top 5 recommendations for a certain user. 

I included the genre types of each movie to show that the latent factors are present. They could potentially be represented by these genres as a feature but the recommendation system quantifies this without the direct inclusion of these features. Pretty neat!

In [307]:
user_vecs = model.user_factors
item_vecs = model.item_factors

In [308]:
def predict(user_vecs, item_vecs):
    preds = []
    i = 0
    for i in range(n_users):
            pred = np.dot(user_vecs[i], item_vecs[i])
            preds.append(pred)
            i += 1
    return sum(preds) / len(preds)
predict(user_vecs, item_vecs)

#the mean prediction for all users by taking the dot product of each user with each movie

0.14052335815286918