PDA 2019 Kaggle Data Competition
For the Data competition the task was to build a recommender system using information from a movie rating database. The training data was composed of user's ratings of different movies and information on the movies (genre, visual features, release year, tags and title). The objective of the system was to recommend to each user in the test set their top 10 favorite movies. 
The algorithm I decided to use for this assignment was SVD. SVD stands for Singular Value Decomposition which works using matrix factorization. Matrix factorization breaks down one matrix into the product of multiple matrices. Although there are many ways to do matrix factorization, SVD is quite powerful when used for recommendations. It decomposes a matrix into two unitary matrices and a diagonal matrix.

                                               R=UΣV^T
                                               
Where R is our user ratings matrix, U is the user features matrix, Σ is the diagonal matrix of singular values and V^T is the movie features matrix. U and V^T are orthogonal, U represents how much users ‘like’ each feature and V^T represents how relevant each feature is to a movie. To get the lower rank approximation, I use the matrices and keep only the top k features, which are the k most relevant taste and preference vectors.
My objective with this model is to return the movies with the highest predicted rating that a specific user has not yet seen.


In [13]:
import pandas as pd
import numpy as np
from surprise import Dataset,Reader,SVD
from scipy.sparse.linalg import svds
rating_df = pd.read_csv('../data/train-PDA2019.csv',sep=',')
rating_df.head()
rating_df.head()
movies_df = pd.read_csv('../data/content-PDA2019.csv',sep=',')
rating_df.head()
test = pd.read_csv('../data/test-PDA2019.csv')


FileNotFoundError: [Errno 2] File b'../data/train-PDA2019.csv' does not exist: b'../data/train-PDA2019.csv'

I pivot the original ratings dataset as to have every usedID on 1 row and every movieID on 1 column. Then we can compute the average rating for each user, then normalize the data by using the average of each user in R_demeaned.

In [None]:
r_df = rating_df.pivot(index = 'userID', columns ='itemID', values = 'rating')


Then we can compute the average rating for each user, then normalize the data by using the average of each user in R_demeaned. We used the pivoted dataset to replace the values for each movies with the average for each user. Doing this we normalize and set the unknown rates with the user mean(0 after substraction)

In [None]:
users_mean=np.array(r_df.mean(axis=1))
R_demeaned=r_df.sub(r_df.mean(axis=1), axis=0)
R_demeaned=R_demeaned.fillna(0).as_matrix()
R_demeaned

In [None]:
R_demeaned.shape

Now my matrix is properly formatted and normalized, I can go on with the singular value decomposition. I used the svds function from scipy because it lets me choose how many latent factors I use to approimate the original ratings matrix (instead of having to truncate afterwards). I defined the 3 matrixes above, which will be used to calculate our reconstructed matrix below.
For movies, predictions from lower rank matrices with values of k between roughly 20 and 100 have been found to be the best at generalizing to unseen data.

In [None]:
U, sigma, Vt = svds(R_demeaned, k = 20,maxiter=20)

The svds function just prints out the values of the diagonal matrix we defined above as Σ, but for matrix multiplication we need to convert it into a diagonal matrix.

In [None]:
sigma = np.diag(sigma)
sigma

Now we have everything we need to predict the ratings for all users and all movies in the original dataset. The way the predictions are computed is by using dot product(matrix) and multiplying the U,Σ and then use that dot product to multiplicate again with  the Vt matrix to get the reconstructed matrix. The reconstructed matrix with the the mean of the users gives me back the predicted ratings for each user and for each movie.

In [None]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + users_mean.reshape(-1, 1)
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = r_df.columns)
preds_df.index=r_df.index.values

In [None]:
all_user_predicted_ratings
preds_df

In the preds_df dataframe, each column of itemID refers to a movie and each row represents on user. What I did next is sort the ratings in descdendig order to get the top 10 predited ratings for each user. Also, I used the userID column from thetest dataset as the index to extract the predictions only for the users in the test set

In [None]:
users = test.loc[:,'userID']
users

Finally, I am going to build the reccomendations dataset with the following for loop: I used the index above to determine the users that I want to predict for, then I used the top10 list to get the top 10 reccomended movies for each user after sorting them in descending order. Then I transforme the user predictions into a string and concatenated them together in a series where the key value is the userID and the values are the top 10 recommended unseen movies for each user.

In [None]:
col_names = preds_df.columns
ind = 0
for j in range(len(users)):
    user = users[j]
    user_pred = []
    rating = preds_df.loc[user,:]
    rating = rating.sort_values(ascending=False)
    top10 = rating[0:10]
    indices = top10.index[:].tolist()
    indices_str = str(indices).strip('[]').replace(",", " ")
    user_pred = [indices]
    final_df.loc[ind] = pd.Series({'userID':user,'recommended_itemIDs':indices_str})
    ind += 1

In the end I just export my final Series to a csv file as to upload the result to the Kaggle competition page.

In [None]:
final_df.to_csv(path_or_buf = 'recommendations.csv', 
                  index = False,
                  header = True, sep = ',')