PDA 2019 Kaggle Data Competition
For the Data competition the task was to build a recommender system using information from a movie rating database. The training data was composed of user's ratings of different movies and information on the movies (genre, visual features, release year, tags and title). The objective of the system was to recommend to each user in the test set their top 10 favorite movies. 
The algorithm I decided to use for this assignment was SVD. SVD stands for Singular Value Decomposition which works using matrix factorization. Matrix factorization breaks down one matrix into the product of multiple matrices. Although there are many ways to do matrix factorization, SVD is quite powerful when used for recommendations. It decomposes a matrix into two unitary matrices and a diagonal matrix.

                                               R=UΣV^T
                                               
Where R is our user ratings matrix, U is the user features matrix, Σ is the diagonal matrix of singular values and V^T is the movie features matrix. U and V^T are orthogonal, U represents how much users ‘like’ each feature and V^T represents how relevant each feature is to a movie. To get the lower rank approximation, I use the matrices and keep only the top k features, which are the k most relevant taste and preference vectors.


In [1]:
import pandas as pd
import numpy as np
from surprise import Dataset,Reader,SVD
from scipy.sparse.linalg import svds
rating_df = pd.read_csv('../data/train-PDA2019.csv',sep=',')
rating_df.head()
rating_df.head()
movies_df = pd.read_csv('../data/content-PDA2019.csv',sep=',')
rating_df.head()
test = pd.read_csv('../data/test-PDA2019.csv')


ModuleNotFoundError: No module named 'surprise'

In [2]:

reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(rating_df[['userID','itemID','rating']],reader)

NameError: name 'Reader' is not defined

I pivot the original ratings dataset as to have every usedID on 1 row and every movieID on 1 column. Then we can compute the average rating for each user, then normalize the data by using the average of each user in R_demeaned.

In [17]:
r_df = rating_df.pivot(index = 'userID', columns ='itemID', values = 'rating')


Then we can compute the average rating for each user, then normalize the data by using the average of each user in R_demeaned. We used the pivoted dataset to replace the values for each movies with the average for each user. Doing this we normalize and set the unknown rates with the user mean(0 after substraction)

In [6]:
users_mean=np.array(r_df.mean(axis=1))
R_demeaned=r_df.sub(r_df.mean(axis=1), axis=0)
R_demeaned=R_demeaned.fillna(0).as_matrix()
R_demeaned

NameError: name 'r_df' is not defined

In [21]:
R_demeaned.shape

(5690, 1824)

Now my matrix is properly formatted and normalized, I can go on with the singular value decomposition. I used the svds function from scipy because it lets me choose how many latent factors I use to approimate the original ratings matrix (instead of having to truncate afterwards)

In [22]:
U, sigma, Vt = svds(R_demeaned, k = 20,maxiter=20)

The svds function just prints out the values of the diagonal matrix we defined above as Σ, but for matrix multiplication we need to convert it into a diagonal matrix.

In [None]:
sigma = np.diag(sigma)
sigma

Now we have everything we need to predict the ratings for all users and all movies in the original dataset. The way the predictions are computed is by using dot product(matrix) and multiplying the U,Σ and then use that dot product to multiplicate again with  the Vt matrixe to get predicted user ratings. Then I add back the mean of the users to get the predicted ratings.

In [12]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + users_mean.reshape(-1, 1)
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = r_df.columns)
preds_df.index=r_df.index.values

In [13]:
all_user_predicted_ratings
preds_df

itemID,89,93,94,95,97,98,100,101,102,104,...,3929,3930,3931,3932,3937,3938,3945,3946,3950,3952
1,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,...,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000
3,4.068796,4.061406,4.056998,4.071683,4.065849,4.080423,4.096717,4.089263,4.051843,4.097777,...,4.082270,4.114935,4.064369,4.064677,4.037437,4.067180,4.082781,4.076501,4.087368,4.075616
5,4.353793,4.314717,4.284275,4.356992,4.317164,4.302795,4.257032,4.293577,4.326788,4.172904,...,4.293126,4.256334,4.303050,4.294403,4.354948,4.273185,4.277415,4.354635,4.334967,4.355816
7,4.112597,4.143994,4.149516,4.061567,4.146658,4.144155,4.149181,4.116078,4.182591,4.159403,...,4.148637,4.132310,4.137964,4.157293,4.191658,4.151345,4.162280,4.169970,4.114594,4.127912
9,3.301906,3.388734,3.338637,3.260410,3.335613,3.339122,3.371196,3.429613,3.327346,3.326691,...,3.329216,3.188028,3.277524,3.311117,3.355953,3.343276,3.434101,3.324074,3.225159,3.155942
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12069,2.526723,2.527010,2.656473,2.536531,2.660777,2.676989,2.541800,2.829495,2.439088,3.045958,...,2.612726,2.522441,2.672929,2.627734,2.693609,2.669347,2.655035,2.898758,2.706967,2.866595
12071,3.407234,3.238238,3.126906,3.233675,3.275500,3.372047,3.369980,3.169492,3.460149,3.038088,...,3.426444,3.453158,3.242676,3.402915,3.217141,3.333078,3.724724,3.167315,3.405940,3.251843
12073,3.546506,3.561170,3.507800,3.525806,3.505455,3.527509,3.536614,3.527397,3.481934,3.496873,...,3.500170,3.483428,3.535617,3.498467,3.528067,3.546146,3.470147,3.548009,3.524073,3.553793
12077,4.000676,4.039090,4.025830,3.958430,4.006532,4.018978,3.971515,3.988499,4.029222,3.963382,...,4.029392,3.913115,3.999561,4.020765,4.028088,3.994795,4.000643,4.019498,4.007097,4.053066


In the preds_df dataframe, each column of itemID refers to a movie and each row represents on user. What I did next is sort the ratings in descdendig order to get the top 10 predited ratings for each user. Also, I used the userID column from thetest dataset as the index to extract the predictions only for the users in the test set

In [14]:
users = test.loc[:,'userID']
users

0           1
1           3
2          11
3          29
4          31
        ...  
1987    12047
1988    12051
1989    12061
1990    12063
1991    12073
Name: userID, Length: 1992, dtype: int64

In [15]:
col_names = preds_df.columns
ind = 0
for j in range(len(users)):
    user = users[j]
    user_pred = []
    rating = preds_df.loc[user,:]
    rating = rating.sort_values(ascending=False)
    top10 = rating[0:10]
    indices = top10.index[:].tolist()
    indices_str = str(indices).strip('[]').replace(",", " ")
    user_pred = [indices]
    final_df.loc[ind] = pd.Series({'userID':user,'recommended_itemIDs':indices_str})
    ind += 1

NameError: name 'final_df' is not defined

In [16]:
final_df




NameError: name 'final_df' is not defined

In [None]:
final_df.to_csv(path_or_buf = 'recommendations.csv', 
                  index = False,
                  header = True, sep = ',')