PDA 2019 Kaggle Data Competition
For the Data competition the task was to build a recommender system using information from a movie rating database. The training data was composed of user's ratings of different movies and information on the movies (genre, visual features, release year, tags and title). The objective of the system was to recommend to each user in the test set their top 10 favorite movies. 
The algorithm I decided to use for this assignment was SVD. SVD stands for Singular Value Decomposition which works using matrix factorization. Matrix factorization breaks down one matrix into the product of multiple matrices. Although there are many ways to do matrix factorization, SVD is quite powerful when used for recommendations. It decomposes a matrix into two unitary matrices and a diagonal matrix.

                                               R=UΣV^T
                                               
Where R is our user ratings matrix, U is the user features matrix, Σ is the diagonal matrix of singular values and V^T is the movie features matrix. U and V^T are orthogonal, U represents how much users ‘like’ each feature and V^T represents how relevant each feature is to a movie. To get the lower rank approximation, I use the matrices and keep only the top k features, which are the k most relevant taste and preference vectors.
My objective with this model is to return the movies with the highest predicted rating that a specific user has not yet seen. The reason I have chosen the svds function from scipy is because it allowed me to choose the number of latent factors, and also when using the SVD function from surprise, the result I got back from the algorithms in the competition was half that of the SVDS model, also there is no need to truncate after calculation. 

# Loading the data
I first began by loading all the packages I used and loading all the datasets that were offered in the competition.

In [108]:
import pandas as pd
import numpy as np
from scipy.sparse.linalg import svds
rating_df = pd.read_csv('../data/train-PDA2019.csv',sep=',')
rating_df.head()
rating_df.head()
movies_df = pd.read_csv('../data/content-PDA2019.csv',sep=',')
rating_df.head()
test = pd.read_csv('../data/test-PDA2019.csv')


### Transforming the original dataset
I pivot the original ratings dataset as to have every usedID on 1 row and every movieID on 1 column. Then we can compute the average rating for each user, then normalize the data by using the average of each user in R_demeaned.

In [109]:
r_df = rating_df.pivot(index = 'userID', columns ='itemID', values = 'rating')
r_df

itemID,89,93,94,95,97,98,100,101,102,104,...,3929,3930,3931,3932,3937,3938,3945,3946,3950,3952
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,4.0,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12069,,,,,,,,,,,...,,,,,,,,,,
12071,,,,,,,,,,,...,,,,,,,,,,
12073,,,,,,,,,,,...,,,,,,,,,,
12077,,,,,,,,,,,...,,,,,,,,,,


### Calculating the mean for each user and demeaning the original data
Then we can compute the average rating for each user, then normalize the data by using the average of each user in R_demeaned. We used the pivoted dataset to replace the values for each movies with the average for each user. Doing this we normalize and set the unknown rates with the user mean(0 after substraction). To be able to calculate our 3 matrixes, the data has to be demeaned.

In [110]:
users_mean=np.array(r_df.mean(axis=1))
users_mean

array([4.        , 4.07692308, 4.2972973 , ..., 3.52173913, 4.01190476,
       3.55744681])

In [111]:
R_demeaned=r_df.sub(r_df.mean(axis=1), axis=0)
R_demeaned=R_demeaned.fillna(0).as_matrix()
R_demeaned

  


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [112]:
R_demeaned.shape

(5690, 1824)

### Using SVDS to calculate the U,Sigma and V^T matrices
Now my matrix is properly formatted and normalized, I can go on with the singular value decomposition. I used the svds function from scipy because it lets me choose how many latent factors I use to approimate the original ratings matrix (instead of having to truncate afterwards). I defined the 3 matrixes above, which will be used to calculate our reconstructed matrix below.
For movies, predictions from lower rank matrices with values of k between roughly 20 and 100 have been found to be the best at generalizing to unseen data.

In [113]:
U, sigma, Vt = svds(R_demeaned, k = 20,maxiter=20)

The svds function just prints out the values of the diagonal matrix we defined above as Σ, but for matrix multiplication we need to convert it into a diagonal matrix.

In [114]:
sigma = np.diag(sigma)
sigma

array([[ 42.49678016,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ],
       [  0.        ,  42.83402439,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,  43.6471488 ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,  

In [115]:
U

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.00177074, -0.0063769 ,  0.00504841, ...,  0.00243039,
         0.00276918, -0.00072927],
       [ 0.01433633,  0.00286957,  0.00753249, ...,  0.00398687,
         0.00119028, -0.0012506 ],
       ...,
       [ 0.00294853,  0.00196687,  0.00243141, ...,  0.00016709,
        -0.00017324, -0.00064984],
       [-0.00048142, -0.00102176,  0.0010381 , ...,  0.00682453,
        -0.00228173, -0.00321078],
       [ 0.01036243,  0.01327087,  0.00269019, ...,  0.03296971,
        -0.04673466, -0.02218824]])

In [116]:
Vt

array([[-0.00679028,  0.00572232,  0.00465115, ..., -0.00114808,
        -0.00649961, -0.01167487],
       [ 0.0037671 , -0.00372763, -0.00441896, ..., -0.00595917,
        -0.00256974, -0.00410129],
       [ 0.02936087, -0.0017641 , -0.01129981, ...,  0.00477656,
        -0.00148752, -0.0124447 ],
       ...,
       [ 0.0005008 ,  0.00463704, -0.00486389, ...,  0.00382646,
         0.00110186, -0.0050521 ],
       [-0.00052407, -0.0157004 , -0.00138182, ...,  0.0033236 ,
         0.00359132,  0.00344058],
       [ 0.01144526,  0.01543923, -0.00552631, ...,  0.01057347,
        -0.00032013, -0.00570157]])

Now we have everything we need to predict the ratings for all users and all movies in the original dataset. The way the predictions are computed is by using dot product(matrix) and multiplying the U,Σ and then use that dot product to multiplicate again with  the Vt matrix to get the reconstructed matrix. The reconstructed matrix with the the mean of the users gives me back the predicted ratings for each user and for each movie.

### Computing the user predictions
Below I obtained all the predicted values for each user that need to be combined with the mean rating of each user. Basically this is the reconstructed A matrix.

In [117]:
predicted_ratings = np.dot(np.dot(U, sigma), Vt)
predicted_ratings

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.00836806,  0.00237305, -0.00237038, ...,  0.01207046,
         0.00315937,  0.005653  ],
       [-0.00089247,  0.01384197, -0.01075017, ...,  0.02438522,
         0.00067371, -0.00598711],
       ...,
       [ 0.00504862,  0.0103067 ,  0.00422247, ...,  0.00415781,
        -0.00209619,  0.007994  ],
       [ 0.00740605, -0.00269385, -0.0074761 , ..., -0.0001236 ,
        -0.00257709, -0.00475029],
       [ 0.04380527, -0.0238351 ,  0.01825917, ...,  0.00154547,
        -0.00993703,  0.00102514]])

Now I can add back the user`s means.

In [118]:
all_user_predicted_ratings = predicted_ratings+users_mean.reshape(-1, 1)

Now that we have all the predicted ratings,I used the columns from the pivoted dataset to match them to the corresponding userID and movieID. 

In [119]:
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = r_df.columns)
preds_df.index=r_df.index.values
preds_df

itemID,89,93,94,95,97,98,100,101,102,104,...,3929,3930,3931,3932,3937,3938,3945,3946,3950,3952
1,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,...,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000
3,4.085291,4.079296,4.074553,4.075073,4.077867,4.077451,4.085337,4.074300,4.067506,4.090479,...,4.080160,4.076845,4.074610,4.069150,4.083393,4.077336,4.071088,4.088994,4.080082,4.082576
5,4.296405,4.311139,4.286547,4.242423,4.298161,4.297616,4.299339,4.285723,4.293009,4.288283,...,4.295841,4.282518,4.307778,4.288229,4.305012,4.291263,4.305418,4.321683,4.297971,4.291310
7,4.137748,4.151516,4.146522,4.115969,4.142099,4.143494,4.131382,4.143822,4.148396,4.151630,...,4.141997,4.138665,4.142214,4.139667,4.144548,4.144551,4.148126,4.135352,4.141047,4.143131
9,3.295425,3.334860,3.361143,3.315955,3.320208,3.322225,3.306172,3.354904,3.366350,3.148644,...,3.333369,3.339932,3.334072,3.354983,3.287454,3.310640,3.301873,3.303933,3.323655,3.303599
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12069,2.593849,2.560817,2.822270,2.182376,2.637337,2.665910,2.675123,2.789620,2.382447,2.564328,...,2.627376,2.632516,2.648115,2.606990,2.665217,2.621461,2.521409,2.581606,2.692735,2.880911
12071,3.401296,3.305412,3.358908,2.974460,3.395930,3.322987,3.228641,3.395010,3.384801,2.940222,...,3.474766,3.474951,3.253646,3.493205,3.146316,3.338301,3.370784,3.128883,3.341565,3.463674
12073,3.526788,3.532046,3.525962,3.506134,3.523103,3.523525,3.520468,3.522763,3.515796,3.522240,...,3.519022,3.508769,3.523716,3.508988,3.523508,3.522024,3.517813,3.525897,3.519643,3.529733
12077,4.019311,4.009211,4.004429,4.076412,4.012606,4.011201,4.010056,4.004379,4.009873,4.024203,...,4.025315,4.006079,4.001331,4.020181,4.008827,4.010763,4.010388,4.011781,4.009328,4.007154


### Finding the top 10 recommendations for each user
In the preds_df dataframe, each column of itemID refers to a movie and each row represents on user. What I did next is sort the ratings in descendig order to get the top 10 predited ratings for each user. Also, I used the userID column from thetest dataset as the index to extract the predictions only for the users in the test set.

In [120]:
users = test.loc[:,'userID']
users

0           1
1           3
2          11
3          29
4          31
        ...  
1987    12047
1988    12051
1989    12061
1990    12063
1991    12073
Name: userID, Length: 1992, dtype: int64

Finally, I am going to build the recommendations dataset with the following for loop: I used the index above to determine the users that I want to predict for, then I used the top10 list to get the top 10 reccomended movies for each user after sorting them in descending order. Then I transforme the user predictions into a string and concatenated them together in a series where the key value is the userID and the values are the top 10 recommended unseen movies for each user.

In [121]:
col_names = preds_df.columns
ind = 0
for j in range(len(users)):
    user = users[j]
    user_pred = []
    rating = preds_df.loc[user,:]
    rating = rating.sort_values(ascending=False)
    top10 = rating[0:10]
    indices = top10.index[:].tolist()
    indices_str = str(indices).strip('[]').replace(",", " ")
    user_pred = [indices]
    test.loc[ind] = pd.Series({'userID':user,'recommended_itemIDs':indices_str})
    ind += 1

Here I have the final output where I obtained the top 10 movies for each user and the movies` respective IDs

In [122]:
test.head(10)

Unnamed: 0,userID,recommended_itemIDs
0,1,3952 1401 1405 1406 1407 1408 1409 1412...
1,3,260 1196 1097 2628 1198 1270 3471 1197 ...
2,11,1097 2858 1721 1270 2657 587 924 3471 ...
3,29,318 2762 593 2028 3147 589 2858 457 23...
4,31,2683 1517 2700 2706 3253 223 919 1721 ...
5,33,260 1196 1198 924 1240 541 1197 919 10...
6,35,2028 1198 1196 1197 260 1617 1291 858 ...
7,51,2858 2628 1213 3418 1544 590 2706 2028 ...
8,53,2858 260 1196 2997 2628 1198 858 912 9...
9,55,260 1196 1198 1197 1291 919 912 1148 1...


In the end I just export my final Series to a csv file as to upload the result to the Kaggle competition page.

In [123]:
test.to_csv(path_or_buf = 'recommendations.csv', 
                  index = False,
                  header = True, sep = ',')