Here, we try to use SVD to for our predictions. In SVD, set Y = U x S x V.T. 

U is a m x k matrix representing the feature vectors in some unknown space corresponding to the user

V.T is a k x m matrix representing the feature vectors in some unknown space corresponding to the movies

S is a k x k diagonal matrix is akin to a sort of a "scaling" factor for each of the unknown feature dimensions.

This implmentation is heavily based off of https://cambridgespark.com/content/tutorials/implementing-your-own-recommender-systems-in-Python/index.html which was provided in the project2.pptx. Much of the code in this
implementation came from that site, although a few minor modifications were made to suit this project.

In [1]:
import scipy.sparse as sp
from scipy.sparse.linalg import svds
import pickle
import numpy as np

In [2]:
# Import the data
Y_train = pickle.load((open("data/y_train.p", "rb")))
Y_test = pickle.load((open("data/y_test.p", "rb")))

num_users = max(max(Y_train[:,0]), max(Y_test[:,0])).astype(int)
num_movies = max(max(Y_train[:,1]), max(Y_test[:,1])).astype(int)

In [3]:
# Create the training matrix of known points
training_matrix = np.zeros((num_users, num_movies))
for user, movie, Yij in Y_train:
    training_matrix[user - 1][movie - 1] = Yij

# Create the test matrix by same method
test_matrix = np.zeros((num_users, num_movies))
for user, movie, Yij, in Y_test:
    test_matrix[user - 1][movie - 1] = Yij

In [6]:
# Gets the mean squared error between two matrices. Predictions is a dense matrix while
# actual is a sparse matrix represented by a 1D vector of tuples in the form (i, j, Yij)
def get_err(predictions, actual):
    err = 0
    for user, movie, Yij in actual:
        err += 0.5 * ((Yij - predictions[user - 1][movie - 1]) ** 2)
    return err / len(actual)

In [7]:
# Do the SVD
U, S, V = svds(training_matrix, k = 20)
s_diag = np.diag(S)

# Get the predictions and find E_in and E_out
Y_pred = np.dot(np.dot(U, s_diag), V)
E_in = get_err(Y_pred, Y_train)
E_out = get_err(Y_pred, Y_test)

print("Mean squared in sample error was determined to be " + str(E_in))
print("Mean squared out of sample error was determined to be " + str(E_out))

Mean squared in sample error was determined to be 2.4833788027802512
Mean squared out of sample error was determined to be 3.1310127156459466


In [8]:
print("U is: ")
print(U.T)
print("V is: ")
print(V.T)

U is: 
[[-0.0294572   0.02126707 -0.01676204 ...  0.04468681  0.01665304
  -0.07882847]
 [ 0.02872727 -0.01648492  0.01507433 ...  0.0022922   0.02876654
  -0.00834073]
 [ 0.09958267  0.00031876  0.00460181 ...  0.01187261  0.00460046
  -0.02471898]
 ...
 [ 0.00532886 -0.05162523 -0.02484103 ... -0.00641867 -0.02251532
   0.06118871]
 [-0.00646452  0.04947412  0.02738109 ...  0.02807787 -0.0081022
   0.00675117]
 [ 0.06858009  0.01456516  0.00619795 ...  0.00839252  0.02389851
   0.04082959]]
V is: 
[[-1.03083330e-02 -4.58842540e-02  1.50523556e-02 ...  1.47819672e-02
   9.26171052e-02  9.78700296e-02]
 [ 9.79796043e-03 -3.29048301e-02  2.81282526e-03 ...  6.43509404e-02
   3.70110538e-03  3.53071809e-02]
 [-5.17887985e-02  8.13820678e-03 -1.10328026e-02 ...  1.13746252e-02
   2.70581195e-02  1.92372756e-02]
 ...
 [-9.21480226e-04  1.33031796e-04 -1.09347307e-04 ... -5.94434363e-04
   5.10081390e-04  3.50995852e-05]
 [ 1.40381080e-03  1.54253189e-03  1.27155574e-03 ...  6.60396035e-04
