Here, we try to use SVD to for our predictions. In SVD, set Y = U x S x V.T. 

U is a m x k matrix representing the feature vectors in some unknown space corresponding to the user

V.T is a k x m matrix representing the feature vectors in some unknown space corresponding to the movies

S is a k x k diagonal matrix is akin to a sort of a "scaling" factor for each of the unknown feature dimensions.

This implmentation is heavily based off of https://cambridgespark.com/content/tutorials/implementing-your-own-recommender-systems-in-Python/index.html which was provided in the project2.pptx. Much of the code in this
implementation came from that site, although a few minor modifications were made to suit this project.

In [1]:
import scipy.sparse as sp
from scipy.sparse.linalg import svds
import pickle
import numpy as np

In [2]:
# Import the data
Y_train = pickle.load((open("data/y_train.p", "rb")))
Y_test = pickle.load((open("data/y_test.p", "rb")))

num_users = max(max(Y_train[:,0]), max(Y_test[:,0])).astype(int)
num_movies = max(max(Y_train[:,1]), max(Y_test[:,1])).astype(int)

In [3]:
# Create the training matrix of known points
training_matrix = np.zeros((num_users, num_movies))
for user, movie, Yij in Y_train:
    training_matrix[user - 1][movie - 1] = Yij

# Create the test matrix by same method
test_matrix = np.zeros((num_users, num_movies))
for user, movie, Yij, in Y_train:
    test_matrix[user - 1][movie - 1] = Yij

In [4]:
# Gets the mean squared error between two matrices. Predictions is a dense matrix while
# actual is a sparse matrix represented by a 1D vector of tuples in the form (i, j, Yij)
def get_err(predicitons, actual):
    err = 0
    for user, movie, Yij in actual:
        err += 0.5 * ((Yij - predicitons[user - 1][movie - 1]) ** 2)
    return err / len(actual)

In [5]:
# Do the SVD
U, S, V = svds(training_matrix, k = 20)
s_diag = np.diag(S)

# Get the predictions and find E_in and E_out
Y_pred = np.dot(np.dot(U, s_diag), V)
E_in = get_err(Y_pred, Y_train)
E_out = get_err(Y_pred, Y_test)

print("Mean squared in sample error was determined to be " + str(E_in))
print("Mean squared out of sample error was determined to be " + str(E_out))

In sample error was determined to be 2.483378802780253
Out of sample error was determined to be 3.1310127156459444
