#  The MovieLens Dataset

[MovieLens](https://movielens.org/) is a non-commercial web-based movie recommender system, created in 1997 by GroupLens, a research lab at the University of Minnesota, in order to gather movie rating data for research purposes.


## Getting the Data


The MovieLens dataset is hosted by the [GroupLens](https://grouplens.org/datasets/movielens/) website. Several versions are available. I will use the latest smallest dataset released from [link](https://files.grouplens.org/datasets/movielens/ml-latest-small.zip).

## Custom Code

soft_impute and functionsCF are custom packages

In [1]:
# Impute necessary packages
import numpy as np
import pandas as pd
from fancyimpute import BiScaler
from soft_impute import SoftImpute
from functionsCF import GenerateTrainingSet

## Create the incomplete matrices for training and testing

In [2]:
# Read movielens data from files- point to where data is stored, small set of Movielens dataset
# 100836 (rows), userId	movieId	rating	timestamp (columns).
# Using smaller dataset rather than the full dataset to speed performance.
# read in values only
rating = pd.read_csv('ratings.csv', sep=',').values

In [3]:
# Here we only care about the ratings, so we only use the first three columns, which contain use IDs, movie IDs, and ratings.
rating = rating[:,0:3]

In [4]:
#show top 5 rows
print(rating[:5, :])

[[ 1.  1.  4.]
 [ 1.  3.  4.]
 [ 1.  6.  4.]
 [ 1. 47.  5.]
 [ 1. 50.  5.]]


In [5]:

# First, we create an empty matrix
matrix_incomplete = np.zeros((len(np.unique(rating[:,0])), len(np.unique(rating[:,1]))))

# Second, Since some movies don't have any ratings, we only use the movies that have ratings. 
# Here, we correspondingly change the movie IDs to make each column have ratings.
# We create an array of all movie IDs
usedID = np.unique(rating[:, 1]) 
# replace the movie IDs by the their positions in the array we just created
for i in range(len(rating[:,1])):
    rating[:,1][i] = np.where(usedID==rating[:,1][i])[0][0] + 1
    
# Finally, we construct the incomplete matrix, on which the incomplete components are nan by default. 
# all components are nan by default
matrix_incomplete[:] = np.nan
# create the index pair of the components with ratings
indices = np.array(rating[:,0] - 1).astype(int), np.array(rating[:,1] - 1).astype(int)
# change the values in the corresponding positions to the known rating information
matrix_incomplete[indices] = rating[:,2]

In [6]:
# Obtain the index pairs of the training set and the validation set, with ratio 90%
train_indices, validation_indices = GenerateTrainingSet(rating[:,0], rating[:,1], 0.90)
# And then use the index pairs to create the incomplete training test
matrix_train = matrix_incomplete.copy()
matrix_train[:] = np.nan
matrix_train[train_indices] = matrix_incomplete[train_indices]

##  Run the softImpute model for collaborative filtering

In [7]:
# Create the BiScaler model
biscaler = BiScaler(scale_rows=False, scale_columns=False, max_iters=50, verbose=False)
# Rescale both rows and columns to have zero mean
matrix_train_normalized = biscaler.fit_transform(matrix_train)

In [8]:
# Use softImpute to complete the matrix. J comes from the soft_impute.py and refers to the number of Archetypes. 
softImpute = SoftImpute(J = 4, maxit = 200, random_seed = 1, verbose = False)

In [9]:
# We run the softImpute model on the normalized training set
matrix_train_softImpute = softImpute.fit(matrix_train_normalized)
# Use the softImpute model to create the predicted matrix. Set copyto to false to avoid changing the value of matrix_train_normalized
matrix_train_filled_normalized = matrix_train_softImpute.predict(matrix_train_normalized, copyto = False)
# Inverse transformation to undo the scaling we did in two cells above
matrix_train_filled = biscaler.inverse_transform(matrix_train_filled_normalized)

## Analysis of the predicted ratings

### Out-of-sample R^2

In [10]:
# We create the baseline method
train_average = np.average(matrix_train[train_indices])

In [11]:
# Calculate out-of-sample R2 and in-sample R2

validation_mse = ((matrix_train_filled[validation_indices] - matrix_incomplete[validation_indices]) ** 2).mean()
training_mse = ((matrix_train_filled[train_indices] - matrix_incomplete[train_indices]) ** 2).mean()
validation_mse_baseline = ((train_average - matrix_incomplete[validation_indices]) ** 2).mean()
training_mse_baseline = ((train_average - matrix_incomplete[train_indices]) ** 2).mean()
print("out-of-sample R2: %.4f, in-sample R2: %.4f." % (1 - validation_mse / validation_mse_baseline, 1 - training_mse / training_mse_baseline))

out-of-sample R2: 0.1854, in-sample R2: 0.6270.


### Get low-rank factors

In [12]:
# Obtain the ratings of each archetype
# Each row of this matrix corresponds to a movie and each column corresponds to an archetype
softImpute.v

array([[-0.00818305, -0.00186758, -0.0064995 , -0.00933809],
       [-0.00363814, -0.0054402 , -0.00101252,  0.00464615],
       [-0.00436636, -0.01031742, -0.01088822,  0.00168521],
       ...,
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ]],
      shape=(9724, 4))

In [13]:
softImpute.v.shape

(9724, 4)

In [14]:

#  the weights of archetypes of each user
# each row of this matrix corresponds to a user and each column corresponds to an archetype
weights = np.dot(softImpute.u, np.diagflat(softImpute.d).T)
weights

array([[ -0.53562161, -18.40094309,   2.04833569,  -4.7306278 ],
       [ -7.96613867,   9.96148122,   4.36921993,  -0.54960925],
       [-33.19272433,  26.54336994,  39.60564913,  59.11771063],
       ...,
       [ 41.74092437,  28.96409398,  18.97959189,  12.41164149],
       [ -3.35984854,  -4.20183855,  -1.98668248,   5.32306638],
       [-10.43299218,   6.68294442,  -5.86424168, -21.54523686]],
      shape=(610, 4))

In [15]:
weights.shape

(610, 4)

In [16]:
# And then the predicted matrix is computed by the product of two low-rank matrices
new_prediction = np.dot(weights, softImpute.v.T)

In [17]:
# We can see it is the same with the output of the codes in the previous section
np.sum(np.abs(new_prediction - matrix_train_filled_normalized))

np.float64(7.381488988216507e-11)

END!