# SVD Predictor

Creating and testing the SVD Predictor model on our dataframe. First we need to split the data in train and test set, and then run GridSearchCV on the training set in order to find the best n factors value to pass into the SVD model creation. 

In [29]:
import pickle
from datetime import datetime
from tqdm import tqdm
import numpy as np
import pandas as pd
import os
import sklearn
from sklearn.metrics.pairwise import cosine_similarity
import random
from surprise import Reader, Dataset
from surprise import SVD
from surprise import SVDpp
from surprise.model_selection import GridSearchCV
import xgboost as xbg
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [27]:
movie = pd.read_pickle("cleanedMovie.pkl")
movie.head()

Unnamed: 0,MovieID,CustomerID,Rating
0,1,1488844,3
3,1,30878,4
7,1,1248029,3
19,1,372233,5
20,1,1080361,3


In [32]:
movie_train, movie_test = sklearn.model_selection.train_test_split(movie, train_size = 0.8)

(3751486, 3)

In [34]:
print(movie_test.shape)
movie_test.head()

(3751486, 3)


Unnamed: 0,MovieID,CustomerID,Rating
7882419,1582,826184,1
99705020,17580,1833576,3
45681607,8171,395747,5
22273012,4225,2361318,4
55260630,10109,864647,5


In [36]:
print(movie_train.shape)
movie_train.head()

(15005940, 3)


Unnamed: 0,MovieID,CustomerID,Rating
11836212,2284,1352912,4
88353372,15717,574834,4
3977489,758,698147,4
76706859,13923,1200886,3
85305777,15134,2222815,3


In [37]:
reader = Reader(rating_scale=(1,5))
movieInput = pd.DataFrame()
movieInput['CustomerID'] = movie_train['CustomerID']
movieInput['MovieID'] = movie_train['MovieID']
movieInput['Rating'] = movie_train['Rating']

train_data = Dataset.load_from_df(movieInput, reader)
trainset = train_data.build_full_trainset()

In [38]:
testset = list(zip(movie_test["CustomerID"].values, movie_test["MovieID"].values, movie_test["Rating"].values))

In [40]:
error_table = pd.DataFrame(columns = ["Model", "Train_RMSE", "Test_RMSE"])

In [46]:
trainset.global_mean

3.4188866542182628

## Creating Fit and Prediction Method

In [60]:
def run_surprise(algo, trainset, testset, model_name):
    start = datetime.now()
    algo.fit(trainset)
    
    pred_train = algo.test(trainset.build_testset())
    
    trainActual = np.array([p.r_ui for p in pred_train])
    trainPred = np.array([p.est for p in pred_train]) 
    trainRMSE = np.sqrt(mean_squared_error(trainActual, trainPred))
    
    print("Train Data RMSE: {}".format(trainRMSE))
    print("\n")
    
    train = {"RMSE": trainRMSE, "Prediction": trainPred}
    
    pred_test = algo.test(testset)
    testActual = np.array([p.r_ui for p in pred_test])
    testPred = np.array([p.est for p in pred_test])
    testRMSE = np.sqrt(mean_squared_error(testActual, testPred))
    
    print("Test Data RMSE: {}".format(testRMSE))
    print("\n")
    
    test = {"RMSE": testRMSE, "Prediction": testPred}
    
    print("Time Taken = " + str(datetime.now() - start))
        
    return train, test

## Finding N Factors

GridSearchCV cannot handle the amount of data we are passing through, so we will run GridSearchCV on a smaller portion of the dataset in order to return the best n_factors to pass into SVD. SVD will itself get the full dataset we are using. Only run the code below for GridSearchCV if you want to run through the whole code. The fit command can take up to an hour, and the results are always the same, so we can hard code in the parameter for SVD.

In [66]:
params = { 'n_factors': [5, 10, 15, 20, 25, 30, 35, 40, 50]}
grid = GridSearchCV(SVD, params, measures=['rmse'], cv=3, refit=True)
grid.fit(Dataset.load_from_df(movieInput.iloc[:1500000], reader))
print(grid.best_score['rmse'])

0.9141425899963144


Of the N factors passed in, we can find the one that had the best RMSE and use that in the SVD model. Below, we use that directly from the calculation above. In the following class file, we use the value as a static variable in order to minimize processing time on unnecessary calculations. The results of grid.best_params['rmse']['n_factors'] are always 5, so feel free to input that into the parameters.

In [67]:
algo = SVD(n_factors = grid.best_params['rmse']['n_factors'], biased=True, verbose=True)
train_result, test_result = run_surprise(algo, trainset, testset, "SVD")

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Train Data RMSE: 0.8198083205483772


Test Data RMSE: 0.828587192328288


Time Taken = 0:10:29.761314


The best results for the test data are RMSE of 0.829, which is an extremely good RMSE for the Netflix Prize Data! Of course, the data may be somewhat skewed from only taking the most active users and popular movies. However, for our data of active users, this is highly effective! We can test below on less active users to see the RMSE for users we didn't include.

## Testing on Unused Data

Here we read in some of the rest of the data that was not passed into the pickle file in the beginning. We test for the RMSE of this test data using the SVD algorithm above.

In [153]:
cwd = os.getcwd()
movie = pd.read_csv(cwd + "/data/final.csv")
movie.describe()

Unnamed: 0,MovieID,CustomerID,Rating
count,100480500.0,100480500.0,100480500.0
mean,9070.915,1322489.0,3.60429
std,5131.891,764536.8,1.085219
min,1.0,6.0,1.0
25%,4677.0,661198.0,3.0
50%,9051.0,1319012.0,4.0
75%,13635.0,1984455.0,4.0
max,17770.0,2649429.0,5.0


In [154]:
reduced_data = movie.drop(columns=['Date'])

reduced_data['MovieID'] = reduced_data['MovieID'].astype('int16')
reduced_data['CustomerID'] = reduced_data['CustomerID'].astype('int32')
reduced_data['Rating'] = reduced_data['Rating'].astype('int8')

In [157]:
movie_freq = pd.DataFrame(reduced_data.groupby('MovieID').size(),columns=['count'])
threshold = 100

popular_movies = list(set(movie_freq.query('count>=@threshold').index))

# ratings df after dropping non popular movies
data_popular_movies = reduced_data[reduced_data.MovieID.isin(popular_movies)]

print('shape of original data:', reduced_data.shape)
print('shape of data_popular_movies', data_popular_movies.shape)
print("No. of movies which are rated more than 100 times:", len(popular_movies))

shape of original data: (100480507, 3)
shape of data_popular_movies (100400918, 3)
No. of movies which are rated more than 100 times: 16795


In [159]:
# reduce data_popular_movie to only have movies that are in movie
movieList = movie.MovieID.tolist()
popMoviesTest = data_popular_movies[data_popular_movies.MovieID.isin(movieList)]

The below code takes a very long time to run as well, as it is 100,000,000 rows of data being passed in. You can uncomment the line near the top to reduce the dataset to a more manageable size. If not, it takes about an hour to run.

In [160]:
start = datetime.now()

reducedtestset = list(zip(popMoviesTest["CustomerID"].values, popMoviesTest["MovieID"].values, popMoviesTest["Rating"].values))
#reducedtestset = reducedtestset.iloc[:2000000]

pred_test = algo.test(reducedtestset)
testActual = np.array([p.r_ui for p in pred_test])
testPred = np.array([p.est for p in pred_test])
testRMSE = np.sqrt(mean_squared_error(testActual, testPred))
    
print("Test Data RMSE: {}".format(testRMSE))
print("\n")
    
test = {"RMSE": testRMSE, "Prediction": testPred}
    
print("Time Taken = " + str(datetime.now() - start))

Test Data RMSE: 0.9949435296768612


Time Taken = 1:04:08.393445


Based on the RMSE from the test set we just ran, our algorithm is definitely biased toward users who have rated very frequently. There is likely a high level of correlation between those users, which affects our results. This proves that our algorithm is highly effective for those users, but only average for users who do not rate in the top percentile.

## Putting it all together

Take all the information we gathered, the functions we built, and the models we created, and put them all into one class. The class saves algorithms as pickle files to be reused later without having to calculate the model and algorithm all over again. The class also has no testset, as there is no need for verification at this stage- only fitting the model and predicting values for the given user.

In [146]:
import pickle
from tqdm import tqdm
import numpy as np
import pandas as pd
import os
import pathlib
from surprise import Reader, Dataset
from surprise import SVD
from surprise import SVDpp
from surprise.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

class SVDPredictor:
    error_table = pd.DataFrame(columns = ["Model", "Train_RMSE", "Test_RMSE"])
    
    #Class takes in final.csv as a whole as a DataFrame
    def __init__(self, data):
        self.movie = data
        self.createAlgorithmFromData()
        
    def createAlgorithmFromData(self):
        #check if algo and trainset/train_data files are already created
        file = pathlib.Path('svd.pickle')
        if not file.exists():
            self._reduceDataSize()
            self._splitMovie()
            self._createTrainSet()
        self._run_surprise()
        
    def predict(self, userID, movieID):
        #use algo to predict rating. Return predicted rating
        return self.algo.predict(userID, movieID)
    
    def _splitMovie(self):
        self.movie = self.movie.iloc[:1500000]
        
    def _createTrainSet(self):
        reader = Reader(rating_scale=(1,5))
        movieInput = pd.DataFrame()
        movieInput['CustomerID'] = self.movie['CustomerID']
        movieInput['MovieID'] = self.movie['MovieID']
        movieInput['Rating'] = self.movie['Rating']

        self.train_data = Dataset.load_from_df(movieInput, reader)
        self.trainset = self.train_data.build_full_trainset()
        #write to a file
    
    def _reduceDataSize(self):
        self.movie['Date'] = self.movie['Date'].astype('category')
        self.movie['MovieID'] = self.movie['MovieID'].astype('int16')
        self.movie['CustomerID'] = self.movie['CustomerID'].astype('int32')
        self.movie['Rating'] = self.movie['Rating'].astype('int8')
    
    def _run_surprise(self): 
        file = pathlib.Path('svd.pickle')
        if file.exists():
            with open('svd.pickle', 'rb') as f:
                self.algo = pickle.load(f)
        else:
            self.algo = SVD(n_factors = 5, biased=True, verbose=True)
            self.algo = self.algo.fit(self.trainset)
            with open('svd.pickle', 'wb') as f:
                pickle.dump(self.algo, f)

Let's now instantiate the class we just created above with our curated data set, and then test it on a random CustomerID and MovieID! In the actual application of a recommender, we would want to set a specific CustomerID, but allow the MovieID to vary to get our ratings.

In [147]:
svd = SVDPredictor(movie)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
3.9115833649385836


In [151]:
print("Prediction for Customer 1 and Movie 5: ", svd.predict(1, 5).est)
print("Prediction for Customer 1 and Movie 16378: ", svd.predict(7, 16378).est)

4.340447112643969


## Creating a Mini-Recommender

Though we already have 2 models for recommenders, we can also use the SVD model as a recommender for a Customer at a time. Below is code to recommend 10 movies to a given Customer.

In [221]:
def recommendFor(customerID, movieIDs, model):
    predictions = []
    for movie in movieIDs:
        predictions.append(model.predict(customerID, movie).est)
    return predictions

In [222]:
preds = recommendFor(1, movie.MovieID.unique().tolist(), svd)

In [223]:
movie_title = pd.read_csv(cwd + "/data/movie_titles.csv", encoding='unicode_escape', usecols=[2], header=None)
movie_title.columns = ['title']
movie_title

Unnamed: 0,title
0,Dinosaur Planet
1,Isle of Man TT 2004 Review
2,Character
3,Paula Abdul's Get Up & Dance
4,The Rise and Fall of ECW
...,...
17765,Where the Wild Things Are and Other Maurice Se...
17766,Fidel Castro: American Experience
17767,Epoch
17768,The Company


In [224]:
def recommendedMovies(count, preds):
    movieAndRating = {}
    copyPreds = preds
    for i in range(count):
        index = copyPreds.index(max(copyPreds))
        maxPred = max(copyPreds)
        title = movie_title.iloc[index:index+1]['title'][index]
        movieAndRating[title] = maxPred
        copyPreds.pop(index)
    return movieAndRating

In [228]:
recommendedMovies(10, preds)

{'Animation Legend: Winsor McCay': 3.9761161781499674,
 'Winnie the Pooh: Springtime with Roo': 3.9734730585465985,
 'Fatal Beauty': 3.9620931808825235,
 'The Great Race': 3.959445030171472,
 'Mann': 3.9564895830237434,
 "That '70s Show: Season 1": 3.952919822356805,
 'Lost in the Pershing Point Hotel': 3.9486325753206364,
 'Pressure': 3.9241229015910952,
 'The Hunchback of Notre Dame II': 3.9224312892992375,
 'Chain of Command': 3.9193358298762853}

The final product is a recommender that returns the count of top movies for the chosen Customer, and gives the predicted rating.