# Finding Best Model-Based Matrix-Factorization Algorithm

Through a Gridsearch, two model-based matrix-factorization algorithms were tested:

1. SVD - Singular Value Decompostion
2. NMF - Non-negative Matrix Factorization

Note that various gridsearches were performed for each model, but many were deleted for the sake of conciseness. The best results were placed in the beginning of each model.

In [1]:
import pandas as pd
from surprise import SVD, NMF
from surprise import Dataset, Reader
from surprise.model_selection import GridSearchCV
from surprise.accuracy import rmse

In [2]:
sample = pd.read_csv('../data/books_reviews_sample.csv')
sample.head(2)

Unnamed: 0,book_id,title,user_id,rating
0,5,Harry Potter and the Prisoner of Azkaban (Harr...,84f866eb6dae54d7ac52d45a4c9b4d1f,4
1,5,Harry Potter and the Prisoner of Azkaban (Harr...,f1b86bf7c103c46fcb854e1fb711b1ec,5


In [3]:
# reordered columns to make it readable for surprise

ratings = sample[['user_id', 'book_id', 'rating']]

In [5]:
ratings.shape

(91567, 3)

In [6]:
reader = Reader(rating_scale=(1,5))

In [7]:
dataset = Dataset.load_from_df(ratings,reader)

### Matrix Factorization-based Algorithms

#### Non-Negative Matrix Factorization

Since biased version is highly prone to overfitting:
- number of factors was reduced
- regularization parameters were increased

In [10]:
param_grid = {'n_factors': [8],
              'n_epochs': [40],
              'biased': [True],
              'reg_pu': [.8],
              'reg_qi': [2],
              'reg_bu': [.03, .02],
              'reg_bi': [.03, .02],
              'lr_bu': [.005],
              'lr_bi': [.005],
              }
gs = GridSearchCV(SVD, param_grid, measures=['rmse'])
gs.fit(dataset)

In [11]:
gs.best_score['rmse'] # 0.7961

0.7961335036831231

In [12]:
gs.best_params['rmse']

{'n_factors': 8,
 'n_epochs': 40,
 'biased': True,
 'reg_pu': 0.8,
 'reg_qi': 2,
 'reg_bu': 0.03,
 'reg_bi': 0.03,
 'lr_bu': 0.005,
 'lr_bi': 0.005}

In [14]:
param_grid = {'n_factors': [8],
              'n_epochs': [40],
              'biased': [True],
              'reg_pu': [.8],
              'reg_qi': [2],
              'reg_bu': [.03],
              'reg_bi': [.03],
              'lr_bu': [.005],
              'lr_bi': [.005],
              }
gs = GridSearchCV(SVD, param_grid, measures=['rmse'])
gs.fit(dataset)

In [15]:
gs.best_score['rmse']

0.8182415007035473

In [19]:
param_grid = {'n_factors': [8],
              'n_epochs': [40],
              'biased': [True],
              'reg_pu': [.8],
              'reg_qi': [2, 2.5,3, 5.5, 6],
              }
gs = GridSearchCV(NMF, param_grid, measures=['rmse'])
gs.fit(dataset)

In [20]:
gs.best_score['rmse']

0.8179601512706389

In [21]:
gs.best_params['rmse']

{'n_factors': 8, 'n_epochs': 40, 'biased': True, 'reg_pu': 0.8, 'reg_qi': 2}

In [84]:
param_grid = {'n_factors': [8],
              'n_epochs': [40],
              'biased': [True],
              'reg_pu': [.8],
              'reg_qi': [2, 3, 4, 5, 6],
              }
gs = GridSearchCV(NMF, param_grid, measures=['rmse'])
gs.fit(dataset)

In [86]:
gs.best_params['rmse']

{'n_factors': 8, 'n_epochs': 40, 'biased': True, 'reg_pu': 0.8, 'reg_qi': 2}

In [85]:
gs.best_score['rmse']

0.8783387360596535

### SVD

In [15]:
param_grid = {'n_factors': [8],
              'n_epochs': [20],
              'lr_all': [.005],
              'reg_all': [.06],
              'reg_pu': [.8],
              'reg_qi': [2],
              'reg_bu': [.02],
              'reg_bi': [.03],
              }
gs = GridSearchCV(SVD, param_grid, measures=['rmse'])
gs.fit(dataset)

In [16]:
gs.best_score['rmse'] # 0.7964

0.7963993239440491

In [17]:
gs.best_params['rmse']

{'n_factors': 8,
 'n_epochs': 20,
 'lr_all': 0.005,
 'reg_all': 0.06,
 'reg_pu': 0.8,
 'reg_qi': 2,
 'reg_bu': 0.02,
 'reg_bi': 0.03}

In [31]:
param_grid = {'n_factors': [5,8,10],
              'n_epochs': [20],
              'lr_all': [.005],
              'reg_all': [.06],
              'reg_pu': [.8, .06],
              'reg_qi': [2, .06],
              'reg_bu': [.03, .02],
              'reg_bi': [.03, .02],
              }
gs = GridSearchCV(SVD, param_grid, measures=['rmse'])
gs.fit(dataset)

In [32]:
gs.best_score['rmse']

0.8173772613703478

In [33]:
gs.best_params['rmse']

{'n_factors': 8,
 'n_epochs': 20,
 'lr_all': 0.005,
 'reg_all': 0.06,
 'reg_pu': 0.8,
 'reg_qi': 2,
 'reg_bu': 0.02,
 'reg_bi': 0.03}

In [25]:
param_grid = {'n_factors': [5,8,10],
              'n_epochs': [20],
              'lr_all': [.005],
              'reg_all': [.06, .08, .1]
              }
gs = GridSearchCV(SVD, param_grid, measures=['rmse'])
gs.fit(dataset)

In [26]:
gs.best_score['rmse']

0.8175484213735096

In [27]:
gs.best_params['rmse']

{'n_factors': 5, 'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.06}

Once the hyperparameters for each algorithm were hypertuned, they were tested on the whole dataset to see which one would be better:

### Testing on the whole dataset

In [19]:
algo_nmf = NMF(n_factors=8, n_epochs=40, biased=True,
               reg_pu=0.8, reg_qi=2,
               reg_bu=.03, reg_bi=0.3,
               random_state=123)

algo_svd = SVD(n_factors=8, n_epochs=20, lr_all=0.005,
               reg_pu=0.8, reg_qi=2,
               reg_bu=.02, reg_bi=0.3,
               random_state=123)

# Retrieve trainset as the entire dataset
trainset = dataset.build_full_trainset()

# Create testset
testset = trainset.build_testset()

# Train the algorithms on the trainset (dataset)
# Predict on the testset

algo_nmf.fit(trainset)
preds_nmf = algo_nmf.test(testset)

algo_svd.fit(trainset)
preds_svd = algo_svd.test(testset)

In [20]:
rmse(preds_nmf) # 0.7204

RMSE: 0.7204


0.7204389953844127

In [21]:
rmse(preds_svd) # 0.7351

RMSE: 0.7351


0.7351107190415771

### Conclusion: 

Non-negative matrix factorization (NMF) algorithm is better than SVD.