# Finding Best Memory-Based Algorithm

Through a Gridsearch, four memory-based models were tested:

1. KNNBasic
2. KNNWithMeans
3. KNNWithZScore
4. KNNBaseline

After coming up with the best algorithm from above, parameters were further hypertuned using the whole dataset as the trainset.

Note that various gridsearches were performed for each model, but many were deleted for the sake of conciseness. The best results were placed in the beginning of each model.

In [22]:
import pandas as pd
from surprise import KNNBasic, KNNWithMeans, KNNWithZScore, KNNBaseline
from surprise import Dataset, Reader
from surprise.model_selection import GridSearchCV
from surprise.accuracy import rmse

In [13]:
# Read in data

sample = pd.read_csv('../data/books_reviews_sample.csv')
sample.head(2)

Unnamed: 0,book_id,title,user_id,rating
0,5,"Harry Potter and the Prisoner of Azkaban (Harry Potter, #3)",84f866eb6dae54d7ac52d45a4c9b4d1f,4
1,5,"Harry Potter and the Prisoner of Azkaban (Harry Potter, #3)",f1b86bf7c103c46fcb854e1fb711b1ec,5


In [4]:
# reordered columns to make it readable for Surprise

ratings = sample[['user_id', 'book_id', 'rating']]

In [5]:
ratings.head(2)

Unnamed: 0,user_id,book_id,rating
0,84f866eb6dae54d7ac52d45a4c9b4d1f,5,4
1,f1b86bf7c103c46fcb854e1fb711b1ec,5,5


In [6]:
ratings.shape

(91567, 3)

In [7]:
reader = Reader(rating_scale=(1,5))

In [8]:
dataset = Dataset.load_from_df(ratings,reader)

### KNNBasic

In [14]:
param_grid = {'k': [40],
              'min_k':[5],
              'sim_options': {'name': ['msd'],
                              'min_support': [2],
                              'user_based': [False]}
              }
gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse'])
gs.fit(dataset)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


In [15]:
gs.best_score['rmse'] # 0.8839

0.8839189944775429

In [None]:
param_grid = {'k': [40, 50],
              'min_k':[ 5, 10],
              'sim_options': {'name': ['cosine', 'msd', 'pearson'],
                              'min_support': [2,5],
                              'user_based': [False, True]}
              }
gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse'])
gs.fit(dataset)

In [31]:
gs.best_score['rmse']

0.8873281790635819

In [32]:
gs.best_params['rmse']

{'k': 40,
 'min_k': 5,
 'sim_options': {'name': 'msd', 'min_support': 2, 'user_based': False}}

### KNNWithMeans

In [18]:
param_grid = {'k': [60],
              'min_k':[4],
              'sim_options': {'name': ['cosine'],
                              'min_support': [1],
                              'user_based': [False]}
              }
gs = GridSearchCV(KNNWithMeans, param_grid, measures=['rmse'])
gs.fit(dataset)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.


In [19]:
gs.best_score['rmse'] # 0.8326

0.8325582778870373

In [33]:
param_grid = {'k': [60, 50],
              'min_k':[4, 5, 6],
              'sim_options': {'name': ['cosine'],
                              'min_support': [1,2,3],
                              'user_based': [False, True]}
              }
gs = GridSearchCV(KNNWithMeans, param_grid, measures=['rmse'])
gs.fit(dataset)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing th

In [34]:
gs.best_score['rmse']

0.8387498355432106

In [35]:
gs.best_params['rmse']

{'k': 60,
 'min_k': 4,
 'sim_options': {'name': 'cosine', 'min_support': 1, 'user_based': False}}

### KNNWithZScore

In [16]:
param_grid = {'k': [40],
              'min_k':[5],
              'sim_options': {'name': ['cosine'],
                              'min_support': [2],
                              'user_based': [False]}
              }
gs = GridSearchCV(KNNWithZScore, param_grid, measures=['rmse'])
gs.fit(dataset)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.


In [17]:
gs.best_score['rmse'] # 0.8537

0.8536795289821354

In [36]:
param_grid = {'k': [40, 30],
              'min_k':[5, 10],
              'sim_options': {'name': ['cosine', 'msd', 'pearson'],
                              'min_support': [2,9],
                              'user_based': [False, True]}
              }
gs = GridSearchCV(KNNWithZScore, param_grid, measures=['rmse'])
gs.fit(dataset)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing th

In [38]:
gs.best_score['rmse']

0.8543751702794602

In [37]:
gs.best_params['rmse']

{'k': 40,
 'min_k': 5,
 'sim_options': {'name': 'cosine', 'min_support': 2, 'user_based': False}}

### KNNBaseline

In [9]:
param_grid = {'bsl_options': {'method': ['sgd'],
                              'reg': [.08],
                              'learning_rate': [.005],
                              'n_epochs': [40]
                             },
              'k': [40],
              'min_k': [10],
              'sim_options': {'name': ['msd'],
                              'min_support':[2,9],
                              'user_based': [False, True]}
              }
gs = GridSearchCV(KNNBaseline, param_grid, measures=['rmse'])
gs.fit(dataset)

Estimating biases using sgd...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the msd similarity matr

In [10]:
gs.best_score['rmse'] # 0.7942

0.7941704856202715

In [11]:
gs.best_params['rmse']

{'bsl_options': {'method': 'sgd',
  'reg': 0.08,
  'learning_rate': 0.005,
  'n_epochs': 40},
 'k': 40,
 'min_k': 10,
 'sim_options': {'name': 'msd', 'min_support': 9, 'user_based': True}}

### Comparison Result: KNNBaseline

**KNNBaseline** with an RMSE of 0.7942 was the best among all 4 algorithms.

### Testing on the whole dataset

Check if user-based or item-based must be used for similarity options.

In [29]:
# Instantiate algorithms

bsl_options = {'method': 'sgd',
               'reg': .08,
               'learning_rate': .005,
               'n_epochs': 40}
              
sim_options_1 = {'name': 'msd',
               'min_support':1,
               'user_based': False}           #Item-based

sim_options_2 = {'name': 'msd',
               'min_support':1,
               'user_based': True}            #User-based

algo_knn_1 = KNNBaseline(k=40, min_k=2, sim_options = sim_options_1, bsl_options = bsl_options)

algo_knn_2 = KNNBaseline(k=40, min_k=2, sim_options = sim_options_2, bsl_options = bsl_options)

# Retrieve trainset as the entire dataset
trainset = dataset.build_full_trainset()

# Create testset
testset = trainset.build_testset()

# Train the algorithms on the trainset (dataset)
# Predict on the testset

algo_knn_1.fit(trainset)
preds_knn_1 = algo_knn_1.test(testset)

algo_knn_2.fit(trainset)
preds_knn_2 = algo_knn_2.test(testset)

Estimating biases using sgd...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the msd similarity matrix...
Done computing similarity matrix.


In [24]:
rmse(preds_knn_1) # 0.47397

RMSE: 0.4740


0.47396721637185607

In [25]:
rmse(preds_knn_2) # 0.48106

RMSE: 0.4811


0.4810623512466855

### Conclusion: 

Item-based KNNBaseline is the best memory-based algorithm.