# `Recommender System Models`

&emsp;&emsp;&emsp;<font size="4">In this project, the user ratings from a subset of Amazon Reviews are used train a collaborative-filtering recommendation system by evaluating various algorithms with hyperparameter optimization. The models return a list of recommended items based on the reviewer's previous actions.<font>

# `Data`
&emsp;&emsp;&emsp;<font size="4">The `Movies_and_TV` ratings data was retrieved from [here](https://jmcauley.ucsd.edu/data/amazon/). This includes (item, user, rating, timestamp) tuples. A subset of the data was used in the analysis due to constraints of sparcity for computation.<font>

# `Preprocessing`
&emsp;&emsp;&emsp;<font size="4">The code that was used for preprocessing and EDA can be found [here](https://github.com/adataschultz/RecSys/blob/main/Notebooks_Scripts/Recommender_System.py). 
- First the environment is set up with the dependencies, library options, the seed for reproducibility, and setting the location of the project directory. Then the data is read, duplicate observations dropped and columns named.<font>

In [None]:
import os
import random
import numpy as np
import warnings
import sys
import pandas as pd
import time
import json
from surprise import Dataset, Reader, BaselineOnly, NormalPredictor
from surprise import KNNBaseline, KNNWithMeans, KNNBasic, KNNWithZScore
from surprise import SVD, SVDpp, NMF, CoClustering, dump, accuracy
from surprise.model_selection import cross_validate, GridSearchCV
from sklearn.utils.extmath import randomized_svd
from sklearn.decomposition import TruncatedSVD
from tempfile import mkdtemp
import os.path as path
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Set seed 
seed_value = 42
os.environ['Recommender'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

# Set path
path = r'D:\AmazonReviews\Data'
os.chdir(path)

# Read data
df = pd.read_csv('Movies_and_TV.csv', header=None, skiprows=[0],
                 low_memory=False)
df = df.drop_duplicates()

# Name columns
df.columns = ['item', 'reviewerID', 'rating', 'timestamp']

print('Sample observations:')
df.head()

Sample observations:


Unnamed: 0,item,reviewerID,rating,timestamp
0,1527665,A2VHSG6TZHU1OB,5.0,1361145600
1,1527665,A23EJWOW1TLENE,5.0,1358380800
2,1527665,A1KM9FNEJ8Q171,5.0,1357776000
3,1527665,A38LY2SSHVHRYB,4.0,1356480000
4,1527665,AHTYUW2H1276L,5.0,1353024000


<font size="4">Then a function is defined to examine the data for the number of missing observations, data types and the amount of unique values in the initial set. The `timestamp` variable is dropped since it will not be used<font>

In [None]:
# Define a function to examine the data
def data_summary(df):
    print('Number of Rows: {}, Columns: {}'.format(df.shape[0], df.shape[1]))
    a = pd.DataFrame()
    a['Number of Missing Values'] = df.isnull().sum()
    a['Data type of variable'] = df.dtypes
    a['Number of Unique Values'] = df.nunique()
    print(a)

print('Initial Data Summary:')     
print(data_summary(df))

df = df.drop(['timestamp'], axis=1) 

Initial Data Summary:
Number of Rows: 8522125, Columns: 4
            Number of Missing Values Data type of variable  \
item                               0                object   
reviewerID                         0                object   
rating                             0               float64   
timestamp                          0                 int64   

            Number of Unique Values  
item                         182032  
reviewerID                  3826085  
rating                            5  
timestamp                      7476  
None


<font size="4">The top 10 reviewers with the most number of ratings in the initial set shows that they have over 1,600 reviews.<font>

In [None]:
reviewers_top10 = df.groupby('reviewerID').size().sort_values(ascending=False)[:10]
print('Reviewers with highest number of ratings in initial set:')
print(reviewers_top10)

Reviewers with highest number of ratings in initial set:
reviewerID
AV6QDP8Q0ONK4     4101
A1GGOC9PVDXW7Z    2114
ABO2ZI2Y5DQ9T     2073
A328S9RN3U5M68    2059
A3MV1KKHX51FYT    1989
A2EDZH51XHFA9B    1842
A3LZGLA88K0LA0    1814
A16CZRQL23NOIW    1808
AIMR915K4YCN      1719
A2NJO6YE954DBH    1699
dtype: int64


<font size="4">The top 10 items in the initial set shows the highest item has 24,554 ratings while the 10th highest items has 14,174 ratings. <font>

In [None]:
items_top10 = df.groupby('item').size().sort_values(ascending=False)[:10]
print('Items with highest number of ratings in initial set:')
print(items_top10)

Items with highest number of ratings in initial set:
item
B00YSG2ZPA    24554
B00006CXSS    24485
B00AQVMZKQ    21015
B01BHTSIOC    20889
B00NAQ3EOK    16857
6305837325    16671
B00WNBABVC    15205
B017S3OP7A    14795
B009934S5M    14481
B00FL31UF0    14174
dtype: int64


<font size="4">Since the data is sparse, a new integer id is created for `item` rather the initial string variable. <font>

In [None]:
value_counts = df['item'].value_counts(dropna=True, sort=True)
df1 = pd.DataFrame(value_counts)
df1 = df1.reset_index()
df1.columns = ['item_unique', 'counts'] 
df1 = df1.reset_index()
df1.rename(columns={'index': 'item_id'}, inplace=True)
df1 = df1.drop(['counts'], axis=1)
df = pd.merge(df, df1, how='left', left_on=['item'],
              right_on=['item_unique'])
df = df.drop_duplicates()
df = df.drop(['item_unique'], axis=1)

del value_counts, df1

<font size="4">The same process is used for `reviewerID`. A key is created for merging the new integer variables that later be used to join the original data. For this set, the unnecessary keys are then dropped.<font>

In [None]:
value_counts = df['reviewerID'].value_counts(dropna=True, sort=True)
df1 = pd.DataFrame(value_counts)
df1 = df1.reset_index()
df1.columns = ['id_unique', 'counts'] 
df1 = df1.reset_index()
df1.rename(columns={'index': 'reviewer_id'}, inplace=True)
df1 = df1.drop(['counts'], axis=1)
df = pd.merge(df, df1, how='left', left_on=['reviewerID'],
              right_on=['id_unique'])
df = df.drop_duplicates()
df = df.drop(['id_unique'], axis=1)

del value_counts, df1

df1 = df[['item', 'item_id', 'reviewerID', 'reviewer_id']]
df1.to_csv('Movies_and_TV_idMatch.csv', index=False)

del df1

df = df.drop(['item', 'reviewerID'], axis=1)

<font size="4">The data is then filtered to ratings/reviewers who have greater than or equal to 25 ratings/reviews due to sparsity. This results in a set containing 1,113,396 ratings with 19,639 unique reviewers and 103,687 unique items. The majority of items are rated 5 star.<font>

In [None]:
reviewer_count = df.reviewer_id.value_counts()
df = df[df.reviewer_id.isin(reviewer_count[reviewer_count >= 25].index)]
df = df.drop_duplicates()

del reviewer_count

print('- Number of ratings after filtering: ', len(df))
print('- Number of unique reviewers: ', df['reviewer_id'].nunique())
print('- Number of unique items: ', df['item_id'].nunique())
for i in range(1,6):
  print('- Number of items with {0} rating = {1}'.format(i, df[df['rating'] == i].shape[0]))

- Number of ratings after filtering:  1113396
- Number of unique reviewers:  19639
- Number of unique items:  103687
- Number of items with 1 rating = 59470
- Number of items with 2 rating = 65558
- Number of items with 3 rating = 141436
- Number of items with 4 rating = 252584
- Number of items with 5 rating = 594348


<font size="4">The top 10 reviewers with the most number of ratings in the filtered set still have over 1,600 ratings/reviews.<font>

In [None]:
reviewers_top10 = df.groupby('reviewer_id').size().sort_values(ascending=False)[:10]
print('Reviewers with highest number of ratings in filtered set:')
print(reviewers_top10)

del reviewers_top10

Reviewers with highest number of ratings in filtered set:
reviewer_id
0    3981
1    2068
2    1997
3    1986
4    1838
5    1811
6    1797
7    1733
8    1706
9    1634
dtype: int64


<font size="4">The top 10 items in the filtered set shows a large reduction with the highest item reducing from 24,554 to 1,136 ratings, while the 10th highest item reducing from 14,174 to 853 ratings.<font>

In [None]:
items_top10 = df.groupby('item_id').size().sort_values(ascending=False)[:10]
print('Items with highest number of ratings filtered set:')
print(items_top10)

del items_top10

Items with highest number of ratings filtered set:
item_id
8     1136
14    1042
15    1040
13    1040
29     964
22     903
53     895
87     870
67     860
81     853
dtype: int64


## Create Recommendation Systems using Surprise

<font size="4">The data is loaded using the `Reader` class from a `pandas` dataframe prior to modeling. For the initial training of the models using `surprise`, the default parameters of `BaselineOnly`, `KNNBaseline`, `KNNBasic`, `KNNWithMeans`, `KNNWithZScore`, `CoClustering`, `SVD`, `SVDpp` (an extension of SVD which taking into account implicit ratings), `NMF`, were evaluated using the `cross_validate` method using 3-fold cross validation to determine which algorithm yielded the lowest `RMSE` errors. This revealed that `SVDpp` generated the lowest `RMSE`, but it took the longest to fit the model and test. The default parameters for `SVDpp` uses 20 epochs for fitting the model, so experimenting with less epochs and other model parameters will reduce the runtime and potentially maintain a low `RMSE`. The results from using `KNNBaseline` demonstrate a close loss with a significantly lower runtime, so hyperparameter tuning might allow this to be a better choice, especially given larger sample sizes. <font>

In [None]:
# Set path for results
path = r'D:\AmazonReviews\Models'
os.chdir(path)

# Load data using reader
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['reviewer_id', 'item_id', 'rating']], reader)
del df

# Iterate over all algorithms
print('Time for iterating through different algorithms..')
search_time_start = time.time()
benchmark = []
for algorithm in [BaselineOnly(), KNNBaseline(), KNNBasic(), KNNWithMeans(), 
                  KNNWithZScore(), CoClustering(), SVD(), SVDpp(), NMF(), 
                  NormalPredictor()]:
    # Cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, 
                             verbose=False, n_jobs=-1)
    
    # Model results
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]],
                               index=['Algorithm']))
    benchmark.append(tmp)
print('Finished iterating through different algorithms:',
      time.time() - search_time_start)

# Create df with results and save
surprise_results = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')    
print('Results from testing different algorithms:')
print(surprise_results)
surprise_results.to_csv('MoviesTV_results_algorithms.csv')
del surprise_results

Time for iterating through different algorithms..
Finished iterating through different algorithms: 1877.5992422103882
Results from testing different algorithms:
                 test_rmse     fit_time  test_time
Algorithm                                         
SVDpp             0.951981  1449.569379  34.628475
SVD               0.956978    56.881580   3.054269
BaselineOnly      0.958829     0.541074   2.015631
KNNBaseline       0.986208    12.493270  27.932690
KNNWithMeans      0.996443    12.406364  24.509705
KNNWithZScore     1.002136    13.109460  26.066691
CoClustering      1.012724    18.731312   2.226075
NMF               1.051432    59.292935   2.782387
KNNBasic          1.106805    12.031695  23.257696
NormalPredictor   1.505521     0.629097   2.384375


<font size="4"><font>

In [None]:
# Set path for loading train/test
path = r'D:\AmazonReviews\Data'
os.chdir(path)

# Read train/test sets
train = pd.read_csv('train_filtered.csv', sep='|')
train.columns = ['item_id', 'reviewer_id', 'rating']

test = pd.read_csv('eval_filtered.csv', sep='|')
test.columns = ['item_id', 'reviewer_id', 'rating']

# Load data using reader
train = Dataset.load_from_df(train[['reviewer_id', 'item_id', 'rating']], reader)
test = Dataset.load_from_df(test[['reviewer_id', 'item_id', 'rating']], reader)

### SVDpp with lowest rmse 
Fit model with default parameters for 3 epochs and examine RMSE on train/test sets

In [None]:
# Set path for results
path = r'D:\AmazonReviews\Models'
os.chdir(path)

print('Train/predict using SVDpp default parameters for 3 epochs:')
print('\n')
print('Time for iterating through SVDpp default parameters..')
search_time_start = time.time()
algo = SVDpp(n_epochs=3, random_state=seed_value)
cv = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=False, 
                    n_jobs=-1)
print('Finished iterating through SVDpp default parameters:',
      time.time() - search_time_start)
print('\n')
print('Cross validation results:')
# Iterate over key/value pairs in cv results dict 
for key, value in cv.items():
    print(key, ' : ', value)
print('\n')

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./SVDpp_3epochs_DefaultParamModel_file', predictions, algo)
#predictions, algo = dump.load('./SVDpp_3epochs_DefaultParamModel_file')

Train/predict using SVDpp default parameters for 3 epochs:


Time for iterating through SVDpp default parameters..
Finished iterating through SVDpp default parameters: 259.01811718940735


Cross validation results:
test_rmse  :  [0.99186233 0.99067176 0.98925811]
test_mae  :  [0.75639429 0.75586697 0.75391987]
fit_time  :  (212.8136830329895, 213.71945691108704, 214.16111540794373)
test_time  :  (34.81285309791565, 35.151495695114136, 34.626320362091064)


RMSE from fit best parameters on train predict on test:
RMSE: 0.9811
0.9811266996736911


<font size="4">Examine results from predictions<font>

In [None]:
def get_Ir(reviewerID):
    """
    Determine the number of items rated by given reviewer
    Args: 
      reviewerID: the id of the reviewer
    Returns: 
      Number of items rated by the reviewer
    """
    try:
        return len(train.ur[train.to_inner_uid(reviewerID)])
    except ValueError: 
        return 0
    
def get_Ri(itemID):
    """ 
    Determine number of reviewers that rated given item
    Args:
      itemID: the id of the item
    Returns:
     Number of reviewers that have rated the item
    """
    try: 
        return len(train.ir[train.to_inner_iid(itemID)])
    except ValueError:
        return 0

<font size="4"> Make df of prediction results, apply functions and save prediction results<font size>

In [None]:
df1 = pd.DataFrame(predictions, columns=['reviewerID', 'itemID', 'rui', 'est',
                                         'details']) 
df1['Iu'] = df1.reviewerID.apply(get_Ir)
df1['Ui'] = df1.itemID.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)
df1.to_csv('predictions_SVDpp_DefaultParamModel.csv')

<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df1.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
        reviewerID  itemID  rui  est                    details   Iu   Ui  err
2465          1901     286  5.0  5.0  {'was_impossible': False}   81  247  0.0
103694        2922     300  5.0  5.0  {'was_impossible': False}   59  190  0.0
140212       11094     444  5.0  5.0  {'was_impossible': False}   28  193  0.0
196650         198    1488  5.0  5.0  {'was_impossible': False}  284   47  0.0
34119         2454    2501  5.0  5.0  {'was_impossible': False}   62   43  0.0
14052         3719     151  5.0  5.0  {'was_impossible': False}   51  440  0.0
34137         3916    2600  5.0  5.0  {'was_impossible': False}   50   92  0.0
159982        1066      14  5.0  5.0  {'was_impossible': False}   97  837  0.0
64899         5448     147  5.0  5.0  {'was_impossible': False}   39  223  0.0
103569         882    5672  5.0  5.0  {'was_impossible': False}  121   79  0.0


<font size="4"> Find the worst  predictions<font>

In [None]:
worst_predictions = df1.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df1, best_predictions, worst_predictions

Worst 10 predictions:
        reviewerID  itemID  rui       est                    details   Iu  \
26621         4844     269  1.0  4.995164  {'was_impossible': False}   38   
148629          42     172  1.0  5.000000  {'was_impossible': False}  660   
44833         5789     409  1.0  5.000000  {'was_impossible': False}   41   
67997         8107     534  1.0  5.000000  {'was_impossible': False}   29   
29490          778    2691  1.0  5.000000  {'was_impossible': False}  122   
110642        1311     451  1.0  5.000000  {'was_impossible': False}   93   
119911        1331    1915  1.0  5.000000  {'was_impossible': False}   95   
109341         835    2752  1.0  5.000000  {'was_impossible': False}  116   
33420         4933     752  1.0  5.000000  {'was_impossible': False}   43   
118018        1666    1834  1.0  5.000000  {'was_impossible': False}   77   

         Ui       err  
26621   158  3.995164  
148629  588  4.000000  
44833   243  4.000000  
67997   262  4.000000  
29490    7

#### SVDpp HPO using Grid Search

<font size="4">Hyperparameter optimization using `GridSearchCV` was performed to find the best parameters. Since this algorithm is omputationally expensive with gradient descent, 10 epochs was used. A larger number of factors compared to the default `n_factors=20`, The default parameters for `lr_all=0.007` and `reg_all=0.02` were included in the search. <font>

<font size="4">Define the parameters for the grid search <font>



In [None]:
param_grid = {'n_epochs': [10],
              'n_factors': [30, 40, 50], 
              'lr_all': [7e-4, 7e-3, 7e-2], 
              'reg_all': [2e-4, 2e-3, 2e-2],
              'random_state': [seed_value]}
print('Grid search parameters:')
param_grid

Grid search parameters:


{'n_epochs': [10],
 'n_factors': [30, 40, 50],
 'lr_all': [0.0007, 0.007, 0.07],
 'reg_all': [0.0002, 0.002, 0.02],
 'random_state': [42]}

<font size="4">Run grid search with `rmse` and `mae` as the metrics. Then use the parameters that resulted in the lowest RMSE on the train/test sets<font>



In [None]:
gs = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=3,
                  joblib_verbose=-1, n_jobs=-1)
print('Time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
      time.time() - search_time_start)
print('\n')
print('Lowest RMSE from Grid Search:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters of Model with lowest RMSE from Grid Search:')
print(gs.best_params['rmse'])

Time for iterating grid search parameters..


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed: 98.9min


Finished iterating grid search parameters: 12332.196253061295


Lowest RMSE from Grid Search:
0.9525513633721562


Parameters of Model with lowest RMSE from Grid Search:
{'n_epochs': 10, 'n_factors': 30, 'lr_all': 0.007, 'reg_all': 0.02, 'random_state': 42}


[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed: 205.4min finished


In [None]:
# Save results to df
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df = results_df.sort_values('mean_test_rmse', ascending=True)

print('SVDpp GridSearch HPO Cross Validation Results:')
print(results_df.head())
print('\n')
results_df.to_csv('SVDpp_gridSearch_cvResults.csv', index=False)

del results_df

SVDpp GridSearch HPO Cross Validation Results:
    split0_test_rmse  split1_test_rmse  split2_test_rmse  mean_test_rmse  \
5           0.955645          0.950836          0.951173        0.952551   
14          0.956360          0.951313          0.951621        0.953098   
4           0.956469          0.951755          0.951959        0.953394   
23          0.957312          0.952152          0.953361        0.954275   
13          0.957878          0.953556          0.953487        0.954974   

    std_test_rmse  rank_test_rmse  split0_test_mae  split1_test_mae  \
5        0.002192               1         0.697202         0.694840   
14       0.002310               2         0.698210         0.695600   
4        0.002176               3         0.694615         0.692085   
23       0.002204               4         0.698765         0.696435   
13       0.002054               5         0.695999         0.693928   

    split2_test_mae  mean_test_mae  std_test_mae  rank_test_mae  \
5 

<font size="4">Fit and predict on the best model, apply functions and save prediction results <font>

In [None]:
algo = gs.best_estimator['rmse']

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./SVDpp_bestGrid_Model_file', predictions, algo)
#predictions, algo = dump.load('./SVDpp_bestGrid_Model_file')
  
df1 = pd.DataFrame(predictions, columns=['reviewerID', 'itemID', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewerID.apply(get_Ir)
df1['Ui'] = df1.itemID.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)
df1.to_csv('predictions_SVDpp_gridSearch.csv')

RMSE from fit best parameters on train predict on test:
RMSE: 0.9466
0.9465595720332523


<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df1.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
        reviewerID  itemID  rui  est                    details   Iu   Ui  err
66148         1115    3391  5.0  5.0  {'was_impossible': False}  101  112  0.0
68178        12878    5195  5.0  5.0  {'was_impossible': False}   25   29  0.0
68175         3186    9565  5.0  5.0  {'was_impossible': False}   51   28  0.0
175864        3773    2453  5.0  5.0  {'was_impossible': False}   47  137  0.0
213457        6532     420  5.0  5.0  {'was_impossible': False}   40  185  0.0
68171         9096     512  5.0  5.0  {'was_impossible': False}   27  168  0.0
68168        13356    2956  5.0  5.0  {'was_impossible': False}   25   75  0.0
68158         3830    8075  5.0  5.0  {'was_impossible': False}   47   46  0.0
68123         7200    2041  5.0  5.0  {'was_impossible': False}   31   52  0.0
68113         7108    7983  5.0  5.0  {'was_impossible': False}   38   24  0.0


<font size="4"> Find the worst predictions<font>

In [None]:
worst_predictions = df1.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df1, best_predictions, worst_predictions

Worst 10 predictions:
        reviewerID  itemID  rui  est                    details   Iu   Ui  err
110642        1311     451  1.0  5.0  {'was_impossible': False}   93  461  4.0
39171         3522     906  1.0  5.0  {'was_impossible': False}   51   46  4.0
67997         8107     534  1.0  5.0  {'was_impossible': False}   29  262  4.0
12513          804   23913  1.0  5.0  {'was_impossible': False}  117    4  4.0
11064        10089    2718  1.0  5.0  {'was_impossible': False}   28   90  4.0
132808        4212   11537  1.0  5.0  {'was_impossible': False}   43   35  4.0
53552         2548   10674  1.0  5.0  {'was_impossible': False}   68   41  4.0
157725       14167    5196  1.0  5.0  {'was_impossible': False}   25   29  4.0
80945        18177    2118  1.0  5.0  {'was_impossible': False}   20   61  4.0
204198        6386    3565  1.0  5.0  {'was_impossible': False}   32   34  4.0


#### SVDpp HPO using Grid Search - 20 Epochs More Parameters

<font size="4">Define the parameters for the grid search <font>



In [None]:
param_grid = {'n_epochs': [20],
              'n_factors': [10, 20, 30, 40, 50], 
              'lr_all': [7e-6, 7e-5, 7e-4, 7e-3, 7e-2], 
              'reg_all': [2e-4, 2e-3, 2e-2, 2e-1, 2e-0],
              'random_state': [seed_value]}
print('Grid search parameters:')
param_grid

Grid search parameters:


{'n_epochs': [20],
 'n_factors': [10, 20, 30, 40, 50],
 'lr_all': [7e-06, 7e-05, 0.0007, 0.007, 0.07],
 'reg_all': [0.0002, 0.002, 0.02, 0.2, 2.0],
 'random_state': [42]}

<font size="4">Run grid search with `rmse` and `mae` as the metrics. Then use the parameters that resulted in the lowest RMSE on the train/test sets<font>



In [None]:
gs = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=3,
                  joblib_verbose=-1, n_jobs=-1)
print('Time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
      time.time() - search_time_start)
print('\n')
print('Lowest RMSE from Grid Search:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters of Model with lowest RMSE from Grid Search:')
print(gs.best_params['rmse'])

Time for iterating grid search parameters..


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed: 103.0min
[Parallel(n_jobs=-1)]: Done 256 tasks      | elapsed: 791.0min


Finished iterating grid search parameters: 82490.94835019112


Lowest RMSE from Grid Search:
0.9487615218367708


Parameters of Model with lowest RMSE from Grid Search:
{'n_epochs': 20, 'n_factors': 10, 'lr_all': 0.007, 'reg_all': 0.02, 'random_state': 42}


[Parallel(n_jobs=-1)]: Done 375 out of 375 | elapsed: 1374.8min finished


In [None]:
# Save results to df
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df = results_df.sort_values('mean_test_rmse', ascending=True)

print('SVDpp GridSearch HPO Cross Validation Results:')
print(results_df.head())
print('\n')
results_df.to_csv('SVDpp_gridSearch_cvResults_moreParams.csv', index=False)

del results_df

SVDpp GridSearch HPO Cross Validation Results:
    split0_test_rmse  split1_test_rmse  split2_test_rmse  mean_test_rmse  \
17          0.949363          0.950721          0.946200        0.948762   
18          0.951589          0.950517          0.947567        0.949891   
43          0.951702          0.950655          0.947704        0.950020   
68          0.951936          0.950781          0.947783        0.950167   
93          0.952030          0.950867          0.947989        0.950295   

    std_test_rmse  rank_test_rmse  split0_test_mae  split1_test_mae  \
17       0.001894               1         0.680528         0.681833   
18       0.001701               2         0.699633         0.698991   
43       0.001693               3         0.699909         0.699289   
68       0.001750               4         0.700300         0.699534   
93       0.001699               5         0.700566         0.699855   

    split2_test_mae  mean_test_mae  std_test_mae  rank_test_mae  \
17

<font size="4">Fit and predict on the best model, apply functions and save prediction results <font>

In [None]:
algo = gs.best_estimator['rmse']

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./SVDpp_bestGrid_Model_moreParams_file', predictions, algo)
#predictions, algo = dump.load('./SVDpp_bestGrid_Model_file')
    
df1 = pd.DataFrame(predictions, columns=['reviewerID', 'itemID', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewerID.apply(get_Ir)
df1['Ui'] = df1.itemID.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)
df1.to_csv('predictions_SVDpp_gridSearch_moreParams.csv')

RMSE from fit best parameters on train predict on test:
RMSE: 0.9439
0.9438520213385112


<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df1.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
        reviewerID  itemID  rui  est                    details   Iu   Ui  err
199772        5707   12824  5.0  5.0  {'was_impossible': False}   42   27  0.0
183867        4328     807  5.0  5.0  {'was_impossible': False}   46  196  0.0
83935          865    2576  5.0  5.0  {'was_impossible': False}  113   65  0.0
83923         5238    1240  5.0  5.0  {'was_impossible': False}   48   63  0.0
166321       12530    1752  5.0  5.0  {'was_impossible': False}   21   81  0.0
14868         1166    9580  5.0  5.0  {'was_impossible': False}  101   24  0.0
122028         892    3886  5.0  5.0  {'was_impossible': False}  115   43  0.0
33111         2022      45  5.0  5.0  {'was_impossible': False}   75  635  0.0
33094         1772   38840  5.0  5.0  {'was_impossible': False}   73    3  0.0
122030       14191   16620  5.0  5.0  {'was_impossible': False}   20   15  0.0


<font size="4"> Find the worst predictions<font>

In [None]:
worst_predictions = df1.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df1, best_predictions, worst_predictions

Worst 10 predictions:
        reviewerID  itemID  rui  est                    details   Iu   Ui  err
98501        10283    5532  1.0  5.0  {'was_impossible': False}   27   34  4.0
12513          804   23913  1.0  5.0  {'was_impossible': False}  117    4  4.0
157725       14167    5196  1.0  5.0  {'was_impossible': False}   25   29  4.0
11064        10089    2718  1.0  5.0  {'was_impossible': False}   28   90  4.0
140176          38     109  1.0  5.0  {'was_impossible': False}  701  556  4.0
67222        12174    1319  1.0  5.0  {'was_impossible': False}   26   92  4.0
13116         3284    3483  1.0  5.0  {'was_impossible': False}   54  101  4.0
147671         835   96950  1.0  5.0  {'was_impossible': False}  116    1  4.0
119911        1331    1915  1.0  5.0  {'was_impossible': False}   95   63  4.0
192547        5214     139  1.0  5.0  {'was_impossible': False}   41  535  4.0


#### SVDpp HPO using Grid Search - 20 Epochs More Parameters Less Factors

<font size="4">Define the parameters for the grid search <font>



In [None]:
param_grid = {'n_epochs': [20],
              'n_factors': [5, 10, 15], 
              'lr_all': [7e-4, 7e-3, 7e-2], 
              'reg_all': [7e-2, 5e-2, 2e-2, 7e-1],
              'random_state': [seed_value]}
print('Grid search parameters:')
param_grid

Grid search parameters:


{'n_epochs': [20],
 'n_factors': [5, 10, 15],
 'lr_all': [0.0007, 0.007, 0.07],
 'reg_all': [0.07, 0.05, 0.02, 0.7],
 'random_state': [42]}

<font size="4">Run grid search with `rmse` and `mae` as the metrics. Then use the parameters that resulted in the lowest RMSE on the train/test sets<font>



In [None]:
gs = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=3,
                  joblib_verbose=-1, n_jobs=-1)
print('Time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
      time.time() - search_time_start)
print('\n')
print('Lowest RMSE from Grid Search:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters of Model with lowest RMSE from Grid Search:')
print(gs.best_params['rmse'])

Time for iterating grid search parameters..


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed: 87.7min


Finished iterating grid search parameters: 13742.905663013458


Lowest RMSE from Grid Search:
0.9454947143378393


Parameters of Model with lowest RMSE from Grid Search:
{'n_epochs': 20, 'n_factors': 5, 'lr_all': 0.007, 'reg_all': 0.05, 'random_state': 42}


[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed: 229.0min finished


In [None]:
# Save results to df
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df = results_df.sort_values('mean_test_rmse', ascending=True)

print('SVDpp GridSearch HPO Cross Validation Results:')
print(results_df.head())
print('\n')
results_df.to_csv('SVDpp_gridSearch_cvResults_moreParamsLessFactors.csv', index=False)

del results_df

SVDpp GridSearch HPO Cross Validation Results:
    split0_test_rmse  split1_test_rmse  split2_test_rmse  mean_test_rmse  \
5           0.948897          0.943507          0.944079        0.945495   
4           0.948880          0.943676          0.944297        0.945618   
16          0.948938          0.943844          0.944304        0.945695   
28          0.948962          0.943712          0.944634        0.945769   
17          0.949299          0.943932          0.944247        0.945826   

    std_test_rmse  rank_test_rmse  split0_test_mae  split1_test_mae  \
5        0.002417               1         0.684832         0.682043   
4        0.002321               2         0.686716         0.684125   
16       0.002301               3         0.686884         0.684269   
28       0.002289               4         0.686872         0.684129   
17       0.002459               5         0.685130         0.682168   

    split2_test_mae  mean_test_mae  std_test_mae  rank_test_mae  \
5 

<font size="4">Fit and predict on the best model, apply functions and save prediction results <font>

In [None]:
algo = gs.best_estimator['rmse']

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./SVDpp_bestGrid_Model_moreParams_file', predictions, algo)
#predictions, algo = dump.load('./SVDpp_bestGrid_Model_file')
    
df1 = pd.DataFrame(predictions, columns=['reviewerID', 'itemID', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewerID.apply(get_Ir)
df1['Ui'] = df1.itemID.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)
df1.to_csv('predictions_SVDpp_gridSearch_moreParamsLessFactors.csv')

RMSE from fit best parameters on train predict on test:
RMSE: 0.9405
0.940543273193018


<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df1.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
        reviewerID  itemID  rui  est                    details    Iu   Ui  \
194083        3613    1417  5.0  5.0  {'was_impossible': False}    50   80   
192688        6628    2512  5.0  5.0  {'was_impossible': False}    37   24   
42659         1066   23696  5.0  5.0  {'was_impossible': False}    97   19   
104939        1505    7867  5.0  5.0  {'was_impossible': False}    82   19   
149062         602    9291  5.0  5.0  {'was_impossible': False}   141   22   
42666         3437    4268  5.0  5.0  {'was_impossible': False}    54   43   
104935          13   20848  5.0  5.0  {'was_impossible': False}  1171   12   
192690        3709    8889  5.0  5.0  {'was_impossible': False}    51   12   
213285        9615    1405  5.0  5.0  {'was_impossible': False}    31   69   
42680           13    2069  5.0  5.0  {'was_impossible': False}  1171  157   

        err  
194083  0.0  
192688  0.0  
42659   0.0  
104939  0.0  
149062  0.0  
42666   0.0  
104935  0.0  
192690  

<font size="4"> Find the worst predictions<font>

In [None]:
worst_predictions = df1.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df1, best_predictions, worst_predictions

Worst 10 predictions:
        reviewerID  itemID  rui  est                    details   Iu   Ui  err
13116         3284    3483  1.0  5.0  {'was_impossible': False}   54  101  4.0
220688       16290    2647  1.0  5.0  {'was_impossible': False}   22   37  4.0
22368        11682    2998  1.0  5.0  {'was_impossible': False}   24   84  4.0
21561           38   34867  5.0  1.0  {'was_impossible': False}  701   12  4.0
12513          804   23913  1.0  5.0  {'was_impossible': False}  117    4  4.0
147671         835   96950  1.0  5.0  {'was_impossible': False}  116    1  4.0
44833         5789     409  1.0  5.0  {'was_impossible': False}   41  243  4.0
190931        8522    1795  1.0  5.0  {'was_impossible': False}   35  126  4.0
118238        8688   12360  1.0  5.0  {'was_impossible': False}   30   13  4.0
147695        9038    5276  1.0  5.0  {'was_impossible': False}   31   20  4.0


### SVD 
Fit model with default parameters for 3 epochs, examine RMSE on train/test sets and predictions

In [None]:
print('Train/predict using SVD default parameters for 3 epochs:')
print('\n')
print('Time for iterating through SVD default parameters..')
search_time_start = time.time()
algo = SVD(n_epochs=3, random_state=seed_value)
cv = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=False, 
                    n_jobs=-1)
print('Finished iterating through SVD default parameters:',
      time.time() - search_time_start)
print('\n')
print('Cross validation results:')
# Iterate over key/value pairs in cv results dict 
for key, value in cv.items():
    print(key, ' : ', value)
print('\n')

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./SVD_3epochs_DefaultParamModel_file', predictions, algo)
#predictions, algo = dump.load('./SVD_3epochs_DefaultParamModel_file')
 
df1 = pd.DataFrame(predictions, columns=['reviewerID', 'itemID', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewerID.apply(get_Ir)
df1['Ui'] = df1.itemID.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)   
df1.to_csv('predictions_SVD_DefaultParamModel.csv')

Train/predict using SVD default parameters for 3 epochs:


Time for iterating through SVD default parameters..
Finished iterating through SVD default parameters: 21.718221426010132


Cross validation results:
test_rmse  :  [1.01170944 1.01306478 1.00997506]
test_mae  :  [0.77944209 0.78081871 0.77882791]
fit_time  :  (9.072836875915527, 8.729201793670654, 8.824720859527588)
test_time  :  (2.9705753326416016, 3.0572056770324707, 2.9552011489868164)


RMSE from fit best parameters on train predict on test:
RMSE: 0.9995
0.9995495492030135


<font size="4"> Find the best predictions<font>

In [None]:
best_predictions = df1.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
        reviewerID  itemID  rui  est                    details    Iu   Ui  \
0             4911       0  5.0  5.0  {'was_impossible': False}    41  334   
163930        1279    1286  5.0  5.0  {'was_impossible': False}    92  102   
210635          55     959  5.0  5.0  {'was_impossible': False}   605  186   
21967           10    1683  5.0  5.0  {'was_impossible': False}  1262  122   
195113         120   11155  5.0  5.0  {'was_impossible': False}   392   15   
2540          1761    4244  5.0  5.0  {'was_impossible': False}    76   61   
163781         313    1893  5.0  5.0  {'was_impossible': False}   216   71   
116666         192     407  5.0  5.0  {'was_impossible': False}   286  258   
54574          142    1238  5.0  5.0  {'was_impossible': False}   336  212   
163763        8633     526  5.0  5.0  {'was_impossible': False}    31  186   

        err  
0       0.0  
163930  0.0  
210635  0.0  
21967   0.0  
195113  0.0  
2540    0.0  
163781  0.0  
116666  

<font size="4"> Find the worst predictions<font>

In [None]:
worst_predictions = df1.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df1, best_predictions, worst_predictions

Worst 10 predictions:
        reviewerID  itemID  rui       est                    details    Iu  \
217222         246      92  1.0  4.974257  {'was_impossible': False}   248   
21188          291    3674  1.0  4.989281  {'was_impossible': False}   224   
83087        17415    1672  1.0  5.000000  {'was_impossible': False}    21   
185192          10     833  1.0  5.000000  {'was_impossible': False}  1262   
98948         3522     906  1.0  5.000000  {'was_impossible': False}    55   
184013         439    2453  1.0  5.000000  {'was_impossible': False}   175   
201875        3590     608  1.0  5.000000  {'was_impossible': False}    53   
71015         4132    1558  1.0  5.000000  {'was_impossible': False}    48   
135268          10     536  1.0  5.000000  {'was_impossible': False}  1262   
17450          243    3035  5.0  1.000000  {'was_impossible': False}   261   

         Ui       err  
217222  151  3.974257  
21188    92  3.989281  
83087   169  4.000000  
185192  187  4.000000  

#### SVD HPO using Grid Search

<font size="4">Define the parameters for the grid search <font>



In [None]:
param_grid = {'n_epochs': [30, 35, 40, 45, 50, 55, 60, 65, 70], 
              'n_factors': [20, 25, 30, 35, 40 ,45 , 50], 
              'lr_all': [0.002, 0.003, 0.004, 0.005, 0.006, 0.007], 
              'reg_all': [0.0001, 0.001, 0.01, 0.02, 0.03, 0.04],
              'random_state': [seed_value]}
print('SVD HPO Grid search parameters:')
param_grid

SVD HPO Grid search parameters:


{'n_epochs': [30, 35, 40, 45, 50, 55, 60, 65, 70],
 'n_factors': [20, 25, 30, 35, 40, 45, 50],
 'lr_all': [0.002, 0.003, 0.004, 0.005, 0.006, 0.007],
 'reg_all': [0.0001, 0.001, 0.01, 0.02, 0.03, 0.04],
 'random_state': [42]}

<font size="4">Run grid search with `rmse` and `mae` as the metrics. Then use the parameters that resulted in the lowest RMSE on the train/test sets<font>



In [None]:
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3,
                  joblib_verbose=-1, n_jobs=-1)
print('Time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
      time.time() - search_time_start)
print('\n')
print('Lowest RMSE from Grid Search:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters of Model with lowest RMSE from Grid Search:')
print(gs.best_params['rmse'])

Time for iterating grid search parameters..


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 256 tasks      | elapsed: 18.0min
[Parallel(n_jobs=-1)]: Done 616 tasks      | elapsed: 48.1min
[Parallel(n_jobs=-1)]: Done 1120 tasks      | elapsed: 90.8min
[Parallel(n_jobs=-1)]: Done 1768 tasks      | elapsed: 153.8min
[Parallel(n_jobs=-1)]: Done 2560 tasks      | elapsed: 238.1min
[Parallel(n_jobs=-1)]: Done 3496 tasks      | elapsed: 350.8min
[Parallel(n_jobs=-1)]: Done 4576 tasks      | elapsed: 501.9min
[Parallel(n_jobs=-1)]: Done 5800 tasks      | elapsed: 687.2min


Finished iterating grid search parameters: 51872.37151479721


Lowest RMSE from Grid Search:
0.9470577657248332


Parameters of Model with lowest RMSE from Grid Search:
{'n_epochs': 65, 'n_factors': 20, 'lr_all': 0.002, 'reg_all': 0.04, 'random_state': 42}


[Parallel(n_jobs=-1)]: Done 6804 out of 6804 | elapsed: 864.5min finished


In [None]:
# Save results to df
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df = results_df.sort_values('mean_test_rmse', ascending=True)
print('SVD GridSearch HPO Cross Validation Results:')
print(results_df.head())
print('\n')
results_df.to_csv('SVD_gridSearch_cvResults.csv', index=False)

del results_df

SVD GridSearch HPO Cross Validation Results:
      split0_test_rmse  split1_test_rmse  split2_test_rmse  mean_test_rmse  \
1769          0.947837          0.947292          0.946044        0.947058   
515           0.947919          0.947282          0.946094        0.947098   
1517          0.947917          0.947291          0.946087        0.947098   
767           0.947869          0.947354          0.946098        0.947107   
17            0.947931          0.947284          0.946113        0.947109   

      std_test_rmse  rank_test_rmse  split0_test_mae  split1_test_mae  \
1769       0.000751               1         0.684882         0.684427   
515        0.000756               2         0.685845         0.685306   
1517       0.000760               3         0.685833         0.685298   
767        0.000744               4         0.684506         0.684084   
17         0.000752               5         0.685866         0.685321   

      split2_test_mae  mean_test_mae  std_test_

<font size="4">Fit and predict on the best model, apply functions and save prediction results <font>

In [None]:
algo = gs.best_estimator['rmse']

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
print('\n')
dump.dump('./SVD_bestGrid_Model_file', predictions, algo)
#predictions, algo = dump.load('./SVD_bestGrid_Model_file')

df1 = pd.DataFrame(predictions, columns=['reviewerID', 'itemID', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewerID.apply(get_Ir)
df1['Ui'] = df1.itemID.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)     
df1.to_csv('predictions_SVD_gridSearch.csv')

RMSE from fit best parameters on train predict on test:
RMSE: 0.9426
0.9426279412758342




<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df1.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
        reviewerID  itemID  rui  est                    details   Iu   Ui  err
176329       16449     806  5.0  5.0  {'was_impossible': False}   18  206  0.0
13930         1410    1456  5.0  5.0  {'was_impossible': False}   84  122  0.0
38632        11133   21418  5.0  5.0  {'was_impossible': False}   27   15  0.0
197661        4184    6502  5.0  5.0  {'was_impossible': False}   47   23  0.0
38643         4880     362  5.0  5.0  {'was_impossible': False}   44  336  0.0
97976        10433     577  5.0  5.0  {'was_impossible': False}   25  245  0.0
97971        14696     435  5.0  5.0  {'was_impossible': False}   23  188  0.0
38631         3128   33032  5.0  5.0  {'was_impossible': False}   56    6  0.0
97969          192     918  5.0  5.0  {'was_impossible': False}  293  154  0.0
197652         835   22792  5.0  5.0  {'was_impossible': False}  131   14  0.0


<font size="4"> Find the worst predictions<font>

In [None]:
worst_predictions = df1.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del data, train, test, predictions, df1, best_predictions, worst_predictions

Worst 10 predictions:
        reviewerID  itemID  rui  est                    details   Iu   Ui  err
12813         6703      22  1.0  5.0  {'was_impossible': False}   36  739  4.0
114206        1053    4884  1.0  5.0  {'was_impossible': False}  109   25  4.0
113120        6928   22390  1.0  5.0  {'was_impossible': False}   34   11  4.0
77728         1563   23406  1.0  5.0  {'was_impossible': False}   85   11  4.0
80302          705      50  1.0  5.0  {'was_impossible': False}  133  646  4.0
57297        18154    1821  1.0  5.0  {'was_impossible': False}   21   47  4.0
121154        2098   45548  1.0  5.0  {'was_impossible': False}   70    5  4.0
99084        12373    4451  1.0  5.0  {'was_impossible': False}   24   39  4.0
907           5443   24721  1.0  5.0  {'was_impossible': False}   40    4  4.0
221274        5108    1178  1.0  5.0  {'was_impossible': False}   42  152  4.0


### BaselineOnly
Fit model with default parameters for 3 epochs using `method='als'` and examine RMSE on train/test sets

In [None]:
print('Train/predict using Baseline default parameters with Alternating Least Squares for 3 epochs:')
print('\n')
bsl_options = {'method': 'als',
               'n_epochs': 3}
print('Baselines estimates configuration:')
print(bsl_options)
print('\n')
print('Model parameters:')
print(BaselineOnly(bsl_options=bsl_options)) 

Train/predict using Baseline default parameters with Alternating Least Squares for 3 epochs:


Baselines estimates configuration:
{'method': 'als', 'n_epochs': 3}


Model parameters:
<surprise.prediction_algorithms.baseline_only.BaselineOnly object at 0x000002F2AB988C40>


<font size="4">Cross validation, Fit and predict on best model with the lowest rmse, apply functions and save prediction results  <font>

In [None]:
print('Time for iterating through Baseline default parameters epochs=3 using ALS..')
search_time_start = time.time()
algo = BaselineOnly(bsl_options=bsl_options)
cv = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=False, 
                    n_jobs=-1)
print('Finished iterating through Baseline default parameters epochs=3 using ALS:',
      time.time() - search_time_start)
print('\n')
print('Cross validation results:')
# Iterate over key/value pairs in cv results dict 
for key, value in cv.items():
    print(key, ' : ', value)

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./Baseline_3epochs_DefaultParamModel_file', predictions, algo)
#predictions, algo = dump.load('./Baseline_3epochs_DefaultParamModel_file')
 
df1 = pd.DataFrame(predictions, columns=['reviewerID', 'itemID', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewerID.apply(get_Ir)
df1['Ui'] = df1.itemID.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)  
df1.to_csv('predictions_Baseline_DefaultParamModel.csv')

Time for iterating through Baseline default parameters epochs=3 using ALS..
Finished iterating through Baseline default parameters epochs=3 using ALS: 13.459791421890259


Cross validation results:
test_rmse  :  [0.96003615 0.95827913 0.95906949]
test_mae  :  [0.7179837  0.71734625 0.7175912 ]
fit_time  :  (0.3123772144317627, 0.36087560653686523, 0.28015828132629395)
test_time  :  (2.41690731048584, 2.252049207687378, 2.211479425430298)
Estimating biases using als...
RMSE from fit best parameters on train predict on test:
RMSE: 0.9504
0.9504451345236158


<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df1.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
        reviewerID  itemID  rui  est                    details    Iu   Ui  \
140170        5718     164  5.0  5.0  {'was_impossible': False}    38  343   
118833        8919     357  5.0  5.0  {'was_impossible': False}    32  273   
13622         6881     738  5.0  5.0  {'was_impossible': False}    33  279   
209389        5673    1597  5.0  5.0  {'was_impossible': False}    38  143   
59221          204     396  5.0  5.0  {'was_impossible': False}   276  324   
114384          10    3429  5.0  5.0  {'was_impossible': False}  1297   51   
165059       16405    5977  5.0  5.0  {'was_impossible': False}    20   46   
132724         139   39056  5.0  5.0  {'was_impossible': False}   326    7   
209394        5069    6451  5.0  5.0  {'was_impossible': False}    42   24   
92324         1992   12968  5.0  5.0  {'was_impossible': False}    68   25   

        err  
140170  0.0  
118833  0.0  
13622   0.0  
209389  0.0  
59221   0.0  
114384  0.0  
165059  0.0  
132724  

<font size="4">Find the worst predictions<font>

In [None]:
worst_predictions = df1.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df1, best_predictions, worst_predictions

Worst 10 predictions:
        reviewerID  itemID  rui  est                    details    Iu   Ui  \
35996         1955   25056  1.0  5.0  {'was_impossible': False}    71   14   
219408         969     798  1.0  5.0  {'was_impossible': False}   107  212   
39115         1331    1915  1.0  5.0  {'was_impossible': False}    94   57   
22671           10    6563  1.0  5.0  {'was_impossible': False}  1297   26   
25381        17531     890  1.0  5.0  {'was_impossible': False}    20  176   
50579         1696    1571  1.0  5.0  {'was_impossible': False}    78   69   
76566         2706   21086  1.0  5.0  {'was_impossible': False}    61   17   
140738        5260     678  1.0  5.0  {'was_impossible': False}    46   51   
50948          317   80685  1.0  5.0  {'was_impossible': False}   206    1   
133395          10    1999  1.0  5.0  {'was_impossible': False}  1297   28   

        err  
35996   4.0  
219408  4.0  
39115   4.0  
22671   4.0  
25381   4.0  
50579   4.0  
76566   4.0  
140738 

#### BaselineOnly HPO using Grid Search

<font size="4">Define the parameters for the grid search <font>



In [None]:
print('BaselineOnly HPO using Grid Search Minimized:')
print('\n')
param_grid = {'bsl_options': {'method': ['als', 'sgd'],
                             'n_epochs': [5, 10, 15, 20],
                             'reg_u': [5, 10, 15, 20],
                             'reg_i': [5, 10, 15, 20]}}
print('Grid search parameters:')
param_grid          

BaselineOnly HPO using Grid Search Minimized:


Grid search parameters:


{'bsl_options': {'method': ['als', 'sgd'],
  'n_epochs': [5, 10, 15, 20],
  'reg_u': [5, 10, 15, 20],
  'reg_i': [5, 10, 15, 20]}}

<font size="4">Run grid search with `rmse` and `mae` as the metrics. Then use the parameters that resulted in the lowest RMSE on the train/test sets<font>



In [None]:
gs = GridSearchCV(BaselineOnly, param_grid, measures=['rmse', 'mae'], cv=3,
                  joblib_verbose=-1, n_jobs=-1)
print('Time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
      time.time() - search_time_start)
print('\n')
print('Lowest RMSE from Grid Search:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters of Model with lowest RMSE from Grid Search:')
print(gs.best_params['rmse'])

Time for iterating grid search parameters..


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed:   45.3s
[Parallel(n_jobs=-1)]: Done 256 tasks      | elapsed:  4.5min


Finished iterating grid search parameters: 410.2233831882477


Lowest RMSE from Grid Search:
0.9453093755322367


Parameters of Model with lowest RMSE from Grid Search:
{'bsl_options': {'method': 'als', 'n_epochs': 20, 'reg_u': 5, 'reg_i': 5}}


[Parallel(n_jobs=-1)]: Done 384 out of 384 | elapsed:  6.7min finished


<font size="4">Fit and predict on the best model, apply functions and save prediction results <font>

In [None]:
algo = gs.best_estimator['rmse']

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./BaselineOnly_bestGrid_Model_file', predictions, algo)
#predictions, algo = dump.load('./BaselineOnly_bestGrid_Model_file')
   
df1 = pd.DataFrame(predictions, columns=['reviewerID', 'itemID', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewerID.apply(get_Ir)
df1['Ui'] = df1.itemID.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)
df1.to_csv('predictions_BaselineOnly_gridSearch.csv')

Estimating biases using als...
RMSE from fit best parameters on train predict on test:
RMSE: 0.9418
0.9418294572167315


<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df1.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
        reviewerID  itemID  rui  est                    details   Iu   Ui  err
191694       10879    2578  5.0  5.0  {'was_impossible': False}   28   30  0.0
50481        10448     300  5.0  5.0  {'was_impossible': False}   28  190  0.0
12633         9398    1087  5.0  5.0  {'was_impossible': False}   31   98  0.0
82689          869     306  5.0  5.0  {'was_impossible': False}  120   52  0.0
82686         4469     270  5.0  5.0  {'was_impossible': False}   52  459  0.0
174741        1283    2020  5.0  5.0  {'was_impossible': False}   91   88  0.0
25555         6659    7980  5.0  5.0  {'was_impossible': False}   36   11  0.0
82667          464    2305  5.0  5.0  {'was_impossible': False}  185   77  0.0
82663          164    4373  5.0  5.0  {'was_impossible': False}  314   54  0.0
174729         484    8362  5.0  5.0  {'was_impossible': False}  172   18  0.0


<font size="4"> Find the worst predictions<font>

In [None]:
worst_predictions = df1.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df1, best_predictions, worst_predictions

Worst 10 predictions:
        reviewerID  itemID  rui  est                    details    Iu   Ui  \
33420         4933     752  1.0  5.0  {'was_impossible': False}    43  209   
108646        8302      20  1.0  5.0  {'was_impossible': False}    32   53   
11064        10089    2718  1.0  5.0  {'was_impossible': False}    28   90   
118238        8688   12360  1.0  5.0  {'was_impossible': False}    30   13   
56860         2159    3181  5.0  1.0  {'was_impossible': False}    75   53   
17953        11474   10529  1.0  5.0  {'was_impossible': False}    25   42   
199058       10055     577  1.0  5.0  {'was_impossible': False}    27  218   
11370         1747    1823  1.0  5.0  {'was_impossible': False}    76   72   
198775          10    2632  1.0  5.0  {'was_impossible': False}  1270  127   
99311         1094   12926  1.0  5.0  {'was_impossible': False}   102   18   

        err  
33420   4.0  
108646  4.0  
11064   4.0  
118238  4.0  
56860   4.0  
17953   4.0  
199058  4.0  
11370  

#### BaselineOnly HPO using Grid Search - More Params 

In [None]:
print('BaselineOnly HPO using Grid Search Minimized:')
print('\n')
param_grid = {'bsl_options': {'method': ['als', 'sgd'],
                             'n_epochs': [10, 15, 20, 25, 30, 35, 40, 45, 50],
                             'reg_u': [0, 1, 2, 3, 4, 5, 10, 15, 20],
                             'reg_i': [0, 1, 2, 3, 4,5, 10, 15, 20]}}
print('Grid search parameters:')
param_grid          

BaselineOnly HPO using Grid Search Minimized:


Grid search parameters:


{'bsl_options': {'method': ['als', 'sgd'],
  'n_epochs': [10, 15, 20, 25, 30, 35, 40, 45, 50],
  'reg_u': [0, 1, 2, 3, 4, 5, 10, 15, 20],
  'reg_i': [0, 1, 2, 3, 4, 5, 10, 15, 20]}}

<font size="4">Run grid search with `rmse` and `mae` as the metrics. Then use the parameters that resulted in the lowest RMSE on the train/test sets<font>



In [None]:
gs = GridSearchCV(BaselineOnly, param_grid, measures=['rmse', 'mae'], cv=3,
                  joblib_verbose=-1, n_jobs=-1)
print('Time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
      time.time() - search_time_start)
print('\n')
print('Lowest RMSE from Grid Search:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters of Model with lowest RMSE from Grid Search:')
print(gs.best_params['rmse'])

Time for iterating grid search parameters..


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed:   45.6s
[Parallel(n_jobs=-1)]: Done 256 tasks      | elapsed:  4.4min
[Parallel(n_jobs=-1)]: Done 616 tasks      | elapsed: 10.5min
[Parallel(n_jobs=-1)]: Done 1120 tasks      | elapsed: 19.6min
[Parallel(n_jobs=-1)]: Done 1768 tasks      | elapsed: 32.1min
[Parallel(n_jobs=-1)]: Done 2560 tasks      | elapsed: 46.8min
[Parallel(n_jobs=-1)]: Done 3496 tasks      | elapsed: 63.2min


Finished iterating grid search parameters: 4817.081090688705


Lowest RMSE from Grid Search:
0.9429468447894139


Parameters of Model with lowest RMSE from Grid Search:
{'bsl_options': {'method': 'als', 'n_epochs': 50, 'reg_u': 1, 'reg_i': 3}}


[Parallel(n_jobs=-1)]: Done 4374 out of 4374 | elapsed: 80.2min finished


<font size="4">Fit and predict on the best model, apply functions and save prediction results <font>

In [None]:
algo = gs.best_estimator['rmse']

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./BaselineOnly_moreParams_bestGrid_Model_file', predictions, algo)
#predictions, algo = dump.load('./BaselineOnly_moreParams_bestGrid_Model_file')
    
df1 = pd.DataFrame(predictions, columns=['reviewerID', 'itemID', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewerID.apply(get_Ir)
df1['Ui'] = df1.itemID.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)
df1.to_csv('predictions_BaselineOnly_moreParams_gridSearch.csv')

Estimating biases using als...
RMSE from fit best parameters on train predict on test:
RMSE: 0.9400
0.9400458413742169


<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df1.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
        reviewerID  itemID  rui  est                    details   Iu   Ui  err
111339       14872     757  5.0  5.0  {'was_impossible': False}   19  107  0.0
204037        1218    7653  5.0  5.0  {'was_impossible': False}   98   32  0.0
74750        12389   18648  5.0  5.0  {'was_impossible': False}   29   17  0.0
158992       11073     306  5.0  5.0  {'was_impossible': False}   22   52  0.0
204044        5800     366  5.0  5.0  {'was_impossible': False}   38  365  0.0
204054       16798    2016  5.0  5.0  {'was_impossible': False}   22   50  0.0
158983        7591   13388  5.0  5.0  {'was_impossible': False}   33   12  0.0
158982       16037   20893  5.0  5.0  {'was_impossible': False}   18    6  0.0
204055         476    4689  5.0  5.0  {'was_impossible': False}  174   25  0.0
158978        2180      87  5.0  5.0  {'was_impossible': False}   66  694  0.0


<font size="4"> Find the worst predictions<font>

In [None]:
worst_predictions = df1.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df1, best_predictions, worst_predictions

Worst 10 predictions:
        reviewerID  itemID  rui  est                    details   Iu   Ui  err
162516        6479     148  1.0  5.0  {'was_impossible': False}   40  450  4.0
220011       15523    6988  1.0  5.0  {'was_impossible': False}   19   21  4.0
126619        5904    1667  1.0  5.0  {'was_impossible': False}   36  105  4.0
99713         3612   65249  1.0  5.0  {'was_impossible': False}   44    1  4.0
99311         1094   12926  1.0  5.0  {'was_impossible': False}  102   18  4.0
13116         3284    3483  1.0  5.0  {'was_impossible': False}   54  101  4.0
54348        11448   61254  1.0  5.0  {'was_impossible': False}   29    4  4.0
108646        8302      20  1.0  5.0  {'was_impossible': False}   32   53  4.0
26866         4268    2680  1.0  5.0  {'was_impossible': False}   45   14  4.0
96274          458    7802  1.0  5.0  {'was_impossible': False}  163   22  4.0


### KNNBaseline
Fit model with default parameters for 3 epochs using `method='als'` and examine RMSE on train/test sets

In [None]:
print('Train/predict using KNNBaseline default parameters with Alternating Least Squares for 3 epochs:')
print('\n')
bsl_options = {'method': 'als',
               'n_epochs': 3}
print('Baselines estimates configuration:')
print(bsl_options)
print('\n')
print('Model parameters:')
print('KNNBaseline(k=40, min_k=1, bsl_options=bsl_options, verbose=False)') 

Train/predict using KNNBaseline default parameters with Alternating Least Squares for 3 epochs:


Baselines estimates configuration:
{'method': 'als', 'n_epochs': 3}


Model parameters:
KNNBaseline(k=40, min_k=1, bsl_options=bsl_options, verbose=False)


<font size="4">Cross validation, Fit and predict on best model with the lowest rmse, apply functions and save prediction results  <font>

In [None]:
print('Time for iterating through KNNBaseline default parameters epochs=3 using ALS..')
search_time_start = time.time()
algo = KNNBaseline(bsl_options=bsl_options)
cv = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=False, 
                    n_jobs=-1)
print('Finished iterating through KNNBaseline default parameters epochs=3 using ALS:',
      time.time() - search_time_start)
print('\n')
print('Cross validation results:')
# Iterate over key/value pairs in cv results dict 
for key, value in cv.items():
    print(key, ' : ', value)

Time for iterating through KNNBaseline default parameters epochs=3 using ALS..
Finished iterating through KNNBaseline default parameters epochs=3 using ALS: 52.3457145690918


Cross validation results:
test_rmse  :  [0.98871454 0.98392946 0.98468003]
test_mae  :  [0.70193266 0.6991709  0.69981287]
fit_time  :  (12.814245462417603, 12.74767255783081, 12.50976276397705)
test_time  :  (27.62490439414978, 28.00885558128357, 28.005619525909424)


In [None]:
predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./KNNBaseline_3epochs_DefaultParamModel_file', predictions, algo)
#predictions, algo = dump.load('./KNNBaseline_3epochs_DefaultParamModel_file')
  
df1 = pd.DataFrame(predictions, columns=['reviewerID', 'itemID', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewerID.apply(get_Ir)
df1['Ui'] = df1.itemID.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)  
df1.to_csv('predictions_KNNBaseline_DefaultParamModel.csv')

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE from fit best parameters on train predict on test:
RMSE: 0.9639
0.9638924023238667


<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df1.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
        reviewerID  itemID  rui  est  \
82292         7359     599  5.0  5.0   
66702        10988    1054  5.0  5.0   
199042        3797    4924  5.0  5.0   
66720          305    5079  5.0  5.0   
66728        13455    1670  5.0  5.0   
66730          567   10371  5.0  5.0   
199038        3490   20933  5.0  5.0   
66697        16570   33621  5.0  5.0   
66734         4546   10521  5.0  5.0   
66745        12943   32981  5.0  5.0   

                                          details   Iu   Ui  err  
82292   {'actual_k': 40, 'was_impossible': False}   36  104  0.0  
66702   {'actual_k': 40, 'was_impossible': False}   26  257  0.0  
199042  {'actual_k': 40, 'was_impossible': False}   47   73  0.0  
66720   {'actual_k': 40, 'was_impossible': False}  226   57  0.0  
66728    {'actual_k': 3, 'was_impossible': False}   24   13  0.0  
66730    {'actual_k': 5, 'was_impossible': False}  153    5  0.0  
199038   {'actual_k': 9, 'was_impossible': False}   49   15  0.0  
66

<font size="4"> Find the worst predictions<font>

In [None]:
worst_predictions = df1.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df1, best_predictions, worst_predictions

Worst 10 predictions:
        reviewerID  itemID  rui  est  \
50924         1518   41193  5.0  1.0   
162946        5015    3768  1.0  5.0   
64718         3387    2240  1.0  5.0   
121979       14782   46077  1.0  5.0   
87343         5468    6536  1.0  5.0   
1375            76   55298  5.0  1.0   
37276         3867   22619  5.0  1.0   
128317       17696   24076  1.0  5.0   
21647        11177   44655  5.0  1.0   
7373          1420   28629  1.0  5.0   

                                          details   Iu  Ui  err  
50924    {'actual_k': 1, 'was_impossible': False}   85   5  4.0  
162946   {'actual_k': 6, 'was_impossible': False}   43  21  4.0  
64718   {'actual_k': 22, 'was_impossible': False}   49  66  4.0  
121979   {'actual_k': 1, 'was_impossible': False}   23   1  4.0  
87343    {'actual_k': 7, 'was_impossible': False}   42  48  4.0  
1375     {'actual_k': 2, 'was_impossible': False}  503   2  4.0  
37276    {'actual_k': 1, 'was_impossible': False}   46   3  4.0  
128317   

#### KNNBaseline HPO using Grid Search

<font size="4">Define the parameters for the grid search <font>



In [None]:
print('KNNBaseline HPO using Grid Search:')
print('\n')
param_grid = {'bsl_options': {'method': ['als', 'sgd']}, 
              'k': [30, 35, 40, 45, 50], 
              'min_k': [5, 10],
              'random_state': [seed_value],
              'sim_options': {'name': ['pearson_baseline'],
                              'min_support': [5, 10],
                              'shrinkage': [0, 100]}}
print('Grid search parameters:')
param_grid          

KNNBaseline HPO using Grid Search:


Grid search parameters:


{'bsl_options': {'method': ['als', 'sgd']},
 'k': [30, 35, 40, 45, 50],
 'min_k': [20, 25],
 'random_state': [42],
 'sim_options': {'name': ['pearson_baseline'],
  'min_support': [5, 10],
  'shrinkage': [0, 100]}}

<font size="4">Run grid search with `rmse` and `mae` as the metrics. Then use the parameters that resulted in the lowest RMSE on the train/test sets<font>



In [None]:
gs = GridSearchCV(KNNBaseline, param_grid, measures=['rmse', 'mae'], cv=3, 
                  joblib_verbose=-1, n_jobs=5)
print('Start time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
      time.time() - search_time_start)
print('\n')
print('Model with lowest RMSE:')
print(gs.best_score['rmse'])
print('\n')
# Parameters with the lowest RMSE 
print('Parameters with the lowest RMSE:')
print(gs.best_params['rmse'])

Start time for iterating grid search parameters..


[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.
[Parallel(n_jobs=5)]: Done  62 tasks      | elapsed: 11.1min


Finished iterating grid search parameters: 2447.833321094513


Model with lowest RMSE:
0.9453110845949882


Parameters with the lowest RMSE:
{'bsl_options': {'method': 'sgd'}, 'k': 40, 'min_k': 20, 'random_state': 42, 'sim_options': {'name': 'pearson_baseline', 'min_support': 5, 'shrinkage': 100, 'user_based': True}}


[Parallel(n_jobs=5)]: Done 240 out of 240 | elapsed: 40.7min finished


In [None]:
predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
print('\n')
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df = results_df.sort_values('mean_test_rmse', ascending=True)
print('KNNBaseline GridSearch HPO Cross Validation Results:')
print(results_df.head())
results_df.to_csv('KNNBaseline_gridSearch_cvResults.csv', index=False)

del results_df

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE from fit best parameters on train predict on test:
RMSE: 0.9661
0.9661408444802692


KNNBaseline GridSearch HPO Cross Validation Results:
    split0_test_rmse  split1_test_rmse  split2_test_rmse  mean_test_rmse  \
57          0.948154          0.943403          0.944376        0.945311   
49          0.948146          0.943407          0.944382        0.945312   
41          0.948147          0.943410          0.944380        0.945313   
65          0.948156          0.943407          0.944381        0.945315   
73          0.948159          0.943412          0.944383        0.945318   

    std_test_rmse  rank_test_rmse  split0_test_mae  split1_test_mae  \
57       0.002049               1         0.684004         0.681184   
49       0.002044               2         0.683991         0.681176   
41       0.002043               3         0.683976         0.681167   
65       0.0

<font size="4">Fit and predict on the best model, apply functions and save prediction results <font>

In [None]:
algo = gs.best_estimator['rmse']

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
print('\n')
dump.dump('./KNNBaseline_bestGrid_Model_file', predictions, algo)
#predictions, algo = dump.load('./KNNBaseline_bestGrid_Model_file')
   
df1 = pd.DataFrame(predictions, columns=['reviewerID', 'itemID', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewerID.apply(get_Ir)
df1['Ui'] = df1.itemID.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)   
df1.to_csv('predictions_KNNBaseline_gridSearch.csv')

Estimating biases using sgd...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE from fit best parameters on train predict on test:
RMSE: 0.9379
0.9378743356689234




<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df1.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
        reviewerID  itemID  rui  est  \
90940         4608     211  5.0  5.0   
159489       18439   14836  5.0  5.0   
134670       12176    3672  5.0  5.0   
41605         5787      90  5.0  5.0   
214781       16444     188  5.0  5.0   
14334         7397    1319  5.0  5.0   
41607          602    8628  5.0  5.0   
200020        8201   17834  5.0  5.0   
41613         8629    7599  5.0  5.0   
91520         6810     874  5.0  5.0   

                                          details   Iu   Ui  err  
90940    {'actual_k': 6, 'was_impossible': False}   48  231  0.0  
159489   {'actual_k': 2, 'was_impossible': False}   17   26  0.0  
134670   {'actual_k': 0, 'was_impossible': False}   24   54  0.0  
41605    {'actual_k': 1, 'was_impossible': False}   40  152  0.0  
214781   {'actual_k': 2, 'was_impossible': False}   20   96  0.0  
14334    {'actual_k': 2, 'was_impossible': False}   35   96  0.0  
41607    {'actual_k': 0, 'was_impossible': False}  133    7  0.0  
20

<font size="4"> Find the worst predictions<font>

In [None]:
worst_predictions = df1.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df1, best_predictions, worst_predictions

Worst 10 predictions:
        reviewerID  itemID  rui  est  \
199613        2527     733  1.0  5.0   
111347         302   20835  1.0  5.0   
128997        7822     700  1.0  5.0   
13429        10089    2718  1.0  5.0   
135680         804   23913  1.0  5.0   
105987        9570     388  1.0  5.0   
145506          10    8074  1.0  5.0   
44675         5849    4328  1.0  5.0   
64715         3907     100  1.0  5.0   
201633        3958    1842  5.0  1.0   

                                          details    Iu   Ui  err  
199613   {'actual_k': 5, 'was_impossible': False}    65  417  4.0  
111347   {'actual_k': 0, 'was_impossible': False}   218    5  4.0  
128997   {'actual_k': 4, 'was_impossible': False}    34  114  4.0  
13429    {'actual_k': 0, 'was_impossible': False}    30   87  4.0  
135680   {'actual_k': 0, 'was_impossible': False}   122    5  4.0  
105987   {'actual_k': 0, 'was_impossible': False}    25  299  4.0  
145506   {'actual_k': 8, 'was_impossible': False}  1307   38 

## RecSys Methods without Surprise
### Create the rating matrix with items and reviewers

### Create SVD Based Recommendation System using SciPy
<font size="4">For the training of the SVD based models using `SciPy`:
A rating matrix with reviewers and items was constructed and transposed. The parameters for the model were `U, sigma, Vt = randomized_svd(ratingsMat, n_components, n_oversamples, random_state)`. A diagonal matrix was constructed in SVD. Then the ratings were predicted and RMSE was calculated. <font>

In [None]:
# Create user-item matrix & transpose the utility matrix
ratingsMat = df.pivot_table(values='rating', index='reviewer_id', 
                            columns='item_id', fill_value=0)
X = ratingsMat.values.T
X.shape

(103687, 19639)

### Model with `n_components=15` and `n_oversamples=20`

In [None]:
# Define model components
U, sigma, Vt = randomized_svd(X, n_components=15, 
                              n_oversamples=20, random_state=42)

# Construct a diagonal matrix in SVD
sigma = np.diag(sigma)

# Predicted rating
reviewers_predRating = np.dot(np.dot(U, sigma), Vt) 

# Create a dataframe of the predicted ratings
ratingPred = pd.DataFrame(reviewers_predRating)

In [None]:
print('\nEvaluate the SciPy SVD Collaborative recommender model')
rmse_df = pd.concat([ratingsMat.mean(), ratingPred.mean()], axis=1)
rmse_df.columns = ['Avg_actual_rating', 'Avg_predicted_rating']
rmse_df['item_index'] = np.arange(0, rmse_df.shape[0], 1)

RMSE = round((((rmse_df.Avg_actual_rating 
                - rmse_df.Avg_predicted_rating) ** 2).mean() ** 0.5), 10)
print('\nRMSE of SciPy SVD Model = {} \n'.format(RMSE))


Evaluate the SciPy SVD Collaborative recommender model

RMSE of SciPy SVD Model = 0.0144595486 



### Model with `n_components=10` and `n_oversamples=20`

In [None]:
# Define model components
U, sigma, Vt = randomized_svd(X, n_components=10,
                              n_oversamples=20, random_state=42)

# Construct a diagonal matrix in SVD
sigma = np.diag(sigma)

# Predicted rating
reviewers_predRating = np.dot(np.dot(U, sigma), Vt) 

# Create a dataframe of the predicted ratings
ratingPred = pd.DataFrame(reviewers_predRating)

In [None]:
print('\nEvaluate the SciPy SVD Collaborative recommender model')
rmse_df = pd.concat([ratingsMat.mean(), ratingPred.mean()], axis=1)
rmse_df.columns = ['Avg_actual_rating', 'Avg_predicted_rating']
rmse_df['item_index'] = np.arange(0, rmse_df.shape[0], 1)

RMSE = round((((rmse_df.Avg_actual_rating 
                - rmse_df.Avg_predicted_rating) ** 2).mean() ** 0.5), 10)
print('\nRMSE of SciPy SVD Model = {} \n'.format(RMSE))


Evaluate the SciPy SVD Collaborative recommender model

RMSE of SciPy SVD Model = 0.0144523715 



<a id="decompose-matrix"></a>
<font size="4">**Decomposing the Matrix**

`TruncatedSVD` from `sklearn` was used to to compress the transposed matrix down to a number of rows by different number of matrices. The `items` are in the rows. while the `users` are compressed down to `X components`, providing a way to examine a generalized perspective of the users' interests with this given set.  <font>

In [None]:
SVD = TruncatedSVD(n_components=12, random_state=seed_value)
result_matrix = SVD.fit_transform(X)
result_matrix.shape

(103687, 500)

In [None]:
SVD = TruncatedSVD(n_components=50, random_state=seed_value)
result_matrix = SVD.fit_transform(X)
result_matrix1.shape

(103687, 50)

<a id="gen-corr-matrix"></a>
<font size="4">**Generating a Correlation Matrix**

PearsonR coefficient was calculated for every item pair in the result_matrix based on similarities between users' interests. `numpy.memmap` was utilized due to the memory constraints of sparse data. This involved splitting the input in 1000 row chucks, subtract means form the input data, and normalizing to create the correlation matrix.<font>




In [None]:
SPLITROWS = 1000
numrows = result_matrix.shape[0]

result_matrix -= np.mean(result_matrix, axis=1)[:,None]

result_matrix /= np.sqrt(np.sum(result_matrix * result_matrix, axis=1))[:,None]

corr_matrix = np.memmap('/mydata.dat', 'float64', mode='w+', 
                        shape=(numrows, numrows)) 

for r in range(0, numrows, SPLITROWS):
    for c in range(0, numrows, SPLITROWS):
        r1 = r + SPLITROWS
        c1 = c + SPLITROWS
        chunk1 = result_matrix[r:r1]
        chunk2 = result_matrix[c:c1]
        corr_matrix[r:r1, c:c1] = np.dot(chunk1, chunk2.T)
corr_matrix.shape

(103687, 103687)

<a id="isolate"></a>
<font size="4">**Extract the most popular item from the Correlation Matrix**

The most popular `item_id=8`. The correlation values are then extracted between `item_id=8` and all other items in the matrix. <font>

In [None]:
item_names = ratingsMat.columns
item_list = list(item_names)
item_list

popular_item = item_list.index(8)
print('index of the popular item: ', popular_item) 
 
corr_popular_item = corr_matrix[popular_item]

index of the popular item:  8


<a id="recommend"></a>
<font size="4">**Recommend Highly Correlated Items**

Now filter out the most correlated item to "Add item" by applying the following conditions as shown below. <font>

#### 12 components

In [None]:
# Construct a list items that are greater 0.9 correlated with target
a = list(item_names[(corr_popular_item < 1.0) & (corr_popular_item > 0.90)])
print('Number of items > 0.9 correlated with target:', len(a))

# Construct a list items that are greater 0.95 correlated with target
b = list(item_names[(corr_popular_item < 1.0) & (corr_popular_item > 0.95)])
print('Number of items > 0.95 correlated with target:', len(b))
print('Items > 0.95 correlated with target:', b)

Number of items > 0.9 correlated with target: 87
Number of items > 0.95 correlated with target: 7
Items > 0.95 correlated with target: [17, 11062, 26414, 29303, 34289, 40162, 54404]


#### 50 components

In [None]:
# Construct a list items that are greater 0.9 correlated with target
a = list(item_names[(corr_popular_item < 1.0) & (corr_popular_item > 0.90)])
print('Number of items > 0.9 correlated with target:', len(a))

# Construct a list items that are greater 0.80 correlated with target
b = list(item_names[(corr_popular_item < 1.0) & (corr_popular_item > 0.80)])
print('Number of items > 0.80 correlated with target:', len(b))

# Construct a list items that are greater 0.70 correlated with target
c = list(item_names[(corr_popular_item < 1.0) & (corr_popular_item > 0.70)])
print('Number of items > 0.70 correlated with target:', len(c))
print('Items > 0.70 correlated with target:', b)

Number of items > 0.9 correlated with target: 0
Number of items > 0.8 correlated with target: 7
Number of items > 0.7 correlated with target: 67
Items > 0.80 correlated with target: [37141, 37478, 48025, 51623, 66472, 70981, 89506]


<font size="4">Recommend the items with the highest predicted rating by selecting and sorting the reviewer's rating and concatenating the actual rating with the predicted rating <font>

In [None]:
def recommend_items(reviewerID, ratingsMat, ratingPred, num_recommendations):
    reviewer_idx = reviewerID - 1
    
    sorted_reviewer_rating = ratingsMat.iloc[reviewer_idx].sort_values(ascending=False)
    sorted_reviewer_predictions = ratingPred.iloc[reviewer_idx].sort_values(ascending=False)

    tmp = pd.concat([sorted_reviewer_rating, sorted_reviewer_predictions], 
                     axis=1)
    tmp.index.name = 'Recommended Items'
    tmp.columns = ['reviewer_rating', 'reviewer_predictions']
    tmp = tmp.sort_values('reviewer_predictions', ascending=False)
    
    print('\nBelow are the recommended items for reviewer(reviewer_id = {}):\n'.format(reviewerID))
    print(tmp.head(num_recommendations))

In [None]:
reviewerID = 1
num_recommendations = 10
recommend_items(reviewerID, ratingsMat, ratingPred, num_recommendations)

reviewerID = 40
num_recommendations = 10
recommend_items(reviewerID, ratingsMat, ratingPred, num_recommendations)

reviewerID = 300
num_recommendations = 10
recommend_items(reviewerID, ratingsMat, ratingPred, num_recommendations)


Below are the recommended items for reviewer(reviewer_id = 1):

                   reviewer_rating  reviewer_predictions
Recommended Items                                       
9                              5.0              5.717030
34                             5.0              3.476725
46                             0.0              2.600543
3                              0.0              2.251135
33                             0.0              2.209918
43                             0.0              2.184715
42                             3.0              2.123471
39                             0.0              2.057666
55                             4.0              1.968689
173                            4.0              1.875312

Below are the recommended items for reviewer(reviewer_id = 40):

                   reviewer_rating  reviewer_predictions
Recommended Items                                       
0                              0.0              2.102204
13            

### Popularity Model
<font size="4">For the construction of the popularity recommender model, a recommendation score was created by counting each reviewer for each unique item. This score was sorted and a recommendation rank was created based on scoring. Then the top five recommendations were examined. <font>

In [None]:
# Examine train/test sets for modeling
print('Dimensions of train set:', train.shape)
print('Dimensions of test set:', test.shape)

Dimensions of train set: (890561, 3)
Dimensions of test set: (222833, 3)


In [None]:
train_grouped = train.groupby('item_id').agg({'reviewer_id': 'count'}).reset_index()
train_grouped.rename(columns = {'reviewer_id': 'count'}, inplace=True)

train_sort = train_grouped.sort_values(['item_id', 'count'], ascending=[0,1]) 
train_sort['rank'] = train_sort['count'].rank(ascending=0, method='first') 

popularity_recommendations = train_sort.head() 
print('\nTop 5 recommendations')
print(popularity_recommendations)


Top 5 recommendations
   item_id   count  rank
4      5.0  475124   1.0
3      4.0  202266   2.0
2      3.0  112957   3.0
1      2.0   52608   4.0
0      1.0   47606   5.0


<font size="4">Predictions were then calculated for various reviewers by defining a function where `reviewer_id` was added as the first column for which the recommendations are generated. For three different reviewers, the same items were generated. This is not a robust methods for recommendation systems because there are lots of unaccounted for variable like age, gender, location and time to nane a few.<font>

In [None]:
def recommend(reviewer_id):   

    reviewer_recommendations = popularity_recommendations 
    reviewer_recommendations['reviewer_id'] = reviewer_id 
    cols = reviewer_recommendations.columns.tolist() 
    cols = cols[-1:] + cols[:-1] 
    reviewer_recommendations = reviewer_recommendations[cols] 
          
    return reviewer_recommendations 

find_recom = [1,100,200]   
for i in find_recom:
    print('The list of recommendations for the reviewer_id: %d\n' %(i))
    print(recommend(i))    

The list of recommendations for the reviewer_id: 1

   reviewer_id  item_id   count  rank
4            1      5.0  475124   1.0
3            1      4.0  202266   2.0
2            1      3.0  112957   3.0
1            1      2.0   52608   4.0
0            1      1.0   47606   5.0
The list of recommendations for the reviewer_id: 100

   reviewer_id  item_id   count  rank
4          100      5.0  475124   1.0
3          100      4.0  202266   2.0
2          100      3.0  112957   3.0
1          100      2.0   52608   4.0
0          100      1.0   47606   5.0
The list of recommendations for the reviewer_id: 200

   reviewer_id  item_id   count  rank
4          200      5.0  475124   1.0
3          200      4.0  202266   2.0
2          200      3.0  112957   3.0
1          200      2.0   52608   4.0
0          200      1.0   47606   5.0
