# `Recommender System Models`

&emsp;&emsp;&emsp;<font size="4">In this project, the user ratings from a subset of Amazon Reviews are used train a collaborative-filtering recommendation system by evaluating various algorithms with hyperparameter optimization. The models return a list of recommended items based on the reviewer's previous actions.<font>

# `Data`
&emsp;&emsp;&emsp;<font size="4">The `Movies_and_TV` ratings data was retrieved from [here](https://jmcauley.ucsd.edu/data/amazon/). This includes (item, user, rating, timestamp) tuples. A subset of the data was used in the analysis due to constraints of sparcity for computation.<font>

# `Preprocessing`
&emsp;&emsp;&emsp;<font size="4">The code that was used for preprocessing and EDA can be found [here](https://github.com/adataschultz/RecSys/blob/main/Notebooks_Scripts/Recommender_System.py).
- First the environment is set up with the dependencies, library options, the seed for reproducibility, and setting the location of the project directory. Then the data is read, duplicate observations dropped and columns named.<font>

In [None]:
import os
import random
import numpy as np
import warnings
import sys
import pandas as pd
import time
import json
from surprise import Dataset, Reader, BaselineOnly, NormalPredictor
from surprise import KNNBaseline, KNNWithMeans, KNNBasic, KNNWithZScore
from surprise import SVD, SVDpp, NMF, CoClustering, dump, accuracy
from surprise.model_selection import cross_validate, GridSearchCV
from sklearn.utils.extmath import randomized_svd
from sklearn.decomposition import TruncatedSVD
from tempfile import mkdtemp
import os.path as path
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Set seed
seed_value = 42
os.environ['Recommender'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

# Set path
path = r'D:\AmazonReviews\Data'
os.chdir(path)

# Read data
df = pd.read_csv('Movies_and_TV.csv', header=None, skiprows=[0],
                 low_memory=False)
df = df.drop_duplicates()

# Name columns
df.columns = ['item', 'reviewerID', 'rating', 'timestamp']

print('Sample observations:')
df.head()

Sample observations:


Unnamed: 0,item,reviewerID,rating,timestamp
0,1527665,A2VHSG6TZHU1OB,5.0,1361145600
1,1527665,A23EJWOW1TLENE,5.0,1358380800
2,1527665,A1KM9FNEJ8Q171,5.0,1357776000
3,1527665,A38LY2SSHVHRYB,4.0,1356480000
4,1527665,AHTYUW2H1276L,5.0,1353024000


<font size="4">Then a function is defined to examine the data for the number of missing observations, data types and the amount of unique values in the initial set. The `timestamp` variable is dropped since it will not be used<font>

In [None]:
# Define a function to examine the data
def data_summary(df):
    print('Number of Rows: {}, Columns: {}'.format(df.shape[0], df.shape[1]))
    a = pd.DataFrame()
    a['Number of Missing Values'] = df.isnull().sum()
    a['Data type of variable'] = df.dtypes
    a['Number of Unique Values'] = df.nunique()
    print(a)

print('Initial Data Summary:')
print(data_summary(df))

df = df.drop(['timestamp'], axis=1)

Initial Data Summary:
Number of Rows: 8522125, Columns: 4
            Number of Missing Values Data type of variable  \
item                               0                object   
reviewerID                         0                object   
rating                             0               float64   
timestamp                          0                 int64   

            Number of Unique Values  
item                         182032  
reviewerID                  3826085  
rating                            5  
timestamp                      7476  
None


<font size="4">The top 10 reviewers with the most number of ratings in the initial set shows that they have over 1,600 reviews.<font>

In [None]:
reviewers_top10 = df.groupby('reviewerID').size().sort_values(ascending=False)[:10]
print('Reviewers with highest number of ratings in initial set:')
print(reviewers_top10)

Reviewers with highest number of ratings in initial set:
reviewerID
AV6QDP8Q0ONK4     4101
A1GGOC9PVDXW7Z    2114
ABO2ZI2Y5DQ9T     2073
A328S9RN3U5M68    2059
A3MV1KKHX51FYT    1989
A2EDZH51XHFA9B    1842
A3LZGLA88K0LA0    1814
A16CZRQL23NOIW    1808
AIMR915K4YCN      1719
A2NJO6YE954DBH    1699
dtype: int64


<font size="4">The top 10 items in the initial set shows the highest item has 24,554 ratings while the 10th highest items has 14,174 ratings. <font>

In [None]:
items_top10 = df.groupby('item').size().sort_values(ascending=False)[:10]
print('Items with highest number of ratings in initial set:')
print(items_top10)

Items with highest number of ratings in initial set:
item
B00YSG2ZPA    24554
B00006CXSS    24485
B00AQVMZKQ    21015
B01BHTSIOC    20889
B00NAQ3EOK    16857
6305837325    16671
B00WNBABVC    15205
B017S3OP7A    14795
B009934S5M    14481
B00FL31UF0    14174
dtype: int64


<font size="4">Since the data is sparse, a new integer id is created for `item` rather the initial string variable. <font>

In [None]:
value_counts = df['item'].value_counts(dropna=True, sort=True)
df1 = pd.DataFrame(value_counts)
df1 = df1.reset_index()
df1.columns = ['item_unique', 'counts']
df1 = df1.reset_index()
df1.rename(columns={'index': 'item_id'}, inplace=True)
df1 = df1.drop(['counts'], axis=1)
df = pd.merge(df, df1, how='left', left_on=['item'],
              right_on=['item_unique'])
df = df.drop_duplicates()
df = df.drop(['item_unique'], axis=1)

del value_counts, df1

<font size="4">The same process is used for `reviewerID`. A key is created for merging the new integer variables that later be used to join the original data. For this set, the unnecessary keys are then dropped.<font>

In [None]:
value_counts = df['reviewerID'].value_counts(dropna=True, sort=True)
df1 = pd.DataFrame(value_counts)
df1 = df1.reset_index()
df1.columns = ['id_unique', 'counts']
df1 = df1.reset_index()
df1.rename(columns={'index': 'reviewer_id'}, inplace=True)
df1 = df1.drop(['counts'], axis=1)
df = pd.merge(df, df1, how='left', left_on=['reviewerID'],
              right_on=['id_unique'])
df = df.drop_duplicates()
df = df.drop(['id_unique'], axis=1)

del value_counts, df1

df1 = df[['item', 'item_id', 'reviewerID', 'reviewer_id']]
df1.to_csv('Movies_and_TV_idMatch.csv', index=False)

del df1

df = df.drop(['item', 'reviewerID'], axis=1)

<font size="4">The data is then filtered to ratings/reviewers who have greater than or equal to 25 ratings/reviews due to sparsity. This results in a set containing 1,113,396 ratings with 19,639 unique reviewers and 103,687 unique items. The majority of items are rated 5 star.<font>

In [None]:
reviewer_count = df.reviewer_id.value_counts()
df = df[df.reviewer_id.isin(reviewer_count[reviewer_count >= 25].index)]
df = df.drop_duplicates()

del reviewer_count

print('- Number of ratings after filtering: ', len(df))
print('- Number of unique reviewers: ', df['reviewer_id'].nunique())
print('- Number of unique items: ', df['item_id'].nunique())
for i in range(1,6):
  print('- Number of items with {0} rating = {1}'.format(i, df[df['rating'] == i].shape[0]))

- Number of ratings after filtering:  1113396
- Number of unique reviewers:  19639
- Number of unique items:  103687
- Number of items with 1 rating = 59470
- Number of items with 2 rating = 65558
- Number of items with 3 rating = 141436
- Number of items with 4 rating = 252584
- Number of items with 5 rating = 594348


<font size="4">The top 10 reviewers with the most number of ratings in the filtered set still have over 1,600 ratings/reviews.<font>

In [None]:
reviewers_top10 = df.groupby('reviewer_id').size().sort_values(ascending=False)[:10]
print('Reviewers with highest number of ratings in filtered set:')
print(reviewers_top10)

del reviewers_top10

Reviewers with highest number of ratings in filtered set:
reviewer_id
0    3981
1    2068
2    1997
3    1986
4    1838
5    1811
6    1797
7    1733
8    1706
9    1634
dtype: int64


<font size="4">The top 10 items in the filtered set shows a large reduction with the highest item reducing from 24,554 to 1,136 ratings, while the 10th highest item reducing from 14,174 to 853 ratings.<font>

In [None]:
items_top10 = df.groupby('item_id').size().sort_values(ascending=False)[:10]
print('Items with highest number of ratings filtered set:')
print(items_top10)

del items_top10

Items with highest number of ratings filtered set:
item_id
8     1136
14    1042
15    1040
13    1040
29     964
22     903
53     895
87     870
67     860
81     853
dtype: int64


## Create Recommendation Systems using Surprise

<font size="4">The data is loaded using the `Reader` class from a `pandas` dataframe prior to modeling. For the initial training of the models using `surprise`, the default parameters of `BaselineOnly`, `KNNBaseline`, `KNNBasic`, `KNNWithMeans`, `KNNWithZScore`, `CoClustering`, `SVD`, `SVDpp` (an extension of SVD which taking into account implicit ratings), `NMF`, were evaluated using the `cross_validate` method using 3-fold cross validation to determine which algorithm yielded the lowest `RMSE` errors. This revealed that `SVDpp` generated the lowest `RMSE`, but it took the longest to fit the model and test. The default parameters for `SVDpp` uses 20 epochs for fitting the model, so experimenting with less epochs and other model parameters will reduce the runtime and potentially maintain a low `RMSE`. The results from using `KNNBaseline` demonstrate a close loss with a significantly lower runtime, so hyperparameter tuning might allow this to be a better choice, especially given larger sample sizes. <font>

In [None]:
# Set path for results
path = r'D:\AmazonReviews\Models'
os.chdir(path)

# Load data using reader
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['reviewer_id', 'item_id', 'rating']], reader)
del df

# Iterate over all algorithms
print('Time for iterating through different algorithms..')
search_time_start = time.time()
benchmark = []
for algorithm in [BaselineOnly(), KNNBaseline(), KNNBasic(), KNNWithMeans(),
                  KNNWithZScore(), CoClustering(), SVD(), SVDpp(), NMF(),
                  NormalPredictor()]:
    # Cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3,
                             verbose=False, n_jobs=-1)

    # Model results
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]],
                               index=['Algorithm']))
    benchmark.append(tmp)
print('Finished iterating through different algorithms:',
      time.time() - search_time_start)

# Create df with results and save
surprise_results = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')
print('Results from testing different algorithms:')
print(surprise_results)
surprise_results.to_csv('MoviesTV_results_algorithms.csv', index=False)
del surprise_results

Time for iterating through different algorithms..
Finished iterating through different algorithms: 1877.5992422103882
Results from testing different algorithms:
                 test_rmse     fit_time  test_time
Algorithm                                         
SVDpp             0.951981  1449.569379  34.628475
SVD               0.956978    56.881580   3.054269
BaselineOnly      0.958829     0.541074   2.015631
KNNBaseline       0.986208    12.493270  27.932690
KNNWithMeans      0.996443    12.406364  24.509705
KNNWithZScore     1.002136    13.109460  26.066691
CoClustering      1.012724    18.731312   2.226075
NMF               1.051432    59.292935   2.782387
KNNBasic          1.106805    12.031695  23.257696
NormalPredictor   1.505521     0.629097   2.384375


<font size="4"><font>

In [None]:
# Set path for loading train/test
path = r'D:\AmazonReviews\Data'
os.chdir(path)

# Read train/test sets
train = pd.read_csv('train_filtered.csv', sep='|')
train.columns = ['rating', 'item_id', 'reviewer_id']
train['reviewer_id'] = train['reviewer_id'].str.extract(pat='(\d+)',
                                                        expand=False)
train['item_id'] = train['item_id'].str.extract(pat='(\d+)', expand=False)
train['reviewer_id'] = train['reviewer_id'].astype(int)
train['item_id'] = train['item_id'].astype(int)

test = pd.read_csv('eval_filtered.csv', sep='|')
test.columns = ['rating', 'item_id', 'reviewer_id']
test['reviewer_id'] = test['reviewer_id'].str.extract(pat='(\d+)', expand=False)
test['item_id'] = test['item_id'].str.extract(pat='(\d+)', expand=False)
test['reviewer_id'] = test['reviewer_id'].astype(int)
test['item_id'] = test['item_id'].astype(int)

### SVDpp with lowest rmse
Fit model with default parameters for 3 epochs and examine RMSE on train/test sets

In [None]:
# Set path for results
path = r'D:\AmazonReviews\Models'
os.chdir(path)

print('Train/predict using SVDpp default parameters for 3 epochs:')
print('\n')
print('Time for iterating through SVDpp default parameters..')
search_time_start = time.time()
algo = SVDpp(n_epochs=3, random_state=seed_value)
cv = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=False,
                    n_jobs=-1)
print('Finished iterating through SVDpp default parameters:',
      time.time() - search_time_start)
print('\n')
print('Cross validation results:')
# Iterate over key/value pairs in cv results dict
for key, value in cv.items():
    print(key, ' : ', value)
print('\n')

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./SVDpp_3epochs_DefaultParamModel_file', predictions, algo)
#predictions, algo = dump.load('./SVDpp_3epochs_DefaultParamModel_file')

Train/predict using SVDpp default parameters for 3 epochs:


Time for iterating through SVDpp default parameters..
Finished iterating through SVDpp default parameters: 259.01811718940735


Cross validation results:
test_rmse  :  [0.99186233 0.99067176 0.98925811]
test_mae  :  [0.75639429 0.75586697 0.75391987]
fit_time  :  (212.8136830329895, 213.71945691108704, 214.16111540794373)
test_time  :  (34.81285309791565, 35.151495695114136, 34.626320362091064)


RMSE from fit best parameters on train predict on test:
RMSE: 0.9811
0.9811266996736911


Read the `Movies_and_TV_idMatch.csv` set so the original features are present

In [None]:
# Set path
path = r'D:\AmazonReviews\Data'
os.chdir(path)

df = pd.read_csv('Movies_and_TV_idMatch.csv')
df = df.drop_duplicates()

print('Sample observations:')
df.head()

Sample observations:


Unnamed: 0,item,item_id,reviewerID,reviewer_id
0,1527665,45044,A2VHSG6TZHU1OB,1406814
1,1527665,45044,A23EJWOW1TLENE,3722126
2,1527665,45044,A1KM9FNEJ8Q171,842275
3,1527665,45044,A38LY2SSHVHRYB,2915259
4,1527665,45044,AHTYUW2H1276L,2915260


<font size="4">Examine results from predictions<font>

In [None]:
def get_Ir(reviewer_id):
    """
    Determine the number of items rated by given reviewer
    Args:
      reviewerID: the id of the reviewer
    Returns:
      Number of items rated by the reviewer
    """
    try:
        return len(train.ur[train.to_inner_uid(reviewer_id)])
    except ValueError:
        return 0

def get_Ri(item_id):
    """
    Determine number of reviewers that rated given item
    Args:
      itemID: the id of the item
    Returns:
     Number of reviewers that have rated the item
    """
    try:
        return len(train.ir[train.to_inner_iid(item_id)])
    except ValueError:
        return 0

<font size="4"> Make df of prediction results, apply functions and save prediction results<font size>

In [None]:
# Set path for results
path = r'D:\AmazonReviews\Models'
os.chdir(path)

df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)

df2 = pd.merge(df, df1, how='right',
               left_on=['reviewer_id', 'item_id'],
               right_on=['reviewer_id', 'item_id'])
df2.to_csv('predictions_SVDpp_DefaultParamModel.csv', index=False)

del df1

<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
        reviewerID  itemID  rui  est                    details   Iu   Ui  err
2465          1901     286  5.0  5.0  {'was_impossible': False}   81  247  0.0
103694        2922     300  5.0  5.0  {'was_impossible': False}   59  190  0.0
140212       11094     444  5.0  5.0  {'was_impossible': False}   28  193  0.0
196650         198    1488  5.0  5.0  {'was_impossible': False}  284   47  0.0
34119         2454    2501  5.0  5.0  {'was_impossible': False}   62   43  0.0
14052         3719     151  5.0  5.0  {'was_impossible': False}   51  440  0.0
34137         3916    2600  5.0  5.0  {'was_impossible': False}   50   92  0.0
159982        1066      14  5.0  5.0  {'was_impossible': False}   97  837  0.0
64899         5448     147  5.0  5.0  {'was_impossible': False}   39  223  0.0
103569         882    5672  5.0  5.0  {'was_impossible': False}  121   79  0.0


<font size="4"> Find the worst  predictions<font>

In [None]:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df2, best_predictions, worst_predictions

Worst 10 predictions:
        reviewerID  itemID  rui       est                    details   Iu  \
26621         4844     269  1.0  4.995164  {'was_impossible': False}   38   
148629          42     172  1.0  5.000000  {'was_impossible': False}  660   
44833         5789     409  1.0  5.000000  {'was_impossible': False}   41   
67997         8107     534  1.0  5.000000  {'was_impossible': False}   29   
29490          778    2691  1.0  5.000000  {'was_impossible': False}  122   
110642        1311     451  1.0  5.000000  {'was_impossible': False}   93   
119911        1331    1915  1.0  5.000000  {'was_impossible': False}   95   
109341         835    2752  1.0  5.000000  {'was_impossible': False}  116   
33420         4933     752  1.0  5.000000  {'was_impossible': False}   43   
118018        1666    1834  1.0  5.000000  {'was_impossible': False}   77   

         Ui       err  
26621   158  3.995164  
148629  588  4.000000  
44833   243  4.000000  
67997   262  4.000000  
29490    7

#### SVDpp HPO using Grid Search

<font size="4">Hyperparameter optimization using `GridSearchCV` was performed to find the best parameters. Since this algorithm is omputationally expensive with gradient descent, 10 epochs was used. A larger number of factors compared to the default `n_factors=20`, The default parameters for `lr_all=0.007` and `reg_all=0.02` were included in the search. <font>

<font size="4">Define the parameters for the grid search <font>



In [None]:
param_grid = {'n_epochs': [10],
              'n_factors': [30, 40, 50],
              'lr_all': [7e-4, 7e-3, 7e-2],
              'reg_all': [2e-4, 2e-3, 2e-2],
              'random_state': [seed_value]}
print('Grid search parameters:')
param_grid

Grid search parameters:


{'n_epochs': [10],
 'n_factors': [30, 40, 50],
 'lr_all': [0.0007, 0.007, 0.07],
 'reg_all': [0.0002, 0.002, 0.02],
 'random_state': [42]}

<font size="4">Run grid search with `rmse` and `mae` as the metrics. Then use the parameters that resulted in the lowest RMSE on the train/test sets<font>



In [None]:
gs = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=3,
                  joblib_verbose=-1, n_jobs=-1)
print('Time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
      time.time() - search_time_start)
print('\n')
print('Lowest RMSE from Grid Search:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters of Model with lowest RMSE from Grid Search:')
print(gs.best_params['rmse'])

Time for iterating grid search parameters..


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed: 98.9min


Finished iterating grid search parameters: 12332.196253061295


Lowest RMSE from Grid Search:
0.9525513633721562


Parameters of Model with lowest RMSE from Grid Search:
{'n_epochs': 10, 'n_factors': 30, 'lr_all': 0.007, 'reg_all': 0.02, 'random_state': 42}


[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed: 205.4min finished


In [None]:
# Save results to df
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df = results_df.sort_values('mean_test_rmse', ascending=True)

print('SVDpp GridSearch HPO Cross Validation Results:')
print(results_df.head())
print('\n')
results_df.to_csv('SVDpp_gridSearch_cvResults.csv', index=False)

del results_df

SVDpp GridSearch HPO Cross Validation Results:
    split0_test_rmse  split1_test_rmse  split2_test_rmse  mean_test_rmse  \
5           0.955645          0.950836          0.951173        0.952551   
14          0.956360          0.951313          0.951621        0.953098   
4           0.956469          0.951755          0.951959        0.953394   
23          0.957312          0.952152          0.953361        0.954275   
13          0.957878          0.953556          0.953487        0.954974   

    std_test_rmse  rank_test_rmse  split0_test_mae  split1_test_mae  \
5        0.002192               1         0.697202         0.694840   
14       0.002310               2         0.698210         0.695600   
4        0.002176               3         0.694615         0.692085   
23       0.002204               4         0.698765         0.696435   
13       0.002054               5         0.695999         0.693928   

    split2_test_mae  mean_test_mae  std_test_mae  rank_test_mae  \
5 

<font size="4">Fit and predict on the best model, apply functions and save prediction results <font>

In [None]:
algo = gs.best_estimator['rmse']

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./SVDpp_bestGrid_Model_file', predictions, algo)
#predictions, algo = dump.load('./SVDpp_bestGrid_Model_file')

df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)

df2 = pd.merge(df, df1, how='right',
               left_on=['reviewer_id', 'item_id'],
               right_on=['reviewer_id', 'item_id'])
df2.to_csv('predictions_SVDpp_gridSearch.csv', index=False)

del df1

RMSE from fit best parameters on train predict on test:
RMSE: 0.9466
0.9465595720332523


<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
66148   0790732300     3391  A26JGAM6GZMM4V         1115  5.0  5.0   
68178   B004AUP3F8     5195   A3B3182H8G4VF        12878  5.0  5.0   
68175   B000LXGXY8     9565  A3KOWOXEUEHJX8         3186  5.0  5.0   
175864  6304618352     2453  A1ADL615ZYH2ZZ         3773  5.0  5.0   
213457  B00364K7AU      420  A1E7VTRDMI4XMV         6532  5.0  5.0   
68171   0800141660      512  A37FC6SI13C7ZG         9096  5.0  5.0   
68168   6300213730     2956   APNWI1W2D3HGH        13356  5.0  5.0   
68158   6301933532     8075  A1TN8INXTQN040         3830  5.0  5.0   
68123   B0002F6BTM     2041   AI71P00BG70FQ         7200  5.0  5.0   
68113   B000BITV1A     7983   A4WNCSJJY011A         7108  5.0  5.0   

                          details   Iu   Ui  err  
66148   {'was_impossible': False}  101  112  0.0  
68178   {'was_impossible': False}   25   29  0.0  
68175   {'was_impossible': False}   51   28  0.0  
1758

<font size="4"> Find the worst predictions<font>

In [None]:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df2, best_predictions, worst_predictions

Worst 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
110642  B00005JNJV      451  A1JU644REKXCNO         1311  1.0  5.0   
39171   B000E5LEXS      906  A3TY31K1VXAGSX         3522  1.0  5.0   
67997   6302168465      534  A2X5WF3FPJ8C9J         8107  1.0  5.0   
12513   B00007K027    23913  A192UPT6KST3E2          804  1.0  5.0   
11064   6304961693     2718   AK2FSJJ7GAN8B        10089  1.0  5.0   
132808  6300158764    11537  A3LRLY9YUNZAGS         4212  1.0  5.0   
53552   6302969794    10674  A29ND8RD38SEUZ         2548  1.0  5.0   
157725  B00005U0JX     5196  A19RHMNYUFAN4L        14167  1.0  5.0   
80945   6300216977     2118  A1RVXYN8QAXXNB        18177  1.0  5.0   
204198  6305492042     3565  A2N03TUL3EZZDD         6386  1.0  5.0   

                          details   Iu   Ui  err  
110642  {'was_impossible': False}   93  461  4.0  
39171   {'was_impossible': False}   51   46  4.0  
67997   {'was_impossible': False}   29  262  4.0  
125

#### SVDpp HPO using Grid Search - 20 Epochs More Parameters

<font size="4">Define the parameters for the grid search <font>



In [None]:
param_grid = {'n_epochs': [20],
              'n_factors': [10, 20, 30, 40, 50],
              'lr_all': [7e-6, 7e-5, 7e-4, 7e-3, 7e-2],
              'reg_all': [2e-4, 2e-3, 2e-2, 2e-1, 2e-0],
              'random_state': [seed_value]}
print('Grid search parameters:')
param_grid

Grid search parameters:


{'n_epochs': [20],
 'n_factors': [10, 20, 30, 40, 50],
 'lr_all': [7e-06, 7e-05, 0.0007, 0.007, 0.07],
 'reg_all': [0.0002, 0.002, 0.02, 0.2, 2.0],
 'random_state': [42]}

<font size="4">Run grid search with `rmse` and `mae` as the metrics. Then use the parameters that resulted in the lowest RMSE on the train/test sets<font>



In [None]:
gs = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=3,
                  joblib_verbose=-1, n_jobs=-1)
print('Time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
      time.time() - search_time_start)
print('\n')
print('Lowest RMSE from Grid Search:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters of Model with lowest RMSE from Grid Search:')
print(gs.best_params['rmse'])

Time for iterating grid search parameters..


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed: 103.0min
[Parallel(n_jobs=-1)]: Done 256 tasks      | elapsed: 791.0min


Finished iterating grid search parameters: 82490.94835019112


Lowest RMSE from Grid Search:
0.9487615218367708


Parameters of Model with lowest RMSE from Grid Search:
{'n_epochs': 20, 'n_factors': 10, 'lr_all': 0.007, 'reg_all': 0.02, 'random_state': 42}


[Parallel(n_jobs=-1)]: Done 375 out of 375 | elapsed: 1374.8min finished


In [None]:
# Save results to df
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df = results_df.sort_values('mean_test_rmse', ascending=True)

print('SVDpp GridSearch HPO Cross Validation Results:')
print(results_df.head())
print('\n')
results_df.to_csv('SVDpp_gridSearch_cvResults_moreParams.csv', index=False)

del results_df

SVDpp GridSearch HPO Cross Validation Results:
    split0_test_rmse  split1_test_rmse  split2_test_rmse  mean_test_rmse  \
17          0.949363          0.950721          0.946200        0.948762   
18          0.951589          0.950517          0.947567        0.949891   
43          0.951702          0.950655          0.947704        0.950020   
68          0.951936          0.950781          0.947783        0.950167   
93          0.952030          0.950867          0.947989        0.950295   

    std_test_rmse  rank_test_rmse  split0_test_mae  split1_test_mae  \
17       0.001894               1         0.680528         0.681833   
18       0.001701               2         0.699633         0.698991   
43       0.001693               3         0.699909         0.699289   
68       0.001750               4         0.700300         0.699534   
93       0.001699               5         0.700566         0.699855   

    split2_test_mae  mean_test_mae  std_test_mae  rank_test_mae  \
17

<font size="4">Fit and predict on the best model, apply functions and save prediction results <font>

In [None]:
algo = gs.best_estimator['rmse']

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./SVDpp_bestGrid_Model_moreParams_file', predictions, algo)
#predictions, algo = dump.load('./SVDpp_bestGrid_Model_file')

df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)

df2 = pd.merge(df, df1, how='right',
               left_on=['reviewer_id', 'item_id'],
               right_on=['reviewer_id', 'item_id'])
df2.to_csv('predictions_SVDpp_gridSearch_moreParams.csv', index=False)

del df1

RMSE from fit best parameters on train predict on test:
RMSE: 0.9439
0.9438520213385112


<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
199772  B00A6SZRT0    12824  A2FJGG9JNDP3VF         5707  5.0  5.0   
183867  B00003CXP0      807   AEE3T3U2ZLOT4         4328  5.0  5.0   
83935   B00VGQJLDO     2576   ANIQWW6NR8RLE          865  5.0  5.0   
83923   B01DPW1A0I     1240   A6OYYJP09OH7B         5238  5.0  5.0   
166321  B0000A2ZNX     1752   A76FNWB1C1GPV        12530  5.0  5.0   
14868   B00DS79HBU     9580  A24KSFA19N69US         1166  5.0  5.0   
122028  6304392907     3886  A1O3TR94QGSKT4          892  5.0  5.0   
33111   B0059XTU1S       45  A3HWBN95DDD7UV         2022  5.0  5.0   
33094   B007CZ3418    38840  A3F3TYKEHDA4YX         1772  5.0  5.0   
122030  B004ITYDT8    16620   A84CCE6K6RTEJ        14191  5.0  5.0   

                          details   Iu   Ui  err  
199772  {'was_impossible': False}   42   27  0.0  
183867  {'was_impossible': False}   46  196  0.0  
83935   {'was_impossible': False}  113   65  0.0  
8392

<font size="4"> Find the worst predictions<font>

In [None]:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df2, best_predictions, worst_predictions

Worst 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
98501   B00007ELF1     5532  A2EH9PUFVZGI52        10283  1.0  5.0   
12513   B00007K027    23913  A192UPT6KST3E2          804  1.0  5.0   
157725  B00005U0JX     5196  A19RHMNYUFAN4L        14167  1.0  5.0   
11064   6304961693     2718   AK2FSJJ7GAN8B        10089  1.0  5.0   
140176  B001AVCFJM      109  A1KIQ4P4ZW3ALF           38  1.0  5.0   
67222   079073463X     1319  A1WQNBG0RY9GU3        12174  1.0  5.0   
13116   6303686850     3483  A3N3Y4UJ07LOVG         3284  1.0  5.0   
147671  B000KCSWZS    96950  A3NONVSAUJAFXC          835  1.0  5.0   
119911  B00DMK6WN4     1915   ANHTCXAG77MBM         1331  1.0  5.0   
192547  B00EXPOCXY      139   AU0NV5KSJ02QZ         5214  1.0  5.0   

                          details   Iu   Ui  err  
98501   {'was_impossible': False}   27   34  4.0  
12513   {'was_impossible': False}  117    4  4.0  
157725  {'was_impossible': False}   25   29  4.0  
110

#### SVDpp HPO using Grid Search - 20 Epochs More Parameters Less Factors

<font size="4">Define the parameters for the grid search <font>



In [None]:
param_grid = {'n_epochs': [20],
              'n_factors': [5, 10, 15],
              'lr_all': [7e-4, 7e-3, 7e-2],
              'reg_all': [7e-2, 5e-2, 2e-2, 7e-1],
              'random_state': [seed_value]}
print('Grid search parameters:')
param_grid

Grid search parameters:


{'n_epochs': [20],
 'n_factors': [5, 10, 15],
 'lr_all': [0.0007, 0.007, 0.07],
 'reg_all': [0.07, 0.05, 0.02, 0.7],
 'random_state': [42]}

<font size="4">Run grid search with `rmse` and `mae` as the metrics. Then use the parameters that resulted in the lowest RMSE on the train/test sets<font>



In [None]:
gs = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=3,
                  joblib_verbose=-1, n_jobs=-1)
print('Time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
      time.time() - search_time_start)
print('\n')
print('Lowest RMSE from Grid Search:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters of Model with lowest RMSE from Grid Search:')
print(gs.best_params['rmse'])

Time for iterating grid search parameters..


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed: 87.7min


Finished iterating grid search parameters: 13742.905663013458


Lowest RMSE from Grid Search:
0.9454947143378393


Parameters of Model with lowest RMSE from Grid Search:
{'n_epochs': 20, 'n_factors': 5, 'lr_all': 0.007, 'reg_all': 0.05, 'random_state': 42}


[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed: 229.0min finished


In [None]:
# Save results to df
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df = results_df.sort_values('mean_test_rmse', ascending=True)

print('SVDpp GridSearch HPO Cross Validation Results:')
print(results_df.head())
print('\n')
results_df.to_csv('SVDpp_gridSearch_cvResults_moreParamsLessFactors.csv',
                  index=False)

del results_df

SVDpp GridSearch HPO Cross Validation Results:
    split0_test_rmse  split1_test_rmse  split2_test_rmse  mean_test_rmse  \
5           0.948897          0.943507          0.944079        0.945495   
4           0.948880          0.943676          0.944297        0.945618   
16          0.948938          0.943844          0.944304        0.945695   
28          0.948962          0.943712          0.944634        0.945769   
17          0.949299          0.943932          0.944247        0.945826   

    std_test_rmse  rank_test_rmse  split0_test_mae  split1_test_mae  \
5        0.002417               1         0.684832         0.682043   
4        0.002321               2         0.686716         0.684125   
16       0.002301               3         0.686884         0.684269   
28       0.002289               4         0.686872         0.684129   
17       0.002459               5         0.685130         0.682168   

    split2_test_mae  mean_test_mae  std_test_mae  rank_test_mae  \
5 

<font size="4">Fit and predict on the best model, apply functions and save prediction results <font>

In [None]:
algo = gs.best_estimator['rmse']

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./SVDpp_bestGrid_Model_moreParams_file', predictions, algo)
#predictions, algo = dump.load('./SVDpp_bestGrid_Model_file')

df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)

df2 = pd.merge(df, df1, how='right',
               left_on=['reviewer_id', 'item_id'],
               right_on=['reviewer_id', 'item_id'])
df2.to_csv('predictions_SVDpp_gridSearch_moreParamsLessFactors.csv',
           index=False)

del df1

RMSE from fit best parameters on train predict on test:
RMSE: 0.9405
0.940543273193018


<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
194083  B0024396EW     1417  A1VTFZ4K6CNNN5         3613  5.0  5.0   
192688  B009A87WU4     2512  A2BMWC2S2KG7YG         6628  5.0  5.0   
42659   B0000ADXGA    23696  A2YW1GDT8MQB88         1066  5.0  5.0   
104939  B001AI774S     7867  A3TPHG1RZTBEGT         1505  5.0  5.0   
149062  B000TQZBNQ     9291  A1QWQF0CXTE41E          602  5.0  5.0   
42666   B00M4KXLCI     4268  A1JXNTVO0IEOOS         3437  5.0  5.0   
104935  6302320429    20848  A2YUA3H1LLU53Z           13  5.0  5.0   
192690  B00H5NY7O0     8889   AIH5LH0J6QXWG         3709  5.0  5.0   
213285  B006WN5W5M     1405  A1ZWI72HBYGUB4         9615  5.0  5.0   
42680   B004BLJQOK     2069  A2YUA3H1LLU53Z           13  5.0  5.0   

                          details    Iu   Ui  err  
194083  {'was_impossible': False}    50   80  0.0  
192688  {'was_impossible': False}    37   24  0.0  
42659   {'was_impossible': False}    97   19  0.0  


<font size="4"> Find the worst predictions<font>

In [None]:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df2, best_predictions, worst_predictions

Worst 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
13116   6303686850     3483  A3N3Y4UJ07LOVG         3284  1.0  5.0   
220688  B00VNX5S96     2647   AR7TXBRR5M79V        16290  1.0  5.0   
22368   6305013578     2998  A1EUIGIJLKS6CZ        11682  1.0  5.0   
21561   B0051O0NHK    34867  A1KIQ4P4ZW3ALF           38  5.0  1.0   
12513   B00007K027    23913  A192UPT6KST3E2          804  1.0  5.0   
147671  B000KCSWZS    96950  A3NONVSAUJAFXC          835  1.0  5.0   
44833   B0000VCZK2      409  A3K89LA3W7YS1Q         5789  1.0  5.0   
190931  0792839072     1795   AZE4HR7KPHEX8         8522  1.0  5.0   
118238  B00OAIHHRC    12360   A7O05G42L82N9         8688  1.0  5.0   
147695  B00465I1BA     5276  A17R1QJA3HUPJ6         9038  1.0  5.0   

                          details   Iu   Ui  err  
13116   {'was_impossible': False}   54  101  4.0  
220688  {'was_impossible': False}   22   37  4.0  
22368   {'was_impossible': False}   24   84  4.0  
215

### SVD
Fit model with default parameters for 3 epochs, examine RMSE on train/test sets and predictions

In [None]:
print('Train/predict using SVD default parameters for 3 epochs:')
print('\n')
print('Time for iterating through SVD default parameters..')
search_time_start = time.time()
algo = SVD(n_epochs=3, random_state=seed_value)
cv = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=False,
                    n_jobs=-1)
print('Finished iterating through SVD default parameters:',
      time.time() - search_time_start)
print('\n')
print('Cross validation results:')
# Iterate over key/value pairs in cv results dict
for key, value in cv.items():
    print(key, ' : ', value)
print('\n')

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./SVD_3epochs_DefaultParamModel_file', predictions, algo)
#predictions, algo = dump.load('./SVD_3epochs_DefaultParamModel_file')

df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)

df2 = pd.merge(df, df1, how='right',
               left_on=['reviewer_id', 'item_id'],
               right_on=['reviewer_id', 'item_id'])
df2.to_csv('predictions_SVD_DefaultParamModel.csv', index=False)

del df1

Train/predict using SVD default parameters for 3 epochs:


Time for iterating through SVD default parameters..
Finished iterating through SVD default parameters: 21.718221426010132


Cross validation results:
test_rmse  :  [1.01170944 1.01306478 1.00997506]
test_mae  :  [0.77944209 0.78081871 0.77882791]
fit_time  :  (9.072836875915527, 8.729201793670654, 8.824720859527588)
test_time  :  (2.9705753326416016, 3.0572056770324707, 2.9552011489868164)


RMSE from fit best parameters on train predict on test:
RMSE: 0.9995
0.9995495492030135


<font size="4"> Find the best predictions<font>

In [None]:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
0       B00YSG2ZPA        0  A37PPLULHT6CFQ         4911  5.0  5.0   
163930  B017RR6YJE     1286   AUDXDMFM49NGY         1279  5.0  5.0   
210635  6303365752      959  A35ZK3M8L9JUPX           55  5.0  5.0   
21967   0790729342     1683   AWG2O9C42XW5G           10  5.0  5.0   
195113  B000QGDJGK    11155  A3FL6CIO8QIJ2F          120  5.0  5.0   
2540    B00BNAE6M4     4244   AS2SJP1G389FE         1761  5.0  5.0   
163781  6300215628     1893  A3DTNUZALZDZPR          313  5.0  5.0   
116666  B00006JE59      407  A3LZBOBV9H1HDV          192  5.0  5.0   
54574   B00004Z4U9     1238   AXOZ5BWOEDL76          142  5.0  5.0   
163763  6304500831      526   AIHBP1TQ3JA86         8633  5.0  5.0   

                          details    Iu   Ui  err  
0       {'was_impossible': False}    41  334  0.0  
163930  {'was_impossible': False}    92  102  0.0  
210635  {'was_impossible': False}   605  186  0.0  


<font size="4"> Find the worst predictions<font>

In [None]:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df2, best_predictions, worst_predictions

Worst 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui       est  \
217222  B0009AK57Y       92  A1IWR4YH4ZA9BM          246  1.0  4.974257   
21188   6305622825     3674    ASRB35ZZQZGB          291  1.0  4.989281   
83087   B0002Z16HY     1672   AHW9HY6U7MV1Y        17415  1.0  5.000000   
185192  B005X5XIF6      833   AWG2O9C42XW5G           10  1.0  5.000000   
98948   B000E5LEXS      906  A3TY31K1VXAGSX         3522  1.0  5.000000   
184013  6304618352     2453  A3D2VIUT2HWP0Z          439  1.0  5.000000   
201875  1562550888      608  A3B1IPR94VFG1K         3590  1.0  5.000000   
71015   0788832492     1558  A335JCRHG40DUK         4132  1.0  5.000000   
135268  B000M341QE      536   AWG2O9C42XW5G           10  1.0  5.000000   
17450   B00003CXIU     3035  A2JQJ7S0LOXH1G          243  5.0  1.000000   

                          details    Iu   Ui       err  
217222  {'was_impossible': False}   248  151  3.974257  
21188   {'was_impossible': False}   22

#### SVD HPO using Grid Search

<font size="4">Define the parameters for the grid search <font>



In [None]:
param_grid = {'n_epochs': [30, 35, 40, 45, 50, 55, 60, 65, 70],
              'n_factors': [20, 25, 30, 35, 40 ,45 , 50],
              'lr_all': [0.002, 0.003, 0.004, 0.005, 0.006, 0.007],
              'reg_all': [0.0001, 0.001, 0.01, 0.02, 0.03, 0.04],
              'random_state': [seed_value]}
print('SVD HPO Grid search parameters:')
param_grid

SVD HPO Grid search parameters:


{'n_epochs': [30, 35, 40, 45, 50, 55, 60, 65, 70],
 'n_factors': [20, 25, 30, 35, 40, 45, 50],
 'lr_all': [0.002, 0.003, 0.004, 0.005, 0.006, 0.007],
 'reg_all': [0.0001, 0.001, 0.01, 0.02, 0.03, 0.04],
 'random_state': [42]}

<font size="4">Run grid search with `rmse` and `mae` as the metrics. Then use the parameters that resulted in the lowest RMSE on the train/test sets<font>



In [None]:
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3,
                  joblib_verbose=-1, n_jobs=-1)
print('Time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
      time.time() - search_time_start)
print('\n')
print('Lowest RMSE from Grid Search:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters of Model with lowest RMSE from Grid Search:')
print(gs.best_params['rmse'])

Time for iterating grid search parameters..


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 256 tasks      | elapsed: 18.0min
[Parallel(n_jobs=-1)]: Done 616 tasks      | elapsed: 48.1min
[Parallel(n_jobs=-1)]: Done 1120 tasks      | elapsed: 90.8min
[Parallel(n_jobs=-1)]: Done 1768 tasks      | elapsed: 153.8min
[Parallel(n_jobs=-1)]: Done 2560 tasks      | elapsed: 238.1min
[Parallel(n_jobs=-1)]: Done 3496 tasks      | elapsed: 350.8min
[Parallel(n_jobs=-1)]: Done 4576 tasks      | elapsed: 501.9min
[Parallel(n_jobs=-1)]: Done 5800 tasks      | elapsed: 687.2min


Finished iterating grid search parameters: 51872.37151479721


Lowest RMSE from Grid Search:
0.9470577657248332


Parameters of Model with lowest RMSE from Grid Search:
{'n_epochs': 65, 'n_factors': 20, 'lr_all': 0.002, 'reg_all': 0.04, 'random_state': 42}


[Parallel(n_jobs=-1)]: Done 6804 out of 6804 | elapsed: 864.5min finished


In [None]:
# Save results to df
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df = results_df.sort_values('mean_test_rmse', ascending=True)
print('SVD GridSearch HPO Cross Validation Results:')
print(results_df.head())
print('\n')
results_df.to_csv('SVD_gridSearch_cvResults.csv', index=False)

del results_df

SVD GridSearch HPO Cross Validation Results:
      split0_test_rmse  split1_test_rmse  split2_test_rmse  mean_test_rmse  \
1769          0.947837          0.947292          0.946044        0.947058   
515           0.947919          0.947282          0.946094        0.947098   
1517          0.947917          0.947291          0.946087        0.947098   
767           0.947869          0.947354          0.946098        0.947107   
17            0.947931          0.947284          0.946113        0.947109   

      std_test_rmse  rank_test_rmse  split0_test_mae  split1_test_mae  \
1769       0.000751               1         0.684882         0.684427   
515        0.000756               2         0.685845         0.685306   
1517       0.000760               3         0.685833         0.685298   
767        0.000744               4         0.684506         0.684084   
17         0.000752               5         0.685866         0.685321   

      split2_test_mae  mean_test_mae  std_test_

<font size="4">Fit and predict on the best model, apply functions and save prediction results <font>

In [None]:
algo = gs.best_estimator['rmse']

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
print('\n')
dump.dump('./SVD_bestGrid_Model_file', predictions, algo)
#predictions, algo = dump.load('./SVD_bestGrid_Model_file')

df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)

df2 = pd.merge(df, df1, how='right',
               left_on=['reviewer_id', 'item_id'],
               right_on=['reviewer_id', 'item_id'])
df2.to_csv('predictions_SVD_gridSearch.csv', index=False)

del df1

RMSE from fit best parameters on train predict on test:
RMSE: 0.9426
0.9426279412758342




<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
176329  B000068TSI      806  A12BXTSI4P73PA        16449  5.0  5.0   
13930   0792846133     1456   AT07UZQQR7ZEH         1410  5.0  5.0   
38632   6305807655    21418  A3LKO630Z5MC5P        11133  5.0  5.0   
197661  B00005JO2V     6502  A2P5WV8GFN1HQO         4184  5.0  5.0   
38643   6303042503      362  A1VIZFPC5FDDI5         4880  5.0  5.0   
97976   6304401132      577   AL7HW45VDAWUZ        10433  5.0  5.0   
97971   6300215695      435  A32R3MTF6UVLGE        14696  5.0  5.0   
38631   B001KX50AQ    33032  A3QMO4Z0U2R8P1         3128  5.0  5.0   
97969   B00BEIYP1W      918  A3LZBOBV9H1HDV          192  5.0  5.0   
197652  B000095J2Q    22792  A3NONVSAUJAFXC          835  5.0  5.0   

                          details   Iu   Ui  err  
176329  {'was_impossible': False}   18  206  0.0  
13930   {'was_impossible': False}   84  122  0.0  
38632   {'was_impossible': False}   27   15  0.0  
1976

<font size="4"> Find the worst predictions<font>

In [None]:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df2, best_predictions, worst_predictions

Worst 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
12813   B00OV3VGP0       22  A107BIW9F8FHZK         6703  1.0  5.0   
114206  6303832563     4884  A2UKC1X32PBQZ9         1053  1.0  5.0   
113120  B00ID3TPA2    22390  A2KASTU01ARCF9         6928  1.0  5.0   
77728   B00753M8U0    23406  A31W21E1FQ8Q76         1563  1.0  5.0   
80302   B002HEXVUI       50  A1JTTJ7M7EC7Q7          705  1.0  5.0   
57297   B004X0MHEU     1821  A3G16L6SE9WQJW        18154  1.0  5.0   
121154  B000069HYZ    45548   ASFCEIBZOZFHF         2098  1.0  5.0   
99084   B00E0KWBE4     4451   AH4USMNP4MV4I        12373  1.0  5.0   
907     B003Y7TJXA    24721   AH06UFDUCQLUI         5443  1.0  5.0   
221274  630365147X     1178  A35E0VGWQKCADS         5108  1.0  5.0   

                          details   Iu   Ui  err  
12813   {'was_impossible': False}   36  739  4.0  
114206  {'was_impossible': False}  109   25  4.0  
113120  {'was_impossible': False}   34   11  4.0  
777

### BaselineOnly
Fit model with default parameters for 3 epochs using `method='als'` and examine RMSE on train/test sets

In [None]:
print('Train/predict using Baseline default parameters with Alternating Least Squares for 3 epochs:')
print('\n')
bsl_options = {'method': 'als',
               'n_epochs': 3}
print('Baselines estimates configuration:')
print(bsl_options)
print('\n')
print('Model parameters:')
print(BaselineOnly(bsl_options=bsl_options))

Train/predict using Baseline default parameters with Alternating Least Squares for 3 epochs:


Baselines estimates configuration:
{'method': 'als', 'n_epochs': 3}


Model parameters:
<surprise.prediction_algorithms.baseline_only.BaselineOnly object at 0x000002F2AB988C40>


<font size="4">Cross validation, Fit and predict on best model with the lowest rmse, apply functions and save prediction results  <font>

In [None]:
print('Time for iterating through Baseline default parameters epochs=3 using ALS..')
search_time_start = time.time()
algo = BaselineOnly(bsl_options=bsl_options)
cv = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=False,
                    n_jobs=-1)
print('Finished iterating through Baseline default parameters epochs=3 using ALS:',
      time.time() - search_time_start)
print('\n')
print('Cross validation results:')
# Iterate over key/value pairs in cv results dict
for key, value in cv.items():
    print(key, ' : ', value)

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./Baseline_3epochs_DefaultParamModel_file', predictions, algo)
#predictions, algo = dump.load('./Baseline_3epochs_DefaultParamModel_file')

df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)

df2 = pd.merge(df, df1, how='right',
              left_on=['reviewer_id', 'item_id'],
              right_on=['reviewer_id', 'item_id'])
df2.to_csv('predictions_Baseline_DefaultParamModel.csv', index=False)

del df1

Time for iterating through Baseline default parameters epochs=3 using ALS..
Finished iterating through Baseline default parameters epochs=3 using ALS: 18.00507688522339


Cross validation results:
test_rmse  :  [0.96236564 0.9575281  0.95800474]
test_mae  :  [0.7192956  0.71624087 0.71744233]
fit_time  :  (0.45334315299987793, 0.5258574485778809, 0.4615175724029541)
test_time  :  (3.44765043258667, 3.460092782974243, 3.342712163925171)
Estimating biases using als...
RMSE from fit best parameters on train predict on test:
RMSE: 0.9532
0.9532457019820773


<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
22719   6303824358      652  A3UGURGJMNRSH5         6615  5.0  5.0   
154698  6303521231     3897   ARMESRCOD7ZRV         5721  5.0  5.0   
190758  B00IV3FLO8      504  A3CXEYHXHR5RSE         1293  5.0  5.0   
86407   7799128836      123  A3TGQ0Y8ILUGBP         9774  5.0  5.0   
154699  0792834917     3899   A8KI69T8XEYH8        11191  5.0  5.0   
215986  B001KVZ6HK      489  A2YUA3H1LLU53Z           13  5.0  5.0   
51302   6302799112     7144  A18EXIAW8NU3DP          764  5.0  5.0   
66610   B0013LRKRQ     8910  A1QNLOV3985VUP         2518  5.0  5.0   
51298   6301966554      107  A2BVCF2AKHHCDF         3942  5.0  5.0   
51296   B001ATWK2Q    14068  A12CN8FQSR18H9          452  5.0  5.0   

                          details    Iu   Ui  err  
22719   {'was_impossible': False}    32  124  0.0  
154698  {'was_impossible': False}    36   28  0.0  
190758  {'was_impossible': False}    92   72  0.0  


<font size="4">Find the worst predictions<font>

In [None]:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df2, best_predictions, worst_predictions

Worst 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
214948  B005BYBZCM    14976   APSOSLICIDIFX         4500  1.0  5.0   
118018  B00AMSLDW4     1834  A157SI9J9ECKYH         1666  1.0  5.0   
110642  B00005JNJV      451  A1JU644REKXCNO         1311  1.0  5.0   
108646  B0046ZT40W       20  A15XO0U10UDT8O         8302  1.0  5.0   
125152  B000BGQSEA     1999   AWG2O9C42XW5G           10  1.0  5.0   
33420   0800141709      752  A28YE81E63ZOZT         4933  1.0  5.0   
44833   B0000VCZK2      409  A3K89LA3W7YS1Q         5789  1.0  5.0   
29490   B01D9EUNBY     2691  A3FWD37OUCOQDL          778  1.0  5.0   
119911  B00DMK6WN4     1915   ANHTCXAG77MBM         1331  1.0  5.0   
39171   B000E5LEXS      906  A3TY31K1VXAGSX         3522  1.0  5.0   

                          details    Iu   Ui  err  
214948  {'was_impossible': False}    48   10  4.0  
118018  {'was_impossible': False}    77   97  4.0  
110642  {'was_impossible': False}    93  461  4.0  

#### BaselineOnly HPO using Grid Search

<font size="4">Define the parameters for the grid search <font>



In [None]:
print('BaselineOnly HPO using Grid Search Minimized:')
print('\n')
param_grid = {'bsl_options': {'method': ['als', 'sgd'],
                             'n_epochs': [5, 10, 15, 20],
                             'reg_u': [5, 10, 15, 20],
                             'reg_i': [5, 10, 15, 20]}}
print('Grid search parameters:')
param_grid

BaselineOnly HPO using Grid Search Minimized:


Grid search parameters:


{'bsl_options': {'method': ['als', 'sgd'],
  'n_epochs': [5, 10, 15, 20],
  'reg_u': [5, 10, 15, 20],
  'reg_i': [5, 10, 15, 20]}}

<font size="4">Run grid search with `rmse` and `mae` as the metrics. Then use the parameters that resulted in the lowest RMSE on the train/test sets<font>



In [None]:
gs = GridSearchCV(BaselineOnly, param_grid, measures=['rmse', 'mae'], cv=3,
                  joblib_verbose=-1, n_jobs=-1)
print('Time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
      time.time() - search_time_start)
print('\n')
print('Lowest RMSE from Grid Search:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters of Model with lowest RMSE from Grid Search:')
print(gs.best_params['rmse'])

Time for iterating grid search parameters..


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed:   45.3s
[Parallel(n_jobs=-1)]: Done 256 tasks      | elapsed:  4.5min


Finished iterating grid search parameters: 410.2233831882477


Lowest RMSE from Grid Search:
0.9453093755322367


Parameters of Model with lowest RMSE from Grid Search:
{'bsl_options': {'method': 'als', 'n_epochs': 20, 'reg_u': 5, 'reg_i': 5}}


[Parallel(n_jobs=-1)]: Done 384 out of 384 | elapsed:  6.7min finished


<font size="4">Fit and predict on the best model, apply functions and save prediction results <font>

In [None]:
algo = gs.best_estimator['rmse']

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./BaselineOnly_bestGrid_Model_file', predictions, algo)
#predictions, algo = dump.load('./BaselineOnly_bestGrid_Model_file')

df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)

df2 = pd.merge(df, df1, how='right',
               left_on=['reviewer_id', 'item_id'],
               right_on=['reviewer_id', 'item_id'])
df2.to_csv('predictions_BaselineOnly_gridSearch.csv', index=False)

del df1

Estimating biases using als...
RMSE from fit best parameters on train predict on test:
RMSE: 0.9418
0.9418294572167315


<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
191694  B016WU7GOO     2578   AWAZ7J9JXGBR0        10879  5.0  5.0   
50481   B005LAJ23A      300  A1JOXRFELGRLJL        10448  5.0  5.0   
12633   7799437278     1087   AL0G3NEW9J1JA         9398  5.0  5.0   
82689   B00FDJGL6A      306  A30L7BXXX0Q6AJ          869  5.0  5.0   
82686   0792151712      270  A3OX0WG51XGG4X         4469  5.0  5.0   
174741  6303853102     2020  A3RQ0VN4KUKH8G         1283  5.0  5.0   
25555   B000GDH8HE     7980   AJ3LYAOVG1RID         6659  5.0  5.0   
82667   6302158095     2305  A2M0FW9AFTG72V          464  5.0  5.0   
82663   0790744236     4373  A1Q9Q0EAC3TB02          164  5.0  5.0   
174729  B00X7HEIB0     8362  A2QJP924WOJ1RM          484  5.0  5.0   

                          details   Iu   Ui  err  
191694  {'was_impossible': False}   28   30  0.0  
50481   {'was_impossible': False}   28  190  0.0  
12633   {'was_impossible': False}   31   98  0.0  
8268

<font size="4"> Find the worst predictions<font>

In [None]:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df2, best_predictions, worst_predictions

Worst 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
33420   0800141709      752  A28YE81E63ZOZT         4933  1.0  5.0   
108646  B0046ZT40W       20  A15XO0U10UDT8O         8302  1.0  5.0   
11064   6304961693     2718   AK2FSJJ7GAN8B        10089  1.0  5.0   
118238  B00OAIHHRC    12360   A7O05G42L82N9         8688  1.0  5.0   
56860   B004LROMSY     3181  A1KMF9130CHALE         2159  5.0  1.0   
17953   7885887014    10529  A1ZNCIO851MBOS        11474  1.0  5.0   
199058  6304401132      577   A3VD2H76PZOI6        10055  1.0  5.0   
11370   6303610811     1823   AR8JRW08OOEM6         1747  1.0  5.0   
198775  1572522232     2632   AWG2O9C42XW5G           10  1.0  5.0   
99311   B000NOIX6G    12926  A3VHB8OAQQF9W6         1094  1.0  5.0   

                          details    Iu   Ui  err  
33420   {'was_impossible': False}    43  209  4.0  
108646  {'was_impossible': False}    32   53  4.0  
11064   {'was_impossible': False}    28   90  4.0  

#### BaselineOnly HPO using Grid Search - More Params

In [None]:
print('BaselineOnly HPO using Grid Search Minimized:')
print('\n')
param_grid = {'bsl_options': {'method': ['als', 'sgd'],
                             'n_epochs': [10, 15, 20, 25, 30, 35, 40, 45, 50],
                             'reg_u': [0, 1, 2, 3, 4, 5, 10, 15, 20],
                             'reg_i': [0, 1, 2, 3, 4,5, 10, 15, 20]}}
print('Grid search parameters:')
param_grid

BaselineOnly HPO using Grid Search Minimized:


Grid search parameters:


{'bsl_options': {'method': ['als', 'sgd'],
  'n_epochs': [10, 15, 20, 25, 30, 35, 40, 45, 50],
  'reg_u': [0, 1, 2, 3, 4, 5, 10, 15, 20],
  'reg_i': [0, 1, 2, 3, 4, 5, 10, 15, 20]}}

<font size="4">Run grid search with `rmse` and `mae` as the metrics. Then use the parameters that resulted in the lowest RMSE on the train/test sets<font>



In [None]:
gs = GridSearchCV(BaselineOnly, param_grid, measures=['rmse', 'mae'], cv=3,
                  joblib_verbose=-1, n_jobs=-1)
print('Time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
      time.time() - search_time_start)
print('\n')
print('Lowest RMSE from Grid Search:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters of Model with lowest RMSE from Grid Search:')
print(gs.best_params['rmse'])

Time for iterating grid search parameters..


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed:   45.6s
[Parallel(n_jobs=-1)]: Done 256 tasks      | elapsed:  4.4min
[Parallel(n_jobs=-1)]: Done 616 tasks      | elapsed: 10.5min
[Parallel(n_jobs=-1)]: Done 1120 tasks      | elapsed: 19.6min
[Parallel(n_jobs=-1)]: Done 1768 tasks      | elapsed: 32.1min
[Parallel(n_jobs=-1)]: Done 2560 tasks      | elapsed: 46.8min
[Parallel(n_jobs=-1)]: Done 3496 tasks      | elapsed: 63.2min


Finished iterating grid search parameters: 4817.081090688705


Lowest RMSE from Grid Search:
0.9429468447894139


Parameters of Model with lowest RMSE from Grid Search:
{'bsl_options': {'method': 'als', 'n_epochs': 50, 'reg_u': 1, 'reg_i': 3}}


[Parallel(n_jobs=-1)]: Done 4374 out of 4374 | elapsed: 80.2min finished


<font size="4">Fit and predict on the best model, apply functions and save prediction results <font>

In [None]:
algo = gs.best_estimator['rmse']

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./BaselineOnly_moreParams_bestGrid_Model_file', predictions, algo)
#predictions, algo = dump.load('./BaselineOnly_moreParams_bestGrid_Model_file')

df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)

df2 = pd.merge(df, df1, how='right',
               left_on=['reviewer_id', 'item_id'],
               right_on=['reviewer_id', 'item_id'])
df2.to_csv('predictions_BaselineOnly_moreParams_gridSearch.csv', index=False)

del df1

Estimating biases using als...
RMSE from fit best parameters on train predict on test:
RMSE: 0.9400
0.9400458413742169


<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
111339  B0053O89WY      757  A10EU5IAKELBDI        14872  5.0  5.0   
204037  B000YDMPCO     7653   A551XY0L7WQXU         1218  5.0  5.0   
74750   B00U38W0X4    18648  A227P36VP2YMFJ        12389  5.0  5.0   
158992  B00FDJGL6A      306  A16J08RGNPYXW6        11073  5.0  5.0   
204044  B00005JKZU      366  A36GQV2C4O0DXE         5800  5.0  5.0   
204054  B000059H6T     2016   AP40TY8769CBM        16798  5.0  5.0   
158983  B00005BJWN    13388  A3ESQSCCM2IR24         7591  5.0  5.0   
158982  B00S89IAWU    20893  A2545BZKJ6FLF6        16037  5.0  5.0   
204055  B0012Z36FI     4689   AWD6SR6I52C5C          476  5.0  5.0   
158978  B00D91GRA4       87  A2RC58KZZC41QW         2180  5.0  5.0   

                          details   Iu   Ui  err  
111339  {'was_impossible': False}   19  107  0.0  
204037  {'was_impossible': False}   98   32  0.0  
74750   {'was_impossible': False}   29   17  0.0  
1589

<font size="4"> Find the worst predictions<font>

In [None]:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df2, best_predictions, worst_predictions

Worst 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
162516  B000VS20M2      148  A1GFX8FS58ZDHR         6479  1.0  5.0   
220011  B01DPPMPSG     6988  A30BQKQU5XGGS1        15523  1.0  5.0   
126619  6305627401     1667   AJ32DH3NMIES9         5904  1.0  5.0   
99713   6302707722    65249  A1JFJFHKG8Y7LB         3612  1.0  5.0   
99311   B000NOIX6G    12926  A3VHB8OAQQF9W6         1094  1.0  5.0   
13116   6303686850     3483  A3N3Y4UJ07LOVG         3284  1.0  5.0   
54348   B01D5MQQ2A    61254  A3AEZXERB7JP4R        11448  1.0  5.0   
108646  B0046ZT40W       20  A15XO0U10UDT8O         8302  1.0  5.0   
26866   B00005T33K     2680  A28UFH7QJBZZMU         4268  1.0  5.0   
96274   B000BKVKSA     7802  A10175AMUHOQC4          458  1.0  5.0   

                          details   Iu   Ui  err  
162516  {'was_impossible': False}   40  450  4.0  
220011  {'was_impossible': False}   19   21  4.0  
126619  {'was_impossible': False}   36  105  4.0  
997

### KNNBaseline
Fit model with default parameters for 3 epochs using `method='als'` and examine RMSE on train/test sets

In [None]:
print('Train/predict using KNNBaseline default parameters with Alternating Least Squares for 3 epochs:')
print('\n')
bsl_options = {'method': 'als',
               'n_epochs': 3}
print('Baselines estimates configuration:')
print(bsl_options)
print('\n')
print('Model parameters:')
print('KNNBaseline(k=40, min_k=1, bsl_options=bsl_options, verbose=False)')

Train/predict using KNNBaseline default parameters with Alternating Least Squares for 3 epochs:


Baselines estimates configuration:
{'method': 'als', 'n_epochs': 3}


Model parameters:
KNNBaseline(k=40, min_k=1, bsl_options=bsl_options, verbose=False)


<font size="4">Cross validation, Fit and predict on best model with the lowest rmse, apply functions and save prediction results  <font>

In [None]:
print('Time for iterating through KNNBaseline default parameters epochs=3 using ALS..')
search_time_start = time.time()
algo = KNNBaseline(bsl_options=bsl_options)
cv = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=False,
                    n_jobs=-1)
print('Finished iterating through KNNBaseline default parameters epochs=3 using ALS:',
      time.time() - search_time_start)
print('\n')
print('Cross validation results:')
# Iterate over key/value pairs in cv results dict
for key, value in cv.items():
    print(key, ' : ', value)

Time for iterating through KNNBaseline default parameters epochs=3 using ALS..
Finished iterating through KNNBaseline default parameters epochs=3 using ALS: 52.3457145690918


Cross validation results:
test_rmse  :  [0.98871454 0.98392946 0.98468003]
test_mae  :  [0.70193266 0.6991709  0.69981287]
fit_time  :  (12.814245462417603, 12.74767255783081, 12.50976276397705)
test_time  :  (27.62490439414978, 28.00885558128357, 28.005619525909424)


In [None]:
predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./KNNBaseline_3epochs_DefaultParamModel_file', predictions, algo)
#predictions, algo = dump.load('./KNNBaseline_3epochs_DefaultParamModel_file')

df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)

df2 = pd.merge(df, df1, how='right',
               left_on=['reviewer_id', 'item_id'],
               right_on=['reviewer_id', 'item_id'])
df2.to_csv('predictions_KNNBaseline_DefaultParamModel.csv', index=False)

del df1

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE from fit best parameters on train predict on test:
RMSE: 0.9639
0.9638924023238667


<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
82292   B002ZG97EC      599  A29TWAFI927A05         7359  5.0  5.0   
66702   B016LGTB1A     1054  A3UCR3ANY81Y3X        10988  5.0  5.0   
199042  B00000JQU8     4924   AHD101501WCN1         3797  5.0  5.0   
66720   B000ANDBTY     5079   ALWB64XOXNMDP          305  5.0  5.0   
66728   B00OJ0X41E     1670  A3HYD0KTB8XKPQ        13455  5.0  5.0   
66730   B001EHDSRK    10371   AR0BHE8D3J5LD          567  5.0  5.0   
199038  B007K7IC6K    20933  A35X4WCJ3K67BY         3490  5.0  5.0   
66697   B0066E6RWE    33621   A1UJPR7G1TX5D        16570  5.0  5.0   
66734   6304493703    10521  A386NVAVQV5WUO         4546  5.0  5.0   
66745   B0000C508X    32981  A1XEVKJR5FBBSK        12943  5.0  5.0   

                                          details   Iu   Ui  err  
82292   {'actual_k': 40, 'was_impossible': False}   36  104  0.0  
66702   {'actual_k': 40, 'was_impossible': False}   26  257  0.0  
199042 

<font size="4"> Find the worst predictions<font>

In [None]:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df2, best_predictions, worst_predictions

Worst 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
50924   B000UYN9PY    41193  A2POT13FER5L82         1518  5.0  1.0   
162946  B00AZMFG48     3768  A1NFCJXNYVJ0O1         5015  1.0  5.0   
64718   B00LT1JHLW     2240  A1QWGK4SXTRXMD         3387  1.0  5.0   
121979  630249933X    46077  A2WHWME9PTOS0E        14782  1.0  5.0   
87343   6304329008     6536  A30NXBU3CDNDFR         5468  1.0  5.0   
1375    B0009UVCO4    55298  A141HP4LYPWMSR           76  5.0  1.0   
37276   B00435KP44    22619  A347I00XTPNIUP         3867  5.0  1.0   
128317  B00000IYR0    24076  A1POCQWA3VAY8T        17696  1.0  5.0   
21647   B000JMKJPK    44655   AETEH7SOTOFRT        11177  5.0  1.0   
7373    B000QCUZ7U    28629   AJTQS6KM9ZV5L         1420  1.0  5.0   

                                          details   Iu  Ui  err  
50924    {'actual_k': 1, 'was_impossible': False}   85   5  4.0  
162946   {'actual_k': 6, 'was_impossible': False}   43  21  4.0  
64718   {

#### KNNBaseline HPO using Grid Search

<font size="4">Define the parameters for the grid search <font>



In [None]:
print('KNNBaseline HPO using Grid Search:')
print('\n')
param_grid = {'bsl_options': {'method': ['als', 'sgd']},
              'k': [30, 35, 40, 45, 50],
              'min_k': [5, 10],
              'random_state': [seed_value],
              'sim_options': {'name': ['pearson_baseline'],
                              'min_support': [5, 10],
                              'shrinkage': [0, 100]}}
print('Grid search parameters:')
param_grid

KNNBaseline HPO using Grid Search:


Grid search parameters:


{'bsl_options': {'method': ['als', 'sgd']},
 'k': [30, 35, 40, 45, 50],
 'min_k': [20, 25],
 'random_state': [42],
 'sim_options': {'name': ['pearson_baseline'],
  'min_support': [5, 10],
  'shrinkage': [0, 100]}}

<font size="4">Run grid search with `rmse` and `mae` as the metrics. Then use the parameters that resulted in the lowest RMSE on the train/test sets<font>



In [None]:
gs = GridSearchCV(KNNBaseline, param_grid, measures=['rmse', 'mae'], cv=3,
                  joblib_verbose=-1, n_jobs=5)
print('Start time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
      time.time() - search_time_start)
print('\n')
print('Model with lowest RMSE:')
print(gs.best_score['rmse'])
print('\n')
# Parameters with the lowest RMSE
print('Parameters with the lowest RMSE:')
print(gs.best_params['rmse'])

Start time for iterating grid search parameters..


[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.
[Parallel(n_jobs=5)]: Done  62 tasks      | elapsed: 11.1min


Finished iterating grid search parameters: 2447.833321094513


Model with lowest RMSE:
0.9453110845949882


Parameters with the lowest RMSE:
{'bsl_options': {'method': 'sgd'}, 'k': 40, 'min_k': 20, 'random_state': 42, 'sim_options': {'name': 'pearson_baseline', 'min_support': 5, 'shrinkage': 100, 'user_based': True}}


[Parallel(n_jobs=5)]: Done 240 out of 240 | elapsed: 40.7min finished


In [None]:
predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
print('\n')
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df = results_df.sort_values('mean_test_rmse', ascending=True)
print('KNNBaseline GridSearch HPO Cross Validation Results:')
print(results_df.head())
results_df.to_csv('KNNBaseline_gridSearch_cvResults.csv', index=False)

del results_df

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE from fit best parameters on train predict on test:
RMSE: 0.9661
0.9661408444802692


KNNBaseline GridSearch HPO Cross Validation Results:
    split0_test_rmse  split1_test_rmse  split2_test_rmse  mean_test_rmse  \
57          0.948154          0.943403          0.944376        0.945311   
49          0.948146          0.943407          0.944382        0.945312   
41          0.948147          0.943410          0.944380        0.945313   
65          0.948156          0.943407          0.944381        0.945315   
73          0.948159          0.943412          0.944383        0.945318   

    std_test_rmse  rank_test_rmse  split0_test_mae  split1_test_mae  \
57       0.002049               1         0.684004         0.681184   
49       0.002044               2         0.683991         0.681176   
41       0.002043               3         0.683976         0.681167   
65       0.0

<font size="4">Fit and predict on the best model, apply functions and save prediction results <font>

In [None]:
algo = gs.best_estimator['rmse']

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
print('\n')
dump.dump('./KNNBaseline_bestGrid_Model_file', predictions, algo)
#predictions, algo = dump.load('./KNNBaseline_bestGrid_Model_file')

df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)

df2 = pd.merge(df, df1, how='right',
               left_on=['reviewer_id', 'item_id'],
               right_on=['reviewer_id', 'item_id'])
df2.to_csv('predictions_KNNBaseline_gridSearch.csv', index=False)

del df1

Estimating biases using sgd...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE from fit best parameters on train predict on test:
RMSE: 0.9379
0.9378743356689234




<font size="4">Find the best predictions<font>

In [None]:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)

Best 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
90940   B00O2IZPD8      211  A2Q9961GP0RMBT         4608  5.0  5.0   
159489  6303686761    14836   AZIV57ZFSUYI6        18439  5.0  5.0   
134670  B000063EME     3672   AN49MA84I1W13        12176  5.0  5.0   
41605   B00C888NFQ       90  A2EQRK5VKWY4UL         5787  5.0  5.0   
214781  B00FEP9PQG      188   AKWS79G5WUYUE        16444  5.0  5.0   
14334   079073463X     1319  A3MQAQT8C6D1I7         7397  5.0  5.0   
41607   B000UFIYQ2     8628  A1QWQF0CXTE41E          602  5.0  5.0   
200020  B000CDSS18    17834  A2XOH1PC423E2S         8201  5.0  5.0   
41613   B00FR23GPW     7599   AJC05Q1I8IG0U         8629  5.0  5.0   
91520   B00HEPCRLE      874  A12XYDOWQA4H5B         6810  5.0  5.0   

                                          details   Iu   Ui  err  
90940    {'actual_k': 6, 'was_impossible': False}   48  231  0.0  
159489   {'actual_k': 2, 'was_impossible': False}   17   26  0.0  
134670 

<font size="4"> Find the worst predictions<font>

In [None]:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df2, best_predictions, worst_predictions

Worst 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
199613  B00005JO20      733  A167KI3P7XN1AM         2527  1.0  5.0   
111347  B000J10F8C    20835  A33RN6T49VEFUO          302  1.0  5.0   
128997  B019T8QBR4      700  A2BXZJPLOTMI5A         7822  1.0  5.0   
13429   6304961693     2718   AK2FSJJ7GAN8B        10089  1.0  5.0   
135680  B00007K027    23913  A192UPT6KST3E2          804  1.0  5.0   
105987  B00FF9SKSK      388   AE35V0M480BBV         9570  1.0  5.0   
145506  6304178360     8074   AWG2O9C42XW5G           10  1.0  5.0   
44675   B00008MTW7     4328   AXOIYNI70LPN1         5849  1.0  5.0   
64715   B00G3D732Q      100  A3JDBLJIN704EW         3907  1.0  5.0   
201633  B0009PW4D2     1842  A2JHYW5V7UFIQ2         3958  5.0  1.0   

                                          details    Iu   Ui  err  
199613   {'actual_k': 5, 'was_impossible': False}    65  417  4.0  
111347   {'actual_k': 0, 'was_impossible': False}   218    5  4.0  
128

## RecSys Methods without Surprise
### Create the rating matrix with items and reviewers

### Create SVD Based Recommendation System using SciPy
<font size="4">For the training of the SVD based models using `SciPy`:
A rating matrix with reviewers and items was constructed and transposed. The parameters for the model were `U, sigma, Vt = randomized_svd(ratingsMat, n_components, n_oversamples, random_state)`. A diagonal matrix was constructed in SVD. Then the ratings were predicted and RMSE was calculated. <font>

In [None]:
# Create user-item matrix & transpose the utility matrix
ratingsMat = df.pivot_table(values='rating', index='reviewer_id',
                            columns='item_id', fill_value=0)
X = ratingsMat.values.T
X.shape

(103687, 19639)

### Model with `n_components=15` and `n_oversamples=20`

In [None]:
# Define model components
U, sigma, Vt = randomized_svd(X, n_components=15,
                              n_oversamples=20, random_state=42)

# Construct a diagonal matrix in SVD
sigma = np.diag(sigma)

# Predicted rating
reviewers_predRating = np.dot(np.dot(U, sigma), Vt)

# Create a dataframe of the predicted ratings
ratingPred = pd.DataFrame(reviewers_predRating)

In [None]:
print('\nEvaluate the SciPy SVD Collaborative recommender model')
rmse_df = pd.concat([ratingsMat.mean(), ratingPred.mean()], axis=1)
rmse_df.columns = ['Avg_actual_rating', 'Avg_predicted_rating']
rmse_df['item_index'] = np.arange(0, rmse_df.shape[0], 1)

RMSE = round((((rmse_df.Avg_actual_rating
                - rmse_df.Avg_predicted_rating) ** 2).mean() ** 0.5), 10)
print('\nRMSE of SciPy SVD Model = {} \n'.format(RMSE))


Evaluate the SciPy SVD Collaborative recommender model

RMSE of SciPy SVD Model = 0.0144595486 



### Model with `n_components=10` and `n_oversamples=20`

In [None]:
# Define model components
U, sigma, Vt = randomized_svd(X, n_components=10,
                              n_oversamples=20, random_state=42)

# Construct a diagonal matrix in SVD
sigma = np.diag(sigma)

# Predicted rating
reviewers_predRating = np.dot(np.dot(U, sigma), Vt)

# Create a dataframe of the predicted ratings
ratingPred = pd.DataFrame(reviewers_predRating)

In [None]:
print('\nEvaluate the SciPy SVD Collaborative recommender model')
rmse_df = pd.concat([ratingsMat.mean(), ratingPred.mean()], axis=1)
rmse_df.columns = ['Avg_actual_rating', 'Avg_predicted_rating']
rmse_df['item_index'] = np.arange(0, rmse_df.shape[0], 1)

RMSE = round((((rmse_df.Avg_actual_rating
                - rmse_df.Avg_predicted_rating) ** 2).mean() ** 0.5), 10)
print('\nRMSE of SciPy SVD Model = {} \n'.format(RMSE))


Evaluate the SciPy SVD Collaborative recommender model

RMSE of SciPy SVD Model = 0.0144523715 



<a id="decompose-matrix"></a>
<font size="4">**Decomposing the Matrix**

`TruncatedSVD` from `sklearn` was used to to compress the transposed matrix down to a number of rows by different number of matrices. The `items` are in the rows. while the `users` are compressed down to `X components`, providing a way to examine a generalized perspective of the users' interests with this given set.  <font>

In [None]:
SVD = TruncatedSVD(n_components=12, random_state=seed_value)
result_matrix = SVD.fit_transform(X)
result_matrix.shape

(103687, 500)

In [None]:
SVD = TruncatedSVD(n_components=50, random_state=seed_value)
result_matrix = SVD.fit_transform(X)
result_matrix1.shape

(103687, 50)

<a id="gen-corr-matrix"></a>
<font size="4">**Generating a Correlation Matrix**

PearsonR coefficient was calculated for every item pair in the result_matrix based on similarities between users' interests. `numpy.memmap` was utilized due to the memory constraints of sparse data. This involved splitting the input in 1000 row chucks, subtract means form the input data, and normalizing to create the correlation matrix.<font>




In [None]:
SPLITROWS = 1000
numrows = result_matrix.shape[0]

result_matrix -= np.mean(result_matrix, axis=1)[:,None]

result_matrix /= np.sqrt(np.sum(result_matrix * result_matrix, axis=1))[:,None]

corr_matrix = np.memmap('/mydata.dat', 'float64', mode='w+',
                        shape=(numrows, numrows))

for r in range(0, numrows, SPLITROWS):
    for c in range(0, numrows, SPLITROWS):
        r1 = r + SPLITROWS
        c1 = c + SPLITROWS
        chunk1 = result_matrix[r:r1]
        chunk2 = result_matrix[c:c1]
        corr_matrix[r:r1, c:c1] = np.dot(chunk1, chunk2.T)
corr_matrix.shape

(103687, 103687)

<a id="isolate"></a>
<font size="4">**Extract the most popular item from the Correlation Matrix**

The most popular `item_id=8`. The correlation values are then extracted between `item_id=8` and all other items in the matrix. <font>

In [None]:
item_names = ratingsMat.columns
item_list = list(item_names)
item_list

popular_item = item_list.index(8)
print('index of the popular item: ', popular_item)

corr_popular_item = corr_matrix[popular_item]

index of the popular item:  8


<a id="recommend"></a>
<font size="4">**Recommend Highly Correlated Items**

Now filter out the most correlated item to "Add item" by applying the following conditions as shown below. <font>

#### 12 components

In [None]:
# Construct a list items that are greater 0.9 correlated with target
a = list(item_names[(corr_popular_item < 1.0) & (corr_popular_item > 0.90)])
print('Number of items > 0.9 correlated with target:', len(a))

# Construct a list items that are greater 0.95 correlated with target
b = list(item_names[(corr_popular_item < 1.0) & (corr_popular_item > 0.95)])
print('Number of items > 0.95 correlated with target:', len(b))
print('Items > 0.95 correlated with target:', b)

Number of items > 0.9 correlated with target: 87
Number of items > 0.95 correlated with target: 7
Items > 0.95 correlated with target: [17, 11062, 26414, 29303, 34289, 40162, 54404]


#### 50 components

In [None]:
# Construct a list items that are greater 0.9 correlated with target
a = list(item_names[(corr_popular_item < 1.0) & (corr_popular_item > 0.90)])
print('Number of items > 0.9 correlated with target:', len(a))

# Construct a list items that are greater 0.80 correlated with target
b = list(item_names[(corr_popular_item < 1.0) & (corr_popular_item > 0.80)])
print('Number of items > 0.80 correlated with target:', len(b))

# Construct a list items that are greater 0.70 correlated with target
c = list(item_names[(corr_popular_item < 1.0) & (corr_popular_item > 0.70)])
print('Number of items > 0.70 correlated with target:', len(c))
print('Items > 0.70 correlated with target:', b)

Number of items > 0.9 correlated with target: 0
Number of items > 0.8 correlated with target: 7
Number of items > 0.7 correlated with target: 67
Items > 0.80 correlated with target: [37141, 37478, 48025, 51623, 66472, 70981, 89506]


<font size="4">Recommend the items with the highest predicted rating by selecting and sorting the reviewer's rating and concatenating the actual rating with the predicted rating <font>

In [None]:
def recommend_items(reviewerID, ratingsMat, ratingPred, num_recommendations):
    reviewer_idx = reviewerID - 1

    sorted_reviewer_rating = ratingsMat.iloc[reviewer_idx].sort_values(ascending=False)
    sorted_reviewer_predictions = ratingPred.iloc[reviewer_idx].sort_values(ascending=False)

    tmp = pd.concat([sorted_reviewer_rating, sorted_reviewer_predictions],
                     axis=1)
    tmp.index.name = 'Recommended Items'
    tmp.columns = ['reviewer_rating', 'reviewer_predictions']
    tmp = tmp.sort_values('reviewer_predictions', ascending=False)

    print('\nBelow are the recommended items for reviewer(reviewer_id = {}):\n'.format(reviewerID))
    print(tmp.head(num_recommendations))

In [None]:
reviewerID = 1
num_recommendations = 10
recommend_items(reviewerID, ratingsMat, ratingPred, num_recommendations)

reviewerID = 40
num_recommendations = 10
recommend_items(reviewerID, ratingsMat, ratingPred, num_recommendations)

reviewerID = 300
num_recommendations = 10
recommend_items(reviewerID, ratingsMat, ratingPred, num_recommendations)


Below are the recommended items for reviewer(reviewer_id = 1):

                   reviewer_rating  reviewer_predictions
Recommended Items                                       
9                              5.0              5.717030
34                             5.0              3.476725
46                             0.0              2.600543
3                              0.0              2.251135
33                             0.0              2.209918
43                             0.0              2.184715
42                             3.0              2.123471
39                             0.0              2.057666
55                             4.0              1.968689
173                            4.0              1.875312

Below are the recommended items for reviewer(reviewer_id = 40):

                   reviewer_rating  reviewer_predictions
Recommended Items                                       
0                              0.0              2.102204
13            

### Popularity Model
<font size="4">For the construction of the popularity recommender model, a recommendation score was created by counting each reviewer for each unique item. This score was sorted and a recommendation rank was created based on scoring. Then the top five recommendations were examined. <font>

In [None]:
# Examine train/test sets for modeling
print('Dimensions of train set:', train.shape)
print('Dimensions of test set:', test.shape)

Dimensions of train set: (890561, 3)
Dimensions of test set: (222833, 3)


In [None]:
train_grouped = train.groupby('item_id').agg({'reviewer_id': 'count'}).reset_index()
train_grouped.rename(columns = {'reviewer_id': 'count'}, inplace=True)

train_sort = train_grouped.sort_values(['item_id', 'count'], ascending=[0,1])
train_sort['rank'] = train_sort['count'].rank(ascending=0, method='first')

popularity_recommendations = train_sort.head()
print('\nTop 5 recommendations')
print(popularity_recommendations)


Top 5 recommendations
   item_id   count  rank
4      5.0  475124   1.0
3      4.0  202266   2.0
2      3.0  112957   3.0
1      2.0   52608   4.0
0      1.0   47606   5.0


<font size="4">Predictions were then calculated for various reviewers by defining a function where `reviewer_id` was added as the first column for which the recommendations are generated. For three different reviewers, the same items were generated. This is not a robust methods for recommendation systems because there are lots of unaccounted for variable like age, gender, location and time to nane a few.<font>

In [None]:
def recommend(reviewer_id):

    reviewer_recommendations = popularity_recommendations
    reviewer_recommendations['reviewer_id'] = reviewer_id
    cols = reviewer_recommendations.columns.tolist()
    cols = cols[-1:] + cols[:-1]
    reviewer_recommendations = reviewer_recommendations[cols]

    return reviewer_recommendations

find_recom = [1,100,200]
for i in find_recom:
    print('The list of recommendations for the reviewer_id: %d\n' %(i))
    print(recommend(i))

The list of recommendations for the reviewer_id: 1

   reviewer_id  item_id   count  rank
4            1      5.0  475124   1.0
3            1      4.0  202266   2.0
2            1      3.0  112957   3.0
1            1      2.0   52608   4.0
0            1      1.0   47606   5.0
The list of recommendations for the reviewer_id: 100

   reviewer_id  item_id   count  rank
4          100      5.0  475124   1.0
3          100      4.0  202266   2.0
2          100      3.0  112957   3.0
1          100      2.0   52608   4.0
0          100      1.0   47606   5.0
The list of recommendations for the reviewer_id: 200

   reviewer_id  item_id   count  rank
4          200      5.0  475124   1.0
3          200      4.0  202266   2.0
2          200      3.0  112957   3.0
1          200      2.0   52608   4.0
0          200      1.0   47606   5.0
