# Model Building

In this notebook we will be building a few different models for our recommendation system. We will look into content-based, collaborative filtering, and a combined model (content-based collaborative filtering). We will spilt our data into training and testing sets. 

**The steps are as follows:** 
1. Import Train and Test data
2. 

#### Import libraries/modules below:

In [36]:
import pickle
import re
import pandas as pd
import surprise
from surprise import KNNWithMeans
from surprise import SVD
from surprise import SVDpp
from surprise import NMF
from surprise import NormalPredictor
from surprise import KNNBaseline
from surprise import KNNBasic
from surprise import KNNWithZScore
from surprise import BaselineOnly
from surprise import CoClustering
from surprise import SlopeOne
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
from surprise.reader import Reader
from surprise.model_selection import KFold
from surprise.model_selection import GridSearchCV
from Mod_5_functions import pickle_file,open_pickle

#### Import the data:
- Save it as a DataFrame. 
- For our __baseline Collaborative-Filtering Model__, lets create a brand new __DataFrame__ with *only* the **userUrl, rev_company_name, rev_company_url, and the star rating.**

In [34]:
user_reviews_train = open_pickle('Data/train_data')
user_reviews_test = open_pickle('Data/test_data')

In [3]:
bus_reviews_df

Unnamed: 0,rev_company_name,rev_comp_url,company_loc,rev_comp_rating,rev_comp_reviews,userUrl,comapny_source
0,Planet Fitness - Manhattan - Canal St - NY,https://www.yelp.com/biz/planet-fitness-manhat...,"370 Canal St New York, NY 10013",3.0,"Planet Fitness is an affordable, no frills gym...",https://www.yelp.com/user_details?userid=exPhu...,Peloton
1,Montauk Salt Cave,https://www.yelp.com/biz/montauk-salt-cave-new...,"90 E 10th St New York, NY 10003",2.0,I purchased a Groupon for a friend and I. When...,https://www.yelp.com/user_details?userid=exPhu...,Peloton
2,Pure Barre - New York Columbus Circle - 60th &...,https://www.yelp.com/biz/pure-barre-new-york-c...,"1841 Broadway New York, NY 11023",3.0,"I enjoyed my class, but this was one of my lea...",https://www.yelp.com/user_details?userid=exPhu...,Peloton
3,Return To Life Center - Pilates and Functional...,https://www.yelp.com/biz/return-to-life-center...,"19 W 45th St New York, NY 10036",4.0,I came in for their Pilates Mat Fundamental cl...,https://www.yelp.com/user_details?userid=exPhu...,Peloton
4,Peloton,https://www.yelp.com/biz/peloton-new-york,"140 W 23rd St New York, NY 10011",4.0,I came in for my first Peloton class awhile ba...,https://www.yelp.com/user_details?userid=exPhu...,Peloton
5,Sonic Yoga,https://www.yelp.com/biz/sonic-yoga-new-york,"944 8th Ave New York, NY 10019",4.0,"I found Sonic on Class Pass, it was a donation...",https://www.yelp.com/user_details?userid=exPhu...,Peloton
6,Daya Yoga Studio,https://www.yelp.com/biz/daya-yoga-studio-bush...,"360 Jefferson St Bushwick, NY 11237",5.0,This is what I think of when I think of a ster...,https://www.yelp.com/user_details?userid=exPhu...,Peloton
7,Simply Fit Astoria,https://www.yelp.com/biz/simply-fit-astoria-as...,"37-20 Astoria Blvd Astoria, NY 11103",5.0,Jesus! I signed up for their Burn the Barre cl...,https://www.yelp.com/user_details?userid=exPhu...,Peloton
8,Exhale Upper East Side,https://www.yelp.com/biz/exhale-upper-east-sid...,"980 Madison Ave New York, NY 10075",5.0,"Wow, this place is really gorgeous. I came for...",https://www.yelp.com/user_details?userid=exPhu...,Peloton
9,Physique 57,https://www.yelp.com/biz/physique-57-new-york-2,"180 6th Ave New York, NY 10013",5.0,I think this was my favorite Barre class yet! ...,https://www.yelp.com/user_details?userid=exPhu...,Peloton


In [8]:
cf_rec_start_df = bus_reviews_df[['userUrl','rev_company_name','rev_comp_rating']]
cf_rec_user_start_df = user_reviews_df[['userUrl','rev_company_name','rev_comp_rating']]

#### To make a userId out of the Users' urls we will use regex to remove everything before the 'userid=':

In [9]:
s = 'userid\=(.*)'
cf_rec_start_df['user_id'] = cf_rec_start_df.userUrl.apply(lambda url: re.search(s, url).group(1))
cf_rec_user_start_df['user_id'] = cf_rec_user_start_df.userUrl.apply(lambda url: re.search(s, url).group(1))\

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [10]:
reader = Reader(rating_scale=(1, 5))
data_1 = Dataset.load_from_df(cf_rec_start_df[['user_id', 'rev_company_name','rev_comp_rating']], reader)
data_2 = Dataset.load_from_df(cf_rec_user_start_df[['user_id', 'rev_company_name','rev_comp_rating']], reader)

In [8]:
# define a cross-validation iterator
kf = KFold(n_splits=4)

algo = SlopeOne()

RMSE = []
for trainset, testset in kf.split(data):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    RMSE.append(accuracy.rmse(predictions))
    accuracy.rmse(predictions, verbose=True)
print(f'Average RSME: {sum(RMSE)/4}')

RMSE: 1.3918
RMSE: 1.3918
RMSE: 1.4134
RMSE: 1.4134
RMSE: 1.3921
RMSE: 1.3921
RMSE: 1.4226
RMSE: 1.4226
Average RSME: 1.4049471617782279


Now, that we have run a very basic SlopeOne model and recieved an averge Root Mean Squared Error (RMSE) of 1.41. This basiclly means that on averge we were almost a star and a half rating off. We can now try different combinations of the parameters to improve our model.

In [14]:
param_grid = {'n_epochs': [5, 10],'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs = GridSearchCV(S, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

1.1923823868939172
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}


In [15]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import GridSearchCV


param_grid = {'n_epochs': [5, 10,20], 'lr_all': [0.002, 0.005,0.006,0.1],
              'reg_all': [0.4, 0.6,0.2]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

1.1702228906310412
{'n_epochs': 20, 'lr_all': 0.006, 'reg_all': 0.2}


In [17]:


benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse') 

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVDpp,1.167599,0.982693,0.064427
SVD,1.169148,0.398672,0.019652
BaselineOnly,1.179409,0.011756,0.014861
KNNBaseline,1.246702,0.187133,0.095442
KNNBasic,1.312642,0.17197,0.097058
CoClustering,1.338433,0.35475,0.019541
KNNWithMeans,1.372905,0.205766,0.093389
KNNWithZScore,1.393994,0.224914,0.096295
NMF,1.415371,0.487227,0.019075
SlopeOne,1.41569,0.045959,0.027149


In [19]:
results_df = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse') 

In [22]:
results_df['test_rmse'].sort_values(ascending=True)

Algorithm
SVDpp              1.167599
SVD                1.169148
BaselineOnly       1.179409
KNNBaseline        1.246702
KNNBasic           1.312642
CoClustering       1.338433
KNNWithMeans       1.372905
KNNWithZScore      1.393994
NMF                1.415371
SlopeOne           1.415690
NormalPredictor    1.688078
Name: test_rmse, dtype: float64

Based on the RSME the __SVDpp, SVD and BaselineOnly__ models performed the best. So, next we will use a grid search to determine the best paarameters for each.

#### SVDpp:

In [27]:
param_grid = {'n_epochs': [10,20,30,35,37], 'lr_all': [0.005,0.006,0.0001],
              'reg_all': [0.4, 0.6,0.2,0.1]}
gs = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=5)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

1.1475706648976265
{'n_epochs': 35, 'lr_all': 0.006, 'reg_all': 0.2}


In [28]:
param_grid = {'n_epochs': [35,37], 'lr_all': [0.005,0.006,0.0001],
              'reg_all': [0.4, 0.6,0.3,0.2,0.1]}
gs = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=5)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

1.148258447023538
{'n_epochs': 37, 'lr_all': 0.006, 'reg_all': 0.2}


In [38]:
param_grid = {'n_epochs': [35,37], 'lr_all': [0.006,0.007,0.008],
              'reg_all': [0.02,0.2,0.25]}
gs = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=5)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

1.1454172362858182
{'n_epochs': 35, 'lr_all': 0.007, 'reg_all': 0.2}


In [40]:
param_grid = {'n_epochs': [35,37], 'lr_all': [0.006,0.007,0.008],
              'reg_all': [0.02,0.2,0.25]}
gs = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=5)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

1.1489518678888262
{'n_epochs': 37, 'lr_all': 0.007, 'reg_all': 0.2}


__SVD:__

In [42]:
param_grid = {'n_epochs': [35,37], 'lr_all': [0.006,0.007,0.008],
              'reg_all': [0.02,0.2,0.25]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=5)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

1.146241602056299
{'n_epochs': 37, 'lr_all': 0.008, 'reg_all': 0.2}


**BaselineOnly**

We will break this into the two separate methods of how our baseline is estimated. The two methods are Alternating Least Squares (ALS), the default, and Stochastic Gradient Descent (SGD). The reason we want to do this is beacause the hyperparameters that are default for each method are very different.

_First with ALS:_

In [66]:
param_grid = {'bsl_options': {'method': ['als'],
                              'reg_i': [2,5,7,10],
                             'reg_u':[2,5,7,15],
                             'n_epocha':[2,5, 7,10]}}
gs = GridSearchCV(BaselineOnly, param_grid, cv=5)
# ['n_epochs': [5,10,15], 'reg_u': [10,15,20],'reg_i': [5,10,15],
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimati

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimati

After a few iterations the best parameters were: 
- RMSE: 1.1496574591737345 -- {'method': 'als', 'reg_i': 5, 'reg_u': 5, 'n_epocha': 2}
- RMSE: 1.1462057653758768 -- {'method': 'als', 'reg_i': 3, 'reg_u': 3, 'n_epocha': 2}
- RMSE: 1.1440607458981593 -- {'method': 'als', 'reg_i': 3, 'reg_u': 4, 'n_epocha': 2}



*Now, with SGD:*

In [74]:
param_grid = {'bsl_options': {'method': ['sgd'],
                              'reg': [0.01,0.02,0.1],
                             'learning_rate':[0.0005,0.005,0.05],
                             'n_epocha':[5,10,20]}}
gs = GridSearchCV(BaselineOnly, param_grid, cv=5)
# ['n_epochs': [5,10,15], 'reg_u': [10,15,20],'reg_i': [5,10,15],
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimati

In [75]:
param_grid = {'bsl_options': {'method': ['sgd'],
                              'reg': [0.01,0.001, 0.02,0.1],
                             'learning_rate':[0.0005,0.005,0.0055,0.05],
                             'n_epocha':[5,7,10]}}
gs = GridSearchCV(BaselineOnly, param_grid, cv=5)
# ['n_epochs': [5,10,15], 'reg_u': [10,15,20],'reg_i': [5,10,15],
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimati

In [76]:
param_grid = {'bsl_options': {'method': ['sgd'],
                              'reg': [0.01, 0.02,0.03],
                             'learning_rate':[0.0005,0.005,0.0055,0.0065],
                             'n_epocha':[5,7,10]}}
gs = GridSearchCV(BaselineOnly, param_grid, cv=5)
# ['n_epochs': [5,10,15], 'reg_u': [10,15,20],'reg_i': [5,10,15],
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimati

In [78]:
param_grid = {'bsl_options': {'method': ['sgd'],
                              'reg': [0.02,0.03,0.04],
                             'learning_rate':[0.0055,0.0065,0.01,0.02],
                             'n_epocha':[5,7,10]}}
gs = GridSearchCV(BaselineOnly, param_grid, cv=5)
# ['n_epochs': [5,10,15], 'reg_u': [10,15,20],'reg_i': [5,10,15],
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimati

In [89]:
param_grid = {'bsl_options': {'method': ['sgd'],
                              'reg': [0.06,0.07,0.08],
                             'learning_rate':[0.095, 0.01,0.015],
                             'n_epocha':[3,4,5,6]}}
gs = GridSearchCV(BaselineOnly, param_grid, cv=10)
# ['n_epochs': [5,10,15], 'reg_u': [10,15,20],'reg_i': [5,10,15],
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimati

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimati

In [None]:
1.1483207108165303
{'bsl_options': {'method': 'sgd', 'reg': 0.06, 'learning_rate': 0.01, 'n_epocha': 3}}

1.1481766687458612
{'bsl_options': {'method': 'sgd', 'reg': 0.07, 'learning_rate': 0.01, 'n_epocha': 3}}

In [90]:
param_grid = {'bsl_options': {'method': ['sgd'],
                              'reg': [0.02,0.03,0.04],
                             'learning_rate':[0.0055,0.0065,0.01,0.02],
                             'n_epocha':[5,7,10]}}
gs = GridSearchCV(BaselineOnly, param_grid, cv=10)
# ['n_epochs': [5,10,15], 'reg_u': [10,15,20],'reg_i': [5,10,15],
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimati

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimati

In [None]:
1.142727250064263
{'bsl_options': {'method': 'sgd', 'reg': 0.08, 'learning_rate': 0.01, 'n_epocha': 3}}

In [16]:
# define a cross-validation iterator
kf = KFold(n_splits=4)

algo = BaselineOnly(bsl_options = {'method': 'sgd', 'reg': 0.04, 'learning_rate': 0.01, 'n_epocha': 5})

RMSE = []
for trainset, testset in kf.split(data_2):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    RMSE.append(accuracy.rmse(predictions))
    accuracy.rmse(predictions, verbose=True)
print(f'Average RSME: {sum(RMSE)/4}')

Estimating biases using sgd...
RMSE: 1.1714
RMSE: 1.1714
Estimating biases using sgd...
RMSE: 1.1496
RMSE: 1.1496
Estimating biases using sgd...
RMSE: 1.1493
RMSE: 1.1493
Estimating biases using sgd...
RMSE: 1.1987
RMSE: 1.1987
Average RSME: 1.1672278350387413


In [15]:
# define a cross-validation iterator
kf = KFold(n_splits=4)

algo = BaselineOnly(bsl_options = {'method': 'sgd', 'reg': 0.08, 'learning_rate': 0.01, 'n_epocha': 3})

RMSE = []
for trainset, testset in kf.split(data_1):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    RMSE.append(accuracy.rmse(predictions))
    accuracy.rmse(predictions, verbose=True)
print(f'Average RSME: {sum(RMSE)/4}')

Estimating biases using sgd...
RMSE: 1.1288
RMSE: 1.1288
Estimating biases using sgd...
RMSE: 1.1448
RMSE: 1.1448
Estimating biases using sgd...
RMSE: 1.1964
RMSE: 1.1964
Estimating biases using sgd...
RMSE: 1.2002
RMSE: 1.2002
Average RSME: 1.1675393216037895


In [21]:
benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    # Perform cross validation
    results = cross_validate(algorithm, data_1, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse') 

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVDpp,1.183018,1.356832,0.087181
SVD,1.192734,0.513061,0.05005
BaselineOnly,1.197564,0.024373,0.025542
KNNBaseline,1.261663,0.298065,0.14357
CoClustering,1.359149,0.472371,0.021606
KNNBasic,1.359378,0.266807,0.119924
KNNWithMeans,1.420546,0.263939,0.128129
KNNWithZScore,1.426252,0.380006,0.136302
NMF,1.447343,0.687014,0.024335
SlopeOne,1.452812,0.061289,0.037372


In [22]:
benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    # Perform cross validation
    results = cross_validate(algorithm, data_2, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse') 

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVDpp,1.188804,1.40114,0.083129
SVD,1.192334,0.487798,0.04039
BaselineOnly,1.19752,0.021505,0.024233
KNNBaseline,1.260038,0.316333,0.134988
KNNBasic,1.357664,0.285997,0.122714
CoClustering,1.36318,0.482785,0.023675
KNNWithMeans,1.41217,0.36758,0.149964
KNNWithZScore,1.422411,0.436025,0.15547
SlopeOne,1.432825,0.06554,0.033384
NMF,1.451937,0.75706,0.038958


### Previously we augmented the rating by using sentiment analysis

We will now import these updated DataFrames and run our best models on these new ratings and see how they preform. So, let's get the top methods: SVDpp, SVD, and Baseline only and try them out. We will try the first function written first to see if the baselines are the same:

In [39]:
user_reviews_train = open_pickle('Data/train_data')
user_reviews_test = open_pickle('Data/test_data')

In [47]:
cf_rec_user_start_df = user_reviews_train[['userUrl','rev_company_name','new_rating']]
reader = Reader(rating_scale=(0, 10))
data = Dataset.load_from_df(cf_rec_user_start_df , reader)

In [52]:
benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=5, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse') 

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Don

Unnamed: 0_level_0,fit_time,test_rmse,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVDpp,2.164058,1.749014,0.064982
SVD,0.738831,1.75169,0.021339
BaselineOnly,0.078493,1.766573,0.018132
KNNBaseline,0.431993,1.868722,0.112297
KNNBasic,0.265244,2.002901,0.105674
CoClustering,0.571131,2.099096,0.014933
KNNWithMeans,0.34457,2.100195,0.082406
KNNWithZScore,0.398659,2.123475,0.115931
SlopeOne,0.11216,2.181165,0.026636
NMF,1.032596,2.283111,0.015797


In [53]:
new_df = open_pickle('Data/newest_sentiment_rating')

In [54]:
new_df.columns

Index(['comapny_source', 'company_loc', 'rev_comp_rating', 'rev_comp_reviews',
       'rev_comp_url', 'rev_company_name', 'userUrl', 'new_rating'],
      dtype='object')

In [58]:
cf_rec_user_start_df_new = new_df[['userUrl','rev_company_name','new_rating']]

In [59]:
cf_rec_user_start_df_new.new_rating.sort_values()

10640    0.0035
10747    0.0036
8387     0.0038
11193    0.0044
7671     0.0051
9034     0.0058
12040    0.0062
2791     0.0065
1481     0.0076
12601    0.0079
4754     0.0100
7041     0.0104
806      0.0108
1985     0.0110
12862    0.0110
11064    0.0113
12220    0.0120
10225    0.0122
1470     0.0138
1471     0.0140
8143     0.0154
7130     0.0160
355      0.0165
3129     0.0171
3594     0.0172
7643     0.0177
5102     0.0182
11000    0.0183
1293     0.0183
12267    0.0184
          ...  
11686    5.9989
11394    5.9989
8381     5.9989
3382     5.9989
12568    5.9990
25       5.9990
1126     5.9991
7806     5.9991
1991     5.9991
3818     5.9992
7929     5.9992
13484    5.9992
13509    5.9992
12884    5.9992
1510     5.9993
9243     5.9993
11509    5.9993
7629     5.9993
12080    5.9994
9925     5.9994
3329     5.9994
3383     5.9995
6023     5.9995
6022     5.9996
13434    5.9996
6025     5.9996
6024     5.9996
3085     5.9997
5190     5.9998
6020     5.9998
Name: new_rating, Length

In [60]:
reader = Reader(rating_scale=(0, 6))
data = Dataset.load_from_df(cf_rec_user_start_df_new, reader)

benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=5, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse') 

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Don

Unnamed: 0_level_0,fit_time,test_rmse,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVDpp,2.080849,1.50901,0.058513
SVD,0.746251,1.50912,0.024904
BaselineOnly,0.072853,1.516467,0.020585
KNNBaseline,0.497887,1.60442,0.138762
KNNBasic,0.348816,1.706469,0.201522
KNNWithMeans,0.54316,1.802333,0.108449
SlopeOne,0.112953,1.820477,0.041446
KNNWithZScore,0.583278,1.832913,0.153016
CoClustering,0.656039,1.846187,0.019009
NMF,1.147156,1.974831,0.019104
