# Model Building

In this notebook we will be building a few different models for our recommendation system. We will look into content-based, collaborative filtering, and a combined model (content-based collaborative filtering). We will spilt our data into training and testing sets. 

**The steps are as follows:** 
1. Import Train and Test data
2. 

#### Import libraries/modules below:

In [1]:
import pickle
import re
import pandas as pd
import surprise
from surprise import KNNWithMeans
from surprise import SVD
from surprise import SVDpp
from surprise import NMF
from surprise import NormalPredictor
from surprise import KNNBaseline
from surprise import KNNBasic
from surprise import KNNWithZScore
from surprise import BaselineOnly
from surprise import CoClustering
from surprise import SlopeOne
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
from surprise.reader import Reader
from surprise.model_selection import KFold
from surprise.model_selection import GridSearchCV
from Mod_5_functions import pickle_file,open_pickle

#### Import the data:
- Save it as a DataFrame. 
- For our __baseline Collaborative-Filtering Model__, lets create a brand new __DataFrame__ with *only* the **userUrl, rev_company_name, rev_company_url, and the star rating.**

In [34]:
user_reviews_train = open_pickle('Data/train_data')
user_reviews_test = open_pickle('Data/test_data')

In [3]:
bus_reviews_df

Unnamed: 0,rev_company_name,rev_comp_url,company_loc,rev_comp_rating,rev_comp_reviews,userUrl,comapny_source
0,Planet Fitness - Manhattan - Canal St - NY,https://www.yelp.com/biz/planet-fitness-manhat...,"370 Canal St New York, NY 10013",3.0,"Planet Fitness is an affordable, no frills gym...",https://www.yelp.com/user_details?userid=exPhu...,Peloton
1,Montauk Salt Cave,https://www.yelp.com/biz/montauk-salt-cave-new...,"90 E 10th St New York, NY 10003",2.0,I purchased a Groupon for a friend and I. When...,https://www.yelp.com/user_details?userid=exPhu...,Peloton
2,Pure Barre - New York Columbus Circle - 60th &...,https://www.yelp.com/biz/pure-barre-new-york-c...,"1841 Broadway New York, NY 11023",3.0,"I enjoyed my class, but this was one of my lea...",https://www.yelp.com/user_details?userid=exPhu...,Peloton
3,Return To Life Center - Pilates and Functional...,https://www.yelp.com/biz/return-to-life-center...,"19 W 45th St New York, NY 10036",4.0,I came in for their Pilates Mat Fundamental cl...,https://www.yelp.com/user_details?userid=exPhu...,Peloton
4,Peloton,https://www.yelp.com/biz/peloton-new-york,"140 W 23rd St New York, NY 10011",4.0,I came in for my first Peloton class awhile ba...,https://www.yelp.com/user_details?userid=exPhu...,Peloton
5,Sonic Yoga,https://www.yelp.com/biz/sonic-yoga-new-york,"944 8th Ave New York, NY 10019",4.0,"I found Sonic on Class Pass, it was a donation...",https://www.yelp.com/user_details?userid=exPhu...,Peloton
6,Daya Yoga Studio,https://www.yelp.com/biz/daya-yoga-studio-bush...,"360 Jefferson St Bushwick, NY 11237",5.0,This is what I think of when I think of a ster...,https://www.yelp.com/user_details?userid=exPhu...,Peloton
7,Simply Fit Astoria,https://www.yelp.com/biz/simply-fit-astoria-as...,"37-20 Astoria Blvd Astoria, NY 11103",5.0,Jesus! I signed up for their Burn the Barre cl...,https://www.yelp.com/user_details?userid=exPhu...,Peloton
8,Exhale Upper East Side,https://www.yelp.com/biz/exhale-upper-east-sid...,"980 Madison Ave New York, NY 10075",5.0,"Wow, this place is really gorgeous. I came for...",https://www.yelp.com/user_details?userid=exPhu...,Peloton
9,Physique 57,https://www.yelp.com/biz/physique-57-new-york-2,"180 6th Ave New York, NY 10013",5.0,I think this was my favorite Barre class yet! ...,https://www.yelp.com/user_details?userid=exPhu...,Peloton


In [8]:
cf_rec_start_df = bus_reviews_df[['userUrl','rev_company_name','rev_comp_rating']]
cf_rec_user_start_df = user_reviews_df[['userUrl','rev_company_name','rev_comp_rating']]

#### To make a userId out of the Users' urls we will use regex to remove everything before the 'userid=':

In [9]:
s = 'userid\=(.*)'
cf_rec_start_df['user_id'] = cf_rec_start_df.userUrl.apply(lambda url: re.search(s, url).group(1))
cf_rec_user_start_df['user_id'] = cf_rec_user_start_df.userUrl.apply(lambda url: re.search(s, url).group(1))\

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [10]:
reader = Reader(rating_scale=(1, 5))
data_1 = Dataset.load_from_df(cf_rec_start_df[['user_id', 'rev_company_name','rev_comp_rating']], reader)
data_2 = Dataset.load_from_df(cf_rec_user_start_df[['user_id', 'rev_company_name','rev_comp_rating']], reader)

In [8]:
# define a cross-validation iterator
kf = KFold(n_splits=4)

algo = SlopeOne()

RMSE = []
for trainset, testset in kf.split(data):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    RMSE.append(accuracy.rmse(predictions))
    accuracy.rmse(predictions, verbose=True)
print(f'Average RSME: {sum(RMSE)/4}')

RMSE: 1.3918
RMSE: 1.3918
RMSE: 1.4134
RMSE: 1.4134
RMSE: 1.3921
RMSE: 1.3921
RMSE: 1.4226
RMSE: 1.4226
Average RSME: 1.4049471617782279


Now, that we have run a very basic SlopeOne model and recieved an averge Root Mean Squared Error (RMSE) of 1.41. This basiclly means that on averge we were almost a star and a half rating off. We can now try different combinations of the parameters to improve our model.

In [14]:
param_grid = {'n_epochs': [5, 10],'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs = GridSearchCV(S, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

1.1923823868939172
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}


In [15]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import GridSearchCV


param_grid = {'n_epochs': [5, 10,20], 'lr_all': [0.002, 0.005,0.006,0.1],
              'reg_all': [0.4, 0.6,0.2]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

1.1702228906310412
{'n_epochs': 20, 'lr_all': 0.006, 'reg_all': 0.2}


In [17]:


benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse') 

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVDpp,1.167599,0.982693,0.064427
SVD,1.169148,0.398672,0.019652
BaselineOnly,1.179409,0.011756,0.014861
KNNBaseline,1.246702,0.187133,0.095442
KNNBasic,1.312642,0.17197,0.097058
CoClustering,1.338433,0.35475,0.019541
KNNWithMeans,1.372905,0.205766,0.093389
KNNWithZScore,1.393994,0.224914,0.096295
NMF,1.415371,0.487227,0.019075
SlopeOne,1.41569,0.045959,0.027149


In [19]:
results_df = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse') 

In [22]:
results_df['test_rmse'].sort_values(ascending=True)

Algorithm
SVDpp              1.167599
SVD                1.169148
BaselineOnly       1.179409
KNNBaseline        1.246702
KNNBasic           1.312642
CoClustering       1.338433
KNNWithMeans       1.372905
KNNWithZScore      1.393994
NMF                1.415371
SlopeOne           1.415690
NormalPredictor    1.688078
Name: test_rmse, dtype: float64

Based on the RSME the __SVDpp, SVD and BaselineOnly__ models performed the best. So, next we will use a grid search to determine the best paarameters for each.

#### SVDpp:

In [27]:
param_grid = {'n_epochs': [10,20,30,35,37], 'lr_all': [0.005,0.006,0.0001],
              'reg_all': [0.4, 0.6,0.2,0.1]}
gs = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=5)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

1.1475706648976265
{'n_epochs': 35, 'lr_all': 0.006, 'reg_all': 0.2}


In [28]:
param_grid = {'n_epochs': [35,37], 'lr_all': [0.005,0.006,0.0001],
              'reg_all': [0.4, 0.6,0.3,0.2,0.1]}
gs = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=5)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

1.148258447023538
{'n_epochs': 37, 'lr_all': 0.006, 'reg_all': 0.2}


In [38]:
param_grid = {'n_epochs': [35,37], 'lr_all': [0.006,0.007,0.008],
              'reg_all': [0.02,0.2,0.25]}
gs = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=5)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

1.1454172362858182
{'n_epochs': 35, 'lr_all': 0.007, 'reg_all': 0.2}


In [40]:
param_grid = {'n_epochs': [35,37], 'lr_all': [0.006,0.007,0.008],
              'reg_all': [0.02,0.2,0.25]}
gs = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=5)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

1.1489518678888262
{'n_epochs': 37, 'lr_all': 0.007, 'reg_all': 0.2}


__SVD:__

In [42]:
param_grid = {'n_epochs': [35,37], 'lr_all': [0.006,0.007,0.008],
              'reg_all': [0.02,0.2,0.25]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=5)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

1.146241602056299
{'n_epochs': 37, 'lr_all': 0.008, 'reg_all': 0.2}


**BaselineOnly**

We will break this into the two separate methods of how our baseline is estimated. The two methods are Alternating Least Squares (ALS), the default, and Stochastic Gradient Descent (SGD). The reason we want to do this is beacause the hyperparameters that are default for each method are very different.

_First with ALS:_

In [66]:
param_grid = {'bsl_options': {'method': ['als'],
                              'reg_i': [2,5,7,10],
                             'reg_u':[2,5,7,15],
                             'n_epocha':[2,5, 7,10]}}
gs = GridSearchCV(BaselineOnly, param_grid, cv=5)
# ['n_epochs': [5,10,15], 'reg_u': [10,15,20],'reg_i': [5,10,15],
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimati

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimati

After a few iterations the best parameters were: 
- RMSE: 1.1496574591737345 -- {'method': 'als', 'reg_i': 5, 'reg_u': 5, 'n_epocha': 2}
- RMSE: 1.1462057653758768 -- {'method': 'als', 'reg_i': 3, 'reg_u': 3, 'n_epocha': 2}
- RMSE: 1.1440607458981593 -- {'method': 'als', 'reg_i': 3, 'reg_u': 4, 'n_epocha': 2}



*Now, with SGD:*

In [74]:
param_grid = {'bsl_options': {'method': ['sgd'],
                              'reg': [0.01,0.02,0.1],
                             'learning_rate':[0.0005,0.005,0.05],
                             'n_epocha':[5,10,20]}}
gs = GridSearchCV(BaselineOnly, param_grid, cv=5)
# ['n_epochs': [5,10,15], 'reg_u': [10,15,20],'reg_i': [5,10,15],
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimati

In [75]:
param_grid = {'bsl_options': {'method': ['sgd'],
                              'reg': [0.01,0.001, 0.02,0.1],
                             'learning_rate':[0.0005,0.005,0.0055,0.05],
                             'n_epocha':[5,7,10]}}
gs = GridSearchCV(BaselineOnly, param_grid, cv=5)
# ['n_epochs': [5,10,15], 'reg_u': [10,15,20],'reg_i': [5,10,15],
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimati

In [76]:
param_grid = {'bsl_options': {'method': ['sgd'],
                              'reg': [0.01, 0.02,0.03],
                             'learning_rate':[0.0005,0.005,0.0055,0.0065],
                             'n_epocha':[5,7,10]}}
gs = GridSearchCV(BaselineOnly, param_grid, cv=5)
# ['n_epochs': [5,10,15], 'reg_u': [10,15,20],'reg_i': [5,10,15],
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimati

In [78]:
param_grid = {'bsl_options': {'method': ['sgd'],
                              'reg': [0.02,0.03,0.04],
                             'learning_rate':[0.0055,0.0065,0.01,0.02],
                             'n_epocha':[5,7,10]}}
gs = GridSearchCV(BaselineOnly, param_grid, cv=5)
# ['n_epochs': [5,10,15], 'reg_u': [10,15,20],'reg_i': [5,10,15],
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimati

In [89]:
param_grid = {'bsl_options': {'method': ['sgd'],
                              'reg': [0.06,0.07,0.08],
                             'learning_rate':[0.095, 0.01,0.015],
                             'n_epocha':[3,4,5,6]}}
gs = GridSearchCV(BaselineOnly, param_grid, cv=10)
# ['n_epochs': [5,10,15], 'reg_u': [10,15,20],'reg_i': [5,10,15],
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimati

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimati

In [None]:
1.1483207108165303
{'bsl_options': {'method': 'sgd', 'reg': 0.06, 'learning_rate': 0.01, 'n_epocha': 3}}

1.1481766687458612
{'bsl_options': {'method': 'sgd', 'reg': 0.07, 'learning_rate': 0.01, 'n_epocha': 3}}

In [90]:
param_grid = {'bsl_options': {'method': ['sgd'],
                              'reg': [0.02,0.03,0.04],
                             'learning_rate':[0.0055,0.0065,0.01,0.02],
                             'n_epocha':[5,7,10]}}
gs = GridSearchCV(BaselineOnly, param_grid, cv=10)
# ['n_epochs': [5,10,15], 'reg_u': [10,15,20],'reg_i': [5,10,15],
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimati

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimati

In [None]:
1.142727250064263
{'bsl_options': {'method': 'sgd', 'reg': 0.08, 'learning_rate': 0.01, 'n_epocha': 3}}

In [16]:
# define a cross-validation iterator
kf = KFold(n_splits=4)

algo = BaselineOnly(bsl_options = {'method': 'sgd', 'reg': 0.04, 'learning_rate': 0.01, 'n_epocha': 5})

RMSE = []
for trainset, testset in kf.split(data_2):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    RMSE.append(accuracy.rmse(predictions))
    accuracy.rmse(predictions, verbose=True)
print(f'Average RSME: {sum(RMSE)/4}')

Estimating biases using sgd...
RMSE: 1.1714
RMSE: 1.1714
Estimating biases using sgd...
RMSE: 1.1496
RMSE: 1.1496
Estimating biases using sgd...
RMSE: 1.1493
RMSE: 1.1493
Estimating biases using sgd...
RMSE: 1.1987
RMSE: 1.1987
Average RSME: 1.1672278350387413


In [15]:
# define a cross-validation iterator
kf = KFold(n_splits=4)

algo = BaselineOnly(bsl_options = {'method': 'sgd', 'reg': 0.08, 'learning_rate': 0.01, 'n_epocha': 3})

RMSE = []
for trainset, testset in kf.split(data_1):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    RMSE.append(accuracy.rmse(predictions))
    accuracy.rmse(predictions, verbose=True)
print(f'Average RSME: {sum(RMSE)/4}')

Estimating biases using sgd...
RMSE: 1.1288
RMSE: 1.1288
Estimating biases using sgd...
RMSE: 1.1448
RMSE: 1.1448
Estimating biases using sgd...
RMSE: 1.1964
RMSE: 1.1964
Estimating biases using sgd...
RMSE: 1.2002
RMSE: 1.2002
Average RSME: 1.1675393216037895


In [21]:
benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    # Perform cross validation
    results = cross_validate(algorithm, data_1, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse') 

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVDpp,1.183018,1.356832,0.087181
SVD,1.192734,0.513061,0.05005
BaselineOnly,1.197564,0.024373,0.025542
KNNBaseline,1.261663,0.298065,0.14357
CoClustering,1.359149,0.472371,0.021606
KNNBasic,1.359378,0.266807,0.119924
KNNWithMeans,1.420546,0.263939,0.128129
KNNWithZScore,1.426252,0.380006,0.136302
NMF,1.447343,0.687014,0.024335
SlopeOne,1.452812,0.061289,0.037372


In [22]:
benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    # Perform cross validation
    results = cross_validate(algorithm, data_2, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse') 

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVDpp,1.188804,1.40114,0.083129
SVD,1.192334,0.487798,0.04039
BaselineOnly,1.19752,0.021505,0.024233
KNNBaseline,1.260038,0.316333,0.134988
KNNBasic,1.357664,0.285997,0.122714
CoClustering,1.36318,0.482785,0.023675
KNNWithMeans,1.41217,0.36758,0.149964
KNNWithZScore,1.422411,0.436025,0.15547
SlopeOne,1.432825,0.06554,0.033384
NMF,1.451937,0.75706,0.038958


TF-IDF Cosine Similarity Recommedation:

In [2]:
tfidf_matrix = open_pickle('Data/tfidf_reviews_matrix')
reviews_df = open_pickle('Data/reviews_w_new_sentiment')

In [50]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim_2 = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [51]:
sim_scores_2[89],sim_scores[6]

IndexError: list index out of range

In [10]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(reviews_df.index, index=reviews_df['rev_company_name']).drop_duplicates()

In [47]:
from scipy.spatial.distance import minkowski
idx = indices['Peloton']
sim_scores_2 = list(enumerate(cosine_sim_2[idx]))
sim_scores_2 = sorted(sim_scores, key=lambda x: sum(x[1]), reverse=True)

In [48]:
sim_scores_2 = sim_scores[1:11]

In [49]:
fit_indices = [i[0] for i in sim_scores_2]
reviews_df['rev_company_name'].iloc[fit_indices]

72                   LA Fitness
91            TITLE Boxing Club
52                  WOOM CENTER
42               CorePower Yoga
80          Equinox Bryant Park
22    New York Pilates - CLOSED
69                    Solidcore
5                    Sonic Yoga
49                SKY TING YOGA
Name: rev_company_name, dtype: object

In [13]:
# Function that takes in movie title as input and outputs most similar movies
from scipy.spatial.distance import minkowski
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: sum(x[1]), reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    fit_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return reviews_df['rev_company_name'].iloc[fit_indices]

In [149]:
# from scipy.stats import rankdata
# from numpy.random import rand
# from scipy.sparse import coo_matrix
from scipy.sparse import csr_matrix

cosine_df = pd.DataFrame(csr_matrix.todok(cosine_similarities))

KeyboardInterrupt: 

In [124]:
cosine_similarities[0].indece

AttributeError: indece not found

In [94]:
# # import pandas as pd
# import numpy as np
# from numpy import argsort
# # from sklearn.feature_extraction.text import TfidfVectorizer
# # from sklearn.metrics.pairwise import cosine_similarity
# # reviews_df['rev_company_name'] #names of companies

from itertools import izip

def sort_coo(m):
    tuples = izip(m.row, m.col, m.data)
    return sorted(tuples, key=lambda x: (x[0], x[2]))

# cosine_similarities = cosine_similarity(tfidf_matrix,Y=None,dense_output=False)
results = {} # dictionary created to store the result in a dictionary format (ID : (Score,item_id))#
for idx, row in reviews_df.iterrows(): #iterates through all the rows

# the below code 'similar_indice' stores similar ids based on cosine similarity. sorts them in ascending order. [:-5:-1] is then used so that the indices with most similarity are got. 0 means no similarity and 1 means perfect similarity#
    similar_indices = sort_coo(cosine_similarities[idx])

    #stores 5 most similar books, you can change it as per your needs
    similar_items = [(cosine_similarities[idx][i], reviews_df['rev_company_name'][i]) for i in similar_indices]
    results[row['rev_company_name']] = similar_items[1:]
    
#below code 'function item(id)' returns a row matching the id along with Book Title. Initially it is a dataframe, then we convert it to a list#
def item(name):
    return reviews_df.loc[reviews_df['rev_company_name'] == name]['rev_company_name'].tolist()[0]
def recommend(name, num):
    if (num == 0):
        print("Unable to recommend any book as you have not chosen the number of book to be recommended")
    elif (num==1):
        print("Recommending " + str(num) + " book similar to " + item(name))
        
    else :
        print("Recommending " + str(num) + " books similar to " + item(name))
        
    print("----------------------------------------------------------")
    recs = results[name][:num]
    for rec in recs:
        print("You may also like to read: " + item(rec[1]) + " (score:" + str(rec[0]) + ")")

#the first argument in the below function to be passed is the id of the book, second argument is the number of books you want to be recommended#
recommend('Peloton',2)

IndexError: index returns 3-dim structure

In [160]:
indices = pd.Series(reviews_df.index)

#  defining the function that takes in movie title 
# as input and returns the top 10 recommended movies
def recommendations(comp_name, cosine_sim = cosine_sim):
    
    # initializing the empty list of recommended movies
    recommended_fit = []
    
    # gettin the index of the movie that matches the title
    idx = indices[indices == comp_name.index]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)

    # getting the indexes of the 10 most similar movies
    top_10_indexes = list(score_series.iloc[1:11].index)
    
    # populating the list with the titles of the best 10 matching movies
    for i in top_10_indexes:
        recommended_fit.append(list( reviews_df.index)[i])
        
    return recommended_fit

recommendations('Peloton', cosine_similarities)

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

In [None]:
list(zip(cosine_similarities[0].indices,cosine_similarities[0].data)).sort(key=lambda tup: tup[1])


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarities = cosine_similarity(tfidf_matrix,Y=None,dense_output=False)

In [33]:
# results = {} 
# for idx, row in reviews_df.iterrows(): #iterates through all the rows
#     similar_zip = list(zip(cosine_similarities[0].indices,cosine_similarities[0].data)).sort(key=lambda tup: tup[1])
#     similar_indices = [i[0] for i in sorted(test,key = lambda tupe: tupe[1], reverse=True)][1:]
    
#     similar_items = [reviews_df['rev_company_name'][i] for i in similar_indices]
#     results[row['rev_company_name']] = similar_items
    

KeyboardInterrupt: 

In [None]:
def item(name):
    return reviews_df.loc[reviews_df['rev_company_name'] == name]['rev_company_name'].tolist()[0]
def recommend(name, num):
    if (num == 0):
        print("Unable to recommend any book as you have not chosen the number of book to be recommended")
    elif (num==1):
        print("Recommending " + str(num) + " book similar to " + item(name))
        
    else :
        print("Recommending " + str(num) + " books similar to " + item(name))
        
    print("----------------------------------------------------------")
    recs = results[name][:num]
    for rec in recs:
        print("You may also like to read: " + item(rec[1]) + " (score:" + str(rec[0]) + ")")


recommend('Peloton',2)

In [8]:
here = list(zip(cosine_similarities[0].indices,cosine_similarities[0].data)).sort(key=lambda tup: tup[1])

In [14]:
list(zip(list(cosine_similarities[0].indices),list(cosine_similarities[0].data)))

[(9698, 0.009278498439621207),
 (14993, 0.008635076627653037),
 (14145, 0.0045644651284599954),
 (13492, 0.003700166578920376),
 (13055, 0.0061381977254084605),
 (12948, 0.0031281606449373423),
 (11102, 0.00177881918106044),
 (6570, 0.003557991556167144),
 (6176, 0.003429729180667422),
 (4912, 0.0025100424298534194),
 (3867, 0.005529410684540945),
 (10020, 0.04042510227107501),
 (10366, 0.023092322355937604),
 (8215, 0.015323992475893075),
 (5420, 0.010397902580397888),
 (2103, 0.010807292628480666),
 (15598, 0.034135300857713054),
 (15196, 0.015591518724695965),
 (14962, 0.009394063402076233),
 (14107, 0.021453736061494293),
 (13742, 0.004162730546881709),
 (13409, 0.0069122954801555045),
 (13346, 0.02001752663590023),
 (13314, 0.012236943067076678),
 (12670, 0.01205647073549531),
 (11948, 0.01595031779000039),
 (6487, 0.020728324590659562),
 (5801, 0.027562512677642297),
 (4307, 0.022328463387850633),
 (2732, 0.03984194059827591),
 (1658, 0.007648940369692417),
 (14267, 0.02060412197

In [15]:
test = list(zip(list(cosine_similarities[0].indices),list(cosine_similarities[0].data)))

In [23]:
[i[0] for i in sorted(test,key = lambda tupe: tupe[1], reverse=True)][1:]

[12632,
 6955,
 70,
 8860,
 14438,
 10548,
 15356,
 10114,
 9803,
 5306,
 4014,
 1197,
 3068,
 6178,
 11573,
 1794,
 6011,
 6007,
 1799,
 7030,
 14838,
 7033,
 8667,
 10502,
 14431,
 2770,
 10239,
 1203,
 8350,
 6901,
 9423,
 12331,
 10676,
 15343,
 2907,
 7934,
 2983,
 9832,
 6004,
 7925,
 10704,
 13638,
 14003,
 3899,
 11750,
 5720,
 11922,
 9432,
 1843,
 14422,
 9529,
 11485,
 7196,
 9695,
 8228,
 2982,
 8669,
 14840,
 15662,
 4441,
 113,
 544,
 10549,
 421,
 11889,
 890,
 1233,
 8668,
 8304,
 10547,
 8064,
 3427,
 11768,
 4644,
 3330,
 1814,
 10812,
 9971,
 1449,
 15200,
 1807,
 2647,
 9408,
 7032,
 10746,
 13478,
 2667,
 6781,
 6527,
 15413,
 12363,
 10648,
 190,
 11753,
 6484,
 7817,
 12033,
 4343,
 11737,
 1231,
 10838,
 3372,
 10791,
 6098,
 13445,
 7368,
 12713,
 5744,
 12387,
 15181,
 8656,
 15201,
 13222,
 7794,
 5361,
 8208,
 10013,
 9045,
 10473,
 14516,
 6526,
 3100,
 7370,
 6009,
 6588,
 2899,
 8851,
 13899,
 7283,
 125,
 9938,
 10466,
 12368,
 10052,
 5672,
 13789,
 127

In [27]:
reviews_df['rev_company_name'][12632]

'Blink Fitness - Chelsea'

In [28]:
similar_items

IndexError: index (12632) out of range