<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-Data" data-toc-modified-id="Import-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import Data</a></span></li><li><span><a href="#Simple-Recommender-(Pittsburgh-Data)" data-toc-modified-id="Simple-Recommender-(Pittsburgh-Data)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Simple Recommender (Pittsburgh Data)</a></span><ul class="toc-item"><li><span><a href="#Scikit-Learn-Surprise-SVD" data-toc-modified-id="Scikit-Learn-Surprise-SVD-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Scikit-Learn Surprise SVD</a></span><ul class="toc-item"><li><span><a href="#Using-GridSearchCV" data-toc-modified-id="Using-GridSearchCV-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Using GridSearchCV</a></span></li></ul></li><li><span><a href="#Recommendation-using-means" data-toc-modified-id="Recommendation-using-means-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Recommendation using means</a></span></li><li><span><a href="#Using-similarities-in-Categories" data-toc-modified-id="Using-similarities-in-Categories-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Using similarities in Categories</a></span></li><li><span><a href="#User-profiles-using-TF-IDF" data-toc-modified-id="User-profiles-using-TF-IDF-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>User profiles using TF-IDF</a></span><ul class="toc-item"><li><span><a href="#Gensim-TF-IDF" data-toc-modified-id="Gensim-TF-IDF-2.4.1"><span class="toc-item-num">2.4.1&nbsp;&nbsp;</span>Gensim TF-IDF</a></span></li></ul></li><li><span><a href="#User-profiles-using-LDA-Topics" data-toc-modified-id="User-profiles-using-LDA-Topics-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>User profiles using LDA Topics</a></span></li><li><span><a href="#Predicting-using-similarities" data-toc-modified-id="Predicting-using-similarities-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Predicting using similarities</a></span></li><li><span><a href="#Matrix-Factorization-Collaborative-Based-Filtering" data-toc-modified-id="Matrix-Factorization-Collaborative-Based-Filtering-2.7"><span class="toc-item-num">2.7&nbsp;&nbsp;</span>Matrix Factorization Collaborative Based Filtering</a></span></li><li><span><a href="#Restaurant-profiles-using-TF-IDF" data-toc-modified-id="Restaurant-profiles-using-TF-IDF-2.8"><span class="toc-item-num">2.8&nbsp;&nbsp;</span>Restaurant profiles using TF-IDF</a></span></li></ul></li></ul></div>

# Import Data

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_style(style='whitegrid')
import matplotlib.pyplot as plt

In [3]:
#sets the default options for viewing pandas dataframes
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 200)
#pd.set_option('display.width', 100)
pd.set_option('display.max_info_columns', 50)

In [4]:
path = '/Users/dmitriykats/Documents/SpringBoard/Springboard/Capstone2/true_review/data/'

In [5]:
pit_power = pd.read_csv(f'{path}/interim/pit_data.csv', parse_dates=['date'])

In [6]:
pit_power = pit_power.drop(columns='Unnamed: 0')

In [7]:
print(f'Number of Users: {pit_power.user_id.unique().shape[0]}')
print(f'Number of Restaurants: {pit_power.business_id.unique().shape[0]}')
print(f'Number of Reviews: {pit_power.text.unique().shape[0]}')

Number of Users: 1887
Number of Restaurants: 1789
Number of Reviews: 19363


The dataset includes only Non Fast-food restaurants in Pittsburgh. It has also been pre-processed to only include users with more than 200 reviews and with at least 1 friend.

# Simple Recommender (Pittsburgh Data)

In [8]:
n_users = pit_power.user_id.unique().shape[0]
n_rests = pit_power.business_id.unique().shape[0]

In [9]:
pit_power_encoded = pit_power.copy()

In [10]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
pit_power_encoded['user_id'] = le.fit_transform(pit_power_encoded.user_id.values)
pit_power_encoded['business_id'] = le.fit_transform(pit_power_encoded.business_id.values)

## Scikit-Learn Surprise SVD

In [11]:
pit_power_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19363 entries, 0 to 19362
Columns: 102 entries, level_0 to string_categories
dtypes: datetime64[ns](1), float64(4), int64(88), object(9)
memory usage: 15.1+ MB


In [12]:
len(pit_power.business_id.unique())

1789

In [13]:
pit_power_encoded.head()

Unnamed: 0,level_0,index,user_id,business_id,rev_stars,date,text,useful,funny,cool,name,neighborhood,address,city,state,postal_code,latitude,longitude,bus_stars,review_count,is_open,categories,weekday,text length,year,split_categories,Bars,Pizza,American (Traditional),American (New),Sandwiches,Italian,Breakfast & Brunch,Chinese,Cafes,Burgers,Salad,Mexican,Coffee & Tea,Seafood,Diners,Event Planning & Services,Chicken Wings,Delis,Sushi Bars,Japanese,Mediterranean,Cocktail Bars,Sports Bars,Thai,Caterers,Barbeque,Pubs,Steakhouses,Asian Fusion,Desserts,Specialty Food,Vegetarian,Soup,Bakeries,Middle Eastern,Indian,Wine Bars,Food Trucks,Lounges,Greek,Vegan,Beer,Wine & Spirits,Food Delivery Services,Arts & Entertainment,Bagels,Hot Dogs,Soul Food,Juice Bars & Smoothies,Ice Cream & Frozen Yogurt,Beer Bar,Gluten-Free,Dive Bars,Latin American,French,Buffets,Gastropubs,Comfort Food,Grocery,Noodles,Korean,Tapas/Small Plates,Ethnic Food,Turkish,Hotels & Travel,Vietnamese,Venues & Event Spaces,Tapas Bars,Shopping,Music Venues,Caribbean,Imported Food,Taiwanese,Local Flavor,Tacos,string_categories
0,0,272367,168,371,3,2011-12-22,I ate here for dinner last Thursday evening wi...,0,0,0,BRGR,Shadyside,"""5997 Centre Ave""",Pittsburgh,PA,15206.0,40.459915,-79.925664,3.5,401,1,Restaurants;Burgers;American (Traditional),3,1314,2011,"['Burgers', 'American (Traditional)']",0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Burgers American (Traditional)
1,1,272368,512,972,2,2013-12-26,"Maybe we just ordered the wrong items, but the...",2,0,0,Thai Cuisine,Bloomfield,"""4627 Liberty Ave""",Pittsburgh,PA,15224.0,40.46255,-79.94976,4.0,213,1,Restaurants;Thai,3,527,2013,['Thai'],0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Thai
2,2,272377,39,1126,5,2016-10-28,Craving Mexican? Check out this tasty spot in ...,6,2,3,Los Cabos Mexican Restaurant,Bloomfield,"""4108-10 Penn Ave""",Pittsburgh,PA,15224.0,40.465539,-79.954581,3.5,115,1,Restaurants;Mexican,4,1012,2016,['Mexican'],0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Mexican
3,3,272380,507,1726,1,2014-09-20,Back one more time to see if anything had chan...,1,0,0,Point Brugge Café,Point Breeze,"""401 Hastings St""",Pittsburgh,PA,15206.0,40.450042,-79.913901,4.5,514,1,Belgian;French;German;Restaurants,5,812,2014,"['Belgian', 'French', 'German']",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Belgian French German
4,4,272381,1613,1689,2,2015-06-24,"Stopped in here on a drive home from Sandusky,...",0,0,0,Primanti Bros,Downtown,"""2 S Market Sq""",Pittsburgh,PA,15222.0,40.440287,-80.002585,3.5,604,1,Sandwiches;American (New);Nightlife;Restaurant...,2,776,2015,"['Sandwiches', 'American (New)', 'Bars', 'Spor...",1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Sandwiches American (New) Bars Sports Bars Chi...


In [53]:
from surprise import Reader, Dataset
from surprise import SVD, accuracy, KNNBasic, BaselineOnly, KNNBaseline, SVDpp, CoClustering
from surprise.model_selection import train_test_split
from surprise import accuracy
reader = Reader()
data = Dataset.load_from_df(pit_power_encoded[['user_id', 'business_id', 'rev_stars']], reader)
#split the data int train/test
trainset, testset = train_test_split(data, test_size=0.1)
#use SVD to train the model
algo = SVD()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x123b8c278>

In [54]:
predictions[0:5]

[Prediction(uid=688, iid=558, r_ui=2.0, est=3.4086073972297046, details={'was_impossible': False}),
 Prediction(uid=1617, iid=1507, r_ui=4.0, est=3.7262710891770916, details={'was_impossible': False}),
 Prediction(uid=567, iid=831, r_ui=4.0, est=4.150625699994919, details={'was_impossible': False}),
 Prediction(uid=528, iid=782, r_ui=3.0, est=4.503245734135035, details={'was_impossible': False}),
 Prediction(uid=631, iid=1217, r_ui=3.0, est=3.290894156370758, details={'was_impossible': False})]

In [55]:
predictions = algo.test(testset)
accuracy.rmse(predictions)

RMSE: 0.8956


0.8956356305670693

### Using GridSearchCV

In [56]:
from surprise.model_selection import GridSearchCV

data = Dataset.load_from_df(pit_power_encoded[['user_id', 'business_id', 'rev_stars']], reader)
param_grid = {'n_epochs': [20, 30], 'lr_all': [0.009, 0.011],
              'reg_all': [0.2, 0.25]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=5)
gs.fit(data)
# best RMSE score
print(gs.best_score['rmse'])
# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.9042683972621649
{'n_epochs': 30, 'lr_all': 0.011, 'reg_all': 0.2}


## Recommendation using means

In [262]:
#simple recommender using collaborative mean (collaborative mean should be very close to Yelp Rating)
def collab_mean(user_id, business_id):
    #make sure not to consider input user
    user_condition = phx.user_id != user_id
    #index into all ratings for the business_id
    rest_condition = phx.business_id == business_id
    
    ratings_by_others = phx.loc[user_condition & rest_condition]
    if ratings_by_others.empty:
        return 3.0
    else: 
        return ratings_by_others.rev_stars.mean()
    
    
#test on single user / restaurant combination   
user = 'sSxVSRgH1nXTijHdeApynw'
rest = 'prdA1r8XP03oD-PYvZJ5AA'     
print(f"Predicted Rating: {collab_mean(user, rest)}")
print(f"User Actual Rating: {(phx[(phx.user_id == user) & (phx.business_id == rest)]).rev_stars.iloc[0]}")
print(f"Yelp Rating: {phx[phx.business_id == rest].bus_stars.iloc[0]}")


Predicted Rating: 2.917525773195876
User Actual Rating: 1
Yelp Rating: 3.0


In [261]:
#simple recommender using content mean (content mean should be same as user's average rating)
def content_mean(user_id, business_id):
    user_condition = phx.user_id == user_id
    return phx.loc[user_condition, 'rev_stars'].mean()


#test on single user / restaurant combination
user = 'sSxVSRgH1nXTijHdeApynw'
rest = 'prdA1r8XP03oD-PYvZJ5AA' 
print(f"Predicted Rating: {content_mean(user, rest)}")
print(f"User Actual Rating: {(phx[(phx.user_id == user) & (phx.business_id == rest)]).rev_stars.iloc[0]}")
print(f"Yelp Rating: {phx[phx.business_id == rest].bus_stars.iloc[0]}")


Predicted Rating: 2.5
User Actual Rating: 1
Yelp Rating: 3.0


## Using similarities in Categories

In [308]:
#separate the categories into a single string so it can be vectorized
pit_power['string_categories'] = pit_power.split_categories.apply(lambda x: " ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [450]:
pit_power.head(10)

Unnamed: 0,level_0,index,user_id,business_id,rev_stars,date,text,useful,funny,cool,name,neighborhood,address,city,state,postal_code,latitude,longitude,bus_stars,review_count,is_open,categories,weekday,text length,year,split_categories,Bars,Pizza,American (Traditional),American (New),Sandwiches,Italian,Breakfast & Brunch,Chinese,Cafes,Burgers,Salad,Mexican,Coffee & Tea,Seafood,Diners,Event Planning & Services,Chicken Wings,Delis,Sushi Bars,Japanese,Mediterranean,Cocktail Bars,Sports Bars,Thai,Caterers,Barbeque,Pubs,Steakhouses,Asian Fusion,Desserts,Specialty Food,Vegetarian,Soup,Bakeries,Middle Eastern,Indian,Wine Bars,Food Trucks,Lounges,Greek,Vegan,Beer,Wine & Spirits,Food Delivery Services,Arts & Entertainment,Bagels,Hot Dogs,Soul Food,Juice Bars & Smoothies,Ice Cream & Frozen Yogurt,Beer Bar,Gluten-Free,Dive Bars,Latin American,French,Buffets,Gastropubs,Comfort Food,Grocery,Noodles,Korean,Tapas/Small Plates,Ethnic Food,Turkish,Hotels & Travel,Vietnamese,Venues & Event Spaces,Tapas Bars,Shopping,Music Venues,Caribbean,Imported Food,Taiwanese,Local Flavor,Tacos,string_categories
0,0,272367,4m9NXICYBC5i9t4aTt-I6w,CFtZH4Skp9z3o4ToSywI4w,3,2011-12-22,I ate here for dinner last Thursday evening wi...,0,0,0,BRGR,Shadyside,"""5997 Centre Ave""",Pittsburgh,PA,15206,40.459915,-79.925664,3.5,401,1,Restaurants;Burgers;American (Traditional),3,1314,2011,"[Burgers, American (Traditional)]",0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Burgers American (Traditional)
1,1,272368,H7pj7sbXY3N-WSEwa-JfpA,YzqV61exMv__mjobBV2g7g,2,2013-12-26,"Maybe we just ordered the wrong items, but the...",2,0,0,Thai Cuisine,Bloomfield,"""4627 Liberty Ave""",Pittsburgh,PA,15224,40.46255,-79.94976,4.0,213,1,Restaurants;Thai,3,527,2013,[Thai],0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Thai
2,2,272377,0EhPIlDozxGKpbbHRr6vZg,dAcaMSkrpYUZocqrLvOzRQ,5,2016-10-28,Craving Mexican? Check out this tasty spot in ...,6,2,3,Los Cabos Mexican Restaurant,Bloomfield,"""4108-10 Penn Ave""",Pittsburgh,PA,15224,40.465539,-79.954581,3.5,115,1,Restaurants;Mexican,4,1012,2016,[Mexican],0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Mexican
3,3,272380,GsFr15zUG8wO9fsSHwvZUw,xcmmTXhuMx2fZF2Bt69F4w,1,2014-09-20,Back one more time to see if anything had chan...,1,0,0,Point Brugge Café,Point Breeze,"""401 Hastings St""",Pittsburgh,PA,15206,40.450042,-79.913901,4.5,514,1,Belgian;French;German;Restaurants,5,812,2014,"[Belgian, French, German]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Belgian French German
4,4,272381,r4_XcrRt08sADOdIT5ex3A,w_UCGMgok7N9p0XdYBx1VQ,2,2015-06-24,"Stopped in here on a drive home from Sandusky,...",0,0,0,Primanti Bros,Downtown,"""2 S Market Sq""",Pittsburgh,PA,15222,40.440287,-80.002585,3.5,604,1,Sandwiches;American (New);Nightlife;Restaurant...,2,776,2015,"[Sandwiches, American (New), Bars, Sports Bars...",1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Sandwiches American (New) Bars Sports Bars Chi...
5,5,272421,H7pj7sbXY3N-WSEwa-JfpA,tIjoGxi2r6QvREucvQwZpA,3,2015-07-06,Decent bbq--you can tell they're legit from th...,2,1,0,The Dream BBQ,Homewood,"""7600 N Braddock Ave""",Pittsburgh,PA,15208,40.453038,-79.891173,4.5,26,1,Barbeque;Restaurants;Chicken Wings;Soul Food,0,628,2015,"[Barbeque, Chicken Wings, Soul Food]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Barbeque Chicken Wings Soul Food
6,6,272423,AJ6oDR8G5eVbOLMvlZpKeQ,YF3MsPifKWOMZ75TBFE13A,3,2014-04-16,I can't believe how big this place is. We went...,1,0,0,Don Pablo's,,"""140 Andrew Dr""",Pittsburgh,PA,15275,40.448227,-80.177578,2.5,34,0,Restaurants;Tex-Mex,2,816,2014,[Tex-Mex],0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Tex-Mex
7,7,272437,DHe3NxeJJjwXdeMMFMpD_w,gHngt6zpP683GKe1i23LUg,3,2012-07-19,Beautiful view of the river and the skyline. I...,0,0,0,Grand Concourse,South Side,"""100 W Station Square Dr""",Pittsburgh,PA,15219,40.433077,-80.003773,3.5,328,1,Seafood;Restaurants;Breakfast & Brunch;America...,3,455,2012,"[Seafood, Breakfast & Brunch, American (Tradit...",0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Seafood Breakfast & Brunch American (Traditional)
8,8,272452,fmzIm7RxEdii5Jz44PtO7g,zZ7KDK3GAkBUZzsaqB1A4Q,4,2017-09-25,I returned here to try out their menu having t...,14,6,11,honeygrow,East Liberty,"""105 S Highland Ave""",Pittsburgh,PA,15206,40.461076,-79.924726,4.0,38,1,Restaurants;Noodles;Salad;American (Traditional),0,3020,2017,"[Noodles, Salad, American (Traditional)]",0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Noodles Salad American (Traditional)
9,9,272480,2otXDEiKDqUbox4BS6uK0g,Xt6xo-UmJhYtHiscwRnw0A,3,2017-02-09,Southern Tier is actually a nice addition to t...,20,10,14,Southern Tier Brewing,North Side,"""316 N Shore Dr""",Pittsburgh,PA,15212,40.446114,-80.010342,4.0,119,1,Food;Pubs;Nightlife;Bars;Restaurants;Gastropub...,3,1454,2017,"[Pubs, Bars, Gastropubs, Breweries]",1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Pubs Bars Gastropubs Breweries


In [335]:
pit_power['name'] = pit_power.name.astype(str)

In [328]:
pit_power['name'] = pit_power.name.apply(lambda x:x.strip('\"'))

In [311]:
pit_power = pit_power.reset_index()

In [441]:
rest_cats = pit_power.drop_duplicates('name')

In [442]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
rest_cats['string_categories'] = rest_cats['string_categories'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(rest_cats['string_categories'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


(1591, 294)

In [445]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [446]:
#Construct a reverse map of indices and restaurant names
indices = pd.Series(rest_cats.index, index=rest_cats['name']).drop_duplicates()

#indices = pd.Series(pit_power.name, index=pit_power.index).drop_duplicates()

In [447]:
indices['BRGR']

0

In [448]:
# Function that takes in restaurant name as input and outputs most similar restaurants
def get_recommendations(name, cosine_sim=cosine_sim):
    # Get the index of the restaurant that matches the name
    idx = indices[name]
    print(idx)
    # Get the pairwsie similarity scores of all restaurants with that restaurant
    sim_scores = list(enumerate(cosine_sim[idx]))
    #print(sim_scores)
    # Sort the restaurants based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    #print(sim_scores)
    # Get the scores of the 20 most similar restaurants
    sim_scores = sim_scores[1:21]
    print(sim_scores)
    # Get the restaurants indices
    rest_indices = [i[0] for i in sim_scores]
    #print(f'rest index{rest_indices}')
    # Return the top 20 most similar restaurants
    return pit_power['name'].iloc[rest_indices]

In [451]:
get_recommendations("Southern Tier Brewing")

9
[(257, 0.7462210281050788), (287, 0.6924505358772142), (233, 0.6383979237844243), (669, 0.5806453663813673), (185, 0.5384413538603344), (95, 0.5246973964358125), (377, 0.5246973964358125), (1513, 0.5246973964358125), (386, 0.5186518705946366), (94, 0.5123948887795814), (150, 0.48494923484686314), (77, 0.4690120088087303), (244, 0.46483022644594313), (405, 0.4558158317071028), (1175, 0.4342722394226629), (65, 0.4297766586771205), (552, 0.4283839306387778), (776, 0.4283839306387778), (1137, 0.4283839306387778), (1374, 0.4283839306387778)]


257        Cracker Barrel Old Country Store
287                            Hello Bistro
233                       Bella Donne Pizza
669                   Original Hot Dog Shop
185                          Primanti Bros.
95                              Mediterrano
377                    P&G's Pamela's Diner
1513              Knossos Gyros & Sis-Kabob
386                 The Dor-Stop Restaurant
94                          Meat & Potatoes
150     Quiet Storm Vegetarian & Vegan Cafe
77                       D's Six Pax & Dogz
244                        Aladdin's Eatery
405                              China Star
1175                       Carhops Sub Shop
65                                Burgatory
552                           Revel + Roost
776                         Royal Caribbean
1137                           Il Pizzaiolo
1374                            Piper's Pub
Name: name, dtype: object

## User profiles using TF-IDF

In [179]:
pit_power.groupby('user_id').size().sort_values(ascending=False)[0:10]

user_id
rCWrxuRC8_pfagpchtHp6A    438
4wp4XI9AxKNqJima-xahlg    361
Lfv4hefW1VbvaC2gatTFWA    327
6Ki3bAL0wx9ymbdJqbSWMA    322
4m9NXICYBC5i9t4aTt-I6w    240
135DbbQnr3BEkQbBzZ9T1A    240
8AwcaBJjiMpQ__FPxktwwQ    235
d0D7L-vfQDIADolnPAcb9A    211
2jKzO_01d12oiu-2bOYcYg    206
H7pj7sbXY3N-WSEwa-JfpA    190
dtype: int64

In [180]:
user_profile = pit_power[pit_power.user_id == 'rCWrxuRC8_pfagpchtHp6A']

In [181]:
user_profile.head()

Unnamed: 0,user_id,business_id,rev_stars,date,text,useful,funny,cool,name,neighborhood,address,city,state,postal_code,latitude,longitude,bus_stars,review_count,is_open,categories,weekday,text length,year,split_categories,Bars,Pizza,American (Traditional),American (New),Sandwiches,Italian,Breakfast & Brunch,Chinese,Cafes,Burgers,Salad,Mexican,Coffee & Tea,Seafood,Diners,Event Planning & Services,Chicken Wings,Delis,Sushi Bars,Japanese,Mediterranean,Cocktail Bars,Sports Bars,Thai,Caterers,Barbeque,Pubs,Steakhouses,Asian Fusion,Desserts,Specialty Food,Vegetarian,Soup,Bakeries,Middle Eastern,Indian,Wine Bars,Food Trucks,Lounges,Greek,Vegan,Beer,Wine & Spirits,Food Delivery Services,Arts & Entertainment,Bagels,Hot Dogs,Soul Food,Juice Bars & Smoothies,Ice Cream & Frozen Yogurt,Beer Bar,Gluten-Free,Dive Bars,Latin American,French,Buffets,Gastropubs,Comfort Food,Grocery,Noodles,Korean,Tapas/Small Plates,Ethnic Food,Turkish,Hotels & Travel,Vietnamese,Venues & Event Spaces,Tapas Bars,Shopping,Music Venues,Caribbean,Imported Food,Taiwanese,Local Flavor,Tacos
274413,rCWrxuRC8_pfagpchtHp6A,Yf74t_bR1mhRXY03IU6OhA,4,2011-07-27,Note: Getaway Cafe also serves breakfast from ...,10,8,9,"""The Getaway Cafe""",,"""3049 Sussex Ave""",Pittsburgh,PA,15226,40.384579,-80.015711,3.0,46,1,Restaurants;Bars;Nightlife;American (New);Barb...,2,4174,2011,"[Bars, American (New), Barbeque, American (Tra...",1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
275133,rCWrxuRC8_pfagpchtHp6A,wNnPxpAOCBk2N1KTJ2PDCw,2,2010-08-04,I'm with Bourdain: Down with cupcakes. They're...,12,13,7,"""Dozen Bake Shop""",Lawrenceville,"""3511 Butler St""",Pittsburgh,PA,15201,40.464398,-79.966333,3.0,63,0,Bakeries;Food;Restaurants;Cafes,2,4537,2010,"[Bakeries, Cafes]",0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
275134,rCWrxuRC8_pfagpchtHp6A,v3rXLmTCX6ZFR6kIYTY2fg,5,2012-05-29,Note: Istanbul Grille does accept credit cards...,13,13,12,"""Istanbul Grille""",Downtown,"""673 Liberty Ave""",Pittsburgh,PA,15222,40.442362,-80.000202,4.5,48,0,Turkish;Middle Eastern;Restaurants,1,1636,2012,"[Turkish, Middle Eastern]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
275528,rCWrxuRC8_pfagpchtHp6A,C5zOHpwyA-snVGppkbUpQg,4,2012-10-07,"This month, EnP is offering up their Buffalo C...",17,13,15,"""Eat'n Park Restaurant""",,"""5100 Clairton Blvd""",Pittsburgh,PA,15236,40.345859,-79.973588,3.5,21,1,American (Traditional);Restaurants;Diners;Brea...,6,669,2012,"[American (Traditional), Diners, Breakfast & B...",0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
276469,rCWrxuRC8_pfagpchtHp6A,xUsLyzWQId6nXwNDVq08qQ,5,2012-12-27,One of Yelp's competitor's is showing that Lot...,16,4,18,"""Lotus Garden""",,"""3911 Saw Mill Run Blvd""",Pittsburgh,PA,15227,40.368626,-79.983,3.5,5,0,Chinese;Restaurants,3,2048,2012,[Chinese],0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [183]:
user_profile.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 438 entries, 274413 to 3043026
Columns: 99 entries, user_id to Tacos
dtypes: datetime64[ns](1), float64(3), int64(84), object(11)
memory usage: 342.2+ KB


### Gensim TF-IDF

In [202]:
from gensim.utils import simple_preprocess
from gensim.models.tfidfmodel import TfidfModel
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from gensim.corpora.dictionary import Dictionary
from collections import defaultdict
import itertools

wordnet_lemmatizer = WordNetLemmatizer()

In [203]:
english_stops = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]

In [204]:
def custom_preprocess(input_string):
    '''
    This function will take a string as an input (in this case an individual review)
    and return a pre-processed list of tokens based on below processing methods
    '''
    doc_words = word_tokenize(input_string) #tokenize words
    lower_tokens = [t.lower() for t in doc_words] #let's convert to lowercase 
    alpha_only = [t for t in lower_tokens if t.isalpha()] #keep only alphabetical characters
    no_stops = [t for t in alpha_only if t not in english_stops] #and let's remove all the stop words
    lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops] # Lemmatize all tokens into a new list: lemmatized
    
    return lemmatized

In [245]:
#
documents = list(user_profile.text)

# Create the Dictionary and Corpus
mydict = Dictionary([custom_preprocess(review) for review in documents])
corpus = [mydict.doc2bow(custom_preprocess(review)) for review in documents]

# Create the TF-IDF model
tfidf = TfidfModel(corpus, smartirs='ntc')

# Show the TF-IDF weights
tfidf_weights = []
for doc in tfidf[corpus]:

    weight=[[mydict[id], np.around(freq, decimals=2)] for id, freq in doc]
    
    tfidf_weights.append(weight)

In [246]:
from operator import itemgetter

In [247]:
print('Top 20 freq words in first review: ')
sorted(tfidf_weights[1], key=itemgetter(1), reverse=True)[0:20]

Top 20 freq words in first review: 


[['dozen', 0.42],
 ['gob', 0.21],
 ['k', 0.19],
 ['blondies', 0.15],
 ['circle', 0.15],
 ['icing', 0.13],
 ['quiche', 0.12],
 ['cupcake', 0.11],
 ['cheaper', 0.1],
 ['na', 0.1],
 ['bad', 0.08],
 ['block', 0.08],
 ['credit', 0.08],
 ['sell', 0.08],
 ['affluent', 0.07],
 ['bakery', 0.07],
 ['broadsword', 0.07],
 ['confectionary', 0.07],
 ['dreadlocked', 0.07],
 ['endorsing', 0.07]]

## User profiles using LDA Topics

In [236]:
from gensim.models import LdaModel, LdaMulticore

In [237]:
#These are all the reviews by one user
all_documents = list(user_profile.text)

In [238]:
# Create the Dictionary and Corpus
all_mydict = Dictionary([custom_preprocess(review) for review in all_documents])
all_corpus = [all_mydict.doc2bow(custom_preprocess(review)) for review in all_documents]

In [239]:
all_lda_model = LdaMulticore(corpus=all_corpus,
                            id2word=all_mydict,
                            random_state=42,
                            num_topics=10,
                            passes=100,
                            chunksize=10,
                            batch=False,
                            alpha='asymmetric',
                            decay=0.5,
                            offset=64,
                            eta=None,
                            eval_every=0,
                            iterations=100,
                            gamma_threshold=0.001,
                            per_word_topics=True)

In [241]:
for c in all_lda_model[corpus[0:5]]:
    print("Document Topics      : ", c[0])      # [(Topics, Perc Contrib)]
    print("Word id, Topics      : ", c[1][:3])  # [(Word id, [Topics])]
    print("Phi Values (word id) : ", c[2][:2])  # [(Word id, [(Topic, Phi Value)])]
    print("Word, Topics         : ", [(all_mydict[wd], topic) for wd, topic in c[1][:2]])   # [(Word, [Topics])]
    print("Phi Values (word)    : ", [(all_mydict[wd], topic) for wd, topic in c[2][:2]])  # [(Word, [(Topic, Phi Value)])]
    print("------------------------------------------------------\n")

Document Topics      :  [(2, 0.92914164), (3, 0.0688784)]
Word id, Topics      :  [(0, [2]), (1, [2]), (2, [2])]
Phi Values (word id) :  [(0, [(2, 0.99960935)]), (1, [(2, 0.99947727)])]
Word, Topics         :  [('accompanied', [2]), ('act', [2])]
Phi Values (word)    :  [('accompanied', [(2, 0.99960935)]), ('act', [(2, 0.99947727)])]
------------------------------------------------------

Document Topics      :  [(0, 0.0356632), (2, 0.96267104)]
Word id, Topics      :  [(7, [2]), (9, [2, 0]), (17, [2])]
Phi Values (word id) :  [(7, [(2, 2.999814)]), (9, [(0, 0.017408386), (2, 1.982507)])]
Word, Topics         :  [('almost', [2]), ('also', [2, 0])]
Phi Values (word)    :  [('almost', [(2, 2.999814)]), ('also', [(0, 0.017408386), (2, 1.982507)])]
------------------------------------------------------

Document Topics      :  [(0, 0.124043696), (2, 0.87140894)]
Word id, Topics      :  [(9, [2, 0]), (17, [2]), (39, [2])]
Phi Values (word id) :  [(9, [(0, 0.03296257), (2, 0.9669916)]), (17,

In [242]:
# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

In [243]:
pyLDAvis.enable_notebook()

In [244]:
vis = pyLDAvis.gensim.prepare(all_lda_model, all_corpus, all_mydict)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


## Predicting using similarities

In [161]:
pit_power_encoded.head()

Unnamed: 0,user_id,business_id,rev_stars,date,text,useful,funny,cool,name,neighborhood,address,city,state,postal_code,latitude,longitude,bus_stars,review_count,is_open,categories,weekday,text length,year,split_categories,Bars,Pizza,American (Traditional),American (New),Sandwiches,Italian,Breakfast & Brunch,Chinese,Cafes,Burgers,Salad,Mexican,Coffee & Tea,Seafood,Diners,Event Planning & Services,Chicken Wings,Delis,Sushi Bars,Japanese,Mediterranean,Cocktail Bars,Sports Bars,Thai,Caterers,Barbeque,Pubs,Steakhouses,Asian Fusion,Desserts,Specialty Food,Vegetarian,Soup,Bakeries,Middle Eastern,Indian,Wine Bars,Food Trucks,Lounges,Greek,Vegan,Beer,Wine & Spirits,Food Delivery Services,Arts & Entertainment,Bagels,Hot Dogs,Soul Food,Juice Bars & Smoothies,Ice Cream & Frozen Yogurt,Beer Bar,Gluten-Free,Dive Bars,Latin American,French,Buffets,Gastropubs,Comfort Food,Grocery,Noodles,Korean,Tapas/Small Plates,Ethnic Food,Turkish,Hotels & Travel,Vietnamese,Venues & Event Spaces,Tapas Bars,Shopping,Music Venues,Caribbean,Imported Food,Taiwanese,Local Flavor,Tacos
272367,168,371,3,2011-12-22,I ate here for dinner last Thursday evening wi...,0,0,0,"""BRGR""",Shadyside,"""5997 Centre Ave""",Pittsburgh,PA,15206,40.459915,-79.925664,3.5,401,1,Restaurants;Burgers;American (Traditional),3,1314,2011,"[Burgers, American (Traditional)]",0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
272368,512,972,2,2013-12-26,"Maybe we just ordered the wrong items, but the...",2,0,0,"""Thai Cuisine""",Bloomfield,"""4627 Liberty Ave""",Pittsburgh,PA,15224,40.46255,-79.94976,4.0,213,1,Restaurants;Thai,3,527,2013,[Thai],0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
272377,39,1126,5,2016-10-28,Craving Mexican? Check out this tasty spot in ...,6,2,3,"""Los Cabos Mexican Restaurant""",Bloomfield,"""4108-10 Penn Ave""",Pittsburgh,PA,15224,40.465539,-79.954581,3.5,115,1,Restaurants;Mexican,4,1012,2016,[Mexican],0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
272380,507,1726,1,2014-09-20,Back one more time to see if anything had chan...,1,0,0,"""Point Brugge Café""",Point Breeze,"""401 Hastings St""",Pittsburgh,PA,15206,40.450042,-79.913901,4.5,514,1,Belgian;French;German;Restaurants,5,812,2014,"[Belgian, French, German]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
272381,1613,1689,2,2015-06-24,"Stopped in here on a drive home from Sandusky,...",0,0,0,"""Primanti Bros""",Downtown,"""2 S Market Sq""",Pittsburgh,PA,15222,40.440287,-80.002585,3.5,604,1,Sandwiches;American (New);Nightlife;Restaurant...,2,776,2015,"[Sandwiches, American (New), Bars, Sports Bars...",1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [162]:
data_matrix = np.zeros((n_users, n_rests))
for line in pit_power_encoded.itertuples():
    data_matrix[line[1]-1, line[2]-1] = line[3]

In [163]:
data_matrix.shape

(1887, 1789)

In [65]:
import scipy

In [66]:
from sklearn.metrics.pairwise import pairwise_distances 

In [67]:
data_matrix = scipy.sparse.csr_matrix(data_matrix)

In [68]:
user_similarity = pairwise_distances(data_matrix, metric='cosine')

In [69]:
item_similarity = pairwise_distances(data_matrix.T, metric='cosine')

In [110]:
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        #We use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating)
        pred = mean_user_rating + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

In [111]:
user_prediction = predict(data_matrix, user_similarity, type='user')
item_prediction = predict(data_matrix, item_similarity, type='item')

In [112]:
user_prediction

matrix([[-0.00604296,  0.00102874, -0.00916227, ..., -0.00478679,
         -0.00932404,  0.01346877],
        [-0.01271642, -0.00591573, -0.01576146, ..., -0.01164287,
         -0.01576146,  0.00644187],
        [-0.0106107 , -0.00339377, -0.01372728, ..., -0.00931164,
         -0.0137308 ,  0.0088357 ],
        ...,
        [-0.01134016, -0.00465068, -0.01461257, ..., -0.01059934,
         -0.01461257,  0.00713665],
        [-0.00240888,  0.00393899, -0.00557697, ..., -0.00136883,
         -0.00568935,  0.01596604],
        [-0.00941093, -0.00232537, -0.01258957, ..., -0.00832027,
         -0.01258957,  0.00984359]])

In [18]:
item_prediction

array([[0.0049565 , 0.00483171, 0.00484942, ..., 0.00493934, 0.00484975,
        0.00484793],
       [0.00499293, 0.0050227 , 0.0050071 , ..., 0.00492922, 0.00494886,
        0.00496885],
       [0.0038854 , 0.00389904, 0.00378498, ..., 0.00382682, 0.00395909,
        0.00394747],
       ...,
       [0.00049929, 0.00050227, 0.00050071, ..., 0.00049585, 0.00047998,
        0.00049863],
       [0.0029849 , 0.00299372, 0.00293935, ..., 0.00292053, 0.00296591,
        0.00298507],
       [0.00499293, 0.00497066, 0.00490997, ..., 0.00480152, 0.0049274 ,
        0.00498633]])

## Matrix Factorization Collaborative Based Filtering

In [74]:
class MF():

    # Initializing the user-restaurant rating matrix, no. of latent features, alpha and beta.
    def __init__(self, R, K, alpha, beta, iterations):
        self.R = R
        self.num_users, self.num_items = R.shape
        self.K = K
        self.alpha = alpha
        self.beta = beta
        self.iterations = iterations

    # Initializing user-feature and restaurant-feature matrix 
    def train(self):
        self.P = np.random.normal(scale=1./self.K, size=(self.num_users, self.K))
        self.Q = np.random.normal(scale=1./self.K, size=(self.num_items, self.K))

        # Initializing the bias terms
        self.b_u = np.zeros(self.num_users)
        self.b_i = np.zeros(self.num_items)
        self.b = np.mean(self.R[np.where(self.R != 0)])

        # List of training samples
        self.samples = [
        (i, j, self.R[i, j])
        for i in range(self.num_users)
        for j in range(self.num_items)
        if self.R[i, j] > 0
        ]

        # Stochastic gradient descent for given number of iterations
        training_process = []
        for i in range(self.iterations):
            np.random.shuffle(self.samples)
            self.sgd()
            mse = self.mse()
            training_process.append((i, mse))
        if (i+1) % 20 == 0:
            print("Iteration: %d ; error = %.4f" % (i+1, mse))

        return training_process

    # Computing total mean squared error
    def mse(self):
        xs, ys = self.R.nonzero()
        predicted = self.full_matrix()
        error = 0
        for x, y in zip(xs, ys):
            error += pow(self.R[x, y] - predicted[x, y], 2)
        return np.sqrt(error)

    # Stochastic gradient descent to get optimized P and Q matrix
    def sgd(self):
        for i, j, r in self.samples:
            prediction = self.get_rating(i, j)
            e = (r - prediction)

            self.b_u[i] += self.alpha * (e - self.beta * self.b_u[i])
            self.b_i[j] += self.alpha * (e - self.beta * self.b_i[j])

            self.P[i, :] += self.alpha * (e * self.Q[j, :] - self.beta * self.P[i,:])
            self.Q[j, :] += self.alpha * (e * self.P[i, :] - self.beta * self.Q[j,:])

    # Ratings for user i and restaurant j
    def get_rating(self, i, j):
        prediction = self.b + self.b_u[i] + self.b_i[j] + self.P[i, :].dot(self.Q[j, :].T)
        return prediction

    # Full user-movie rating matrix
    def full_matrix(self):
        return mf.b + mf.b_u[:,np.newaxis] + mf.b_i[np.newaxis:,] + mf.P.dot(mf.Q.T)

In [75]:
#Create the user / restaurant rating matrix
R = np.array(pit_power_encoded.pivot(index = 'user_id', columns ='business_id', values = 'rev_stars').fillna(0))

In [76]:
R.shape

(1887, 1789)

In [124]:
mf = MF(R, K=50, alpha=0.01, beta=0.00001, iterations=1000)
training_process = mf.train()
print()
print("P x Q:")
print(mf.full_matrix())
print()

Iteration: 1000 ; error = 0.0088

P x Q:
[[4.18215769 3.18249064 3.80229509 ... 3.74964235 3.7263592  4.28531456]
 [3.44621349 2.32826794 3.03937789 ... 2.63952314 2.83340661 3.35642304]
 [3.36636043 2.4454021  3.0009473  ... 2.96320365 2.86025801 3.40255338]
 ...
 [3.63406323 2.72558749 3.33670497 ... 3.27822393 3.16007859 3.67614625]
 [4.0402026  3.07945207 3.68072841 ... 3.68033708 3.49283406 4.03824131]
 [3.94264862 3.02769672 3.94501595 ... 3.37498266 3.23562816 3.81027696]]



In [95]:
mf.full_matrix().shape

(1887, 1789)

In [126]:
mf.get_rating(16, 45)

2.0000677137317826

    R – The user-movie rating matrix
    K – Number of latent features
    alpha – Learning rate for stochastic gradient descent
    beta – Regularization parameter for bias
    iterations – Number of iterations to perform stochastic gradient descent

## Restaurant profiles using TF-IDF