<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Yelp-Data-Challenge---Restaurant-Recommender" data-toc-modified-id="Yelp-Data-Challenge---Restaurant-Recommender-1">Yelp Data Challenge - Restaurant Recommender</a></span><ul class="toc-item"><li><span><a href="#1.-Clean-data-and-get-rating-data" data-toc-modified-id="1.-Clean-data-and-get-rating-data-1.1">1. Clean data and get rating data</a></span></li><li><span><a href="#2.-define-and-select-active-users" data-toc-modified-id="2.-define-and-select-active-users-1.2">2. define and select active users</a></span></li><li><span><a href="#3.-colleborative-filtering-recommender" data-toc-modified-id="3.-colleborative-filtering-recommender-1.3">3. colleborative filtering recommender</a></span></li><li><span><a href="#4.-Recommend-with-Pearsons'-R-correlations" data-toc-modified-id="4.-Recommend-with-Pearsons'-R-correlations-1.4">4. Recommend with Pearsons' R correlations</a></span></li></ul></li></ul></div>

# Yelp Data Challenge - Restaurant Recommender 

In [22]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
plt.style.use("ggplot")

In [23]:
df = pd.read_csv('mydata/last_2_years_restaurant_reviews.csv')

In [24]:
df.head()

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,type,useful,user_id
0,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2015-06-26,0,nCqdz-NW64KazpxqnDr0sQ,1,I mainly went for the ceasar salad prepared ta...,review,0,0XVzm4kVIAaH4eQAxWbhvw
1,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2015-06-29,0,iwx6s6yQxc7yjS7NFANZig,4,Nice atmosphere and wonderful service. I had t...,review,0,2aeNFntqY2QDZLADNo8iQQ
2,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2015-04-05,0,2HrBENXZTiitcCJfzkELgA,2,To be honest it really quit aweful. First the ...,review,0,WFhv5pMJRDPWSyLnKiWFXA
3,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2016-02-16,0,6YNPXoq41qTMZ2TEi0BYUA,2,"The food was decent, but the service was defin...",review,0,2S6gWE-K3DHNcKYYSgN7xA
4,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2016-02-08,1,4bQrVUiRZ642odcKCS0OhQ,2,If you're looking for craptastic service and m...,review,1,rCTVWx_Tws2jWi-K89iEyw


In [60]:
df_title = df[['business_id','name']]
df_title.set_index('business_id', inplace = True)
df_title.shape

(329080, 1)

In [64]:
df_title = df_title.drop_duplicates()

## 1. Clean data and get rating data 

#### Select relevant columns in the original dataframe

In [25]:
# Get business_id, user_id, stars for recommender
selected_features = ['user_id', 'business_id', 'stars']
df_sel = df[selected_features]
df_sel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 329080 entries, 0 to 329079
Data columns (total 3 columns):
user_id        329080 non-null object
business_id    329080 non-null object
stars          329080 non-null int64
dtypes: int64(1), object(2)
memory usage: 7.5+ MB


In [26]:
df_sel.head(10)

Unnamed: 0,user_id,business_id,stars
0,0XVzm4kVIAaH4eQAxWbhvw,--9e1ONYQuAa-CB_Rrw7Tw,1
1,2aeNFntqY2QDZLADNo8iQQ,--9e1ONYQuAa-CB_Rrw7Tw,4
2,WFhv5pMJRDPWSyLnKiWFXA,--9e1ONYQuAa-CB_Rrw7Tw,2
3,2S6gWE-K3DHNcKYYSgN7xA,--9e1ONYQuAa-CB_Rrw7Tw,2
4,rCTVWx_Tws2jWi-K89iEyw,--9e1ONYQuAa-CB_Rrw7Tw,2
5,TU5j2S_Ub__ojLOpD_UepQ,--9e1ONYQuAa-CB_Rrw7Tw,5
6,GQWk8vgYGlN9hp0XP0V05w,--9e1ONYQuAa-CB_Rrw7Tw,5
7,G7ISuG8XlSd4rNsEcCG2dw,--9e1ONYQuAa-CB_Rrw7Tw,5
8,OC_WdUmY2fK-c1SD4JqSsw,--9e1ONYQuAa-CB_Rrw7Tw,5
9,ymSVFNfDzSVedxOuASOHXA,--9e1ONYQuAa-CB_Rrw7Tw,4


## 2. define and select active users
#### There are many users that haven't given many reviews, exclude these users from the item-item similarity recommender

According to the following analysis, we totally have 155423 users in the data, and almost two third of them (101861 users) only wrote one review, one seventh of them (25052 users) only wrote two reviews. I decide to exclude users with only one review before building item based recommendor though it means cutting a large portion of data. I assume that users with more than one rating records are active users. 

For those who only has one review, I would recommend based on the popularity of businesses, or use content based recommendor. For example, I would recommend them popular restaurants near the one she or he rated. 

In [27]:
df_1 = df_sel.groupby('user_id', as_index = False).count()
df_1.shape
df_1.rename(columns={'stars': '# of reviews'}, inplace=True)
del df_1['business_id']

In [28]:
df_1.head()

Unnamed: 0,user_id,# of reviews
0,---1lKK3aKOuomHnwAkAow,4
1,--0sXNBv6IizZXuV-nl0Aw,1
2,--2bpE5vyR-2hAP7sZZ4lA,1
3,--2vR0DIsmQ6WfcSzKWigw,2
4,--3WaS23LcIXtxyFULJHTA,3


In [29]:
df_2 = df_1.groupby('# of reviews').count()
df_2.rename(columns = {'user_id': '# of users'})

Unnamed: 0_level_0,# of users
# of reviews,Unnamed: 1_level_1
1,101861
2,25052
3,10880
4,5509
5,3354
6,2131
7,1368
8,1018
9,754
10,541


In [30]:
cond_count = df_1['# of reviews'] > 1
df_rec = df_1[cond_count]
df_rec.head()

Unnamed: 0,user_id,# of reviews
0,---1lKK3aKOuomHnwAkAow,4
3,--2vR0DIsmQ6WfcSzKWigw,2
4,--3WaS23LcIXtxyFULJHTA,3
5,--56mD0sm1eOogphi2FFLw,2
13,--LUapetRSkZpFZ2d-MXLQ,7


In [31]:
df_rec.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 53562 entries, 0 to 155419
Data columns (total 2 columns):
user_id         53562 non-null object
# of reviews    53562 non-null int64
dtypes: int64(1), object(1)
memory usage: 1.2+ MB


In [32]:
active_users = df_rec['user_id']
active_users.head()

0     ---1lKK3aKOuomHnwAkAow
3     --2vR0DIsmQ6WfcSzKWigw
4     --3WaS23LcIXtxyFULJHTA
5     --56mD0sm1eOogphi2FFLw
13    --LUapetRSkZpFZ2d-MXLQ
Name: user_id, dtype: object

In [33]:
df_active = df_sel[df_sel['user_id'].isin(active_users)]

In [34]:
df_active.head(3)

Unnamed: 0,user_id,business_id,stars
0,0XVzm4kVIAaH4eQAxWbhvw,--9e1ONYQuAa-CB_Rrw7Tw,1
1,2aeNFntqY2QDZLADNo8iQQ,--9e1ONYQuAa-CB_Rrw7Tw,4
3,2S6gWE-K3DHNcKYYSgN7xA,--9e1ONYQuAa-CB_Rrw7Tw,2


In [35]:
df_active.shape

(227219, 3)

After selection, we will use df_active in the recommendor. We are using around two third of the original data, which sounds fine.

## 3. colleborative filtering recommender

In [36]:
#!pip install surprise

In [37]:
#!pip install seaborn

In [38]:
from surprise import Reader, Dataset, SVD, evaluate
import seaborn as sns
sns.set_style("darkgrid")
reader = Reader()

# get just top 100K rows for faster run time
data = Dataset.load_from_df(df_active, reader)
data.split(n_folds=3)

svd = SVD()
evaluate(svd, data, measures=['RMSE', 'MAE'])



Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 1.2126
MAE:  0.9675
------------
Fold 2
RMSE: 1.2165
MAE:  0.9727
------------
Fold 3
RMSE: 1.2109
MAE:  0.9646
------------
------------
Mean RMSE: 1.2134
Mean MAE : 0.9683
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'mae': [0.96745574613419671,
                             0.97272043481280557,
                             0.96464270725507884],
                            'rmse': [1.2126407476138954,
                             1.21654063799371,
                             1.2108689021887236]})

In [67]:
#pick a lucky user
lucky_user_id = '2S6gWE-K3DHNcKYYSgN7xA'

In [68]:
df_lucky

business_id
--9e1ONYQuAa-CB_Rrw7Tw    Delmonico Steakhouse
G-5kEa6E6PD5fkBRuA7k9Q                   Giada
KskYqH1Bi7Z_61pH6Om8pg           Lotus of Siam
Name: name, dtype: object

#### all restaurants reviewed by the lucky user

In [69]:
df_lucky = df[df['user_id'] == lucky_user_id][['business_id','user_id','stars']]
#df_4 = df[(df['user_id'] == 4) & (df['Rating'] == 5)]
df_lucky = df_lucky.set_index('business_id')
df_lucky = df_lucky.join(df_title)['name']
print(df_lucky)

business_id
--9e1ONYQuAa-CB_Rrw7Tw    Delmonico Steakhouse
G-5kEa6E6PD5fkBRuA7k9Q                   Giada
KskYqH1Bi7Z_61pH6Om8pg           Lotus of Siam
Name: name, dtype: object


#### let's predict what restaurants this user will love

In [71]:
user_lucky = df_title.copy()
user_lucky = user_lucky.reset_index()
#user_785314 = user_785314[~user_785314['Movie_Id'].isin(drop_movie_list)]

# getting full dataset
data = Dataset.load_from_df(df[['user_id', 'business_id', 'stars']], reader)

trainset = data.build_full_trainset()
svd.train(trainset)

user_lucky['Estimate_Score'] = user_lucky['business_id'].apply(lambda x: svd.predict(4, x).est)   # I know the index of this lucky user is 4

user_lucky = user_lucky.drop('business_id', axis = 1)

user_lucky = user_lucky.sort_values('Estimate_Score', ascending=False)
print(user_lucky.head())




                            name  Estimate_Score
1890   Lip Smacking Foodie Tours        4.931848
1456                 Cafe Breizh        4.902774
893                 Brew Tea Bar        4.876292
2000  El Frescos Cocina Mexicana        4.851532
243                J Karaoke Bar        4.842445


## 4. Recommend with Pearsons' R correlations

In [72]:
df_p = pd.pivot_table(df,values='stars',index='user_id',columns='business_id')

print(df_p.shape)


(155423, 3753)


In [74]:
f = ['count','mean']

df_business_summary = df.groupby('business_id')['stars'].agg(f)
df_business_summary.index = df_business_summary.index.map(str)

In [82]:
def recommend(business_title, min_count):
    print("For business ({})".format(business_title))
    print("- Top 10 restaurants recommended based on Pearsons'R correlation - ")
    i = str(df_title.index[df_title['name'] == business_title][0])
    target = df_p[i]
    similar_to_target = df_p.corrwith(target)
    corr_target = pd.DataFrame(similar_to_target, columns = ['PearsonR'])
    corr_target.dropna(inplace = True)
    corr_target = corr_target.sort_values('PearsonR', ascending = False)
    corr_target.index = corr_target.index.map(int)
    corr_target = corr_target.join(df_title).join(df_business_summary)[['PearsonR', 'Name', 'count', 'mean']]
    print(corr_target[corr_target['count']>min_count][:10].to_string(index=False))

In [83]:
df_title.index[df_title['name'] == "Delmonico Steakhouse"][0]                                                     

'--9e1ONYQuAa-CB_Rrw7Tw'

In [85]:
recommend("Lip Smacking Foodie Tours", 0)

For business (Lip Smacking Foodie Tours)
- Top 10 restaurants recommended based on Pearsons'R correlation - 


  c = cov(x, y, rowvar)
  c *= 1. / np.float64(fact)


ValueError: invalid literal for int() with base 10: 'iBPyahdJRP5y0t25fF2W9w'