<h3> In this notebook, I will try to build a ranking, a content based, and a collaborative filtering recommendation engines using the Yelp data for restaurants. We will try user/user and item/item methods and see how they perform in terms of recommending restaurants

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
from collections import defaultdict
from itertools import islice
from IPython.display import display
from sklearn.metrics.pairwise import pairwise_distances, cosine_similarity
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.externals import joblib

%config InlineBackend.figure_format = 'svg'
plt.style.use('seaborn')
pd.set_option('display.width', 1000)
pd.set_option('max_columns', 60)

In [2]:
# Load data
restaurants=pd.read_csv('yelp/restaurants_reviews_clean.csv', index_col=0, parse_dates=True)

In [200]:
restaurants.head(2)

Unnamed: 0_level_0,business_id,cool,date.1,funny,review_id,stars_x,text,useful,user_id,address,Alcohol,Ambience casual,Ambience classy,Ambience divey,Ambience hipster,Ambience intimate,Ambience romantic,Ambience touristy,Ambience trendy,Ambience upscale,BikeParking,BusinessAcceptsCreditCards,BusinessParking garage,BusinessParking lot,BusinessParking street,BusinessParking valet,BusinessParking validated,Caters,DogsAllowed,GoodForKids,...,name,neighborhood,postal_code,review_count,stars_y,state,Cuisine,Total hours Friday,Total hours Monday,Total hours Saturday,Total hours Sunday,Total hours Thursday,Total hours Tuesday,Total hours Wednesday,Total hours weekdays,Total hours weekends,Friday open at,Friday close at,Monday open at,Monday close at,Saturday open at,Saturday close at,Sunday open at,Sunday close at,Thursday open at,Thursday close at,Tuesday open at,Tuesday close at,Wednesday open at,Wednesday close at
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1
2004-07-22,uz7UbvVUwsg68Rok6kbqRg,0,2004-07-22,0,PbIY2aIyszb6he6J-ey67w,5,"Sehr gutes Restaurant, leckeres essen und nett...",0.0,le_brG6cwrzvWdKEGqA7YA,Leonberger Str. 97,full_bar,False,False,,False,False,False,False,False,False,False,True,False,False,False,False,False,,False,False,...,Kashmir,,71229,41.0,4.5,BW,Indian Restaurants,5.5,5.5,5.5,5.5,5.5,5.5,5.5,27.5,11.0,17.3,23.0,17.3,23.0,17.3,23.0,17.3,23.0,17.3,23.0,17.3,23.0,17.3,23.0
2004-09-15,9X-43jnj6-6ZBuBdFm7BLA,0,2004-09-15,0,03B9-gqbeGoMmPJbNzNT5w,2,Viel Auswahl täuscht über die wahre Tatsache h...,0.0,w_6miJytUt6z8oRkGjVG-A,Rosensteinstr. 22,full_bar,True,False,,False,False,False,False,False,False,True,True,True,False,True,False,False,True,True,True,...,Woody's,,70191,91.0,3.0,BW,American Bars Cocktail Nightlife Restaurants T...,10.0,8.0,11.0,11.0,8.0,8.0,8.0,42.0,22.0,17.0,27.0,17.0,25.0,16.0,27.0,14.0,25.0,17.0,25.0,17.0,25.0,17.0,25.0


<b style='font-size:200%;'>1. A new ranking based on average star ratings and number of reviews.</b>

I will take two factors into account: average star rating and number of reviews for a particular restaurant. First I will adjust star rating based on two things: date and the user evaluated by how many reviews the user has made when he made that star rating. The reasonings are: if a rating is too old, it won't have as much impact as a rating that is more recent. For the user, I will give more weight to ones who have made more reviews than ones who have made less reviews. 
The number of reviews a restaurant has received is also a reflection of the restaurants popularity so we will factor it in when calculating final ranking.

In [201]:
# First adjusted each rating by the date and the user who made the rating
# Most current year will receive a 1 as a cofactor and other years will receive a value between 0 and 1 based on how far away
# they are from the current year
rank1_df=restaurants.copy()[['name', 'stars_x', 'review_count', 'address', 'postal_code']]

In [202]:
rank1_df.head(3)

Unnamed: 0_level_0,name,stars_x,review_count,address,postal_code
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-07-22,Kashmir,5,41.0,Leonberger Str. 97,71229
2004-09-15,Woody's,2,91.0,Rosensteinstr. 22,70191
2004-10-12,La Bamba,5,18.0,606 S 6th St,61820


In [203]:
len(rank1_df)

3540134

In [62]:
#6-23-18
rank1_df.to_csv('rank_df.csv')

In [204]:
rank1_df['year']=rank1_df.index.year

In [205]:
rank1_df.head(3)

Unnamed: 0_level_0,name,stars_x,review_count,address,postal_code,year
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2004-07-22,Kashmir,5,41.0,Leonberger Str. 97,71229,2004
2004-09-15,Woody's,2,91.0,Rosensteinstr. 22,70191,2004
2004-10-12,La Bamba,5,18.0,606 S 6th St,61820,2004


In [206]:
max_year=rank1_df['year'].max()

In [207]:
max_year

2017

In [208]:
# I will subtract the 1/10th of the difference between the year and the most recent year from each star rating to reflect weights
# of time
year_weights=(max_year-rank1_df['year'])/10

In [209]:
year_weights[0:10]

date
2004-07-22    1.3
2004-09-15    1.3
2004-10-12    1.3
2004-10-19    1.3
2004-10-19    1.3
2004-10-19    1.3
2004-10-19    1.3
2004-10-19    1.3
2004-12-19    1.3
2004-12-19    1.3
Name: year, dtype: float64

In [210]:
rank1_df['year penalty']=year_weights.values

In [211]:
rank1_df.head(3)

Unnamed: 0_level_0,name,stars_x,review_count,address,postal_code,year,year penalty
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2004-07-22,Kashmir,5,41.0,Leonberger Str. 97,71229,2004,1.3
2004-09-15,Woody's,2,91.0,Rosensteinstr. 22,70191,2004,1.3
2004-10-12,La Bamba,5,18.0,606 S 6th St,61820,2004,1.3


In [212]:
rank1_df['star adjusted']=rank1_df['stars_x']-rank1_df['year penalty']

In [213]:
rank1_df.head()

Unnamed: 0_level_0,name,stars_x,review_count,address,postal_code,year,year penalty,star adjusted
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2004-07-22,Kashmir,5,41.0,Leonberger Str. 97,71229,2004,1.3,3.7
2004-09-15,Woody's,2,91.0,Rosensteinstr. 22,70191,2004,1.3,0.7
2004-10-12,La Bamba,5,18.0,606 S 6th St,61820,2004,1.3,3.7
2004-10-19,Papa John's Pizza,4,29.0,106 E Green St,61820,2004,1.3,2.7
2004-10-19,Papa Del's Pizza,5,371.0,1201 S Neil St,61820,2004,1.3,3.7


In [214]:
rank1_df['unique restaurants']=[str(x)+'-'+str(y) for x , y in zip(rank1_df['name'], rank1_df['address'])]

In [215]:
# We now need to consider the popularity of each restaurant by how many total review it has received
# First get average star rating of each restaurant, use name/address pair to find unique restaurants
avg_rating=rank1_df.groupby('unique restaurants')['star adjusted'].mean().reset_index()

In [216]:
avg_rating.head()

Unnamed: 0,unique restaurants,star adjusted
0,#1 Fried Rice-9310 W Van Buren St,3.371053
1,"#1 Hawaiian Barbecue-5905 S Eastern Ave, Ste 105",3.534375
2,#1 Pho-7778 W 130th St,3.142857
3,#1 Sushi-9617 N Metro Pkwy,4.009091
4,"#1Brothers Pizza-16995 W Greenway Rd, Ste 104",2.77


In [217]:
avg_rating.columns=['unique restaurants', 'average star']

In [218]:
len(avg_rating)

68943

In [220]:
rank1_df.head(3)

Unnamed: 0_level_0,name,stars_x,review_count,address,postal_code,year,year penalty,star adjusted,unique restaurants
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2004-07-22,Kashmir,5,41.0,Leonberger Str. 97,71229,2004,1.3,3.7,Kashmir-Leonberger Str. 97
2004-09-15,Woody's,2,91.0,Rosensteinstr. 22,70191,2004,1.3,0.7,Woody's-Rosensteinstr. 22
2004-10-12,La Bamba,5,18.0,606 S 6th St,61820,2004,1.3,3.7,La Bamba-606 S 6th St


In [233]:
# Get the total number of reviews for each restaurant
new_ranking=rank1_df.drop(['stars_x', 'year', 'year penalty', 'star adjusted', 'unique restaurants'], axis=1).drop_duplicates()

In [234]:
new_ranking.reset_index(drop=True, inplace=True)

In [235]:
new_ranking.head()

Unnamed: 0,name,review_count,address,postal_code
0,Kashmir,41.0,Leonberger Str. 97,71229
1,Woody's,91.0,Rosensteinstr. 22,70191
2,La Bamba,18.0,606 S 6th St,61820
3,Papa John's Pizza,29.0,106 E Green St,61820
4,Papa Del's Pizza,371.0,1201 S Neil St,61820


In [236]:
len(new_ranking)

69067

In [237]:
new_ranking['unique restaurants']=[str(x)+'-'+str(y) for x , y in zip(new_ranking['name'], new_ranking['address'])]

In [238]:
new_ranking=new_ranking.merge(avg_rating, how='left', on='unique restaurants')

In [239]:
new_ranking.head()

Unnamed: 0,name,review_count,address,postal_code,unique restaurants,average star
0,Kashmir,41.0,Leonberger Str. 97,71229,Kashmir-Leonberger Str. 97,3.960976
1,Woody's,91.0,Rosensteinstr. 22,70191,Woody's-Rosensteinstr. 22,2.261538
2,La Bamba,18.0,606 S 6th St,61820,La Bamba-606 S 6th St,2.855556
3,Papa John's Pizza,29.0,106 E Green St,61820,Papa John's Pizza-106 E Green St,1.993103
4,Papa Del's Pizza,371.0,1201 S Neil St,61820,Papa Del's Pizza-1201 S Neil St,3.02252


In [240]:
len(new_ranking)

69067

I will use IMDB's method for calculating weighted star rating as follows: <br>
<b>weighted rating</b> = $\frac{V}{V+M}$&sdot;$R$ + $\frac{M}{V+M}$&sdot;$C$ <br>
Here: <br>
V=number of reviews for the restaurant<br>
M=minimum number reviews required to be put on the recommendation list<br>
R=average star rating of the restaurant<br>
C=mean star rating of all restaurants

In [241]:
# To define the minimum number of reviews, I will use the 50 as the minimum number of reviews to be listed

In [242]:
min_reviews=50

In [243]:
weighted_ranking=new_ranking.copy()

In [244]:
weighted_ranking=weighted_ranking[weighted_ranking['review_count']>=50]

In [245]:
len(weighted_ranking)

16074

In [246]:
mean_star_rating=np.mean(weighted_ranking['average star'])

In [247]:
mean_star_rating

3.4360065426756634

In [248]:
# Calculate weighted ratings
weighted_ratings=[(x*weighted_ranking['average star'].values[idx]+min_reviews*mean_star_rating)/(x+min_reviews) \
                  for idx, x in enumerate(weighted_ranking['review_count'])]

In [249]:
weighted_ratings[0:5]

[2.678016504494917,
 3.071627759903334,
 2.1837704910622673,
 2.8000464907941134,
 2.881178394904609]

In [250]:
weighted_ranking['Weighted star rating']=weighted_ratings

In [251]:
weighted_ranking.head(3)

Unnamed: 0,name,review_count,address,postal_code,unique restaurants,average star,Weighted star rating
1,Woody's,91.0,Rosensteinstr. 22,70191,Woody's-Rosensteinstr. 22,2.261538,2.678017
4,Papa Del's Pizza,371.0,1201 S Neil St,61820,Papa Del's Pizza-1201 S Neil St,3.02252,3.071628
9,LVH - Las Vegas Hotel & Casino,942.0,3000 Paradise Rd,89109,LVH - Las Vegas Hotel & Casino-3000 Paradise Rd,2.117304,2.18377


In [252]:
weighted_ranking.columns=['Name', 'Total Reviews', 'Address', 'Postal Code', 'unique', 'Avg', 'Weighted Star Rating']

In [253]:
weighted_ranking.drop(['unique', 'Avg'], axis=1, inplace=True)

In [261]:
# 6-23-18
weighted_ranking.to_csv('final weighted ranking.csv')

In [254]:
# 6-23-18 save the file
new_ranking.to_csv('restaurants weighted rating.csv')

In [255]:
# Make a function to recommend top n restaurants based on postal code using new weighted star ratings
def recommend(postal, n):
    # If user enters a postal code, return the top n restaurants. If not found, return no restaurants found
    postal=str(postal)
    if postal in weighted_ranking['Postal Code'].values:
        all_res=weighted_ranking[weighted_ranking['Postal Code']==postal].sort_values('Weighted Star Rating', 
                                                                                       ascending=False).copy()
        top_n=all_res.copy().head(n)
        nrows=len(top_n)
        top_n.index=np.arange(nrows)+1
        top_n.columns=['Restaurants', 'Total Reviews', 'Address', 'Postal Code', 'Star Rating']
        top_n['Total Reviews']=top_n['Total Reviews'].astype(int)
        top_n['Star Rating']=top_n['Star Rating'].apply(lambda x: round(x, 1))
    else:
        top_n='Sorry, no restaurants were found in the postal code you specified'
    return(print('Recommended restaurants are: \n \n', top_n))

In [256]:
recommend(15222, 5)

Recommended restaurants are: 
 
                  Restaurants  Total Reviews        Address Postal Code  Star Rating
1  Gaucho Parrilla Argentina           1267  1601 Penn Ave       15222          4.4
2           DiAnoia's Eatery            219  2549 Penn Ave       15222          4.2
3            Smallman Galley            339     54 21st St       15222          4.2
4                       täkō            806     214 6th St       15222          4.2
5                Bakersfield            457   940 Penn Ave       15222          4.1


In [257]:
recommend(11364, 5)

Recommended restaurants are: 
 
 Sorry, no restaurants were found in the postal code you specified


<b style='font-size:200%;'>2. Content based recommendation system.</b>

The second recommendation system I am going to build is content based. I will use the cuisine and all the attributes columns to find similarities between restaurants and recommend other restaurants based on these similarities.

In [5]:
columns_to_keep=['name','Alcohol','Ambience casual','Ambience classy','Ambience divey','Ambience hipster','Ambience intimate',
                 'Ambience romantic','Ambience touristy','Ambience trendy','Ambience upscale','BikeParking', 'address',
                 'BusinessAcceptsCreditCards','BusinessParking garage','BusinessParking lot','BusinessParking street',
                 'BusinessParking valet','BusinessParking validated','Caters','DogsAllowed','GoodForKids','GoodForMeal breakfast',
                 'GoodForMeal brunch','GoodForMeal dessert','GoodForMeal dinner','GoodForMeal latenight','GoodForMeal lunch',
                 'HasTV','NoiseLevel','OutdoorSeating','RestaurantsAttire','RestaurantsDelivery','RestaurantsGoodForGroups',
                 'RestaurantsPriceRange2','RestaurantsReservations','RestaurantsTableService','RestaurantsTakeOut',
                 'WheelchairAccessible','WiFi', 'Cuisine']

In [259]:
rec_df=restaurants.copy()[columns_to_keep]

In [260]:
rec_df.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3540134 entries, 2004-07-22 to 2017-12-11
Data columns (total 41 columns):
name                          3540133 non-null object
Alcohol                       3142933 non-null object
Ambience casual               3092102 non-null object
Ambience classy               3092102 non-null object
Ambience divey                2515635 non-null object
Ambience hipster              3091947 non-null object
Ambience intimate             3092102 non-null object
Ambience romantic             3092102 non-null object
Ambience touristy             3092102 non-null object
Ambience trendy               3092102 non-null object
Ambience upscale              3092102 non-null object
BikeParking                   3283085 non-null object
address                       3530054 non-null object
BusinessAcceptsCreditCards    3446709 non-null object
BusinessParking garage        3432378 non-null object
BusinessParking lot           3432378 non-null object
Business

In [262]:
# Drop ambience divey , wheelchairaccessible, and dogs allowed columns then drop na
rec_df.drop(['Ambience divey', 'DogsAllowed', 'WheelchairAccessible'], axis=1, inplace=True)
rec_df.fillna('unknown', inplace=True)

In [263]:
rec_df.drop_duplicates(inplace=True)

In [264]:
rec_df.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 69070 entries, 2004-07-22 to 2017-12-08
Data columns (total 38 columns):
name                          69070 non-null object
Alcohol                       69070 non-null object
Ambience casual               69070 non-null object
Ambience classy               69070 non-null object
Ambience hipster              69070 non-null object
Ambience intimate             69070 non-null object
Ambience romantic             69070 non-null object
Ambience touristy             69070 non-null object
Ambience trendy               69070 non-null object
Ambience upscale              69070 non-null object
BikeParking                   69070 non-null object
address                       69070 non-null object
BusinessAcceptsCreditCards    69070 non-null object
BusinessParking garage        69070 non-null object
BusinessParking lot           69070 non-null object
BusinessParking street        69070 non-null object
BusinessParking valet         69070 non-nu

In [267]:
# Combine ranking data frame with this data frame by unique restaurants
rec_df['unique restaurants']=[str(x)+'-'+str(y) for x , y in zip(rec_df['name'], rec_df['address'])]
new_ranking['unique restaurants']=[str(x)+'-'+str(y) for x , y in zip(new_ranking['name'], new_ranking['address'])]

In [268]:
rec_df.to_csv('rec_df.csv')
new_ranking.to_csv('new_ranking.csv')

In [2]:
# 6-24-18
rec_df=pd.read_csv('rec_df.csv', index_col=0)
new_ranking=pd.read_csv('new_ranking.csv', index_col=0)

In [8]:
# Stopped here on 6-23-18
content_df=new_ranking.merge(rec_df, how='inner', on='unique restaurants')

In [9]:
content_df.head(3)

Unnamed: 0,name_x,review_count,address_x,postal_code,unique restaurants,average star,name_y,Alcohol,Ambience casual,Ambience classy,Ambience hipster,Ambience intimate,Ambience romantic,Ambience touristy,Ambience trendy,Ambience upscale,BikeParking,address_y,BusinessAcceptsCreditCards,BusinessParking garage,BusinessParking lot,BusinessParking street,BusinessParking valet,BusinessParking validated,Caters,GoodForKids,GoodForMeal breakfast,GoodForMeal brunch,GoodForMeal dessert,GoodForMeal dinner,GoodForMeal latenight,GoodForMeal lunch,HasTV,NoiseLevel,OutdoorSeating,RestaurantsAttire,RestaurantsDelivery,RestaurantsGoodForGroups,RestaurantsPriceRange2,RestaurantsReservations,RestaurantsTableService,RestaurantsTakeOut,WiFi,Cuisine
0,Kashmir,41.0,Leonberger Str. 97,71229,Kashmir-Leonberger Str. 97,3.960976,Kashmir,full_bar,False,False,False,False,False,False,False,False,False,Leonberger Str. 97,True,False,False,False,False,False,unknown,False,False,False,False,True,False,False,False,quiet,False,casual,False,True,2.0,True,True,True,no,Indian Restaurants
1,Woody's,91.0,Rosensteinstr. 22,70191,Woody's-Rosensteinstr. 22,2.261538,Woody's,full_bar,True,False,False,False,False,False,False,False,True,Rosensteinstr. 22,True,True,False,True,False,False,True,True,False,False,False,False,False,False,True,average,True,casual,False,True,2.0,True,True,True,free,American Bars Cocktail Nightlife Restaurants T...
2,La Bamba,18.0,606 S 6th St,61820,La Bamba-606 S 6th St,2.855556,La Bamba,none,unknown,unknown,unknown,unknown,unknown,unknown,unknown,unknown,unknown,606 S 6th St,True,False,False,False,False,False,unknown,True,False,False,False,True,True,True,unknown,unknown,False,casual,False,True,1.0,False,False,True,unknown,Mexican Restaurants


In [10]:
content_df.drop(['review_count', 'unique restaurants', 'name_y', 'address_y'], axis=1, inplace=True)

In [11]:
content_df.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 68639 entries, 0 to 68638
Data columns (total 40 columns):
name_x                        68639 non-null object
address_x                     68639 non-null object
postal_code                   68574 non-null object
average star                  68639 non-null float64
Alcohol                       68639 non-null object
Ambience casual               68639 non-null object
Ambience classy               68639 non-null object
Ambience hipster              68639 non-null object
Ambience intimate             68639 non-null object
Ambience romantic             68639 non-null object
Ambience touristy             68639 non-null object
Ambience trendy               68639 non-null object
Ambience upscale              68639 non-null object
BikeParking                   68639 non-null object
BusinessAcceptsCreditCards    68639 non-null object
BusinessParking garage        68639 non-null object
BusinessParking lot           68639 non-null object
Busine

In [12]:
content_df.dropna(inplace=True)

In [13]:
# First transform Cuisine column to tfidf matrix
tfidf=TfidfVectorizer()

In [14]:
tfidf_cuisine_matrix=tfidf.fit_transform(content_df['Cuisine'])

In [15]:
tfidf_cuisine_matrix.shape

(68574, 888)

In [17]:
# Now use labelencoder to transform other columns into numbers also standardize the star rating column
le=LabelEncoder()
scaler=StandardScaler()

In [18]:
attr_columns=[x for x in columns_to_keep if x not in ['name', 'Cuisine', 'Ambience divey', 'DogsAllowed', 'address',
                                                      'WheelchairAccessible']]

In [19]:
for col in attr_columns:
    content_df[col]=le.fit_transform(content_df[col].astype(str))

In [21]:
content_df['average star']=scaler.fit_transform(content_df['average star'].reshape(-1,1))

  """Entry point for launching an IPython kernel.


In [22]:
content_df.head(3)

Unnamed: 0,name_x,address_x,postal_code,average star,Alcohol,Ambience casual,Ambience classy,Ambience hipster,Ambience intimate,Ambience romantic,Ambience touristy,Ambience trendy,Ambience upscale,BikeParking,BusinessAcceptsCreditCards,BusinessParking garage,BusinessParking lot,BusinessParking street,BusinessParking valet,BusinessParking validated,Caters,GoodForKids,GoodForMeal breakfast,GoodForMeal brunch,GoodForMeal dessert,GoodForMeal dinner,GoodForMeal latenight,GoodForMeal lunch,HasTV,NoiseLevel,OutdoorSeating,RestaurantsAttire,RestaurantsDelivery,RestaurantsGoodForGroups,RestaurantsPriceRange2,RestaurantsReservations,RestaurantsTableService,RestaurantsTakeOut,WiFi,Cuisine
0,Kashmir,Leonberger Str. 97,71229,0.892481,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,2,0,0,0,1,1,1,1,1,1,Indian Restaurants
1,Woody's,Rosensteinstr. 22,70191,-1.138865,1,1,0,0,0,0,0,0,0,1,1,1,0,1,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,1,1,1,1,1,0,American Bars Cocktail Nightlife Restaurants T...
2,La Bamba,606 S 6th St,61820,-0.428834,2,2,2,2,2,2,2,2,2,2,1,0,0,0,0,0,2,1,0,0,0,1,1,1,2,3,0,0,0,1,0,0,0,1,3,Mexican Restaurants


In [24]:
content_df[attr_columns].shape

(68574, 35)

In [26]:
content_df['average star'].shape

(68574,)

In [35]:
attr_columns.remove('average star')

In [36]:
attr_columns

['Alcohol',
 'Ambience casual',
 'Ambience classy',
 'Ambience hipster',
 'Ambience intimate',
 'Ambience romantic',
 'Ambience touristy',
 'Ambience trendy',
 'Ambience upscale',
 'BikeParking',
 'BusinessAcceptsCreditCards',
 'BusinessParking garage',
 'BusinessParking lot',
 'BusinessParking street',
 'BusinessParking valet',
 'BusinessParking validated',
 'Caters',
 'GoodForKids',
 'GoodForMeal breakfast',
 'GoodForMeal brunch',
 'GoodForMeal dessert',
 'GoodForMeal dinner',
 'GoodForMeal latenight',
 'GoodForMeal lunch',
 'HasTV',
 'NoiseLevel',
 'OutdoorSeating',
 'RestaurantsAttire',
 'RestaurantsDelivery',
 'RestaurantsGoodForGroups',
 'RestaurantsPriceRange2',
 'RestaurantsReservations',
 'RestaurantsTableService',
 'RestaurantsTakeOut',
 'WiFi']

In [30]:
# Now build a matrix for calculating cosine similarities
feature_matrix=scipy.sparse.hstack((tfidf_cuisine_matrix, content_df[attr_columns].values), format='csr')

In [31]:
feature_matrix.shape

(68574, 924)

In [32]:
type(feature_matrix)

scipy.sparse.csr.csr_matrix

In [None]:
cos_matrix=cosine_similarity(feature_matrix, dense_output=False)

In [168]:
# 6-24-18
content_df.to_csv('content_df.csv')

In [163]:
# Define a function that's similar to the first recommender function but output list by cosine similarities
def content_recommender(restaurant_name, postal, n):
    # Output restaurants based on similarities
    tfidf=TfidfVectorizer()
    #scaler=StandardScaler()
    postal=str(postal)
    if restaurant_name in content_df['name_x'].values and postal in content_df['postal_code'].values :
        df=content_df.copy()
        df.reset_index(inplace=True, drop=True)
        df=df[df['postal_code']==postal]
        df.drop_duplicates('name_x', inplace=True)
        df.reset_index(inplace=True, drop=True)
        cuisine_matrix=tfidf.fit_transform(df['Cuisine'])
        for col in attr_columns:
            df[col]=le.fit_transform(df[col].astype(str))
        #df['average star']=scaler.fit_transform(df['average star'].values.reshape(-1,1))
        all_attr_cols=attr_columns.append('average star')
        feature_matrix=scipy.sparse.hstack((cuisine_matrix, df[attr_columns].values), format='csr')
        cos_matrix=cosine_similarity(feature_matrix)
        cos_df=pd.DataFrame(cos_matrix, index=range(cos_matrix.shape[0]), columns=df['name_x'].values)
        cos_df=pd.concat([cos_df, df[['name_x', 'address_x', 'Cuisine']]], axis=1)
        #top_res=np.array(cos_df.index)[[np.argsort(cos_df[restaurant_name].values)[::-1]]]
        top_res=cos_df.sort_values(restaurant_name, ascending=False)
        top_res.drop_duplicates('name_x', inplace=True)
        top_n_res=top_res.iloc[1:n+1, :]
        top_n_res=top_n_res[['name_x', 'address_x', 'Cuisine']]
        top_n_res.reset_index(drop=True, inplace=True)
        top_n_res.columns=['Name', 'Address', 'Cuisine']
        top_n_res.index=range(1, n+1)
        top_n_res.index.name=None
    else:
        top_n_res='The restaurant you entered was not found in our database. We will add it in the future'
    return(print('Recommended restaurants are: \n'), display(top_n_res))

In [164]:
# Test the recommender
content_recommender('Kashmir', 71229, 5)

Recommended restaurants are: 



Unnamed: 0,Name,Address,Cuisine
1,Valentin's Bistro,Marktplatz 5,Bistros Restaurants
2,nah und gut Alsadi,In den Ziegelwiesen 7,Food Grocery Shopping
3,China-Restaurant Jasmin,Eltinger Str. 5,Chinese Restaurants
4,Don Giovanni,Niederhofenstr. 64,Italian Pizza Restaurants
5,Restaurant Glemseck,Glemseck 1,German Restaurants


(None, None)

In [165]:
# Test again as a sanity check (pizza place is easy to find similar ones)
content_recommender("Papa John's Pizza", 61820, 10)

Recommended restaurants are: 



Unnamed: 0,Name,Address,Cuisine
1,Pizza Hut,411 E Green St,Chicken Italian Pizza Restaurants Wings
2,Wood N' Hog Barbecue,"904 N 4th St, Ste B",Barbeque Chicken Restaurants Wings
3,Sliders,616 E Green St,Burgers Restaurants
4,Pancheros Mexican Grill,2009 S Neil St,Mexican Restaurants
5,Chopstix,"202 E Green St, Ste 1",Chinese Restaurants
6,Drew's Pizza,508 E Green St,Pizza Restaurants
7,Layalina Mediterranean Grill,40 E Springfield Ave,Mediterranean Restaurants
8,Domino's Pizza,102 E Green St,Chicken Italian Pizza Restaurants Sandwiches W...
9,Prime Time Pizza,505 E University Ave,Pizza Restaurants
10,M Sushi & Grill,"715 S Neil St, Ste A",Asian Bars Fusion Japanese Korean Restaurants ...


(None, None)

In [167]:
# Test again.
content_recommender('Starbucks', 44118, 10)

Recommended restaurants are: 



Unnamed: 0,Name,Address,Cuisine
1,Phoenix Coffee,2287 Lee Rd,Coffee Food Tea
2,Zero Below,1844 Coventry Rd,Cream Desserts Food Frozen Ice Yogurt
3,Piccadilly Artisan Yogurt,1767 Coventry Rd,Cream Food Frozen Ice Yogurt
4,Phoenix Coffee Company,1854 Coventry Rd,Coffee Food Rooms Tea
5,Walgreens,3020 Mayfield Rd,Beauty Convenience Cosmetics Drugstores Food P...
6,KiwiSpoon Frozen Yogurt,"1854 Coventry Rd, Ste C",Cream Food Frozen Ice Yogurt
7,Mitchell's Fine Chocolates,2285 Lee Rd,Candy Chocolatiers Food Shops Specialty Stores
8,The Sweet Fix Bakery,2307 Lee Rd,Bakeries Food
9,Bialy's Bagels,2267 Warrensville Ctr Rd,Bagels Food
10,Ben & Jerry's,20650 N Park Blvd,Cream Desserts Food Frozen Ice Yogurt


(None, None)

<b style='font-size:200%;'>3. Collaborative Filtering (CF) recommendation system.</b>

The third recommendation system I am going to build is collaborative filtering. I will use restaurant/restaurant and user/restaurant methods to find recommendations.

In [3]:
# First transform the data into user rows and restaurant columns format
user_item_rating=restaurants.copy()[['user_id', 'business_id', 'stars_x']]

In [4]:
user_item_rating.head()

Unnamed: 0_level_0,user_id,business_id,stars_x
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2004-07-22,le_brG6cwrzvWdKEGqA7YA,uz7UbvVUwsg68Rok6kbqRg,5
2004-09-15,w_6miJytUt6z8oRkGjVG-A,9X-43jnj6-6ZBuBdFm7BLA,2
2004-10-12,sE3ge33huDcNJGW3V4obww,PD2MAlYYi9HCqPH7IBKwTg,5
2004-10-19,nkN_do3fJ9xekchVC-v68A,oYMsq2Xvzw6UbrIlMWjb-A,4
2004-10-19,c6HT44PKCaXqzN_BdgKPCw,u8C8pRvaHXg3PgDrsUHJHQ,5


In [5]:
user_item_rating.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3540134 entries, 2004-07-22 to 2017-12-11
Data columns (total 3 columns):
user_id        3540133 non-null object
business_id    3540134 non-null object
stars_x        3540134 non-null int64
dtypes: int64(1), object(2)
memory usage: 108.0+ MB


In [5]:
user_item_rating.dropna(inplace=True)
user_item_rating.reset_index(inplace=True, drop=True)

In [6]:
user_item_rating['User Restaurant Pair']=[x+y for x, y in zip(user_item_rating['user_id'].values, 
                                                              user_item_rating['business_id'].values)]

In [7]:
user_item_rating['User Restaurant Pair'].value_counts()

n9LehSQoQRnYk_g6IfKVFgUmZdQID7QJoyg2R92mK3HA    1
dExVcEIfWkaqYKkMsRYxfAxdD3EXvF_p9WUiwwKpol_w    1
wxu6RAQqre73_id5lALttAq2PUogtjncgsNn-pLzw5HQ    1
EDe9TDxjQulNGKICLs4uwQCXFARDJEpOreI9Xyxzl8Yw    1
MS5160dww4tX3x03cfk3jALs89UV4aSRE8eW2Pj_LaWg    1
GVFSUXBPUzUu6BU37dmixQb_xFN9mtFWHUbny-GoiErQ    1
pv6iucL4vcT4Bm69nd477gD7Wv_B7GBbrx-EDfgxDwRA    1
sKe7VxOXyAr3IZ69SyHtuQ_w5hBpkjHs5_Hv3pLeHtIw    1
g9BfS5P5OZe8UtSLdVFQ0Q7LBVI9euGV3ugO0-efZHvg    1
RUxmQJCEiQT4b5JV9dVccwebL8hN0iieaa3rClnDfLlw    1
fQonrHZqMX_jjide5080Jwgx2yPrOJSwF1ApJYdGBWIw    1
gre9OK5iWNnDWt7AIis_kQL5VPrCwatTgGUrfU9y4ItA    1
MDc00oCXZWo78vIP9ikdQQaTVssbSnSHOUitXjfyCWZg    1
ePfhvRM-deQgMduESHRWDgmnwRtuVQEsIUomBchu0gwg    1
-0PZxOXJcG6brIiRhjkungRU2VtVZG8to-XVEHD-2qag    1
84F_UxRVby3WqFIxt3Wshg1dhPgc7E7IzzpxjHM2LphQ    1
Z6DDKkU0Zlh2bFv420A1rQ1aVqiz43klXaFJUUx0H5fw    1
-FDKN7C_nD5-6O0AJzR1_wo13eH93qmWVNFZogkjhd9w    1
M0egdMEYFdz50HJ-DJwWxgeLFfWcdb7VkqNyTONksHiQ    1
_6Zg4ukwS0kst9UtkfVw3wVxZQKw6fJzJUW14v4dWKXg    1


In [8]:
user_item_rating.drop('User Restaurant Pair', axis=1, inplace=True)

In [9]:
user_item_rating.to_csv('yelp/user_item_rating.csv')

Creating a large pivot table for the user/item format will cause memory error and no sparse pivot table currently exists. I will use data from on state and demonstrate the method.

In [14]:
# 7-5-18
user_item_rating=pd.read_csv('yelp/user_item_rating.csv', index_col=0)

  mask |= (ar1 == a)


In [6]:
user_item_rating.head()

Unnamed: 0,user_id,business_id,stars_x
0,le_brG6cwrzvWdKEGqA7YA,uz7UbvVUwsg68Rok6kbqRg,5
1,w_6miJytUt6z8oRkGjVG-A,9X-43jnj6-6ZBuBdFm7BLA,2
2,sE3ge33huDcNJGW3V4obww,PD2MAlYYi9HCqPH7IBKwTg,5
3,nkN_do3fJ9xekchVC-v68A,oYMsq2Xvzw6UbrIlMWjb-A,4
4,c6HT44PKCaXqzN_BdgKPCw,u8C8pRvaHXg3PgDrsUHJHQ,5


In [4]:
user_item_rating.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3540133 entries, 0 to 3540132
Data columns (total 3 columns):
user_id        3540133 non-null object
business_id    3540133 non-null object
stars_x        3540133 non-null int64
dtypes: int64(1), object(2)
memory usage: 108.0+ MB


In [15]:
# I will use surprise package to build callaborative filtering models
from surprise import NormalPredictor, Dataset, Reader, SVD, accuracy, KNNBasic
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split as tts

In [16]:
# The full data set will cause a memory problem when buiding training and test data sets, so I will take a subset for this.
sub_set=user_item_rating.sample(5000, axis=0)

In [3]:
# I need to make a dictionary to map the business ids to actual restaurant names
rest_name_dict=dict(zip(restaurants['business_id'], restaurants['name']))

In [14]:
joblib.dump(rest_name_dict, 'yelp/rest_name_dict.pkl')

['yelp/rest_name_dict.pkl']

In [7]:
rest_name_dict=joblib.load('yelp/rest_name_dict.pkl')

In [17]:
# Convert business_id to restaurant names
sub_set['business_id']=sub_set['business_id'].map(rest_name_dict)

In [18]:
sub_set.head()

Unnamed: 0,user_id,business_id,stars_x
420160,-5KiEoe-mb4sU2VwnVwrYw,Cucina by Wolfgang Puck,5
2828521,LPWSfa9mGf59P7KZg_eOdA,My Thai,5
3322,EyLVCFOKltmlMg7XcRxU9Q,Rumbi Island Grill,5
1206300,-tBJZffckiQC6zhPlwVpVQ,ZuZu,4
1239484,TPa6ZGNRjH2Zfrwq4vsUPw,SAS Cupcakes,3


In [19]:
reader = Reader(rating_scale=(1, 5))

In [20]:
data = Dataset.load_from_df(sub_set[['user_id', 'business_id', 'stars_x']], reader)

In [21]:
# Use the full data frame for training our recommender
train_set=data.build_full_trainset()

In [22]:
# Get a test set by findindg all restaurants that users have not rated
test_set = train_set.build_anti_testset()

In [12]:
#train_set, test_set=tts(data, test_size=0.25)

In [23]:
# Use singular value decomposition algorithm
svd=SVD()

In [24]:
svd.fit(train_set)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x11559169470>

In [25]:
preds=svd.test(test_set)

In [26]:
len(preds)

17521082

In [27]:
preds[0]

Prediction(uid='-5KiEoe-mb4sU2VwnVwrYw', iid='My Thai', r_ui=3.6734, est=4.0751192598239, details={'was_impossible': False})

In [43]:
# Check RMSE as a metric for measuring model accuracy
accuracy.rmse(preds)

RMSE: 0.2072


0.20721945717756124

The RMSE is very low indicating a good model.

In [28]:
top_rec=defaultdict(list)

In [29]:
# Get all recommendations for all users
for uid, iid, true_r, est, _ in preds:
    top_rec[uid].append((iid, est))

In [31]:
# Get top 5 recommedations based on predicted ratings
for uid, user_ratings in top_rec.items():
    user_ratings.sort(key = lambda x: x[1], reverse = True)
    top_rec[uid] = user_ratings[:5]

In [33]:
joblib.dump(top_rec, 'yelp/top_rec.pkl')

['yelp/top_rec.pkl']

In [2]:
top_rec=joblib.load('yelp/top_rec.pkl')

In [34]:
top_rec_df=pd.DataFrame.from_dict(top_rec, orient='index')

In [35]:
top_rec_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4886 entries, -5KiEoe-mb4sU2VwnVwrYw to lX_j1NZA0xpJ9E1tu7SsIQ
Data columns (total 5 columns):
0    4886 non-null object
1    4886 non-null object
2    4886 non-null object
3    4886 non-null object
4    4886 non-null object
dtypes: object(5)
memory usage: 229.0+ KB


In [36]:
top_rec_df.to_csv('yelp/top_rec_df.csv')

In [10]:
top_rec_df=pd.read_csv('yelp/top_rec_df.csv', index_col=0)

In [37]:
top_rec_df=top_rec_df.T

In [41]:
# To show recommended restaurants for the first 5 users
recommended=top_rec_df.iloc[:, 0:5].T

In [42]:
recommended.index.name='Users'
recommended.columns=['1st Restaurant', '2nd Restaurant', '3rd Restaurant', '4th Restaurant', '5th Restaurant']

In [44]:
recommended

Unnamed: 0_level_0,1st Restaurant,2nd Restaurant,3rd Restaurant,4th Restaurant,5th Restaurant
Users,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
-5KiEoe-mb4sU2VwnVwrYw,"(Yardbird Southern Table & Bar, 4.448143511632...","(Rollin Smoke Barbeque, 4.361078514298695)","(Jean Philippe Patisserie, 4.338873641980125)","(Yard House, 4.325951264322038)","(Hiroba Sushi, 4.30917748011664)"
LPWSfa9mGf59P7KZg_eOdA,"(Mon Ami Gabi, 4.560510815225531)","(Yard House, 4.3325320086542085)","(Yardbird Southern Table & Bar, 4.273418535906...","(The Combine Eatery, 4.237205892813327)","(Brew Tea Bar, 4.236163737902843)"
EyLVCFOKltmlMg7XcRxU9Q,"(Yard House, 4.420580330439966)","(Mon Ami Gabi, 4.390624184521719)","(Lola Coffee, 4.281053068868802)","(Brew Tea Bar, 4.278227307282902)","(Yardbird Southern Table & Bar, 4.275357509866..."
-tBJZffckiQC6zhPlwVpVQ,"(Yardbird Southern Table & Bar, 4.30498559007849)","(Yard House, 4.251247996344009)","(Mon Ami Gabi, 4.236412309252682)","(District One, 4.2217447790537825)","(Gelatology, 4.21387119440429)"
TPa6ZGNRjH2Zfrwq4vsUPw,"(Gordon Ramsay Steak, 4.239030581165461)","(Brew Tea Bar, 4.192484834780216)","(Mon Ami Gabi, 4.177138518433449)","(Yardbird Southern Table & Bar, 4.168899445418...","(Millie's Homemade Ice Cream, 4.147826665191982)"


<b>The columns are the recommended top 5 restaurants. The first part of the tuple is the name of the restaurant and the second part of the tuple is the predicted rating the user would give to that restaurant.<br>
I didn't convert user_id to actual names for the following reasons:<br>
-In the data set there are only first names so I would get identical names a lot<br>
-User ids are unique so that's what identify each user<br>
<br>