# Category classifier
The goal of this project is to generate a restaurant recommender for users. The initial approach is to use unsupervised learnings methods to vectorize restaurants, with similar restaurants having similar space. This may be visualized through PCA. 

The restaurant vectors will come from the reviews of the various restaurants.

A user's feeling about a given restaurant will be taken from either A) their star rating, or B) sentiment analysis from their reviews. I will start with A given its simplicity, however with the caveat that not all 5-star reviews are alike (e.g. I could rate something 5-stars, but not necessarily want to go to a similar place again for various reasons).

**Customer/Use Case:** Potential user would be Yelp in order to increase user value of the platform, thereby improving customer aquisition, usage, and retention.

**Approach:** 
1) Data curation and EDA (accomplished in sperate notebooks)
2) Data cleaning
    * Reducing feature and data scope (**Initially PA only**)
    * We would likely only want to categorize restaurants that have a certain number of reviews in order to avoid noisy data.
3) Review aggregation and cutoff selection

    * All review data will be combined into a single field, with an initial df something like this:
| RestaurantId | RestaurantName | AllCombinedReviews |
| ------------ | -------------- | ------------------ |
| abcde...     | John's Place   | loved it was good, etc. | 

5) Featurize the review data
    * Review data will be features using **tf-idf**, but additional embeddings could be used as time permits.
    * Dimensionality reduction will be performed via non-negative matrix factorization
      * This will output a reduced feature set for the restaurants. Initial feature set will be 40, but could be tuned as time permits.
      * Matrix W will contain cluster centroids. Matrix H will contain cluster membership indicators
6)  Budilding out recommender
    * **Initial POC** using the business_id from a user's 5-star review, calculate a similiarity score to the other restaurants, and return three restaurants with the highest similarity that do not have the name name (in order to avoid recommended a different Starbucks to someone who likes Starbucks).
    * Could test with various similarity scores to see what works best.
7) Evaluate recommender
   * Evaluation will likely be a manual review given the unsupervised nature of the model
8) Deployment
   * This is a stretch goal. Would be cool to host on AWS for online input

In [386]:
# Importing all packages including NLTK downloads as necessary
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('ggplot')
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk.corpus
import string
from sklearn.decomposition import NMF
from sklearn.metrics.pairwise import cosine_similarity 
from collections import defaultdict
from tabulate import tabulate
first_run = False
if first_run:
    nltk.download('stopwords')
    nltk.download('punkt')
    nltk.download('wordnet')
import dataprep

## Importing and cleaning data

In [2]:
data_import = True
if data_import:
    business = pd.read_csv("yelp_dataset/yelp_academic_dataset_business.csv", low_memory=False)
    reviews = pd.read_csv("yelp_dataset/yelp_academic_dataset_review.csv")

### Filtering for PA restaurants only

In [40]:
# The review cleaning function take a long time. Filtering dataset before we go further.
clean_business = dataprep.clean_business_data(business)
PA_business = clean_business[clean_business['state'] == 'PA']

filtered_reviews = reviews[reviews['business_id'].isin(PA_business['business_id'])].copy()
PA_reviews = dataprep.clean_review_data(filtered_reviews)

In [48]:
# Filtering for only restaurants and slicing out only the columns that we may need going forward
PA_business['is_restaurant'] = PA_business.apply(lambda row: row['category_split'].count('restaurants') > 0, 
                                                 axis=1)
PA_restaurant = PA_business[PA_business['is_restaurant'] == True].reset_index(drop=True)[['business_id', 'name']]
PA_reviews = PA_reviews[['user_id', 'business_id', 'text', 'stars']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  PA_business['is_restaurant'] = PA_business.apply(lambda row: row['category_split'].count('restaurants') > 0,


**Note:** Another way to evaluate this is to look at star rating for various businesses that users have been to. For example, hold out a set of 5 star reviews for usersand see if my analysis would have predicted that they would have liked the place I recommended based on previous places that they liked. This could be a stretch goal, but a very cool way to validate the model.

## Joining datasets

In [49]:
PA_data = PA_restaurant.merge(PA_reviews, how='inner', on='business_id', validate='one_to_many')

In [58]:
# Generating a DataFrame with one row per business with all reviews aggregated into one column
PA_combined = PA_data.groupby('business_id', as_index=False).agg({'text':[' '.join, 'count'],
                                                                  'name': pd.Series.mode})
PA_combined.columns = ['business_id', 'reviews', 'num_reviews', 'name']

In [66]:
# filtering for businesses that have > min_reviews in order to avoid noisy data.
min_reviews = 5
PA_combined_filtered = PA_combined[PA_combined['num_reviews'] >= min_reviews].reset_index(drop=True)[['business_id', 'reviews', 'name']]

In [67]:
PA_combined_filtered.head()

Unnamed: 0,business_id,reviews,name
0,--ZVrH2X2QXBFdCilbirsw,this place is sadly perm closed i was hoping n...,chriss sandwich shop
1,--epgcb7xHGuJ-4PUeSLAw,love their asiago roll that and a cup of coffe...,manhattan bagel
2,-0FX23yAacC4bbLaGPvyxw,it was our first visit to the restaurant under...,the grey stone fine food and spirits
3,-0M0b-XhtFagyLmsBtOe8w,review of paris flea market accidentally poppe...,paris wine bar
4,-0PN_KFPtbnLQZEeb23XiA,while there didnt seem to be anything wrong wi...,mr wongs chinese restaurant


## Getting embeddings from tf-idf for featurization

In [69]:
# Identifying stopwords from multiple sources
my_stopwords = ['review']
nltk_stop_words = list(nltk.corpus.stopwords.words('english'))
nltk_stop_words = [word.translate(str.maketrans('', '', string.punctuation)) for word in nltk_stop_words]
stopwords = list(set(list(ENGLISH_STOP_WORDS) + my_stopwords + nltk_stop_words))

In [70]:
# Lemmatizing words
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]
        
tf = TfidfVectorizer(strip_accents='unicode',
                     tokenizer=LemmaTokenizer(),
                     stop_words=stopwords,
                     max_features=500) # Setting at 500 for POC. Could be tuned further

In [None]:
# Could use train/test split to calculate reconstruction errors using k-fold cross validation

In [None]:
# X_train, X_test , y_train, y_test = train_test_split(PA_data['text'].values,
#                                                      PA_data['is_restaurant'].values, 
#                                                      test_size=0.25, 
#                                                      random_state=43)

In [71]:
# Could use n-grams here
tfidf = tf.fit_transform(PA_combined_filtered['reviews'].values) #ngram_range=(1, 2)) #Including uni and bi-grams



In [74]:
tfidf.shape

(12641, 500)

In [73]:
words = tf.get_feature_names_out()

## Fitting NMF 

In [78]:
nmf = NMF(n_components=40, max_iter=600) # n_components is being set arbitrarily but could be tuned as time permits
nmf.fit(tfidf)

H = nmf.components_
W = nmf.transform(tfidf)

In [84]:
H.shape

(40, 500)

In [85]:
W.shape

(12641, 40)

### Examining NMF latent features

In [142]:
# Examining the top words for each latent feature
top_words_index = np.argsort(-H)[:,0:10]
most_common_words_per_topic = np.array(words)[top_words_index]
for i, items in enumerate(most_common_words_per_topic):
    print(i, items)

0 ['wa' 'ordered' 'got' 'came' 'went' 'really' 'nice' 'like' 'wanted'
 'looked']
1 ['pizza' 'crust' 'slice' 'pie' 'cheese' 'good' 'sauce' 'topping' 'great'
 'best']
2 ['coffee' 'shop' 'good' 'drink' 'great' 'work' 'friendly' 'nice' 'staff'
 'chocolate']
3 ['chinese' 'food' 'rice' 'shrimp' 'egg' 'good' 'takeout' 'roll' 'fried'
 'place']
4 ['sushi' 'roll' 'tuna' 'salmon' 'fish' 'spicy' 'fresh' 'good' 'rice'
 'great']
5 ['bar' 'drink' 'bartender' 'night' 'music' 'great' 'game' 'friend' 'good'
 'cocktail']
6 ['taco' 'mexican' 'food' 'fish' 'order' 'pork' 'margarita' 'chip' 'good'
 'great']
7 ['location' 'order' 'time' 'employee' 'customer' 'service' 'line' 'like'
 'manager' 'minute']
8 ['sandwich' 'bread' 'roll' 'lunch' 'cheese' 'meat' 'great' 'good' 'beef'
 'pork']
9 ['breakfast' 'egg' 'diner' 'toast' 'pancake' 'bacon' 'french' 'brunch'
 'sausage' 'great']
10 ['burger' 'bun' 'onion' 'bacon' 'guy' 'good' 'great' 'cheese' 'topping'
 'order']
11 ['indian' 'curry' 'dish' 'lamb' 'restaurant' '

In looking through the above features, it appears that most of the features have clear categories (e.g. #1 is Pizza, #22 is bar-b-que, #34 is ice cream, etc.). Most of the features have captured cuisine which makes sense as this is the most significant restaurant differentiator as opposed to service quality or location.

As there is some overlap here (i.e. #1 is Pizza and #37 is Italian | #31 and #7 are both service-related) it could be that 40 features is too high.

In [133]:
# Identifying the top restaurants for each latent feature
rest_dict = defaultdict(list)
for index, restaurant in enumerate(W):
    key = np.argmax(restaurant)
    value = restaurant[key]
    name = PA_combined_filtered['name'][index]
    rest_dict[key].append([value, name])
top_restaurants = defaultdict(list)
for feature in rest_dict:
    top_restaurants[feature] = list(np.sort(np.array(rest_dict[feature]).T)[1,-5:])
for key, value in sorted(top_restaurants.items()):
    print(key, value)

0 ['without a cue productions', 'wyndham alumnae house', 'zagafen', 'zahav', 'àrdana food  drink']
1 ['zesto pizza  grill', 'zio pizza palace  grill', 'zios brick oven pizzeria', 'zoe', 'zuzus kitchen']
2 ['odyssey coffee shop', 'reanimator coffee', 'richboro coffee', 'vagrant coffee', 'valerio coffee roasters']
3 ['panda pavilion', 'tea garden chinese restaurant', 'temple garden chinese restaurant', 'wing wah kitchen', 'yummi yummi']
4 ['zama', 'zento contemporary japanese cuisine', 'zhi izakaya', 'zushi', 'zw sushi land']
5 ['writers block rehab', 'ye olde meetinghouse tavern', 'yeats pub', 'yellobar', 'zincbar']
6 ['union taco', 'union taco  flourtown', 'unity taqueria', 'vida byob', 'wahoos fish taco']
7 ['wawa', 'wawa', 'wawa', 'wawa', 'wendys']
8 ['wawa', 'wawa', 'wolfs superior sandwiches', 'wursthaus schmitz', 'yumtown']
9 ['wrightstown country store', 'yannis family restaurant', 'yummy 2', 'yummy diner', 'zakes cafe']
10 ['wahlburgers', 'wawa', 'wayback burgers', 'wayback burg

## Adding the W matrix back to the original dataset

In [148]:
columns = ['feature{}'.format(n) for n in range(0,40)]
W_df = pd.DataFrame(W, columns=columns)
PA_business_features = pd.concat([PA_combined_filtered[['business_id', 'name']], W_df], axis=1)

In [150]:
PA_business_features.head(2)

Unnamed: 0,business_id,name,feature0,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10,feature11,feature12,feature13,feature14,feature15,feature16,feature17,feature18,feature19,feature20,feature21,feature22,feature23,feature24,feature25,feature26,feature27,feature28,feature29,feature30,feature31,feature32,feature33,feature34,feature35,feature36,feature37,feature38,feature39
0,--ZVrH2X2QXBFdCilbirsw,chriss sandwich shop,0.009784,0.015632,0.001069,0.0,0.0,0.0,0.0,0.006217,0.056636,0.0,0.0,0.0,0.0,0.0,0.0,0.00013,0.000572,0.0,0.0,0.0,0.001303,0.0,0.0,0.0,0.0,0.001649,0.001569,0.000574,0.0,0.001077,0.106964,0.0,3e-06,0.014601,0.002408,0.00069,0.0,0.122642,0.0,0.001208
1,--epgcb7xHGuJ-4PUeSLAw,manhattan bagel,0.015611,0.0,0.006201,0.0,0.002165,0.001644,0.0,0.011028,0.011928,0.0,0.0,0.0,0.0,0.179764,0.0,0.000212,0.0,0.0,0.000219,0.0,0.000471,8.8e-05,0.0,0.000269,0.0,0.00821,0.003591,0.001059,0.0,0.005203,0.017922,0.008239,0.000665,0.004371,0.004898,0.002216,0.0,0.0,0.0,0.005459


## Calculating cosine similiary between restaurants

In [295]:
similarity_array = cosine_similarity(PA_business_features.iloc[:,2:])
similarity_df = pd.DataFrame(similarity_array, columns=PA_combined_filtered['business_id'])
PA_business_similarity = pd.concat([PA_combined_filtered[['business_id', 'name']], similarity_df], axis=1)

In [296]:
PA_business_similarity.shape

(12641, 12643)

In [298]:
PA_business_similarity.iloc[:5,:6]

Unnamed: 0,business_id,name,--ZVrH2X2QXBFdCilbirsw,--epgcb7xHGuJ-4PUeSLAw,-0FX23yAacC4bbLaGPvyxw,-0M0b-XhtFagyLmsBtOe8w
0,--ZVrH2X2QXBFdCilbirsw,chriss sandwich shop,1.0,0.09205,0.18183,0.314369
1,--epgcb7xHGuJ-4PUeSLAw,manhattan bagel,0.09205,1.0,0.113264,0.083329
2,-0FX23yAacC4bbLaGPvyxw,the grey stone fine food and spirits,0.18183,0.113264,1.0,0.594128
3,-0M0b-XhtFagyLmsBtOe8w,paris wine bar,0.314369,0.083329,0.594128,1.0
4,-0PN_KFPtbnLQZEeb23XiA,mr wongs chinese restaurant,0.007585,0.02932,0.100554,0.060637


### Testing on user reviews

In [377]:
five_star_reviews = PA_data[PA_data['stars'] == 5]
sample_reviews = five_star_reviews.sample(5)
sample_reviews

Unnamed: 0,business_id,name,user_id,text,stars
190650,qcguEeAMP0XwFLYqhwX2hg,sweet freedom bakery,G1Hyv0xkY60pgmMCCwC-mw,holy moly jesus help me i think my taste buds ...,5.0
767153,TwnzM8mJn_nT2PJf1x-9kQ,cafe lift,z8XOkJ9UneWaP_KJ-3XWTg,this place is great on a friends wonderful rec...,5.0
294276,ajGUFDANNSnqUoLvZPCcPw,maces crossing,9YkdQop_BBykoCWZoGVZOg,maces crossing is a philadelphia landmark weve...,5.0
58599,aw5GN4yk6r0r9e_5TdiLFQ,carmines parkside pizza,P3xTJNQXxEuqsqc5UIs4AQ,this is our family go to for delivery everythi...,5.0
896956,k2YJkdLg25xlYjshpeEtkQ,volo coffeehouse,nLN7FJtreKs5IgadTOcuBA,while lodged in a nearby airbnb awaiting a wed...,5.0


In [364]:
def top_recommendations(business_id, similarity_matrix, top_n=3, name_filter=True):
    df = similarity_matrix[similarity_matrix['business_id'] == business_id]
    name_mapping = similarity_matrix[['business_id', 'name']]
    if name_filter:
        business_ids_to_filter = name_mapping[name_mapping['name'] == df['name'].values[0]]['business_id'].values
    else:
        business_ids_to_filter = business_id
    df = df.drop(business_ids_to_filter, axis=1)
    output = df.T.iloc[2:,:].sort_values(by=df.T.columns[0], ascending=False).iloc[:top_n,:].reset_index()
    output.columns = ['business_id', 'similarity']
    output = output.merge(name_mapping, how='left', on='business_id')
    return output   

In [390]:
for i in range(sample_reviews.shape[0]):
    print(sample_reviews.iloc[i]['name'])
    print(tabulate(top_recommendations(sample_reviews.iloc[i]['business_id'], PA_business_similarity, top_n=2, name_filter=True),
                  showindex=False,
                  headers='keys',
                  tablefmt='psql'))
    print()

sweet freedom bakery
+------------------------+--------------+--------------------------------------+
| business_id            |   similarity | name                                 |
|------------------------+--------------+--------------------------------------|
| cAuOcHxf2nzuTFIUsDfmsQ |     0.968997 | virago baking company                |
| wPSQ2EGGlpTpjC4fsICQsg |     0.955121 | batter  crumbs vegan bakery and cafe |
+------------------------+--------------+--------------------------------------+

cafe lift
+------------------------+--------------+-------------------+
| business_id            |   similarity | name              |
|------------------------+--------------+-------------------|
| 7NwFNLC0SwX1SwQYlfF5yw |     0.963631 | luna café         |
| oXr3EhnQCqA8SNWIZ3H4Fg |     0.95965  | little spoon cafe |
+------------------------+--------------+-------------------+

maces crossing
+------------------------+--------------+------------------------+
| business_id            | 