# Final Project - Aspect Sentiment Analysis
### DATA 620 ~ David Moste ~ Euclid zhang ~ Samuel Reeves

### 7/10/2021

Presentation Video Link: https://youtu.be/NoEhssHZXDY

In [1]:
import json
import requests
import pandas as pd
import re
from bs4 import BeautifulSoup
import nltk
import numpy as np
import matplotlib.pyplot as plt

#conda install -c anaconda click=7.1.2
#conda install -c anaconda spacy
#conda install -c conda-forge spacy-model-en_core_web_sm
import spacy
from textblob import TextBlob

### Data
Data source: https://www.kaggle.com/c/yelp-recsys-2013/data?select=yelp_training_set.zip

A set of yelp reviews on businesses in Arizona. We will select the subset of all restaurants for our analysis.

Raw data and finished dataframes can be downdloaded here:
https://drive.google.com/file/d/1ZNuUWfZm1t5_8QSOL2hHIeEGnfg7Lras/view?usp=sharing

### Objective: 
To build a recommender system to select some restaurants that a reviewer may like, based on the aspects of the restaurants.

By looking at the ratings of a restaurant, we may know generally if it is a good restaurant or not. However, we may not know how well a specific person will like the restaurant.

Some people care more about the taste of their food. Some people care more about the service of the restaurant, and some other people care more about the atmosphere. In this project, we will evaluate the restaurants’ aspects using sentiment analysis. Given a new review posted on Yelp, we can capture the aspects of the new review and make recommendations to the reviewer with restaurants with high average sentiment scores on these aspects.

### Data Loading

Load data in json format. The data is in two files.

The business file contains the name and categories of the business, we will select the business with 'Restaurants' in the categories.

The reviews file contains a full set of reviews, including the review text, and the corresponding business ID.

In [3]:
import os
os.chdir(r'E:\SPS\DATA 620\assignments\data\yelp-recsys-2013\yelp_training_set\yelp_training_set')

In [9]:
business_url = r'yelp_training_set_business.json'
reviews_url = r'yelp_training_set_review.json'

First, we load the business file.

In [10]:
business = []
f = open(business_url,'r')
for line in f:
    business.append(json.loads(line))
f.close()

Store the business ID, business name and city (may not be used) of each restaurant in a dataframe.

In [11]:
restaurant_df = pd.DataFrame(columns = ['business_name','city_name'])
for bus in business:
    if 'Restaurants' in bus['categories']:
        restaurant_df.loc[bus['business_id']] = [bus['name'], bus['city']]

In [54]:
print('restaurant_df')
display(restaurant_df.head())
print('number of restaurant: ' + str(len(restaurant_df)))

restaurant_df


Unnamed: 0,business_name,city_name
PzOqRohWw7F7YEPBz6AubA,Hot Bagels & Deli,Glendale Az
qarobAbxGSHI7ygf1f7a_Q,Jersey Mike's Subs,Gilbert
gA5CuBxF-0CnOpGnryWJdQ,La Paloma Mexican Food,Phoenix
JxVGJ9Nly2FFIs_WpJvkug,Sauce,Scottsdale
Jj7bcQ6NDfKoz4TXwvYfMg,Fuddruckers,Phoenix


number of restaurant: 4503


Load the reviews file.

In [13]:
reviews = []
f = open(reviews_url,'r')
for line in f:
    reviews.append(json.loads(line))
f.close()

Each item in the reviews file looks like the following text. We will extract the user_id, stars (may not be used), and review_text. We will not use the review_id from the file. Instead, we will simply use the index as our new review_id.

In [14]:
reviews[0]

{'votes': {'funny': 0, 'useful': 5, 'cool': 2},
 'user_id': 'rLtl8ZkDX5vH5nAx9C3q5Q',
 'review_id': 'fWKvX83p0-ka4JS3dc6E5A',
 'stars': 5,
 'date': '2011-01-26',
 'text': 'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It wa

Exract the information and store it in a dataframe.

In [32]:
reviews_df = pd.DataFrame(columns = ['user_id', 'business_id','business_name','stars','review_text',])
index = 0
for review in reviews:
    if review['business_id'] in restaurant_df.index:
        reviews_df.loc[index] = [review['user_id'],review['business_id'],
                                        restaurant_df.loc[review['business_id'],'business_name'],
                                       review['stars'],review['text']]
        index = index + 1

In [55]:
#may load the finished dataframe instead of running the above code
#reviews_df = pd.read_csv(r"reviews_df.csv", header=0, index_col=False)

print('reviews_df')
display(reviews_df.head())
print('number of reviews: ' + str(len(reviews_df)))

reviews_df


Unnamed: 0,user_id,business_id,business_name,stars,review_text
0,rLtl8ZkDX5vH5nAx9C3q5Q,9yKzy9PApeiPPOUJEtnvkg,Morning Glory Cafe,5,My wife took me here on my birthday for breakf...
1,0a2KyEL0d3Yb1V6aivbIuQ,ZRJwVLyzEJq1VAihDhYiow,Spinato's Pizzeria,5,I have no idea why some people give bad review...
2,0hT2KtfLiobPvh6cDC8JQg,6oRAC4uyJCsJl1X0WZpVSA,Haji-Baba,4,love the gyro plate. Rice is so good and I als...
3,sqYN3lNgvPbPCTRsMFu27g,-yxfBYGB6SEqszmxJxd97A,Quiessence Restaurant,4,"Quiessence is, simply put, beautiful. Full wi..."
4,wFweIWhv2fREZV_dYkz_1g,zp713qNhx8d9KCJJnrw1xA,La Condesa Gourmet Taco Shop,5,Drop what you're doing and drive here. After I...


number of reviews: 158430


### Preprocessing

We will use the spaCy module to assign POS to our review texts.

In [19]:
nlp = spacy.load("en_core_web_sm")

Let's take the first sentence from the first review and see how it works...

In [24]:
temp = nlp('My wife took me here on my birthday for breakfast and it was excellent.')

In [26]:
for token in temp:
    print(token.text,token.pos_)

My PRON
wife NOUN
took VERB
me PRON
here ADV
on ADP
my PRON
birthday NOUN
for ADP
breakfast NOUN
and CCONJ
it PRON
was VERB
excellent ADJ
. PUNCT


In this analysis, we will focus on the NOUNs and ADJs.

First, we define a few functions to clean up the text and divide it into sentences.

In [13]:
def clean_text(text):
    #remove html markups before separating the text into sentences
    
    #convert to lower case
    text = str(text).lower()   
    
    #remove hyperlinks
    text = re.sub(r'[^\s]+\.com.[^\s]+','',text)
    text = re.sub(r'http[^\s]+','',text)
    
    #clean the html markups
    text = BeautifulSoup(text).get_text()
    
    return text

In [14]:
def clean_sentence(text):
    #additoinal clean up on the individual sentences 
    
    #remove some meaningless words that is classified as NOUN by spaCy
    text = re.sub(r' sure',r'',text)
    text = re.sub(r' other',r'',text)
    
    #Convert terms such as I'll into I will so we can remove the ' character
    text = re.sub(r'\b(he|she|it|this|that)\'(s)\b', r'\1 is', text)
    text = re.sub(r'\b(they|we)\'(re)\b', r'\1 are', text)
    text = re.sub(r'\'ve\b', ' have', text)
    text = re.sub(r'\'ll\b', ' will ', text)
    text = re.sub(r'won\tt ', 'will not ', text)
    text = re.sub(r'n\'t\b', ' not', text)
    
    #remove non letter characters
    text = re.sub(r'[^A-Za-z\s]+', ' ', text) 
    #remove single characters except I, since some single letters are classified as NOUN by spaCy
    text = re.sub(r'\b[^i]\b',' ',text)
    #change multiple space characters into one
    text = re.sub(r'\s{2,}', ' ', text)
    
    return text

In [15]:
def sentencizer(text):
    #Separating text into sentences. Sentences are separated by . ! ? or \n
    sentences = re.split('[\.!?\n]+',text)
    sentences = [sentence.strip() for sentence in sentences if len(sentence) > 0]
    return sentences

Define a function to extract the aspects and corresponding descriptives from a sentence.

In [16]:
# we will lemmatize the aspects, since 'breads' and 'bread' are the same aspect
wnl = nltk.WordNetLemmatizer()

In a sentence, find all the descriptives (terms with POS 'ADJ'). For each of the descriptives, find the nearest aspect (term with POS 'NOUN') in the sentence. If there is no aspect found for a descriptive, disregard the descriptive term.

In [17]:
def aspects_extract(sentences):
    aspects = []
    descriptives = []
    for sentence in sentences:
        temp = nlp(sentence)
        tags = [word.pos_ for word in temp]
        for i in range(len(tags)):
            if tags[i] == 'ADJ':
                last_noun_index = -1
                next_noun_index = -1
                j = i - 1
                while j >= 0:
                    if tags[j] == 'NOUN':
                        last_noun_index = j
                        break           
                    j = j - 1
                j = i + 1
                while j < len(tags):
                    if tags[j] == 'NOUN':
                        next_noun_index = j
                        break           
                    j = j + 1
                if (last_noun_index == -1) & (next_noun_index == -1):
                    continue
                elif last_noun_index == -1 | ((next_noun_index - i) < (i - last_noun_index)):
                    aspects.append(wnl.lemmatize(temp[next_noun_index].text))
                    descriptives.append(temp[i].text)
                elif next_noun_index == -1:
                    aspects.append(wnl.lemmatize(temp[last_noun_index].text))
                    descriptives.append(temp[i].text)
                elif (next_noun_index - i) < (i - last_noun_index):
                    aspects.append(wnl.lemmatize(temp[next_noun_index].text))
                    descriptives.append(temp[i].text)
                else:
                    aspects.append(wnl.lemmatize(temp[last_noun_index].text))
                    descriptives.append(temp[i].text)    
    return aspects, descriptives

Let's check the performance of our aspect extraction.

In [38]:
num_review = 1
for review in reviews_df['review_text'][:num_review]:
    sentences = sentencizer(clean_text(review))
    sentences = [clean_sentence(sentence) for sentence in sentences]
    print(sentences)
    print('----------------')

['my wife took me here on my birthday for breakfast and it was excellent', 'the weather was perfect which made sitting outside overlooking their grounds an absolute pleasure', 'our waitress was excellent and our food arrived quickly on the semi busy saturday morning', 'it looked like the place fills up pretty quickly so the earlier you get here the better', 'do yourself favor and get their bloody mary', 'it was phenomenal and simply the best i have ever had', 'i pretty they only use ingredients from their garden and blend them fresh when you order it', 'it was amazing', 'while everything on the menu looks excellent i had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious', 'it came with pieces of their griddled bread with was amazing and it absolutely made the meal complete', 'it was the best toast i have ever had', 'anyway i ca not wait to go back']
----------------


In [39]:
for i in range(num_review):
    review = reviews_df['review_text'][i]
    sentences = sentencizer(clean_text(review))
    sentences = [clean_sentence(sentence) for sentence in sentences]
    aspects, descriptives = aspects_extract(sentences)
    print(list(zip(aspects,descriptives)))
    print('----------------')

[('breakfast', 'excellent'), ('weather', 'perfect'), ('pleasure', 'absolute'), ('waitress', 'excellent'), ('semi', 'busy'), ('place', 'earlier'), ('place', 'better'), ('mary', 'bloody'), ('garden', 'fresh'), ('menu', 'excellent'), ('truffle', 'white'), ('skillet', 'tasty'), ('skillet', 'delicious'), ('bread', 'amazing'), ('meal', 'complete'), ('toast', 'best')]
----------------


We can see that our method captures most of the aspects and the correct descriptives!

We will apply the method to all of our reviews.

This time, we will use the TextBlob module to calculate the sentiment score (1 for positive, 0 for neutral and -1 for negative) for the descriptives.

We will store the average sentiment score for each aspect for each review ID.

In [None]:
sentiment_df = pd.DataFrame(columns = ['review_id', 'aspect','sentiment',])
index = 0

for i in range(len(reviews_df['review_text'])):
    review = reviews_df['review_text'][i]
    sentences = sentencizer(clean_text(review))
    sentences = [clean_sentence(sentence) for sentence in sentences]
    aspects, descriptives = aspects_extract(sentences)
    temp = pd.DataFrame(columns = ['aspect','sentiment'],dtype=int)
    for j in range(len(aspects)):
        sentiment = TextBlob(descriptives[j]).sentiment.polarity
        if sentiment > 0:
            sentiment = 1
        elif sentiment < 0:
            sentiment = -1
        else:
            sentiment = 0      
        temp.loc[index] = [aspects[j], sentiment]
        index = index + 1
    temp = temp.groupby('aspect',as_index = False).mean()
    temp = pd.concat([pd.DataFrame(i, index=np.arange(len(temp)), columns = ['review_id']), temp], axis=1, ignore_index=True)
    temp.columns = ['review_id', 'aspect','sentiment']
    sentiment_df = sentiment_df.append(temp,ignore_index = True)

In [9]:
#may load the finished dataframe instead of running the above code
#sentiment_df = pd.read_csv(r"sentiment_df.csv", header=0, index_col=False)

print('sentiment_df')
display(sentiment_df.head(20))

sentiment_df


Unnamed: 0,review_id,aspect,sentiment
0,0,bread,1.0
1,0,breakfast,1.0
2,0,garden,1.0
3,0,mary,-1.0
4,0,meal,1.0
5,0,menu,1.0
6,0,place,0.5
7,0,pleasure,1.0
8,0,semi,1.0
9,0,skillet,0.5


We will filter out the infrequent terms with a threshold. Aspects that are mentioned in only a small number of reviews are considered not important.

In [6]:
threshold = 100
temp = sentiment_df['aspect'].value_counts()
selected_aspects = temp.loc[temp > threshold].index.tolist()
sentiment_df = sentiment_df.loc[[aspect in selected_aspects for aspect in sentiment_df['aspect']]]

We will then convert the dataframe from long to wide format for distance calculation.

In [None]:
sentiment_df2 = sentiment_df.pivot_table('sentiment', ['review_id'], 'aspect',fill_value = 0)
sentiment_df2.reset_index(level=0, inplace=True)
sentiment_df2.columns.name = None

In [52]:
#may load the finished dataframe instead of running the above code
#sentiment_df2 = pd.read_csv(r"sentiment_df2.csv", header=0, index_col=False)

print('sentiment_df2')
display(sentiment_df2.loc[sentiment_df2['accompaniment'] != 0])

sentiment_df2


Unnamed: 0,review_id,accent,accompaniment,addition,adovada,adult,advantage,adventure,advice,afternoon,...,yelp,yelper,yelpers,yesterday,yogurt,york,yorker,yr,yum,zucchini
873,889,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1640,1670,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2734,2783,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3203,3258,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6163,6267,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145525,148225,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
146913,149634,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
147018,149740,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
151429,154231,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We then merge the sentiment scores data frame with the reviews dataframe.

In [9]:
reviews_df2 = reviews_df.reset_index(level=0)
reviews_df2.rename(columns = {'index':'review_id'}, inplace = True)
reviews_df2 = reviews_df2.merge(sentiment_df2, on='review_id', how='left').fillna(0)

In [5]:
#may load the finished dataframe instead of running the above code
#reviews_df2 = pd.read_csv(r"reviews_df2.csv", header=0, index_col=False)

print('reviews_df2')
display(reviews_df2.head())

reviews_df2


Unnamed: 0,review_id,user_id,business_id,business_name,stars,review_text,accent,accompaniment,addition,adovada,...,yelp,yelper,yelpers,yesterday,yogurt,york,yorker,yr,yum,zucchini
0,0,rLtl8ZkDX5vH5nAx9C3q5Q,9yKzy9PApeiPPOUJEtnvkg,Morning Glory Cafe,5,My wife took me here on my birthday for breakf...,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0a2KyEL0d3Yb1V6aivbIuQ,ZRJwVLyzEJq1VAihDhYiow,Spinato's Pizzeria,5,I have no idea why some people give bad review...,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,0hT2KtfLiobPvh6cDC8JQg,6oRAC4uyJCsJl1X0WZpVSA,Haji-Baba,4,love the gyro plate. Rice is so good and I als...,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,sqYN3lNgvPbPCTRsMFu27g,-yxfBYGB6SEqszmxJxd97A,Quiessence Restaurant,4,"Quiessence is, simply put, beautiful. Full wi...",0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,wFweIWhv2fREZV_dYkz_1g,zp713qNhx8d9KCJJnrw1xA,La Condesa Gourmet Taco Shop,5,Drop what you're doing and drive here. After I...,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We calculate the average sentiment scores of all aspects for each restaurant.

In [4]:
restaurants = reviews_df2.drop(columns=['review_id', 'user_id','business_name','review_text'])
restaurants = restaurants.groupby('business_id').mean()

In [6]:
#may load the finished dataframe instead of running the above code
#restaurants = pd.read_csv(r"restaurants.csv", header=0, index_col=0)

print('restaurants')
display(restaurants.head())

restaurants


Unnamed: 0_level_0,stars,accent,accompaniment,addition,adovada,adult,advantage,adventure,advice,afternoon,...,yelp,yelper,yelpers,yesterday,yogurt,york,yorker,yr,yum,zucchini
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--5jkZ3-nUPZxUvtcbr8Uw,4.545455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--BlvDO_RG2yElKu9XA1_g,4.162162,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-0QBrNvhrPQCaeo7mTo0zQ,4.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-0bUDim5OGuv8R0Qqq6J4A,2.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-1bOb2izeJBZjHC7NWxiPA,3.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Recommender System

List of aspects that are included in our sentiment score table:

In [7]:
selected_aspects = list(reviews_df2.columns[6:])

Here is a list of 100 aspects that are included commonly in reviews.

Given a new review, if there is no aspect found that is included in the sentiment score table, we will use the popular aspects to find recommendations to the reviewer.

In [10]:
most_popular_aspects = sentiment_df['aspect'].value_counts()[:100]

Here's a review I picked from Yelp that is not included in our data set:

In [11]:
review = '''
I loved this place! 
Was looking for a hookah bar to go to and stumbled across this place on Instagram and based off of the pictures, 
I decided to go. Without reservations, we were seated with no issues. 
The staff were great, attentive and friendly. 
The ambience of the place was very nice and the music was great! 
Blue Hawaiian, curly fries, fried calamari, and chicken and waffles were all good. 
The hookah was on point as well. I will definitely return to this establishment again! 
Loved that it was low key with a nice vibe!
'''

Let's see what aspects are captured by our application.

In [20]:
sentences = sentencizer(clean_text(review))
sentences = [clean_sentence(sentence) for sentence in sentences]
print(sentences)
aspects, descriptives = aspects_extract(sentences)
if len(aspects) == 0:
    aspects = most_popular_aspects
else:
    aspects = set([aspect for aspect in aspects if aspect in selected_aspects])
    
print("")
print("Captured aspects:")
print(aspects)

['i loved this place', '', 'was looking for hookah bar to go to and stumbled across this place on instagram and based off of the pictures ', 'i decided to go', 'without reservations we were seated with no issues', '', 'the staff were great attentive and friendly', '', 'the ambience of the place was very nice and the music was great', '', 'blue hawaiian curly fries fried calamari and chicken and waffles were all good', '', 'the hookah was on point as well', 'i will definitely return to this establishment again', '', 'loved that it was low key with nice vibe']

Captured aspects:
{'waffle', 'staff', 'key', 'fry', 'place', 'vibe', 'music', 'bar'}


For each restaurant, we select the sentiment scores of the captured aspects and calculate the Euclidean distance to the perfect sentiment scores ([1,1,1,….,1,1]).

We then rank the restaurants by the distance. The smaller the distance, the better the restaurant!

In [21]:
num_aspects = len(aspects)
perfect_scores = pd.Series([1]*num_aspects)

restaurant_scores = [np.linalg.norm(pd.Series(row).reset_index(drop = True) - perfect_scores) for index, row in restaurants[aspects].iterrows()]
ranked_restaurants = pd.DataFrame(restaurant_scores, index =restaurants.index, columns =['Distance']).sort_values(by = 'Distance')

In [67]:
pd.merge(ranked_restaurants, restaurant_df, left_index=True, right_index=True).drop(columns = ['city_name'])

Unnamed: 0,Distance,business_name
Qq90BiOx_FWRUDq6afvghw,2.409472,Grotto Cafe
xU10GeaLiXcozY8PDMSlKA,2.439112,Last Exit Bar & Grill
ytKoF3d0XQGt5Va8ru0GMQ,2.449490,Cactus Moon
OQshbYutpgM8LcBK4mbFdw,2.449490,Indigo Joe's
z0hUNe3YiSknxs1HJZZDtg,2.462214,Mary Elaine's
...,...,...
Hp8RwtOHmBjUpPWsDfPnKw,3.020761,TGI Friday's
SPqPt-K6qzTutj_vWOgKGA,3.021589,Marcia's Long Wongs
4TEiMrW7OmpUgBXONg1tQw,3.069162,Hometown Buffet
shUiQtVZpk_YCUmMYJZf-g,3.074593,Union Wine Bar & Grill


We will show the top 2 restaurants and the top 3 reviews for each restaurant.

The aspects that are captured in the new review and showed in the recommendation reviews are marked in upper case and placed in [ ].

In [28]:
num_recommendation = 2
num_reviews = 3
for i in range(num_recommendation):
    reviews = reviews_df2.loc[reviews_df2['business_id'] == ranked_restaurants.index[i]]
    print(reviews.iloc[0]['business_name'])
    print('')
    
    sentiment_scores = restaurants.loc[restaurants.index == ranked_restaurants.index[i]].drop(columns=['stars'])
    print('pros')
    pros = sentiment_scores.loc[:,sentiment_scores.iloc[0] > 0].reset_index(drop=True)
    print(pros.columns.tolist())
    print('cons')
    cons = sentiment_scores.loc[:,sentiment_scores.iloc[0] < 0].reset_index(drop=True)
    print(cons.columns.tolist())
    print('')
    
    review_scores = [np.linalg.norm(pd.Series(row).reset_index(drop = True) - perfect_scores) for index, row in reviews[aspects].iterrows()]
    ranked_reviews = pd.DataFrame(review_scores, index = reviews.index, columns =['Distance']).sort_values(by = 'Distance')
    
    for i in range(min(num_reviews,len(ranked_reviews))):
        print('----------------- review ' + str(i+1) + ' -----------------')
        print('')
        review = reviews.loc[reviews['review_id'] == ranked_reviews.index[i]].iloc[0]['review_text']
        review = review.lower()
        for aspect in aspects:
            review = re.sub(aspect, '\033[1m[' + aspect.upper() + ']\033[0m', review)
        print(review)    
        print('')
        
    print('')
    print('----------------------------------')
    print('')

Grotto Cafe

pros
['bruschetta', 'cafe', 'dining', 'food', 'friend', 'furniture', 'music', 'opportunity', 'patio', 'place', 'price', 'sandwich', 'service', 'staff', 'venue', 'way', 'work']
cons
[]

----------------- review 1 -----------------

one of my new favorite [1m[PLACE][0ms! i love coming here for lunch and sitting outside in the patio area. the sandwiches are delicious, i especially recommend the grilled cheese with added bacon! the grotto cafe has a great atmosphere, friendly [1m[STAFF][0m and great food with very reasonable prices. i always recommend friends to come here for lunch or breakfast :) they have live [1m[MUSIC][0m now too on weekends and i am very looking forward to it, i'm sure i won't be disappointed!

----------------- review 2 -----------------

i had such a fabulous time with my friends last night that i just have to tell you about it. my very good friend david sheehy got a new gig. he is now the house [1m[MUSIC][0mian at the grotto cafe in cave creek.

### Conclusion

We can see that the recommended restaurants have good sentiments on the aspects 'VIBE', 'BAR', 'PLACE','MUSIC' and 'STAFF'. Most of the aspects captured in the new review are included in the restaurants. We conclude that our recommender system is working great!