# Creating a Sentiment Analyzer using Yelp Business Reviews

We're going to go through how to get data off the web using beautifulsoup and the requests module from Python and then later on we'll start analyzing our data cleaning our data.

## Scaping the Data from the URL

In [167]:
import requests
from bs4 import BeautifulSoup

In [168]:
# creating a request 
r = requests.get("https://www.yelp.com/biz/angelina-paris")

In [169]:
r.status_code

200

In [170]:
# display the html result using .text attribute
r.text

'<!DOCTYPE html><html lang="en-US" prefix="og: http://ogp.me/ns#" style="margin: 0;padding: 0; border: 0; font-size: 100%; font: inherit; vertical-align: baseline;"><head><script>document.documentElement.className=document.documentElement.className.replace(/\x08no-js\x08/,"js");</script><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><meta http-equiv="Content-Language" content="en-US" /><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"><link rel="mask-icon" sizes="any" href="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_large_assets/b2bb2fb0ec9c/assets/img/logos/yelp_burst.svg" content="#FF1A1A"><link rel="shortcut icon" href="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_large_assets/dcfe403147fc/assets/img/logos/favicon.ico"><script> window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;window.ygaPageStartTime=new Date().getTime();</script><script>\n            window.yelp = window.yelp || {};\

This is not readable so instead we will use BeautifulSoup to refine it.

In [54]:
soup = BeautifulSoup(r.text, 'html.parser')

In [171]:
# findall to find all the reviews <p class="comment__09f24__D0cxf css-qgunke"></p>
paragraphs = soup.findAll(class_="comment__09f24__D0cxf css-qgunke")

In [172]:
type(paragraphs)

bs4.element.ResultSet

In [173]:
# get each review from the <span> tag inside the <p> tag
reviews = []

for p in paragraphs:
    reviews.append(p.find('span').text) # saving it in an array

In [68]:
reviews

['What amazing experiences I did in Paris The food and drink are amazing service %100 perfect All employees have infos about menus \xa0and food ingredient The best drink, ever the hot chocolate you should try it for one \xa0time, because that flavour tastes unforgettable',
 "Chestnut cake -no good at all. The toppings tasted just like frosting. Way too sweet. I don't understand why is their signature cake. Way Over priced . € 10 a piece",
 "We waited in line for less than an hour to get seated. The hostess keep the line moving and tables are rotated fairly quickly. Hot chocolate is worth the hype. It's thick and not extremely sweet, which gives you incentive to add more whipped cream to your cup. We also ordered the mount blanc, which was a bit chewy on the outside and generally lacked flavor. I wouldn't recommend it and try something else. Glad we tried the hot chocolate, but with so many other cafes in Paris, not sure if it's worth going back.",
 "TLDR; angelina offers delicious hot 

In [70]:
reviews[0]

'What amazing experiences I did in Paris The food and drink are amazing service %100 perfect All employees have infos about menus \xa0and food ingredient The best drink, ever the hot chocolate you should try it for one \xa0time, because that flavour tastes unforgettable'

## Analyzing The Data

In [72]:
import pandas as pd
import numpy as np

In [174]:
# converting the list of reviews into a dataframe
df = pd.DataFrame(np.array(reviews), columns=['review'])

In [76]:
df.head()

Unnamed: 0,review
0,What amazing experiences I did in Paris The fo...
1,Chestnut cake -no good at all. The toppings ta...
2,We waited in line for less than an hour to get...
3,TLDR; angelina offers delicious hot chocolate ...
4,You can expect to wait in line to get in to An...


In [77]:
len(df['review'])
# df.shape

10

we have 10 reviews

## Word count - length of each review

In [81]:
df['word_count'] = df['review'].apply(lambda x: len(x.split()))

In [82]:
df.head()

Unnamed: 0,review,word_count
0,What amazing experiences I did in Paris The fo...,44
1,Chestnut cake -no good at all. The toppings ta...,31
2,We waited in line for less than an hour to get...,95
3,TLDR; angelina offers delicious hot chocolate ...,288
4,You can expect to wait in line to get in to An...,114


number of characters in each review

In [84]:
df['character_count'] = df['review'].apply(lambda x: len(x))

In [85]:
df.head()

Unnamed: 0,review,word_count,character_count
0,What amazing experiences I did in Paris The fo...,44,263
1,Chestnut cake -no good at all. The toppings ta...,31,164
2,We waited in line for less than an hour to get...,95,514
3,TLDR; angelina offers delicious hot chocolate ...,288,1584
4,You can expect to wait in line to get in to An...,114,594


## Average word length

In [87]:
def average_words(x):
    words = x.split()
    return sum(len(word) for word in words)/len(words)

In [89]:
df['avg_word_length'] = df['review'].apply(lambda x: average_words(x))

In [90]:
df.head()

Unnamed: 0,review,word_count,character_count,avg_word_length
0,What amazing experiences I did in Paris The fo...,44,263,4.954545
1,Chestnut cake -no good at all. The toppings ta...,31,164,4.322581
2,We waited in line for less than an hour to get...,95,514,4.421053
3,TLDR; angelina offers delicious hot chocolate ...,288,1584,4.503472
4,You can expect to wait in line to get in to An...,114,594,4.219298


## Stopwords in each review

In [93]:
# importing stopwords list from nltk
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pnwaekwu\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [94]:
# creating stopwods list
stop_words = stopwords.words('english')

In [95]:
len(stop_words)

179

In [96]:
df['stopwords_count'] = df['review'].apply(lambda x: len([word for word in x.split() if word.lower() in stop_words]))

In [98]:
df.head(3)

Unnamed: 0,review,word_count,character_count,avg_word_length,stopwords_count
0,What amazing experiences I did in Paris The fo...,44,263,4.954545,19
1,Chestnut cake -no good at all. The toppings ta...,31,164,4.322581,11
2,We waited in line for less than an hour to get...,95,514,4.421053,43


## percentage of stopwords to non_stopwords : stopwords ratio or proportion

In [99]:
df['stopwords_rate'] = df['stopwords_count']/df['word_count']

In [101]:
df.head(2)

Unnamed: 0,review,word_count,character_count,avg_word_length,stopwords_count,stopwords_rate
0,What amazing experiences I did in Paris The fo...,44,263,4.954545,19,0.431818
1,Chestnut cake -no good at all. The toppings ta...,31,164,4.322581,11,0.354839


In [102]:
df.sort_values(by = 'stopwords_rate')

Unnamed: 0,review,word_count,character_count,avg_word_length,stopwords_count,stopwords_rate
1,Chestnut cake -no good at all. The toppings ta...,31,164,4.322581,11,0.354839
9,Worth the wait! We of course had the hot choco...,32,173,4.4375,12,0.375
3,TLDR; angelina offers delicious hot chocolate ...,288,1584,4.503472,119,0.413194
0,What amazing experiences I did in Paris The fo...,44,263,4.954545,19,0.431818
2,We waited in line for less than an hour to get...,95,514,4.421053,43,0.452632
6,Expect to wait in line for at least 45 minutes...,183,995,4.442623,83,0.453552
4,You can expect to wait in line to get in to An...,114,594,4.219298,52,0.45614
5,It's the best hot chocolate I've had and the b...,87,439,4.057471,40,0.45977
7,We went on a Sunday afternoon and were lucky t...,140,774,4.535714,66,0.471429
8,"Angelina is famous for hot chocolate, so make ...",161,845,4.254658,76,0.47205


In [104]:
df.describe()

Unnamed: 0,word_count,character_count,avg_word_length,stopwords_count,stopwords_rate
count,10.0,10.0,10.0,10.0,10.0
mean,117.5,634.5,4.414892,52.1,0.434042
std,79.862729,437.420024,0.239256,34.668109,0.040712
min,31.0,164.0,4.057471,11.0,0.354839
25%,54.75,307.0,4.271639,24.25,0.41785
50%,104.5,554.0,4.429276,47.5,0.453092
75%,155.75,827.25,4.48826,73.5,0.458863
max,288.0,1584.0,4.954545,119.0,0.47205


## Data Cleaning and Preprocessing Stage

seperating each word and transforming thm to lowercase

In [107]:
df['review_lowercase'] = df['review'].apply(lambda x: " ".join(word.lower() for word in x.split()))

In [108]:
df.head()

Unnamed: 0,review,word_count,character_count,avg_word_length,stopwords_count,stopwords_rate,review_lowercase
0,What amazing experiences I did in Paris The fo...,44,263,4.954545,19,0.431818,what amazing experiences i did in paris the fo...
1,Chestnut cake -no good at all. The toppings ta...,31,164,4.322581,11,0.354839,chestnut cake -no good at all. the toppings ta...
2,We waited in line for less than an hour to get...,95,514,4.421053,43,0.452632,we waited in line for less than an hour to get...
3,TLDR; angelina offers delicious hot chocolate ...,288,1584,4.503472,119,0.413194,tldr; angelina offers delicious hot chocolate ...
4,You can expect to wait in line to get in to An...,114,594,4.219298,52,0.45614,you can expect to wait in line to get in to an...


# Removing Punctuations

In [115]:
import string

def remove_punctuation(text):
    # Create a translation table to remove punctuation marks
    translator = str.maketrans('', '', string.punctuation)
    # Remove punctuation marks using the translation table
    text = text.translate(translator)
    return text

In [116]:
df['punctuations'] = df['review_lowercase'].apply(remove_punctuation)

In [117]:
df['punctuations'].head(10)

0    what amazing experiences i did in paris the fo...
1    chestnut cake no good at all the toppings tast...
2    we waited in line for less than an hour to get...
3    tldr angelina offers delicious hot chocolate i...
4    you can expect to wait in line to get in to an...
5    its the best hot chocolate ive had and the bes...
6    expect to wait in line for at least 45 minutes...
7    we went on a sunday afternoon and were lucky t...
8    angelina is famous for hot chocolate so make s...
9    worth the wait we of course had the hot chocol...
Name: punctuations, dtype: object

## Removing Stopwords

In [119]:
stop_words[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [121]:
df['stopwords_removed'] = df['punctuations'].apply(lambda x: " ".join(word for word in x.split() if word not in stop_words))

In [123]:
df['stopwords_removed'].head()

0    amazing experiences paris food drink amazing s...
1    chestnut cake good toppings tasted like frosti...
2    waited line less hour get seated hostess keep ...
3    tldr angelina offers delicious hot chocolate i...
4    expect wait line get angelina line moves fairl...
Name: stopwords_removed, dtype: object

## Review Recurring Words from Frequency count

In [130]:
pd.Series(" ".join(df['stopwords_removed']).split()).value_counts()[:30]

chocolate    28
hot          27
line         11
wait          9
angelina      8
best          7
rich          7
drink         7
cream         7
quickly       6
sweet         6
time          6
one           6
good          5
outside       5
also          5
cup           5
get           5
paris         5
enough        4
blanc         4
ordered       4
two           4
whipped       4
ive           4
worth         4
anything      4
brunch        4
gift          4
try           4
Name: count, dtype: int64

In [131]:
other_stopwords = [
    "line", "wait", "best", "rich", "drink", "cream", "quickly", "one", "good",
    "outside", "also", "cup", "get", "enough", "ordered", "two", "ive", "worth",
    "anything", "brunch", "try"
]

In [132]:
len(other_stopwords)

21

## removing additional stopwords

In [134]:
df['cleaned_reviews'] = df['stopwords_removed'].apply(lambda x: " ".join(word for word in x.split() if word not in other_stopwords))

In [137]:
pd.Series(" ".join(df['cleaned_reviews']).split()).value_counts()[:30]

chocolate     28
hot           27
angelina       8
time           6
sweet          6
paris          5
whipped        4
gift           4
ever           4
blanc          4
got            3
way            3
small          3
pot            3
visit          3
eggs           3
mont           3
come           3
go             3
order          3
buy            3
dont           3
share          3
amazing        3
experience     3
tasted         3
service        3
shop           3
store          2
probably       2
Name: count, dtype: int64

You can go on to remove additional stopwords if you want to.

In [138]:
df.head()

Unnamed: 0,review,word_count,character_count,avg_word_length,stopwords_count,stopwords_rate,review_lowercase,punctuations,stopwords_removed,cleaned_reviews
0,What amazing experiences I did in Paris The fo...,44,263,4.954545,19,0.431818,what amazing experiences i did in paris the fo...,what amazing experiences i did in paris the fo...,amazing experiences paris food drink amazing s...,amazing experiences paris food amazing service...
1,Chestnut cake -no good at all. The toppings ta...,31,164,4.322581,11,0.354839,chestnut cake -no good at all. the toppings ta...,chestnut cake no good at all the toppings tast...,chestnut cake good toppings tasted like frosti...,chestnut cake toppings tasted like frosting wa...
2,We waited in line for less than an hour to get...,95,514,4.421053,43,0.452632,we waited in line for less than an hour to get...,we waited in line for less than an hour to get...,waited line less hour get seated hostess keep ...,waited less hour seated hostess keep moving ta...
3,TLDR; angelina offers delicious hot chocolate ...,288,1584,4.503472,119,0.413194,tldr; angelina offers delicious hot chocolate ...,tldr angelina offers delicious hot chocolate i...,tldr angelina offers delicious hot chocolate i...,tldr angelina offers delicious hot chocolate i...
4,You can expect to wait in line to get in to An...,114,594,4.219298,52,0.45614,you can expect to wait in line to get in to an...,you can expect to wait in line to get in to an...,expect wait line get angelina line moves fairl...,expect angelina moves fairly hot chocolate eve...


## Lemmatization

In [143]:
from textblob import Word
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\pnwaekwu\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\pnwaekwu\AppData\Roaming\nltk_data...


True

In [145]:
df['lemmatized'] = df['cleaned_reviews'].apply(lambda x: " ".join(Word(word).lemmatize() for word in x.split()))

In [146]:
df['lemmatized']

0    amazing experience paris food amazing service ...
1    chestnut cake topping tasted like frosting way...
2    waited le hour seated hostess keep moving tabl...
3    tldr angelina offer delicious hot chocolate in...
4    expect angelina move fairly hot chocolate ever...
5    hot chocolate mont blanc macarons croissant ba...
6    expect least 45 minute arriving anytime openin...
7    went sunday afternoon lucky lineup 15mins tabl...
8    angelina famous hot chocolate make sure visit ...
9    course hot chocolate never like egg benedict n...
Name: lemmatized, dtype: object

In [147]:
df['cleaned_reviews']

0    amazing experiences paris food amazing service...
1    chestnut cake toppings tasted like frosting wa...
2    waited less hour seated hostess keep moving ta...
3    tldr angelina offers delicious hot chocolate i...
4    expect angelina moves fairly hot chocolate eve...
5    hot chocolate mont blanc macarons croissant ba...
6    expect least 45 minutes arriving anytime openi...
7    went sunday afternoon lucky lineup 15mins tabl...
8    angelina famous hot chocolate make sure visit ...
9    course hot chocolate never like eggs benedict ...
Name: cleaned_reviews, dtype: object

##  Sentiment Analysis

In [148]:
from textblob import TextBlob

when you run sentiment analysis you actually get two metrics back so you get a polarity metric and you get a subjectivity metric. 

Now the polarity metric is probably the one that we're more interested in because it gives us a value from -1 to 1. which is a metric of how positive positive or how negative a review is, so -1 being negative and 1 obviously being positive vice-versa.

subjectivity is a metric between 0 and 1 and it's a measure of how much the text is based on factual information versus just generic opinion, so ideally if you wanted a really constructive review you're probably looking for something which is.. I mean you could say it's objective but you probably want a little bit of personal opinion in that as well. 

but you ideally want something that's not too negative, not too positive. but if you're looking for example your biggest critics you'll be looking for something which is super subjective and also I guess reasonably negative because you're sort of getting right down to the bottom of barrel in terms of I guess the worst bits about your business.

In [152]:
df['polarity'] = df['lemmatized'].apply(lambda x: TextBlob(x).sentiment[0])

In [153]:
df['subjectivity'] = df['lemmatized'].apply(lambda x: TextBlob(x).sentiment[1])

In [158]:
df.drop(['review_lowercase', 'punctuations',
       'stopwords_removed', 'cleaned_reviews', 'lemmatized'], axis=1, inplace=True)

In [159]:
df.head(2)

Unnamed: 0,review,word_count,character_count,avg_word_length,stopwords_count,stopwords_rate,polarity,subjectivity
0,What amazing experiences I did in Paris The fo...,44,263,4.954545,19,0.431818,0.65,0.93
1,Chestnut cake -no good at all. The toppings ta...,31,164,4.322581,11,0.354839,0.35,0.65


In [160]:
df.sort_values(by='polarity')

Unnamed: 0,review,word_count,character_count,avg_word_length,stopwords_count,stopwords_rate,polarity,subjectivity
5,It's the best hot chocolate I've had and the b...,87,439,4.057471,40,0.45977,0.105952,0.65873
7,We went on a Sunday afternoon and were lucky t...,140,774,4.535714,66,0.471429,0.218056,0.594444
8,"Angelina is famous for hot chocolate, so make ...",161,845,4.254658,76,0.47205,0.219416,0.63825
2,We waited in line for less than an hour to get...,95,514,4.421053,43,0.452632,0.233333,0.634877
4,You can expect to wait in line to get in to An...,114,594,4.219298,52,0.45614,0.241667,0.633333
3,TLDR; angelina offers delicious hot chocolate ...,288,1584,4.503472,119,0.413194,0.246,0.64
6,Expect to wait in line for at least 45 minutes...,183,995,4.442623,83,0.453552,0.283333,0.67
9,Worth the wait! We of course had the hot choco...,32,173,4.4375,12,0.375,0.3125,0.7875
1,Chestnut cake -no good at all. The toppings ta...,31,164,4.322581,11,0.354839,0.35,0.65
0,What amazing experiences I did in Paris The fo...,44,263,4.954545,19,0.431818,0.65,0.93


In [166]:
print(list(df['review'][0:][:1]))

['What amazing experiences I did in Paris The food and drink are amazing service %100 perfect All employees have infos about menus \xa0and food ingredient The best drink, ever the hot chocolate you should try it for one \xa0time, because that flavour tastes unforgettable']


# Conclusion

From the data we see that the polarity values range from 0.105952 to 0.787500. The higher the polarity value, the more positive the sentiment expressed in the corresponding review.

the subjectivity values range from 0.594444 to 0.930000. Higher subjectivity values indicate that the corresponding reviews are more opinionated and subjective.

It's important to note that the polarity and subjectivity values are calculated based on the specific sentiment analysis algorithm or tool used, and the interpretation may vary depending on the scale and context of the analysis.

in Conclusion, The polarity values range from 0.105952 to 0.787500, with higher values indicating a more positive sentiment. This suggests that customers generally had positive experiences at the restaurant. The highest polarity value of 0.787500 indicates a strong positive sentiment in one of the reviews.

Regarding subjectivity, the values range from 0.594444 to 0.930000. These values indicate that the reviews are relatively subjective, implying that customers expressed personal opinions and emotions rather than purely objective statements.

Overall, it seems that the restaurant received favorable feedback from customers based on the sentiment analysis results. However, it's important to consider that these conclusions are based solely on the polarity and subjectivity metrics obtained from the sentiment analysis. Additional context and the specific criteria used to determine the sentiment may provide a more comprehensive understanding of the restaurant's overall reputation.