# Yelp Fusion API Call and Naive Bayes Classifier Sentiment Analysis 

Prepared by Denisa Bani

This Jupyter file will explore how to make a Yelp Fusion API call in order to download and analyze reviews and ratings for restaurants in the Toronto area. A naive bayes classifier will be used to predict whether the rating from a review will be positive or negative, where a positive rating is defined as a rating greater than 3 and a negative rating is 3 or lower. We will then find that using Yelp Fusion has many limitations and then explore how to web scrape using BeautifulSoup. 

In [28]:
import requests
import json
import numpy as np
import matplotlib.pyplot as plt
import numpy as np

import random
from pprint import pprint

from requests.auth import HTTPBasicAuth

Go to the Yelp Fusion page: https://www.yelp.com/developers/v3/manage_app and create an app in order to obtain a Yelp token. My Yelp token was obtained and saved as the variable: ` YELP_TOKEN ` but this line of code is hidden for security purposes.

After the token is obtained, we can make a request for restaurants in  pages 1-50 within the Toronto area and then verify whether the request was valid (aka, make sure there were no issues with the token verification). We want to see the status code '200' to verify that this is the case.

In [30]:
r = requests.get("https://api.yelp.com/v3/businesses/search?location=Toronto&limit=50", headers={"Authorization": "Bearer %s" % YELP_TOKEN})
print(r.status_code, r.reason)

200 OK


## Training the Classifier using the Yelp Fusion API

After verifiying that the verification token is functional, we need to scrape *just* the reviews and ratings as that is all we need for our sentiment analysis.There are a couple limitations to using Yelp's API, as documented on this [page](https://www.yelp.ca/developers/faq). The limitations that will specifically impact the classifier is the restriction to 1000 businesses and only being able to retrieve review excerpts that are 160 characters long. This will limit the performance of the naive bayes classifier as a lot of information is potentially being lost.  

The reviews will be compliled into a list of tuples where a tuple contains the review as the first entry and the rating as the second entry. The list will be the variable `review_labels`. Afterwards, the review in the tuple will be broken down into a list and the rating will be converted into a binary "postive" or "negative" rating - this will be stored as `review_features`. 

In [31]:
review_labels = []
pages = np.arange(51,1001,50)
for page in pages:
    r = requests.get("https://api.yelp.com/v3/businesses/search?location=Toronto&offset="+str(page), headers={"Authorization": "Bearer %s" % YELP_TOKEN})
    for business in r.json()['businesses']:
        reviews = requests.get("https://api.yelp.com/v3/businesses/%s/reviews" % business['id'], headers={"Authorization": "Bearer %s" % YELP_TOKEN}).json()
        for review in reviews['reviews']:
            rev = review['text'].rstrip('.')
            review_labels.append((rev.replace('\n\n',''), review['rating']))

In [32]:
review_features = [(x.split(' '), 'positive' if y > 3 else 'negative') for (x, y) in review_labels]

In [33]:
from nltk.sentiment import SentimentAnalyzer
import nltk.sentiment.util
from nltk.classify import NaiveBayesClassifier

random.shuffle(review_features)
training_docs = review_features[:int(len(review_features)*2/3)]
test_docs = review_features[int(len(review_features)*2/3):]

print("Training: %d, Testing: %d" % (len(training_docs), len(test_docs)))

sentim_analyzer = SentimentAnalyzer()

all_words_neg = sentim_analyzer.all_words([nltk.sentiment.util.mark_negation(doc) for doc in training_docs])

Training: 760, Testing: 380


In [34]:
unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg, min_freq=4)
sentim_analyzer.add_feat_extractor(nltk.sentiment.util.extract_unigram_feats, unigrams=unigram_feats)

training_set = sentim_analyzer.apply_features(training_docs)
test_set = sentim_analyzer.apply_features(test_docs)

trainer = NaiveBayesClassifier.train
classifier = sentim_analyzer.train(trainer, training_set)
for key,value in sorted(sentim_analyzer.evaluate(test_set).items()):
     print('{0}: {1}'.format(key, value))

Training classifier
Evaluating NaiveBayesClassifier results...
Accuracy: 0.7210526315789474
F-measure [negative]: 0.32911392405063294
F-measure [positive]: 0.8239202657807309
Precision [negative]: 0.40625
Precision [positive]: 0.7848101265822784
Recall [negative]: 0.2765957446808511
Recall [positive]: 0.8671328671328671


72% accuracy isn't the best. Let's see if we can do better by trying other methods that won't limit our web scraping like Yelp Fusion does. 

## Training the Classifier using BeautifulSoup for Review Retrieval

Instead of being limited by the number of characters in a review from Yelp's API, let's try using BeautifulSoup and see the impact on our classification results. A really great and relevant blog post on the subject can be found [here](https://www.octoparse.com/blog/web-scraping-using-python) and [here](https://www.youtube.com/watch?v=r3-v81c2Oew).

Note that Yelp uses Javascipt to display its webpages, which means that simply inspecting the page and using BeautifulSoup's `find_all` won't work since BeautifulSoup doesn't run Javascripts. Some work-arounds that were researched was to use Selenium with BeautifulSoup, but this fancy technique isn't necessary as the information we need can be obtained without Javascripts. Here is a [link](https://towardsdatascience.com/web-scraping-using-selenium-and-beautifulsoup-99195cd70a58) to explore the Selenium option if you want.

Instead, an easier alternative is to forego using "Inspect" on your web browser, and use "View Page Source". Note that I'm using Firefox so this may have another name on other browsers. Now we can refer to the correct tags that BeautifulSoup would also recognize. A loop will be used to click through and parse through the different pages to collect as many reviews and ratings as possible. 

In [35]:
from bs4 import BeautifulSoup 
import urllib.request

In [36]:
page = 0
reviews = []
ratings = []
combo =[]

while page <= 280:
    
    url = 'https://www.yelp.com/biz/frescos-fish-and-chips-toronto?start=' + str(page)
    
    ourURL = urllib.request.urlopen(url)
    
    soup = BeautifulSoup(ourURL, 'html.parser')
    
    #print(soup.prettify())
    
    reviews_html = soup.find_all("p", {"itemprop":"description"})
    ratings_html = soup.find_all("div", {"itemprop":"reviewRating"})
    
    #Time to clean up 
    
    for review in reviews_html:
        rev= str(review).replace('\n\n','')
        reviews.append(rev[26:-12])
        
    
    for rating in ratings_html:
        ratings.append(float(rating.meta.get('content')))
       
    
    for i in range(len(reviews)):
        combo.append((reviews[i],ratings[i]))
        
    page += 20

In [37]:
review_features= [(x.split(' '), 'positive' if y > 3 else 'negative') for (x, y) in combo]

In [38]:
random.shuffle(review_features)
training_docs = review_features[:int(len(review_features)*2/3)]
test_docs = review_features[int(len(review_features)*2/3):]

print("Training: %d, Testing: %d" % (len(training_docs), len(test_docs)))

sentim_analyzer = SentimentAnalyzer()

all_words_neg = sentim_analyzer.all_words([nltk.sentiment.util.mark_negation(doc) for doc in training_docs])

Training: 1600, Testing: 800


In [39]:
unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg, min_freq=4)
sentim_analyzer.add_feat_extractor(nltk.sentiment.util.extract_unigram_feats, unigrams=unigram_feats)

training_set = sentim_analyzer.apply_features(training_docs)
test_set = sentim_analyzer.apply_features(test_docs)

trainer = NaiveBayesClassifier.train
classifier = sentim_analyzer.train(trainer, training_set)
for key,value in sorted(sentim_analyzer.evaluate(test_set).items()):
     print('{0}: {1}'.format(key, value))

Training classifier
Evaluating NaiveBayesClassifier results...
Accuracy: 0.9875
F-measure [negative]: 0.9514563106796117
F-measure [positive]: 0.9928263988522238
Precision [negative]: 1.0
Precision [positive]: 0.9857549857549858
Recall [negative]: 0.9074074074074074
Recall [positive]: 1.0


This is a major improvement over using the Yelp Fusion API as we are no longer limited by the length of the reviews that we can retrieve nor by the number of reviews we can obtain. Our accuracy is now a solid 99%. 