# Web Scraping and Sentiment Analysis of Food Reviews

In this project I have web scraped the 8 MilePi Pizza reviews on Yelp. It is a website for finding restaurants, homeservices etc. First, I have scraped the reviews using BeautifulSoup. Then for sentiment analysis, I have passed them through the state of the art NLP model BERT. BERT(Bidirectional Encoder Representation from Transformer) is NLP machine learning model pretrained by Google. Here the model is already trained on big data and we are using it to predict the sentiment related to our reviews scraped from the Yelp.com website. This is a good case of "Transfer Learning".

## 1. Install and Import Dependencies

In [1]:
!pip install torch torchvision torchaudio



In [2]:
!pip install transformers requests beautifulsoup4 pandas numpy



In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np

## 2. Instantiate Model

In [2]:
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

pytorch_model.bin:  23%|##3       | 157M/669M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


## 3. Encode and Calculate Sentiment

In [3]:
tokens = tokenizer.encode('I loved it, the pizza is very delicious', return_tensors = 'pt')

In [4]:
tokens

tensor([[  101,   151, 46747, 10197,   117, 10103, 59371, 10127, 12495, 27254,
         47838,   102]])

In [5]:
# we don't need this step but we can also decode the string
tokenizer.decode(tokens[0])

'[CLS] i loved it, the pizza is very delicious [SEP]'

In [6]:
result = model(tokens)
result

SequenceClassifierOutput(loss=None, logits=tensor([[-2.6552, -2.2700, -0.3486,  1.8020,  2.7059]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

The above output from the model is a one-hot encoded list of scores. The position with the highest score represents the sentiment rating.

In [7]:
result.logits

tensor([[-2.6552, -2.2700, -0.3486,  1.8020,  2.7059]],
       grad_fn=<AddmmBackward0>)

torch.argmax will return the position of highest value from the tensor. As count starts from 0, I am adding +1 for better understanding

In [8]:
int(torch.argmax(result.logits))+1



5

Now we have the number between 1 to 5. Higher the number the better is the sentiment and vice versa.

In [9]:
# lets try this on one more review
tokens_a = tokenizer.encode('It was the worst thing i have ever had', return_tensors = 'pt')
result_a = model(tokens_a)
int(torch.argmax(result_a.logits))+1

1

## 4. Import Dataset having reviews

In [10]:
r = requests.get('https://www.yelp.com/biz/8milepi-detroit-style-pizza-san-francisco-3')
soup = BeautifulSoup(r.text,'html.parser')
regex = re.compile('.*comment.*')
results = soup.find_all('p',{'class':regex})
reviews = [result.text for result in results]

In [11]:
results

[<p class="comment__09f24__D0cxf y-css-1wfz87z"><span class="raw__09f24__T4Ezm" lang="en">lol never coming here again they gave me food poisioning and that salad is unholy. also dont talk to jessica. i came here a couple years ago, on 4/5/2031. hope it changes soon</span></p>,
 <p class="comment__09f24__D0cxf y-css-1wfz87z"><span class="raw__09f24__T4Ezm" lang="en">In the past year and a half, I've fallen in love with Detroit-style pizza and been on the hunt to try every place SF, which is what lead me to 8 MilePi! <br/><br/>8 MilePI offers both Detroit-style and Sicilian pizza, but of course I had to go with the Detroit. I love bbq chicken pizzas, so that one immediately got my attention and placed my order for that. The bbq sauce was on the bitter side, but just wish it was a little sweeter and tangy because the bitterness of the sauce overpowered the rest of ingredients and was the prominent flavor. The crust fell short of expectations as well. It wasn't the crispy, cheesy edges tha

In [12]:
results[0].text

'lol never coming here again they gave me food poisioning and that salad is unholy. also dont talk to jessica. i came here a couple years ago, on 4/5/2031. hope it changes soon'

In [13]:
reviews

['lol never coming here again they gave me food poisioning and that salad is unholy. also dont talk to jessica. i came here a couple years ago, on 4/5/2031. hope it changes soon',
 "In the past year and a half, I've fallen in love with Detroit-style pizza and been on the hunt to try every place SF, which is what lead me to 8 MilePi! 8 MilePI offers both Detroit-style and Sicilian pizza, but of course I had to go with the Detroit. I love bbq chicken pizzas, so that one immediately got my attention and placed my order for that. The bbq sauce was on the bitter side, but just wish it was a little sweeter and tangy because the bitterness of the sauce overpowered the rest of ingredients and was the prominent flavor. The crust fell short of expectations as well. It wasn't the crispy, cheesy edges that i've had with previous Detroit-style pizzas. The crust as a whole had more of the sicilian texture rather than the Detroit.  The pizza itself is huge! The regular size was $28, and I was full of

## 5. Load Reviews into Dataframe and score

In [14]:
df = pd.DataFrame(np.array(reviews),columns = ['review'])

In [15]:
df['review'].iloc[0]

'lol never coming here again they gave me food poisioning and that salad is unholy. also dont talk to jessica. i came here a couple years ago, on 4/5/2031. hope it changes soon'

In [16]:
# creating function for the steps we have carried out earlier
def sentiment_score(review):
  tokens = tokenizer.encode(review, return_tensors = 'pt')
  result = model(tokens)
  return int(torch.argmax(result.logits))+1


Above we have created a function that encapsulates the sentiment pipeline which will make it easier to process multiple strings. We will use it for each review in dataframe.

In [17]:
sentiment_score(df['review'].iloc[0])

1

In [18]:
df['sentiment'] = df['review'].apply(lambda x: sentiment_score(x[:512]))

In [19]:
df

Unnamed: 0,review,sentiment
0,lol never coming here again they gave me food ...,1
1,"In the past year and a half, I've fallen in lo...",4
2,3.5 starsOrdered via door dash and some of the...,3
3,8MilePi in Detroit serves up a slice of perfec...,5
4,The ultimate Detroit style pizzas with that de...,5
5,"Hi Sonam, thank you for sharing this great fee...",5
6,I got firehouse special and Motown Meat lover-...,5
7,"Hi Alicia, great to hear that you enjoyed our ...",4
8,"If you love thick pizza, Detroit style pizza f...",5
9,"Hi Farrah, great to hear that you enjoy our SM...",4


We can run the same script for other restaurants or businesses just by copying the link from Yelp website and paste it into the 'r' variable in importing dataset section.   
  
Caution:- If the website structure changes in future this can throw an error.