<a href="https://colab.research.google.com/github/dton24/PortfolioProjects/blob/main/BERT_CIS_4680_Sentiment_Yelp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Install and Import Dependencies**

In [None]:
# Installing pytorch
!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Downloading https://download.pytorch.org/whl/cu121/nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m53.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Downloading https://download.pytorch.org/whl/cu121/nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m61.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Downloading https://download.pytorch.org/whl/cu121/nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m73.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollec

In [None]:
# We need tranformers for our NLP model, it includes the sentiment analysis model for product reviews (scale from 1-5)
# Beautiful soup is going to allow us to work through that soup that we actually get returned back form the page and extract the data that we actually need. It's frequently used in web scraping to extract data from web pages.
# Panda makes us structure the data in an easy way to read
# Numpy is additional data transformation
!pip install transformers requests beautifulsoup4 pandas numpy



In [None]:
# Bring in our model from "HuggingFace". AutoTokenizer automatically loads the tokenizer from the pre-trained model, which converts text into a readable format for the model.
# AutoModelForSequenceClassification: Also from the Transformers library, this is used to automatically download and load a pre-trained model for sequence classification tasks, like sentiment analysis.
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Use Arg Max function from torch to extract our highest sequence result
import torch

# Grab data from the webpage Yelp
import requests

# Creates parse trees. Each node in the tree corresponds to a tag, attribute, or piece of text in the document. This tree structure allows you to navigate and extract specific parts of the document efficiently.
from bs4 import BeautifulSoup

# Allows us to extract the specific comments we want
import re

##**Instantiate our Model**

In [None]:
# Initiate our pre-trained tokenizer from Hugging Face
# This tokenizer is designed to process text in multiple languages and is tailored for sentiment analysis tasks. It prepares text data to be compatible with the corresponding BERT-based model for analysis.
# The input file is the end of the website link
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
# Initiate our pre-trained model from Hugging Face
# Designed to handle sentiment analysis across multiple languages. This model is capable of classifying sequences of text, such as sentences or paragraphs, into predefined categories like sentiment ratings.
model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

##**Encode and Calculate Sentiment**

In [None]:
# tokenizer.encode(...): This function converts the input text into a list of token IDs.
# 'I hated this, absolutely the worst': This is the text input that is being tokenized.
# return_tensors='pt': Specifies that the token IDs should be returned as a PyTorch tensor.

tokens = tokenizer.encode('WOW THIS IS AMAZING!', return_tensors = 'pt')

In [None]:
# Return the token IDs. Each number represents each word.
tokens

tensor([[  101, 94608, 10372, 10127, 39854,   106,   102]])

In [None]:
# Decoding the token IDs back into text. We use tokens[0] because you can't return a list of lists, so we return the first row.
tokenizer.decode(tokens[0])

'[CLS] wow this is amazing! [SEP]'

In [None]:
# In order to perform our sentiment analysis, we have to put the tokens into our model
result = model(tokens)

In [None]:
result.logits

tensor([[-2.0214, -2.7252, -1.0128,  0.9251,  4.0122]],
       grad_fn=<AddmmBackward0>)

*Output interpretation: The logits return the probability that the text is going to be a part of that class. There are 5 classes (1 is very negative and 5 is very positive). For example, it is -2.0214 that it is a 1 out of 5, but it is a 4.0122 that it is a 5/5.*

In [None]:
# This model is outputting logits for each class from 0 to 4. So, if your model predicts the class "0" as the highest score, adding "+1" translates it to a sentiment rating of "1" on a scale from 1 to 5.
# .argmax() returns the class with the highest value, which in this case is the "4th class", but it is really the 5th class b/c of python listing.
int(torch.argmax(result.logits))+1

5

*Output Interpretation: The inputted text is very positive**

##**Collect Reviews**

In [None]:
# This line sends an HTTP GET request to the specified Yelp page for a business called "Mejico Sydney" and stores the response in the variable r. It essentially grabs the webpage
r = requests.get('https://yelp.com/biz/mejico-sydney-2')

In [None]:
# This is the html text of the webpage
r.text

'<!DOCTYPE html><html lang="en-US" prefix="og: http://ogp.me/ns#" style="margin: 0;padding: 0; border: 0; font-size: 100%; font: inherit; vertical-align: baseline;"><head><script>document.documentElement.className=document.documentElement.className.replace(/\x08no-js\x08/,"js");</script><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><meta http-equiv="Content-Language" content="en-US" /><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"><link rel="mask-icon" sizes="any" href="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_large_assets/b2bb2fb0ec9c/assets/img/logos/yelp_burst.svg" content="#FF1A1A"><link rel="shortcut icon" href="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_large_assets/dcfe403147fc/assets/img/logos/favicon.ico"><script> window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;window.ygaPageStartTime=new Date().getTime();</script><script>\n            window.yelp = window.yelp || {};\

In [None]:
# Soup reads the HTML file (which is the language web pages are written in) and translates it into a structure that the Python program can easily work with.
soup = BeautifulSoup(r.text, 'html.parser')

In [None]:
# This line creates a search pattern to find pieces of text that have the word "comment" somewhere in them.
regex = re.compile('.*comment.*')

In [None]:
# This line tells the computer to look through the structured web page data and collect every paragraph (<p>) that has "comment" in its style name.
# We are retuning all of the comments in the yelp reviews
results = soup.find_all('p', {'class':regex})

In [None]:
# Return the first comment
results[0]

<p class="comment__09f24__D0cxf css-qgunke"><span class="raw__09f24__T4Ezm" lang="en">The food is fresh and tasty.  The scallop ceviche started the lunch. The scallops were tender with a great acidity and use of mango and peppers. The steak was tender and I got the hint of tequila in the sauce. I enjoyed a watermelon salad that complimented the the steak. The portions are good, but a stretch if you are sharing. My only down point is the service. They really only showed up to present my next plate and never checked to see if I wanted another drink (which I did).<br/><br/>Enjoyed the food.</span></p>

In [None]:
# We don't want to see the html tags, only he text
# We are going to only extract the text
results[0].text

'The food is fresh and tasty.  The scallop ceviche started the lunch. The scallops were tender with a great acidity and use of mango and peppers. The steak was tender and I got the hint of tequila in the sauce. I enjoyed a watermelon salad that complimented the the steak. The portions are good, but a stretch if you are sharing. My only down point is the service. They really only showed up to present my next plate and never checked to see if I wanted another drink (which I did).Enjoyed the food.'

In [None]:
# We are storing each comment into a list.
reviews = [result.text for result in results]

In [None]:
reviews

['The food is fresh and tasty.  The scallop ceviche started the lunch. The scallops were tender with a great acidity and use of mango and peppers. The steak was tender and I got the hint of tequila in the sauce. I enjoyed a watermelon salad that complimented the the steak. The portions are good, but a stretch if you are sharing. My only down point is the service. They really only showed up to present my next plate and never checked to see if I wanted another drink (which I did).Enjoyed the food.',
 'The food was decent not great..  We had the guacamole which was bland and came with some type of plantain chips.. The chicken and steak tacos were good.. But the service was poor. We had a waitress with an attitude. She seemed upset whenever we asked for anything.  She would walk by and just stick up her hand and say " just wait ".  She spilled the ingredients to make the guacamole all over the table but never apologized. The waitress didn\'t come by at all, not even once to check on us.. I

##**Load Reviews into Dataframe and Score**

In [None]:
import numpy as np
import pandas as pd

In [None]:
# Put the reviews into a dataframe for easier analysis
df = pd.DataFrame(np.array(reviews), columns = ['review'])

In [None]:
# Return df
df['review']

0    The food is fresh and tasty.  The scallop cevi...
1    The food was decent not great..  We had the gu...
2    Food was okay, guacamole was below average. Se...
3    The food and service here was really good.  It...
4    Visiting from Texas and decided to give this r...
5    Don't come here expecting legit Mexican food b...
6    Out of all the restaurants that I tried in Syd...
7    Great atmosphere, attentive service, solid mar...
8    We came here on a Thursday night @ 5pm and by ...
9    Have been here twice and have absolutely loved...
Name: review, dtype: object

In [None]:
# Return first row of the dataframe
df['review'].iloc[0]

'The food is fresh and tasty.  The scallop ceviche started the lunch. The scallops were tender with a great acidity and use of mango and peppers. The steak was tender and I got the hint of tequila in the sauce. I enjoyed a watermelon salad that complimented the the steak. The portions are good, but a stretch if you are sharing. My only down point is the service. They really only showed up to present my next plate and never checked to see if I wanted another drink (which I did).Enjoyed the food.'

In [None]:
# Tokenize, classify using the model, and return the classification based on the highest logit for the review.
def sentiment_score(review):
  tokens = tokenizer.encode(review, return_tensors = 'pt')
  result = model(tokens)
  return int(torch.argmax(result.logits))+1

In [None]:
# Input the first row into the function
sentiment_score(df['review'].iloc[0])

4

In [None]:
# Use apply function and lambda to loop through each review. Create a new column 'sentiment' with the sentiment scores for each review, and add the column into the dataframe df.
# Max amount of tokens the model can take in is 512, so we stop at 512
df['sentiment'] = df['review'].apply(lambda x: sentiment_score(x[:512]))

In [None]:
# New df now has a sentiment score for each review
df

Unnamed: 0,review,sentiment
0,The food is fresh and tasty. The scallop cevi...,4
1,The food was decent not great.. We had the gu...,2
2,"Food was okay, guacamole was below average. Se...",2
3,The food and service here was really good. It...,5
4,Visiting from Texas and decided to give this r...,5
5,Don't come here expecting legit Mexican food b...,3
6,Out of all the restaurants that I tried in Syd...,5
7,"Great atmosphere, attentive service, solid mar...",3
8,We came here on a Thursday night @ 5pm and by ...,4
9,Have been here twice and have absolutely loved...,5


**To collect new reviews, just add a new yelp link in place of the current yelp link**