# Sentiment Analysis Using BERT

## Installing and Importing Dependencies

If you are using Jupyter Notebook, you would need to install PyTorch.

Install a suitable PyTorch version from [here](https://pytorch.org/get-started/locally/)

In [1]:
!pip install transformers requests beautifulsoup4 pandas numpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.22.1-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 27.8 MB/s 
Collecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 54.1 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 42.3 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.22.1


In [2]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import requests
from bs4 import BeautifulSoup
import re
import numpy as np
import pandas as pd

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

## Instantiate Model

In [3]:
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

Downloading:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/669M [00:00<?, ?B/s]

## Encoding and Calculating Sentiment

In [4]:
tokens = tokenizer.encode('It was good but couldve been better. Great', return_tensors='pt')
result = model(tokens)      # passing tokens to our model
result.logits
int(torch.argmax(result.logits))+1

4

## Collecting Reviews

In [12]:
r = requests.get('https://www.yelp.com/biz/rintaro-san-francisco-4')    # using request to grab webpage
soup = BeautifulSoup(r.text, 'html.parser')
regex = re.compile('.*comment.*')                                       # looking for class which starts with 'comment'
results = soup.find_all('p', {'class':regex})
reviews = [result.text for result in results]

In [14]:
reviews[0]

"A beautiful space, impeccable food, and great service.The space is cozy and transports you back to an izakaya in Japan. Fun fact - the owner's father is a carpenter and built the bar. There is indoor and outdoor seating, but I prefer the booth right in front of the kitchen. You'll definitely need to make reservations because it is very popular. The food was delicious. We had the silken tofu, chicken skewers (including the special inner-thigh), mushrooms, sashimi (including uni...yum!), mochi pockets (my favorite), and gyoza. Everything was tasty and a delight to eat."

## Loading the Reviews into a Dataframe and Scoring them

In [21]:
df = pd.DataFrame(np.array(reviews), columns=['review'])      # converting reviews to numpy array

In [31]:
df['review']

0    A beautiful space, impeccable food, and great ...
1    Dining at Rintaro feels like you've escaped fr...
2    Solid 4.5 stars for the food, rounding up sinc...
3    Overall an amazing experience with a generous ...
4    Food is a big part of my travel plans, and usu...
5    Rintaro is another restaurant that's been on t...
6    FYI: Courtyard seating is still inside the res...
7    Food ok, service top notchThis is my 3rd time ...
8    Rintaro is my favorite restaurant in San Franc...
9    Let's just agree that this may not be the best...
Name: review, dtype: object

In [24]:
def sentiment_score(review):
    tokens = tokenizer.encode(review, return_tensors='pt')
    result = model(tokens)
    return int(torch.argmax(result.logits))+1

In [28]:
df['review'].iloc[2]    # third row of reviews

'Solid 4.5 stars for the food, rounding up since I love the decor and the service was on point. We got the tasting menus and everything on there was delicious, and some are outstanding like the udon, the fried chicken, and the panna cotta dessert. Sashimi was very fresh and super yummy. The poke bowl at the end was average. Next time we will order a la carte to try more dishes. Overall a good experience.'

In [29]:
sentiment_score(df['review'].iloc[2])     # scoring the review

4

In [32]:
df['sentiment'] = df['review'].apply(lambda x: sentiment_score(x[:512]))    # looping through every single review and grabbing first 512 tokens

# NOTE - Our NLP pipeline is limited as to how much text (or tokens) we can pass through at one particular time

In [33]:
df

Unnamed: 0,review,sentiment
0,"A beautiful space, impeccable food, and great ...",5
1,Dining at Rintaro feels like you've escaped fr...,5
2,"Solid 4.5 stars for the food, rounding up sinc...",4
3,Overall an amazing experience with a generous ...,4
4,"Food is a big part of my travel plans, and usu...",3
5,Rintaro is another restaurant that's been on t...,5
6,FYI: Courtyard seating is still inside the res...,4
7,"Food ok, service top notchThis is my 3rd time ...",4
8,Rintaro is my favorite restaurant in San Franc...,5
9,Let's just agree that this may not be the best...,4
