<a href="https://colab.research.google.com/github/colivarese/Sentiment-Analysis-with-BERT-and-Web-Scrapping/blob/main/Sentiment_Analysis_using_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center> Sentiment Analysis using BERT and Web Scrapping 🤬 -> 😄

On this notebook we will first load a pre-trained BERT model from NLPTown.<br>
The selected model will take a string as an input and output an integer from 1-5, where 1 is a bad sentiment and 5 is a good one. ‼️

In [None]:
!pip install -q gwpy

[K     |████████████████████████████████| 1.4 MB 5.5 MB/s 
[K     |████████████████████████████████| 11.2 MB 43.0 MB/s 
[K     |████████████████████████████████| 51 kB 2.6 MB/s 
[K     |████████████████████████████████| 890 kB 11.6 MB/s 
[K     |████████████████████████████████| 55 kB 1.6 MB/s 
[K     |████████████████████████████████| 3.6 MB 14.1 MB/s 
[?25h  Building wheel for ligo-segments (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.[0m


# Install the transformers library

In [None]:
%%capture
!pip install transformers

# Import dependencies <br>

* We will load BERT into a PyTorch model (**import torch**)
* From the transformer library we will import **AutoTokenizer** to transform a string into a token (a numeric value which can be interpreted by the model.)
* From the transformer library we will import **AutoModelForSequenceClassification** to retrieve the weights of the pre-trained model.





In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Initialize BERT Model

In [None]:
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

Downloading:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/851k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/638M [00:00<?, ?B/s]

# <center> Model is already loaded and ready to work! ✅

## Define some strings, to test the model, we will use different sentiments on each try.


1.   Create a list with strings.
2.   Convert the list into a Pandas DataFrame to make it more approcheable.
3.   Use the model on each string (row).
4. Add the result to each row.



In [None]:
strings = ['I hate this show, it is awful',
           'This is bad, but it could be worst',
          'It was kind of good',
           'Just fine',
           'This is the best show I have ever seen']

# Import dependencies to create the Pandas DataFrame 🐼

In [None]:
import numpy as np
import pandas as pd

# Create the DataFrame from the list of strings 📝

In [None]:
df = pd.DataFrame(np.array(strings), columns=['Examples'])
df.head()

Unnamed: 0,Examples
0,"I hate this show, it is awful"
1,"This is bad, but it could be worst"
2,It was kind of good
3,Just fine
4,This is the best show I have ever seen


# Create a function to iterate over each row and use the model to predict a sentiment 🤖

In [None]:
def review_score(review):
  tokens = tokenizer.encode(review, return_tensors = 'pt')
  sentiment = model(tokens)
  return int(torch.argmax(sentiment.logits)) + 1

# Use the function on the DataFrame!

In [None]:
df['Predicted Sentiment'] = df['Examples'].apply(lambda x: review_score(x[:512]))

In [None]:
df.head()

Unnamed: 0,Examples,Predicted Sentiment
0,"I hate this show, it is awful",1
1,"This is bad, but it could be worst",2
2,It was kind of good,3
3,Just fine,4
4,This is the best show I have ever seen,5


# Each sentence represent a different level of sentiment, thus a different integer.

## Let's now use the model on some real-world sentences.
---
### For this we will use the Request library to scrape reviews of a movie from the internet RottenTomatoes 🍅 and then check the predicted sentiments



# Import dependencies to Web Scrapping 🌐


*   Import requests to get the HTML information from a webpage.
*   Import BeautifulSoup to make the retrieved information more approcheable.
*   Import re (Regular Expressions) to process the text.



In [None]:
import requests
from bs4 import BeautifulSoup
import re

# Check the website

![image](https://drive.google.com/uc?export=view&id=1E8tde63nCez_7DWsKXTNtrwq8BsH9Yr6)

# Inspect the page

![image](https://drive.google.com/uc?export=view&id=1aFjkgE9Td0k6N9aa3R4v5uTESi1nd0Hb)

# Inspecting a review we can found the element which contains the text of the review. 🔍

![image](https://drive.google.com/uc?export=view&id=1lQXal8cZ4ccYwrk8xnuuzfp3jZcI--Et)

## The review is inside a div class named "the_review", we will use this inside or regex to retrieve only the review.

In [None]:
r = requests.get("https://www.rottentomatoes.com/m/spider_man_no_way_home/reviews")
soup = BeautifulSoup(r.text, 'html.parser')
regex = re.compile(".*the_review.*")
results = soup.find_all('div', {'class': regex})
reviews = [result.text for result in results]

## Check the reviews list. 📝

In [None]:
reviews

['\r\n                    Who knows where the character would go next, but as far as Spidey films go, this seems unbeatable.\r\n                ',
 "\r\n                    Ultimately, it's pulled together in a way that will make you fall in love with Spider-Man all over again.\r\n                ",
 '\r\n                    Full of heart and everything we love in superhero movies.\r\n                ',
 '\r\n                    As fun as it is strained in equal parts. [Full review in Spanish]\r\n                ',
 "\r\n                    An entertaining story that manages to meet the public's expectations. [Full review in Spanish]\r\n                ",
 '\r\n                    One of the most emotional adventures ever produced by the Marvel Cinematic Universe. [Full review in Spanish]\r\n                ',
 "\r\n                    No Way Home ultimately winds up setting Spidey atop the pantheon as the MCU's greatest and most complete trilogy.\r\n                ",
 '\r\n          

## There are only the reviews of the movie!
### We could clean the strings, but lets keep them like this for now. 🤓

# Now, lets use the previous functions to create a DataFrame of the reviews and use the model to predict the sentiment of each one.

In [None]:
df = pd.DataFrame(np.array(reviews), columns=['review'])
df.head()

Unnamed: 0,review
0,\r\n Who knows where the ch...
1,"\r\n Ultimately, it's pulle..."
2,\r\n Full of heart and ever...
3,\r\n As fun as it is strain...
4,\r\n An entertaining story ...


In [None]:
df['rating'] = df['review'].apply(lambda x: review_score(x[:512]))
df.head()

Unnamed: 0,review,rating
0,\r\n Who knows where the ch...,2
1,"\r\n Ultimately, it's pulle...",4
2,\r\n Full of heart and ever...,5
3,\r\n As fun as it is strain...,5
4,\r\n An entertaining story ...,4


## Great! Each review now has an assigned sentiment, we can count the sentiments on the DataFrame.

In [None]:
df.groupby('rating').count()

Unnamed: 0_level_0,review
rating,Unnamed: 1_level_1
2,1
3,4
4,8
5,7


## Most of the reviews are on a 4-5 range, which indicates the reviewers liked the movie! 😄 <br> Spiderman is cool. 🕷🎬

# The code can be adapted to almost every website, feel free to try!