#### **Install Dependencies**

**Install Pytorch**: Go to `pytorch.org`, configure your settings (in my case, I choose- stable,windows,pip,python,cpu), copy the run command and run it in a python shell.

In [2]:
!pip3 install torch torchvision torchaudio
# !pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118     # gpu version

**Install Other Dependencies**

In [4]:
#pip install transformers requests beautifulsoup4 pandas numpy

* `transformers`: transformer is the library of HuggingFace, from where we will load the model "bert-base-multilingual-uncased-sentiment" in order to caculate sentiment score.
> `bert-base-multilingual-uncased-sentiment`: This a bert-base-multilingual-uncased model finetuned for sentiment analysis on product reviews in six languages: English, Dutch, German, French, Spanish and Italian. It predicts the sentiment of the review as a number of stars (between 1 and 5). [`For more details click here`](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment)
* `requests`: is for sending request to a website for extracting data from the website
* `beautifulsoup`: to fetch the required data from the site
* `pandas`: to represent the data to dataframe for better outlooking
* `numpy`: to convert data into numeric for further use in model 

In [6]:
# Import modules
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import requests
from bs4 import BeautifulSoup
import re

#### **Instantiate Model**

We will now instantiate the tokenizer and the model and load its weights. For the first time of loading- it will download the model, about 669MB, that will take a few minutes depending your internet connection.

In [7]:
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model     = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

Downloading (…)okenizer_config.json: 100%|██████████| 39.0/39.0 [00:00<00:00, 2.05kB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading (…)lve/main/config.json: 100%|██████████| 953/953 [00:00<00:00, 95.3kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 872k/872k [00:00<00:00, 981kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 16.0kB/s]
Downloading pytorch_model.bin: 100%|██████████| 669M/669M [01:15<00:00, 8.90MB/s] 


#### **Encode and Calculate Sentiment Score**

**Encode** 

`Encode` converts the sentence tokens into numeric representation to fit the sentence to the model as models only receive numeric data. And then the model will calculate the sentiment score according to its previous knowledge (as it is a pretrained model). We can also `Decode` the converted tensor to its original sentence form.

In [12]:
tokens = tokenizer.encode("He could not do it to me, never!", return_tensors='pt')
tokens

tensor([[  101, 10191, 12296, 10497, 10154, 10197, 10114, 10525,   117, 13362,
           106,   102]])

`return_tensors='pt'` is a parameter used in the Hugging Face transformer library. It is used to specify that the tokenizer should return PyTorch tensors instead of a list of Python integers.

The sentence has 10 tokens and the tokenizer encoded them into numerical values. The first and last tensor value is for start and end tokens respectively.

**Check `Decode`**

In [14]:
print(tokenizer.decode(tokens[0]))

[CLS] he could not do it to me, never! [SEP]


**Calculate Sentiment Score**

Before that, we have to understand the scoring system to the model. The model scores a sentence from (1 to 5). `1=negetive` and `5=positive` and the values between (1 and 5) refers the intensity of negetivity to positivity. We can consider `3=nutral`. It rates a sentence 5 different values. The highest value is considered as the sentiment score.

`Why 5 Scores?` It basically use a softmax classifier to score all the 5 classes. Softmax calculates the probability of classes one by one depending on the conditional probability of other classes.

`Conditional Probability`: P(A/B) = Probability of A given that B is already happened. For example: P(fiver/sick) = Probability of fiver given that he is sick.

In [15]:
probability_scores = model(tokens)
probability_scores

SequenceClassifierOutput(loss=None, logits=tensor([[ 2.6982,  1.5826, -0.0619, -1.9971, -1.7371]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

`Here, we only need the logits`

In [19]:
print(probability_scores.logits)

tensor([[ 2.6982,  1.5826, -0.0619, -1.9971, -1.7371]],
       grad_fn=<AddmmBackward0>)


`To see which rating has the highest value:`

In [22]:
print(int(torch.argmax(probability_scores.logits) + 1))
# Here,
# `torch.argmax()` - returns the index of highest score. In this case it is 0 (index=0, since the indexing of any datastructures starts from 0)
# `+ 1` - makes the index start from 1 (since we have the ratings from 1 to 5)
# `int` - int converts the tensor to just numeric value.

1


`So, it is a negetive sentiment`

**Check other Sentences**

In [26]:
sentence = "I just loved the documentation"
tokens = tokenizer.encode(sentence, return_tensors='pt')
probability_scores = model(tokens)
print(f"Probabilitity Scores: {probability_scores.logits}")
print(f"Ratings: {int(torch.argmax(probability_scores.logits)+1)}")

Probabilitity Scores: tensor([[-2.0370, -1.4861,  0.1390,  1.3571,  1.5039]],
       grad_fn=<AddmmBackward0>)
Ratings: 5


In [27]:
sentence = "There is a possibility but their attitude might destroy it."
tokens = tokenizer.encode(sentence, return_tensors='pt')
probability_scores = model(tokens)
print(f"Probabilitity Scores: {probability_scores.logits}")
print(f"Ratings: {int(torch.argmax(probability_scores.logits)+1)}")

Probabilitity Scores: tensor([[-0.1769,  0.8476,  1.2969,  0.1629, -1.7914]],
       grad_fn=<AddmmBackward0>)
Ratings: 3


#### **Collect Data form Online**
Lets collect some reviews from any sites. We will classify these reviews using the model we use earlier. To collect the reviews we need bunch of things, like-
* `requests: `to sent a request to the site
* `beautifulsoup: `to fetch the informations from the requested site
* `regex: `to work with the collected texts 

In [28]:
r = requests.get('https://www.yelp.com/biz/social-brew-cafe-pyrmont')
# It sents a request to the site and get all the contents
soup = BeautifulSoup(r.text, 'html.parser')
# Parsing HTML content using Beautiful Soup library. 'r.text' extract the contents as text
regex = re.compile('.*comment*.')
# It contains the class name 'comment' from where we need the text. We will further sent the class name to find all of them
results = soup.find_all('p', {'class': regex})
# Here, we pass the class name as regex and find all the classes from the paragraph tag 'p'
reviews = [result.text for result in results]
# Here, we make a list of texts containing all the result found from paragraph tag


In [29]:
reviews

['Some of the best Milkshakes me and my daughter ever tasted. MMMMMM HMMMMMMMM.',
 "Six of us met here for breakfast before our walk to Manly. We were enjoying visiting with each other so much that I apologize for not taking any photos. We all enjoyed our food, as well as our coffee and tea drinks.We were greeted immediately by a friendly server asking if we would like to sit inside or out. We said we would like inside, but weren't exactly sure how many were joining us yet- at least 4. We were told this was no problem, the more the merrier. A few minutes later when 4 more joined our party and we explained to the server we had 6, he just quickly switched our table. I really enjoyed my serenity tea, just what I needed after a long flight in from Sfo that morning. Everyone else were more interested in the lattes for expresso drinks. All said they were hot and delicious. 2 of us ordered the avo on toast. So yummy with the beetroot... I will start adding this to mine now at home, and have f

#### **Load Data into DataFrame**
We will load the data as dataframe using pandas and using numpy we will convert them into numpy array. Dataframe helps us visualizing data in a better way and modifying them easily.

In [39]:
import pandas as pd
import numpy as np

# Convert data into DataFrame
data_frame = pd.DataFrame(np.array(reviews), columns=['review'])
# This will create a dataframe by making a column namely 'review' and place all the reviews against the column.
data_frame

Unnamed: 0,review
0,Some of the best Milkshakes me and my daughter...
1,Six of us met here for breakfast before our wa...
2,Great place with delicious food and friendly s...
3,Great food amazing coffee and tea. Short walk ...
4,It was ok. Had coffee with my friends. I'm new...
5,Ricotta hot cakes! These were so yummy. I ate ...
6,Great staff and food. Must try is the pan fri...
7,We came for brunch twice in our week-long visi...
8,I came to Social brew cafe for brunch while ex...
9,It was ok. The coffee wasn't the best but it w...


In [33]:
# Lets check them one by one
print(data_frame['review'].iloc[0])     # 1st review
print(data_frame['review'].iloc[-1])    # Last review

Some of the best Milkshakes me and my daughter ever tasted. MMMMMM HMMMMMMMM.
It was ok. The coffee wasn't the best but it was fine. The relish on the breakfast roll was yum which did make it sing. So perhaps I just got a bad coffee but the food was good on my visit.


#### **Calculate Sentiment Score**
We can loop over the dataframe and calculate the sentiment of all the reviews using the model we used earlier. We have to repeat all the steps we used- tokenization, encoding, calculate scores, convert them to rating. We can make function that can all of the steps and return us the rating only.

**Make a Rating Function**

In [40]:
def calculate_rating(review):
    tokens = tokenizer.encode(review, return_tensors='pt')
    probability_scores = model(tokens)
    return(int(torch.argmax(probability_scores.logits)+1))

In [41]:
print(calculate_rating(data_frame['review'].iloc[0]))   # For 1st review
print(calculate_rating(data_frame['review'].iloc[-2]))   # For 2nd last review

5
5


**Make a Dataframe with ratings** Dataframe allows us to add a new column namely `rating` along with the reviews. For this, we will use `apply()` method and `lambda function`.
* `apply(): `allows dataframe to apply any modification on the colums
* `lambda(): `is an anonymous function name, we can define it as you wish. Every lambda function can have different definition

In [42]:
data_frame['rating'] = data_frame['review'].apply(lambda x: calculate_rating(x[:512]))
# Here, we are going to add a column 'rating', will hold the rating from lambda function, in the dataframe.
# lambda is the function (in this case- calculate_rating) and x is the parameter.
# 'x[:512]' means- In language processing it has a limit of how much tokens we can pass to the model. It is 512.
# That means- we can not pass more than 512 tokens at a time

In [43]:
data_frame

Unnamed: 0,review,rating
0,Some of the best Milkshakes me and my daughter...,5
1,Six of us met here for breakfast before our wa...,4
2,Great place with delicious food and friendly s...,5
3,Great food amazing coffee and tea. Short walk ...,5
4,It was ok. Had coffee with my friends. I'm new...,3
5,Ricotta hot cakes! These were so yummy. I ate ...,5
6,Great staff and food. Must try is the pan fri...,5
7,We came for brunch twice in our week-long visi...,4
8,I came to Social brew cafe for brunch while ex...,5
9,It was ok. The coffee wasn't the best but it w...,3


`Here, we see that the dataframe has now rating column associate with the review that contains the corresponding rating of all the reviews`