# Fine-tuned multilingualBERT Sentiment Analyser (SA) for CX web app tool

#### Author: Felipe Valencia - Data Scientist at Dataplicada


## Project Introduction

This program is a vital component of a larger initiative aimed at enhancing the accuracy of sentiment analysis in customer experience tools. We are investigating the differences between three prominent sentiment analysis models: VADER (Valence Aware Dictionary and sEntiment Reasoner), TextBlob, a fine-tune MultilingualBERT uncased-sentiment model, and a a fine-tune checkpoint of DistilBERT-base-uncased model. Our goal is to identify the most effective approach for implementing sentiment analysis in a web-based feedback tool, enabling businesses to upload multiple comments and reviews for evaluation.

## Fine-tuned DistilBERT for SA Introduction

The program analyses customer feedback using the Fine-tuned BERT-base-multilingual-uncased-sentiment* model to classify sentiments into a 5-star metric, which can then be used for deeper data analysis.

_*[Fine-tuned MultilingualBERT](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment) is a bert-base-multilingual-uncased model finetuned for sentiment analysis on product reviews in six languages: English, Dutch, German, French, Spanish, and Italian. It predicts the sentiment of the review as a number of stars (between 1 and 5)._ [MIT License](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md).


**Note:** _As we progress, we will rigorously test VADER, TextBlob, MultilingualBERT, and DistilBERT-base-uncased for this classification task, prioritising accuracy while also considering factors such as server storage, speed, and CPU usage. This comprehensive analysis will ensure we choose the best sentiment analysis option for our users, ultimately enhancing their understanding of customer feedback and improving overall service quality._

In [1]:
# Load libraries
import pandas as pd


In [2]:
# Read CSV

data_file = pd.read_csv("raw_datasets/Datafiniti_Hotel_Reviews.csv")

In [3]:
# Convert ratings from float to integer

data_file['reviews.rating'] = data_file['reviews.rating'].astype(int)

# Convert text to string

data_file['reviews.text'] = data_file['reviews.text'].astype(str)

In [4]:
# Simplify the dataframe

data = data_file[['id', 'reviews.rating', 'reviews.text']]

In [5]:
data

Unnamed: 0,id,reviews.rating,reviews.text
0,AVwc252WIN2L1WUfpqLP,5,Our experience at Rancho Valencia was absolute...
1,AVwc252WIN2L1WUfpqLP,5,Amazing place. Everyone was extremely warm and...
2,AVwc252WIN2L1WUfpqLP,5,We booked a 3 night stay at Rancho Valencia to...
3,AVwdOclqIN2L1WUfti38,2,Currently in bed writing this for the past hr ...
4,AVwdOclqIN2L1WUfti38,5,I live in Md and the Aloft is my Home away fro...
...,...,...,...
9995,AVwd4TMv_7pvs4fz-Ers,3,It is hard for me to review an oceanfront hote...
9996,AVwdRp4DIN2L1WUfuGZZ,4,"I live close by, and needed to stay somewhere ..."
9997,AVwd1TbkByjofQCxs6FH,4,Rolled in 11:30 laid out heads down woke up to...
9998,AVwdHbizIN2L1WUfsXto,1,Absolutely terrible..I was told I was being gi...


In [9]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from tqdm import tqdm
import time
import numpy as np

In [10]:
# Load model and tokenizer
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype=torch.float32
)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [11]:
# Move to GPU if available
if torch.cuda.is_available():
    model = model.cuda()

# Ensure model is in evaluation mode
model.eval()

def process_in_batches(texts, batch_size=64):
    sentiments = []
    texts_list = texts.tolist()
    
    for i in tqdm(range(0, len(texts_list), batch_size)):
        batch = texts_list[i:i + batch_size]
        inputs = tokenizer(
            batch,
            padding=True,
            truncation=True,
            max_length=512,
            return_tensors="pt"
        )
        
        if torch.cuda.is_available():
            inputs = {k: v.cuda() for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = model(**inputs)
            # Model directly outputs 1-5 sentiment scores
            predictions = torch.argmax(outputs.logits, dim=1)
            scores = predictions.cpu().numpy() + 1  # Convert 0-4 to 1-5 scale
            sentiments.extend(scores.tolist())
    
    return sentiments

In [12]:
# Process the reviews
print("Starting sentiment analysis...")
start_time = time.time()

valid_reviews = data['reviews.text'].dropna()
valid_indices = valid_reviews.index
sentiments = process_in_batches(valid_reviews)

# Update the DataFrame
data['multilingual.sentiment'] = ''
data.loc[valid_indices, 'multilingual.sentiment'] = sentiments

print(f"Processing completed in {(time.time() - start_time) / 60:.2f} minutes")

Starting sentiment analysis...


100%|██████████| 157/157 [26:22<00:00, 10.08s/it]

Processing completed in 26.37 minutes



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['multilingual.sentiment'] = ''


In [13]:
data

Unnamed: 0,id,reviews.rating,reviews.text,multilingual.sentiment
0,AVwc252WIN2L1WUfpqLP,5,Our experience at Rancho Valencia was absolute...,5
1,AVwc252WIN2L1WUfpqLP,5,Amazing place. Everyone was extremely warm and...,5
2,AVwc252WIN2L1WUfpqLP,5,We booked a 3 night stay at Rancho Valencia to...,5
3,AVwdOclqIN2L1WUfti38,2,Currently in bed writing this for the past hr ...,1
4,AVwdOclqIN2L1WUfti38,5,I live in Md and the Aloft is my Home away fro...,5
...,...,...,...,...
9995,AVwd4TMv_7pvs4fz-Ers,3,It is hard for me to review an oceanfront hote...,4
9996,AVwdRp4DIN2L1WUfuGZZ,4,"I live close by, and needed to stay somewhere ...",5
9997,AVwd1TbkByjofQCxs6FH,4,Rolled in 11:30 laid out heads down woke up to...,5
9998,AVwdHbizIN2L1WUfsXto,1,Absolutely terrible..I was told I was being gi...,1


In [14]:
# Save DataFrame to CSV
data.to_csv('output_MultilingualBert.csv', index=False)  # Set index=False to avoid saving row indices