# Ensemble Sentiment Analyser

## Author: Felipe Valencia

This project has the purpose to test the accuracy of several sentiment analysis libraries and to create an ensemble-like model to get the best outcome for classification of sentiment into a 5-star metric.

In [2]:
# Install libraries
#!pip install torch
#!pip install transformers

Collecting transformers
  Using cached transformers-4.46.0-py3-none-any.whl.metadata (44 kB)
Collecting tokenizers<0.21,>=0.20 (from transformers)
  Using cached tokenizers-0.20.1-cp311-none-win_amd64.whl.metadata (6.9 kB)
Using cached transformers-4.46.0-py3-none-any.whl (10.0 MB)
Using cached tokenizers-0.20.1-cp311-none-win_amd64.whl (2.4 MB)
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.13.3
    Uninstalling tokenizers-0.13.3:
      Successfully uninstalled tokenizers-0.13.3
Successfully installed tokenizers-0.20.1 transformers-4.46.0



[notice] A new release of pip is available: 23.3.1 -> 24.2
[notice] To update, run: C:\Users\felip\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [3]:
# Load libraries
import pandas as pd


In [4]:
# Read CSV

data_file = pd.read_csv("Datafiniti_Hotel_Reviews.csv")

In [5]:
# Convert ratings from float to integer

data_file['reviews.rating'] = data_file['reviews.rating'].astype(int)

# Convert text to string

data_file['reviews.text'] = data_file['reviews.text'].astype(str)

In [6]:
# Simplify the dataframe

data = data_file[['id', 'reviews.rating', 'reviews.text']]

In [7]:
data

Unnamed: 0,id,reviews.rating,reviews.text
0,AVwc252WIN2L1WUfpqLP,5,Our experience at Rancho Valencia was absolute...
1,AVwc252WIN2L1WUfpqLP,5,Amazing place. Everyone was extremely warm and...
2,AVwc252WIN2L1WUfpqLP,5,We booked a 3 night stay at Rancho Valencia to...
3,AVwdOclqIN2L1WUfti38,2,Currently in bed writing this for the past hr ...
4,AVwdOclqIN2L1WUfti38,5,I live in Md and the Aloft is my Home away fro...
...,...,...,...
9995,AVwd4TMv_7pvs4fz-Ers,3,It is hard for me to review an oceanfront hote...
9996,AVwdRp4DIN2L1WUfuGZZ,4,"I live close by, and needed to stay somewhere ..."
9997,AVwd1TbkByjofQCxs6FH,4,Rolled in 11:30 laid out heads down woke up to...
9998,AVwdHbizIN2L1WUfsXto,1,Absolutely terrible..I was told I was being gi...


In [9]:
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

In [10]:
# Load the tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# Set the model to evaluation mode
model.eval()

# Define the sentiment function using DistilBERT
def sentiment_distilbert(text):
    """
    Analyzes text strings and determines the overall polarity of the text using DistilBERT.
    It classifies between 1 to 5 stars based on the model output probabilities.
    """
    inputs = tokenizer(text, return_tensors="pt")
    
    with torch.no_grad():  # Disable gradient calculation
        logits = model(**inputs).logits
    
    probabilities = torch.nn.functional.softmax(logits, dim=-1)  # Get probabilities
    positive_prob = probabilities[0][1].item()  # Probability of positive sentiment
    negative_prob = probabilities[0][0].item()  # Probability of negative sentiment

    # Map probabilities to a 1-5 star rating
    # You can adjust these thresholds as necessary
    if positive_prob > 0.8:
        return 5  # Very positive
    elif positive_prob > 0.6:
        return 4  # Positive
    elif positive_prob > 0.4:
        return 3  # Neutral
    elif negative_prob > 0.6:
        return 2  # Negative
    else:
        return 1  # Very negative


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [11]:
# Apply the sentiment function to the DataFrame
data['distilbert.sentiment'] = data['reviews.text'].apply(lambda x: sentiment_distilbert(x) if x != "nan" else '')

# Display the results
data

Token indices sequence length is longer than the specified maximum sequence length for this model (633 > 512). Running this sequence through the model will result in indexing errors


RuntimeError: The size of tensor a (633) must match the size of tensor b (512) at non-singleton dimension 1

In [10]:
data

Unnamed: 0,id,reviews.rating,reviews.text,vader.sentiment
0,AVwc252WIN2L1WUfpqLP,5,Our experience at Rancho Valencia was absolute...,5
1,AVwc252WIN2L1WUfpqLP,5,Amazing place. Everyone was extremely warm and...,5
2,AVwc252WIN2L1WUfpqLP,5,We booked a 3 night stay at Rancho Valencia to...,5
3,AVwdOclqIN2L1WUfti38,2,Currently in bed writing this for the past hr ...,3
4,AVwdOclqIN2L1WUfti38,5,I live in Md and the Aloft is my Home away fro...,5
...,...,...,...,...
9995,AVwd4TMv_7pvs4fz-Ers,3,It is hard for me to review an oceanfront hote...,5
9996,AVwdRp4DIN2L1WUfuGZZ,4,"I live close by, and needed to stay somewhere ...",5
9997,AVwd1TbkByjofQCxs6FH,4,Rolled in 11:30 laid out heads down woke up to...,5
9998,AVwdHbizIN2L1WUfsXto,1,Absolutely terrible..I was told I was being gi...,3
