# Part 3 - Sentiment Analysis

In this notebook, we explore two models from the BERT family to perform sentiment analysis on reviews. 
- The first model has already been fine-tuned for this task and we can run inference on it
- We fine-tune the second model on our dataset to try to get better results

## Imports

In [20]:
from transformers import AutoTokenizer
import torch
from datetime import datetime
from tqdm import tqdm
tqdm.pandas()  # Initialize tqdm with pandas
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# import helpers module
import helpers
import importlib
importlib.reload(helpers)

SEP = 100 * '-'

## Load dataset

In [8]:
# load dataset
data = helpers.load_pickled_dataset('pickle/data_processed.pkl')

helpers.print_random_product_sheet(data)

Dataset loaded from pickle/data_processed.pkl.
----------------------------------------------------------------------------------------------------
[name] All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi, 16 GB - Includes Special Offers, Magenta
----------------------------------------------------------------------------------------------------
[brand] Amazon
----------------------------------------------------------------------------------------------------
[categories] Electronics,iPad & Tablets,All Tablets,Fire Tablets,Tablets,Computers & Tablets
----------------------------------------------------------------------------------------------------
[reviews.rating] 5.0
----------------------------------------------------------------------------------------------------
[review] Amazon - Fire Kids Edition
I got this for my kid, because it is tailored for kids. They liked it and it worked.


## Map sentiment based on rating
The review sentiment can take 3 values: negative, neutral or positive. Let's add a column and map the sentiment corresponding to the rating, for future inference, fine-tuning and comparison with the predictions. For that, we need to reduce the 5 star-labels to 3 sentiments (negative, neutral, positive), thus generating some imprecisions.

Since most people usually rate a product when they are happy or not, we can assume that the neutral sentiment is not very common and we will map it to 3 stars only, assuming that 1-2 starts is a negative comment and 4-5 stars a positive one.

In [9]:
# most people rate a product when they are happy or not, so we can assume that the neutral sentiment is not very common
review_mapping = {
    1: 'negative',
    2: 'negative',
    3: 'neutral',
    4: 'positive',
    5: 'positive'
}

data['reviews.sentiment'] = data['reviews.rating'].map(review_mapping)

print(data.info())
data['reviews.sentiment'].value_counts(normalize=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50109 entries, 0 to 50108
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   name               50109 non-null  object 
 1   brand              50109 non-null  object 
 2   categories         50109 non-null  object 
 3   reviews.rating     50109 non-null  float64
 4   review             50109 non-null  object 
 5   reviews.sentiment  50109 non-null  object 
dtypes: float64(1), object(5)
memory usage: 2.3+ MB
None


reviews.sentiment
positive    0.915365
neutral     0.045920
negative    0.038716
Name: proportion, dtype: float64

Again, we notice how unbalanced our dataset is, with 91% of positive reviews.

## METHOD 1: Use a fine-tuned model
In this first approach, we use a HuggingFace model from the BERT family, that has already been fine-tuned for sentiment analysis. We can therefore run inference immediately, once our reviews are tokenized.

I did some research and decided to use [bert-base-multilingual-uncased-sentiment](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment?text=I+like+you.+I+love+you), wich is a BERT-family model fine-tuned on product reviews.

- This model has been fine-tuned on product reviews, which makes it more suitable for handling the nuances and specific vocabulary commonly found in reviews (e.g., product features, usability, satisfaction).

- Multilingual Support: Since it's multilingual, it can handle a wide range of languages if your Amazon reviews dataset includes non-English text. This makes it versatile if you encounter reviews in German, Spanish, or other languages.

- Performance: BERT models are known for their good performance on sentiment analysis tasks, and this particular model is specifically optimized for product review sentiment, likely making it more accurate in distinguishing subtle sentiment shifts in review data.

### Tokenize input and run inference on model

In [5]:
# Load pre-trained model and tokenizer
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.to("cuda")

# Example function to predict sentiment for a single review
def predict_sentiment(review_text):
    inputs = tokenizer(review_text, return_tensors="pt", truncation=True, padding=True).to("cuda")
    outputs = model(**inputs)
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    sentiment = torch.argmax(probabilities).item()
    
    # we want to output 3 sentiments: negative, neutral, positive
    # for sentiments 1 and 3, we check side probabilities to decide
    if sentiment == 0:
        return 'negative'
    elif sentiment == 1:
        return 'negative' if probabilities[0][0] > probabilities[0][2] else 'neutral'
    elif sentiment == 2:
        return 'neutral'
    elif sentiment == 3:
        return 'positive' if probabilities[0][4] > probabilities[0][2] else 'neutral'
    else:
        return 'positive'


tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

In [6]:
# Example usage
review = "This product is good."
print(f"Sentiment: {predict_sentiment(review)}")

review = "This is the worst product ever."
print(f"Sentiment: {predict_sentiment(review)}")

Sentiment: positive
Sentiment: negative


Running the model on the full dataset is very time consuming. Let's start with a subset of the dataset first.

In [None]:
# create subset of the dataset
test_data = data.head(1000)

# predict on the dataset's reviews and store the result into a new column
test_data['predicted.sentiment'] = test_data['review'].progress_apply(predict_sentiment)

print(test_data.info())
test_data.head(10)


In [32]:
# let's compute an accuracy score to compare the predicted sentiment with the actual review
accuracy = (test_data['reviews.sentiment'] == test_data['predicted.sentiment']).mean()
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.76


This is not so bad. Let's run the accuracy on the whole dataset.

In [7]:
# predict on the dataset's reviews and store the result into a new column
data['predicted.sentiment'] = data['review'].progress_apply(predict_sentiment)

# print the accuracy score
accuracy = (test_data['reviews.sentiment'] == test_data['predicted.sentiment']).mean()
print(f"Accuracy: {accuracy:.2f}")


100%|██████████| 50174/50174 [06:06<00:00, 137.04it/s]

Accuracy: 0.76





The accuracy is sensibly the same.

## METHOD 2: Fine-Tune a Transformer on our dataset to increase accuracy

Our reviews are very specific to Amazon and amazon products. Let's fine tune a transformer model to do sentiment analysis on our dataset. This way it will be able to classify reviews without the star-rating.

### Model choice
As said, the star-rating system is not very accurate, due to the reduction of 5 star labels to 3 sentiments (negative, neutral, positive). So we should priorize a model that is fast (if we want it to run in real time for the demo) and don't need a high accuracy. Let's use [distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) for that, which is a lightweight version of the BERT-family models.

## Load model

In [11]:
# Load DistilBERT tokenizer and model
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3).to(device)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Split the dataset

In [12]:
# Prepare features and labels
X = data['review'].to_list()
y = data['reviews.sentiment'].to_list()

# convert labels to integers, 0 for negative, 1 for neutral, 2 for positive
y = [0 if sentiment == 'negative' else 1 if sentiment == 'neutral' else 2 for sentiment in y]
print(set(y))  # check result

# Split data into train and test sets (80-20 split)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
# print the number of samples in each set
print(len(X_train), len(y_train))
print(len(X_val), len(y_val))

{0, 1, 2}
40087 40087
10022 10022


In [13]:
# Tokenize the data
train_encodings = tokenizer(X_train, truncation=True, padding="max_length", max_length=128)
val_encodings = tokenizer(X_val, truncation=True, padding="max_length", max_length=128)

print(train_encodings[0])

Encoding(num_tokens=128, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


In [14]:
# Create a Dataset Class
class EncodedDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        # Convert each tokenized item to tensor and add label as well
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx], dtype=torch.long)
        # item = {key: torch.tensor(val[idx]).to(device) for key, val in self.encodings.items()}
        # item["labels"] = torch.tensor(self.labels[idx], dtype=torch.long).to(device)  # Adjust labels as needed
        return item

    def __len__(self):
        return len(self.labels)

In [15]:
# Create the train and validation datasets
train_dataset = EncodedDataset(train_encodings, y_train)
val_dataset = EncodedDataset(val_encodings, y_val)

In [16]:
# Define Training Arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    report_to="none"
)

In [17]:
# Define metrics to track
def compute_metrics(pred):
    logits, labels = pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = accuracy_score(labels, predictions)
    return {"accuracy": accuracy}

# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

# Fine-Tune the Model
# trainer.train()

# Save model
# timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
# model_path = f"models/distilbert_sa_{timestamp}"
# model.save_pretrained(model_path)
# print(f"Model saved to {model_path}")

### Evaluate Model

In [18]:
# load model
model_path = "models/distilbert_sa_20241015_143753"
model = AutoModelForSequenceClassification.from_pretrained(model_path).to(device)

In [19]:
eval_results = trainer.evaluate()
print("Evaluation Results:", eval_results)


Evaluation Results: {'eval_loss': 1.0426701307296753, 'eval_accuracy': 0.11454799441229295, 'eval_runtime': 37.7486, 'eval_samples_per_second': 265.493, 'eval_steps_per_second': 16.61}


### Run batch prediction on reviews and store the results

In [15]:
# Function to perform inference in batches
def predict_in_batches(model, texts, batch_size=32):
    predictions = []
    model.eval()  # Set model to evaluation mode

    with torch.no_grad():  # Disable gradient tracking
        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i : i + batch_size]
            # Tokenize the batch of texts
            encodings = tokenizer(batch_texts, truncation=True, padding="max_length", max_length=128, return_tensors="pt").to(device)
            
            # Run inference and get logits
            outputs = model(**encodings)
            batch_preds = torch.argmax(outputs.logits, dim=1)  # Get the predicted class
            predictions.extend(batch_preds.cpu().numpy())  # Move to CPU to avoid GPU memory overload

    return predictions

# Run the prediction function on your review column
data['reviews.ft'] = predict_in_batches(model, data['review'].tolist(), batch_size=32)


                                              review  reviews.ft
0  Great device for reading. Definately pricey.\n...           2
1  Excellent Kindle\nThe best Kindle ever, for me...           2
2  Love it\nI absolutely love this reader. The bi...           2
3  Good kindle\nI always use it when i read ebook...           2
4  So much to love, but slippery\nLove bigger scr...           2


In [17]:
data['reviews.ft.sentiment'] = data['reviews.ft'].map({0: 'negative', 1: 'neutral', 2: 'positive'})

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50109 entries, 0 to 50108
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   name                  50109 non-null  object 
 1   brand                 50109 non-null  object 
 2   categories            50109 non-null  object 
 3   reviews.rating        50109 non-null  float64
 4   review                50109 non-null  object 
 5   reviews.sentiment     50109 non-null  object 
 6   reviews.ft            50109 non-null  int64  
 7   reviews.ft.sentiment  50109 non-null  object 
dtypes: float64(1), int64(1), object(6)
memory usage: 3.1+ MB


### Compare predictions over real data

In [19]:
accuracy = (data['reviews.sentiment'] == data['reviews.ft.sentiment']).mean()
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.92


Obviously we are getting a much higher accuracy, considering that we run the model on the same data we used for training.

## Pickle Sentiment Analysis Results

In [20]:
data.to_pickle('pickle/data_sentiment_analysis.pkl')