# 👩‍💻 Classical vs. Transformer Sentiment Models: A Head-to-Head Comparison

## 📋 Overview
In this hands-on activity, you'll have the opportunity to put classical machine learning models like Naive Bayes against transformer-based models such as BERT in a head-to-head sentiment analysis showdown. By examining and contrasting these approaches, you'll gain insights into how each model performs, their strengths in different contexts, and situations where one might be preferred over the other.

## 🎯 Learning Outcomes
By the end of this lab, you will be able to:

- ✅ Implement and evaluate classical machine learning models for sentiment analysis
- ✅ Fine-tune and assess transformer-based models for the same task
- ✅ Compare the performance and suitability of classical versus transformer models in various scenarios

## Task 1: Data Exploration and Preparation

**Context:** Proper data preparation ensures the sentiment dataset is clean and consistent for both models.

**Steps:**

1. A collection of product reviews or social media comments has been provided.
2. Clean and preprocess the data consistently for both models, including tokenization and removal of unwanted characters.

In [None]:
# Task 1: Data Exploration and Preparation
# Handle Imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch

# Data Preparation
reviews = ["I love this movie! Amazing storyline and acting.",
           "Horrible experience, I disliked this product.",
           "Incredibly moving and inspiring film.",
           "Waste of time, not recommended."]
labels = [1, 0, 1, 0]

💡 **Tip:** Use `nltk` for tokenization and cleaning tasks.

⚙️ **Test Your Work:**

- Print the cleaned and preprocessed version of the first 5 text entries.

**Expected output:** Cleaned and standardized text ready for modeling.

## Task 2: Implementing Classical Machine Learning Models

**Context:** Classical machine learning models like Naive Bayes use vectorized text data for classification.

**Steps:**

1. Use the TF-IDF vectorization method to transform the text data into numerical vectors.
2. Train a classical machine learning model such as Naive Bayes or SVM on the TF-IDF features.
3. Evaluate the model's performance using metrics like accuracy, precision, recall, and F1-score.

In [None]:
# Task 2: Implementing Classical Machine Learning Models

💡 **Tip:** Use `TfidfVectorizer` and `MultinomialNB` from sklearn.

⚙️ **Test Your Work:**

- Print the classification metrics for the Naive Bayes model.

**Expected output:** Accuracy, precision, recall, and F1-score for the Naive Bayes model.

## Task 3: Exploring Transformer Models

**Context:** Transformer models like BERT require tokenization and fine-tuning for sentiment analysis.

**Steps:**

1. Fine-tune a pre-trained transformer model such as BERT using the same dataset.
2. Tokenize the data using the appropriate tokenizer from the Hugging Face library.
3. Evaluate the model's performance with the same metrics used for the classical model.

In [None]:
# Task 3: Exploring Transformer Models

💡 **Tip:** Use `BertTokenizer` and `BertForSequenceClassification` from the `transformers` library.

⚙️ **Test Your Work:**

- Print the classification metrics for the BERT model.

**Expected output:** Accuracy, precision, recall, and F1-score for the BERT model.

### ✅ Success Checklist

- Successfully obtained and preprocessed the sentiment dataset
- Implemented and evaluated classical machine learning models for sentiment analysis
- Fine-tuned and assessed transformer-based models for the same task
- Compared the performance of classical vs. transformer models
- Provided reflections and recommendations based on findings

### 🔍 Common Issues & Solutions

**Problem:** Text data not cleaning properly.   
**Solution:** Ensure tokenization and cleaning steps are correctly specified and applied.

**Problem:** Tokenization errors for transformer models.   
**Solution:** Verify the use of the correct tokenizer model from the Hugging Face library.

**Problem:** Differences in model performance not noticeable.   
**Solution:** Ensure the dataset has enough variety and complexity to showcase the strengths of transformer models.

### 🔑 Key Points

- Classical models are fast and require less computing power but may struggle with nuanced language.
- Transformer models are more nuanced and accurate but require more resources and longer training times.
- Choosing the right model depends on the specific requirements and constraints of the task.

## 💻 Exemplar Solution

<details>    
<summary><strong>Click HERE to see an exemplar solution</strong></summary>    

```python
# Task 1: Data Exploration and Preparation
import torch
import numpy as np
from torch.utils.data import Dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments

# Data Preparation
reviews = ["I love this movie! Amazing storyline and acting.",
           "Horrible experience, I disliked this product.",
           "Incredibly moving and inspiring film.",
           "Waste of time, not recommended."]
labels = np.array([1, 0, 1, 0])

# Task 2: Classical Machine Learning Model - Naive Bayes
# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(reviews)
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, labels, test_size=0.2, random_state=42)

# Train Naive Bayes
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

# Evaluate Naive Bayes
y_pred_nb = nb_classifier.predict(X_test)
nb_accuracy = accuracy_score(y_test, y_pred_nb)
nb_report = classification_report(y_test, y_pred_nb)
print(f"Naive Bayes Accuracy: {nb_accuracy:.2f}")
print("Classification Report:\n", nb_report)

# Task 3: Transformer-Based Model - BERT
# Tokenization and Encoding
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
train_encodings = tokenizer(reviews, truncation=True, padding=True, return_tensors='pt')
train_labels = torch.tensor(labels)

# Create class for wrapping training_dataset
class SentimentDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        return {
            key: torch.tensor(val[idx]) for key, val in self.encodings.items()
        } | {
            "labels": torch.tensor(self.labels[idx])
        }

    def __len__(self):
        return len(self.labels)
        
train_dataset = SentimentDataset(train_encodings, train_labels)            

# Fine-tune BERT
bert_model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)
training_args = TrainingArguments(output_dir='./', num_train_epochs=2, per_device_train_batch_size=4)
trainer = Trainer(model=bert_model, args=training_args, train_dataset=train_dataset)

trainer.train()

# Evaluate BERT
test_review = tokenizer(reviews[1], return_tensors='pt')
test_outputs = bert_model(**test_review)
test_prediction = torch.argmax(test_outputs.logits, dim=-1)
print(f"BERT Predicted Sentiment for '{reviews[1]}': {test_prediction.item()}")

# Compare
print("Comparison Insights:")
print("Classical models like Naive Bayes are fast and require less computing power but may struggle with nuanced language.")
print("Transformers like BERT are more nuanced and accurate but require more resources and longer training times.")
```  