# 👩‍💻 Compare an LSTM Text Classifier with a Pre-trained Transformer

## 📋 Overview
In this lab, you'll build and compare two powerful text classification approaches: a custom LSTM model and a pre-trained transformer model using the Hugging Face library. You'll work with movie review sentiment analysis - a practical application found in product recommendation systems, social media monitoring, and customer feedback analysis. By the end of this lab, you'll understand the trade-offs between model complexity, performance, and implementation effort for these two popular NLP approaches.

## 🎯 Learning Outcomes
By the end of this lab, you will be able to:

- Implement an LSTM-based text classification model using PyTorch
- Utilize a pre-trained transformer model for sentiment analysis using Hugging Face
- Compare the performance metrics (accuracy, training time, inference speed) between LSTM and transformer models
- Make informed decisions about which model type to use for different NLP applications

## 🚀 Starting Point
Access the starter code below to begin your implementation:

In [None]:
!pip install transformers
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Check for GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# --- Simplified Data Preparation ---
# Using a small, synthetic dataset for demonstration.
# For a real application, you would load a dataset like IMDb from a CSV or similar.
print("Generating synthetic dataset...")
texts = [
    "This movie was fantastic and I loved it.",
    "The acting was terrible, completely ruined the film.",
    "It was an okay film, nothing special but not bad.",
    "Absolutely brilliant cinematography and a compelling story.",
    "I hated every minute, a true waste of time.",
    "A decent watch, worth seeing if you have nothing else.",
    "Simply the best film I've seen all year!",
    "So boring, I fell asleep multiple times.",
    "Good plot, but the characters were underdeveloped.",
    "An inspiring and emotional journey."
]
labels = [1, 0, 1, 1, 0, 1, 1, 0, 1, 1] # 1 for positive, 0 for negative/neutral

# Split into train and test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(
    texts, labels, test_size=0.3, random_state=42, stratify=labels
)


Required tools/setup:

- Python 3.8+
- PyTorch 1.10+
- Transformers library
- Matplotlib for visualization
- Internet connection to download datasets and pre-trained models

## Task 1: Prepare Data for LSTM Model
**Context:** Data preparation is crucial for NLP tasks. For an LSTM model, we need to tokenize text, build a vocabulary, and convert text to numerical sequences of consistent length.

**Steps:**

1. Create a tokenization function using `get_tokenizer` from torchtext

    - The 'basic_english' tokenizer is suitable for this task
    - Consider how different tokenization strategies might affect your model

2. Build a vocabulary from the training data

    - Use `build_vocab_from_iterator` function
    - Set a reasonable vocabulary size (e.g., 25000 most common words)
    - Include special tokens like `<unk>` for unknown words

3. Create functions to convert tokenized text to numerical sequences

    - Map each token to its index in the vocabulary
    - Implement padding to ensure all sequences are the same length
    - How will you handle sequences longer than your max length?

In [None]:
# TASK 1: Data Preparation
# Step 1: Create tokenizer

# Step 2: Build vocabulary function

# Step 3: Text to numerical sequences function

# Step 4: Create PyTorch datasets and dataloaders

**💡 Tip:** Consider using `torch.nn.utils.rnn.pad_sequence` to handle varying sequence lengths efficiently.

**⚙️ Test Your Work:**

- Print the vocabulary size
- Tokenize and convert a sample review
- Verify the dimensions of your batched data (should be [batch_size, seq_length])

## Task 2: Design and Implement LSTM Model
**Context:** LSTMs are specialized recurrent neural networks that excel at capturing sequential patterns in text. They're commonly used before transformers became prevalent and are still valuable for many applications.

**Steps:**

1. Create an LSTM model class that inherits from `nn.Module`

    - Include an embedding layer (`nn.Embeddin`g) to convert token indices to vector representations
    - Add LSTM layers (`nn.LSTM`) with appropriate hidden dimensions
    - Implement dropout for regularization
    - Add a final linear layer (`nn.Linear`) for classification

2. Initialize your model with appropriate hyperparameters

    - Consider vocabulary size, embedding dimension, hidden dimension, etc.
    - How many LSTM layers will you use?

In [None]:
# TASK 2: LSTM Model Implementation 
    # Define model initialization
    
    # Define forward pass

# Initialize model

**💡 Tip:** Using bidirectional LSTMs (`bidirectional=True parameter`) can improve performance by capturing context from both directions.

**⚙️ Test Your Work:**

- Create a small batch of dummy data and pass it through your model
- Verify the output shape (should be [batch_size, num_classes])

## Task 3: Train the LSTM Model
**Context:** Training deep learning models requires careful monitoring of metrics and hyperparameter tuning to achieve optimal performance.

**Steps:**

1. Define training hyperparameters

    - Choose an appropriate optimizer (Adam is often a good choice)
    - Set learning rate, batch size, number of epochs
    - Define a loss function (`nn.BCEWithLogitsLoss` for binary classification)

2. Implement the training loop

    - Iterate through batches from the training dataloader
    - Calculate loss and perform backpropagation
    - Track and report training metrics
    - Use `model.eval()` when evaluating to disable dropout

3. Evaluate model performance on test data

    - Calculate accuracy, precision, recall or other relevant metrics
    - Track the time taken for training and inference

In [None]:
# TASK 3: LSTM Model Training
# Define loss and optimizer

# Training loop function

# Evaluation function

# Execute training

**💡 Tip:** Use `torch.no_grad()` context manager during evaluation to conserve memory and speed up inference.

**⚙️ Test Your Work:**

- Training loss should decrease over epochs
- Test accuracy should improve as training progresses

## Task 4: Implement Sentiment Analysis with a Pre-trained Transformer
**Context:** Pre-trained transformer models have revolutionized NLP by providing powerful, ready-to-use models that capture complex language patterns.

**Steps:**

1. Load a pre-trained sentiment analysis model from Hugging Face

    - Use the pipeline function for the simplest implementation
    - Alternatively, load a specific model like "distilbert-base-uncased-finetuned-sst-2-english"

2. Process the test dataset with the transformer model

    - Be mindful of input formatting requirements
    - Consider batching for efficiency
    - Track the time required for inference

In [None]:
# TASK 4: Pre-trained Transformer Implementation
# Load pre-trained model

# Define inference function

# Run inference on test data

**💡 Tip:** The `pipeline` function abstracts away much of the complexity, but loading the model and tokenizer separately gives you more control over the process.

**⚙️ Test Your Work:**

- Try the model on a few sample reviews
- Verify predictions match expected sentiment (positive/negative)

## Task 5: Compare Model Performance
**Context:** Understanding model trade-offs is crucial for choosing the right approach for a given application and resource constraints.

**Steps:**

1. Calculate accuracy for both models

    - Use the same test set for fair comparison
    - Consider additional metrics like F1 score if appropriate

2. Compare computational efficiency

    - Record and compare training time (LSTM only)
    - Measure and compare inference time for both models
    - Calculate model size (number of parameters)

3. Create visualizations to illustrate the comparison

    - Bar charts for accuracy and time metrics
    - Consider visualizing specific examples where models differ

In [None]:
# TASK 5: Model Comparison
# Calculate performance metrics

# Compare computational efficiency

# Create visualization

**💡 Tip:** Consider the trade-offs beyond just accuracy - deployment requirements, inference speed, and explainability are important factors in real-world applications.

**⚙️ Test Your Work:**

- Verify that your comparison metrics are calculated correctly
- Ensure visualizations clearly communicate the key differences

## ✅ Success Checklist
- LSTM model correctly implemented and trained
- Pre-trained transformer model successfully applied to test data
- Accuracy metrics calculated for both models
- Computational efficiency (time, parameters) compared between models
- Clear understanding of trade-offs between both approaches demonstrated
- Program runs without errors

## 🔍 Common Issues & Solutions
**Problem:** LSTM training is extremely slow **Solution:** Reduce batch size, ensure you're using GPU if available, or reduce sequence length.

**Problem:** Out-of-memory errors when using transformer models **Solution:** Reduce batch size or use a smaller model like DistilBERT instead of BERT.

**Problem:** Poor LSTM performance compared to reported benchmarks **Solution:** Check preprocessing steps, increase model capacity, or try bidirectional LSTMs.

**Problem:** Hugging Face models download/load slowly **Solution:** Ensure good internet connection or download models once and save locally.

## 🔑 Key Points
- LSTMs require more manual implementation but give you full control over the architecture
- Pre-trained transformers provide powerful out-of-the-box performance but are less flexible
- The performance gap between custom LSTMs and pre-trained transformers highlights the value of transfer learning
- Consider computational requirements when choosing between models for production applications

## 💻 Reference Solution

<details>

<summary><strong>Click HERE to see a reference solution</strong></summary>    
    
```python
# TASK 1: Data Preparation
# Using a Hugging Face tokenizer for consistency and simplicity
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
max_len = 128 # Reduced max_len for smaller dataset and faster processing

# Text processing function - now uses the Hugging Face tokenizer
def process_text_hf(text, tokenizer, max_len):
    # This will pad and truncate automatically
    encoding = tokenizer(
        text,
        max_length=max_len,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )
    return encoding['input_ids'].squeeze(0) # Remove batch dimension

# Custom Dataset
class SimplifiedIMDBDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.data = []
        for text, label in zip(texts, labels):
            text_tensor = process_text_hf(text, self.tokenizer, self.max_len)
            label_tensor = torch.tensor([float(label)], dtype=torch.float)
            self.data.append((text_tensor, label_tensor))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

# Create datasets
train_dataset = SimplifiedIMDBDataset(train_texts, train_labels, tokenizer, max_len)
test_dataset = SimplifiedIMDBDataset(test_texts, test_labels, tokenizer, max_len)

# Create dataloaders
batch_size = 2 # Reduced batch size for very small dataset
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size)

# Determine vocab_size and pad_idx from the Hugging Face tokenizer
# Note: For LSTM, we would typically build a custom vocab if not using pretrained embeddings
# For simplification here, we'll use a large enough vocab size and HF's pad_token_id
vocab_size = tokenizer.vocab_size
pad_idx = tokenizer.pad_token_id

print(f"Vocab size (from HF tokenizer): {vocab_size}")
print(f"Pad token ID (from HF tokenizer): {pad_idx}")

# TASK 2: LSTM Model (Minor adjustments for new data processing)
class LSTMTextClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers,
                 bidirectional, dropout, pad_idx):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers,
                            bidirectional=bidirectional, dropout=dropout, batch_first=True)
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, text):
        # text shape: [batch_size, seq_len]
        embedded = self.embedding(text)
        # embedded shape: [batch_size, seq_len, embedding_dim]

        output, (hidden, cell) = self.lstm(embedded)
        # output shape: [batch_size, seq_len, hidden_dim * n_directions]

        if self.lstm.bidirectional:
            # Concatenate the last two hidden states (forward and backward)
            hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
        else:
            hidden = hidden[-1,:,:]
        # hidden shape: [batch_size, hidden_dim * n_directions]

        hidden = self.dropout(hidden)
        return self.fc(hidden)

# Initialize LSTM model
embedding_dim = 100
hidden_dim = 128 # Reduced hidden_dim for smaller model
output_dim = 1
n_layers = 2
bidirectional = True
dropout = 0.5

lstm_model = LSTMTextClassifier(
    vocab_size, embedding_dim, hidden_dim, output_dim,
    n_layers, bidirectional, dropout, pad_idx
).to(device)

# TASK 3: LSTM Training
optimizer_lstm = optim.Adam(lstm_model.parameters(), lr=0.001)
criterion = nn.BCEWithLogitsLoss()

def train_model(model, dataloader, optimizer, criterion, epochs=3): # Reduced epochs for faster demo
    model.train()
    start_time = time.time()

    epoch_losses = []
    for epoch in range(epochs):
        epoch_loss = 0
        for batch_idx, (text, labels) in enumerate(dataloader):
            text, labels = text.to(device), labels.to(device)

            optimizer.zero_grad()
            predictions = model(text)
            loss = criterion(predictions, labels)
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()

            if batch_idx % 1 == 0: # Print more frequently for tiny dataset
                print(f"Epoch {epoch+1}/{epochs}, Batch {batch_idx}/{len(dataloader)}, Loss: {loss.item():.4f}")

        avg_epoch_loss = epoch_loss / len(dataloader)
        epoch_losses.append(avg_epoch_loss)
        print(f"Epoch {epoch+1}/{epochs}, Average Loss: {avg_epoch_loss:.4f}")

    training_time = time.time() - start_time
    print(f"Training completed in {training_time:.2f} seconds")

    return epoch_losses, training_time

def evaluate_model(model, dataloader):
    model.eval()
    correct = 0
    total = 0
    start_time = time.time()

    with torch.no_grad():
        for text, labels in dataloader:
            text, labels = text.to(device), labels.to(device)
            outputs = model(text)
            predicted = (torch.sigmoid(outputs) > 0.5).float()
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    inference_time = time.time() - start_time
    accuracy = 100 * correct / total

    print(f"Accuracy: {accuracy:.2f}%")
    print(f"Inference time: {inference_time:.2f} seconds")

    return accuracy, inference_time

# Train and evaluate LSTM
print("\n--- Training LSTM Model ---")
lstm_losses, lstm_train_time = train_model(lstm_model, train_dataloader, optimizer_lstm, criterion, epochs=3)
print("\n--- Evaluating LSTM Model ---")
lstm_accuracy, lstm_inference_time = evaluate_model(lstm_model, test_dataloader)

# TASK 4: Pre-trained Transformer
# Load transformer model
transformer_model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer_transformer = AutoTokenizer.from_pretrained(transformer_model_name)
model_transformer = AutoModelForSequenceClassification.from_pretrained(transformer_model_name).to(device)

# Using pipeline for easy sentiment analysis (as in original)
sentiment_pipeline = pipeline("sentiment-analysis", model=model_transformer, tokenizer=tokenizer_transformer, device=0 if torch.cuda.is_available() else -1)

print("\n--- Evaluating Pre-trained Transformer Model ---")
def evaluate_transformer(texts, labels, pipeline_model):
    correct = 0
    total = 0
    start_time = time.time()

    for text, true_label in zip(texts, labels):
        # The pipeline handles tokenization and inference internally
        result = pipeline_model(text)[0]
        predicted_sentiment = result['label']

        # Map 'POSITIVE' to 1, 'NEGATIVE' to 0 for comparison
        predicted_class = 1 if predicted_sentiment == 'POSITIVE' else 0

        if predicted_class == true_label:
            correct += 1
        total += 1

    inference_time = time.time() - start_time
    accuracy = 100 * correct / total

    print(f"Transformer Accuracy: {accuracy:.2f}%")
    print(f"Transformer Inference time: {inference_time:.2f} seconds")

    return accuracy, inference_time

transformer_accuracy, transformer_inference_time = evaluate_transformer(test_texts, test_labels, sentiment_pipeline)

# TASK 5: Model Comparison
def compare_models():
    # Accuracy comparison
    models = ['LSTM', 'Transformer']
    accuracies = [lstm_accuracy, transformer_accuracy]

    plt.figure(figsize=(10, 5))
    plt.bar(models, accuracies, color=['blue', 'orange'])
    plt.title('Model Accuracy Comparison')
    plt.ylabel('Accuracy (%)')
    plt.ylim(0, 100)
    for i, v in enumerate(accuracies):
        plt.text(i, v + 1, f"{v:.2f}%", ha='center')
    plt.savefig('accuracy_comparison_simplified.png')
    plt.close() # Close plot to prevent display issues in some environments

    # Inference time comparison
    times = [lstm_inference_time, transformer_inference_time]

    plt.figure(figsize=(10, 5))
    plt.bar(models, times, color=['blue', 'orange'])
    plt.title('Inference Time Comparison')
    plt.ylabel('Time (seconds)')
    for i, v in enumerate(times):
        plt.text(i, v + 0.1, f"{v:.2f}s", ha='center')
    plt.savefig('time_comparison_simplified.png')
    plt.close()

    # Print comparison table
    print("\n--- Model Comparison Summary ---")
    print("-" * 60)
    print(f"{'Metric':<20} | {'LSTM':<15} | {'Transformer':<15}")
    print("-" * 60)
    print(f"{'Accuracy':<20} | {lstm_accuracy:<15.2f}% | {transformer_accuracy:<15.2f}%")
    print(f"{'Inference Time':<20} | {lstm_inference_time:<15.2f}s | {transformer_inference_time:<15.2f}s")
    print(f"{'Training Time':<20} | {lstm_train_time:<15.2f}s | {'N/A (pre-trained)':<15}")

    # Parameter count
    lstm_params = sum(p.numel() for p in lstm_model.parameters())
    transformer_params = sum(p.numel() for p in model_transformer.parameters())
    print(f"{'Parameters':<20} | {lstm_params:<15,d} | {transformer_params:<15,d}")
    print("-" * 60)

compare_models()

# Specific example comparison
def compare_specific_examples():
    sample_texts = [
        "This movie was absolutely fantastic! I loved every minute of it.",
        "The film was neither good nor bad, just mediocre overall.",
        "What a terrible waste of time and money. Worst movie ever."
    ]

    print("\n--- Example Predictions ---")
    print("-" * 80)
    print(f"{'Text':<40} | {'LSTM Prediction':<20} | {'Transformer Prediction':<20}")
    print("-" * 80)

    lstm_model.eval()
    for text in sample_texts:
        # LSTM prediction
        processed = process_text_hf(text, tokenizer, max_len).unsqueeze(0).to(device)
        with torch.no_grad():
            lstm_output = torch.sigmoid(lstm_model(processed)).item()
        lstm_sentiment = "Positive" if lstm_output > 0.5 else "Negative"
        lstm_confidence = max(lstm_output, 1 - lstm_output) * 100

        # Transformer prediction using the pipeline
        transformer_result = sentiment_pipeline(text)[0]
        transformer_sentiment = transformer_result['label']
        transformer_confidence = transformer_result['score'] * 100

        print(f"{text[:37] + '...':<40} | {lstm_sentiment} ({lstm_confidence:.1f}%) | {transformer_sentiment} ({transformer_confidence:.1f}%)")

    print("-" * 80)

compare_specific_examples()
```    