# Mini Workshop 2: Sentiment Classification
Welcome to your second AI mini-workshop! In this notebook, you'll learn how to classify the sentiment of text — determining whether a sentence is **positive** or **negative**. You'll work with real datasets, use a powerful language model tokenizer, visualize sentence embeddings, and train a simple classifier.

Don't worry — everything will be explained step-by-step, and you'll have space to try things out on your own throughout.

## Part 1: What is Sentiment Analysis?
Sentiment analysis is a type of text classification that identifies the emotional tone of text. It's used to analyze:
- Product reviews
- Movie ratings
- Social media posts

Today we'll use a real dataset (Yelp Polarity) and explore how a computer can learn to classify these sentiments.

## Part 2: Load the Yelp Polarity Dataset
We'll use the Yelp Polarity dataset, which contains short reviews labeled as positive or negative. We're using a subset for faster training and easier understanding.

In [None]:
import tensorflow as tf

# Check if GPU is available
gpu_available = tf.config.list_physical_devices('GPU')

if not gpu_available:
  print("WARNING: GPU runtime is not enabled. Please go to 'Runtime' > 'Change Runtime Type' > 'GPU' to enable it for faster training.")
else:
  print("GPU runtime is enabled.")
  # Print the GPU device name if available
  print("GPU Device Name:", tf.config.list_physical_devices('GPU')[0].name)

In [None]:
!pip install -U datasets -q # will give some conflict errors, but should be ok (updated 5/2025)

In [None]:
# This section loads a dataset of Yelp reviews labeled as positive or negative.
from datasets import load_dataset
# Load a small sample (first 200) from the Yelp Polarity training split for speed
small_dataset = load_dataset("yelp_polarity", split="train[:200]")

# Print the first example to see the structure of the data (dictionary with 'text' and 'label')
print(small_dataset[0])

### Let's Explore
Print the first 5 examples and their labels.

In [None]:
# Loop over the first 5 examples in the small_dataset
for i in range(5):
    # Print the i-th example from the dataset (a dictionary with 'text' and 'label')
    print(small_dataset[i])

## Part 3: Tokenization with DistilBERT
Before a model can work with text, the text must be converted into numbers. We’ll use a **tokenizer** to convert each sentence into tokens that represent subword units. These tokens are then mapped to unique IDs that the model can understand.

In [None]:
# Before using text in a model, we must convert it to numbers (tokens).
    # We use a pre-trained tokenizer from Hugging Face's transformers library.
from transformers import AutoTokenizer
# Create a tokenizer object for the 'distilbert-base-uncased' model
    # This will lowercase and split text into subword tokens
    # (The tokenizer knows how to convert text to numbers for the model)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Get the text of the first review in the dataset
example = small_dataset[0]['text']

# Tokenize the text: convert to input IDs (numbers), pad/truncate as needed
tokens = tokenizer(example, padding=True, truncation=True)

# Print the original text and the list of token IDs
print(f"Original: {example}\nTokens: {tokens['input_ids']}")

### ✏️ Try it Yourself
Use the tokenizer on your own short sentence.

In [None]:
# Try tokenizing your own sentence
your_sentence = ""

# Tokenize your sentence (convert to input IDs, pad/truncate as needed)
tokens = tokenizer(your_sentence, padding=True, truncation=True)

# Print the list of token IDs for your sentence
print(tokens['input_ids'])

# Part 4: Three Approaches to Text Classification
Now that you've learned how to tokenize and visualize text embeddings, it's time to train some actual classifiers!

In this section, we'll explore **three different ways** to build a sentiment classifier from scratch. You'll see how different models represent text and how well they perform — and we'll visualize them using t-SNE just like before.

Each method has a different strength, and we'll compare them side by side.

## Model Overview
| Model Type                        | Learns Word Order? | Custom Embeddings? | Fast to Train? |
|----------------------------------|--------------------|---------------------|----------------|
| 1. Bag-of-Words + Linear         | ❌ No              | ❌ No               | ✅ Yes         |
| 2. Embeddings + Mean Pooling     | ❌ No              | ✅ Yes              | ✅ Yes         |
| 3. LSTM or GRU                   | ✅ Yes             | ✅ Yes              | ⚠️ Slower      |

### 📌 Teaching Goals:
- **Model 1**: Understand that you can classify text just by seeing which words are present.
- **Model 2**: Learn how models can learn word meanings via embeddings.
- **Model 3**: See how sequential models capture the order of words.

We'll train all three, but **you'll only have exercises for Model 2** to stay on schedule.

## 🧪 Model 1: Bag-of-Words + Linear Classifier
This is the simplest kind of text classifier. We convert each sentence into a binary vector that marks whether a word is present or not. Then we feed this vector into a linear layer that makes a prediction.

#### Model Objective:
`This is like checking which words are present. No word order, but fast and intuitive.`

In [None]:
# needed imports for the next section
from torch.utils.data import DataLoader, TensorDataset
import torch
import torch.nn as nn
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Build a basic vocabulary from the training data
# This function counts the most common words in the dataset and assigns each a unique index.
def build_vocab(texts, max_vocab=1000):
    from collections import Counter
    word_counts = Counter()  # Create a counter for word frequencies
    for text in texts:
        word_counts.update(text.lower().split())  # Count each word in the text
    most_common = word_counts.most_common(max_vocab)  # Get the most frequent words
    vocab = {word: idx for idx, (word, _) in enumerate(most_common)}  # Map word to index
    return vocab

# This function converts a sentence into a Bag-of-Words vector (1 if word is present, 0 otherwise)
def bow_vector(sentence, vocab):
    import torch
    vector = torch.zeros(len(vocab))  # Start with a vector of zeros
    for word in sentence.lower().split():
        if word in vocab:
            vector[vocab[word]] = 1.0  # Set to 1 if the word is in the vocab
    return vector

# Create the vocabulary from the training data
vocab = build_vocab(small_dataset['text'])


# Prepare Bag-of-Words features for all training sentences
X_bow = torch.stack([bow_vector(text, vocab) for text in small_dataset['text']])  # Shape: (num_examples, vocab_size)
# Prepare labels as a tensor (shape: num_examples x 1)
y_bow = torch.tensor(small_dataset['label'], dtype=torch.float32).unsqueeze(1) # Add a dimension for labels

# Set up mini-batch DataLoader for Bag-of-Words features
batch_size = 32  # Number of examples per batch
bow_dataset = TensorDataset(X_bow, y_bow)  # Pair features and labels
bow_loader = DataLoader(bow_dataset, batch_size=batch_size, shuffle=True)  # Shuffle for training

In [None]:
# Define a simple neural network: linear layer + sigmoid for binary classification
model1 = nn.Sequential(
    nn.Linear(len(vocab), 1),  # Input size = vocab size, output = 1 (probability)
    nn.Sigmoid()               # Output between 0 and 1
)

# Set up loss function (binary cross-entropy) and optimizer (Adam)
loss_fn = nn.BCELoss() # Binary Cross-Entropy Loss for binary classification
optimizer = torch.optim.Adam(model1.parameters(), lr=0.01, weight_decay=1e-4) # Adam optimizer with weight decay

# Train the model using mini-batch gradient descent
for epoch in range(100):
    total_loss = 0
    for X_batch, y_batch in bow_loader:
        y_pred = model1(X_batch)  # Forward pass: predict probabilities
        loss = loss_fn(y_pred, y_batch)  # Compute loss
        optimizer.zero_grad()  # Clear gradients
        loss.backward()        # Backpropagation
        optimizer.step()       # Update weights
        total_loss += loss.item() * X_batch.size(0)  # Accumulate loss
    if epoch % 10 == 0:
        avg_loss = total_loss / len(bow_loader.dataset)
        print(f"Epoch {epoch}, Loss: {avg_loss:.4f}")

In [None]:
# Visualize the Bag-of-Words vectors using t-SNE (2D projection)
bow_embeds = TSNE(n_components=2, random_state=42).fit_transform(X_bow.numpy())  # Reduce to 2D
colors = ['red' if label == 0 else 'blue' for label in small_dataset['label']]  # Color by label
plt.figure(figsize=(8, 6))
plt.scatter(bow_embeds[:, 0], bow_embeds[:, 1], c=colors, alpha=0.6)  # Plot points
plt.title("t-SNE of BOW Sentence Representations")
plt.grid(True)
plt.show()

## 🧪 Model 2: Trainable Embeddings + Mean Pooling
This model maps each word to a **learned embedding vector** (like a trainable word meaning). We take the average of all word vectors in a sentence and use that as our sentence representation.

### 📌 Teaching Goal:
`Now the model learns how to represent words in vector space.`

In [None]:
# Each word in the vocab gets a unique integer ID for embedding models
token_to_id = {token: idx for idx, token in enumerate(vocab)}

# This function converts a sentence to a list of word IDs (only words in vocab)
def encode_sentence(sentence, vocab):
    return [vocab[word] for word in sentence.lower().split() if word in vocab]

# Convert all training sentences to lists of word IDs
encoded_data = [encode_sentence(text, token_to_id) for text in small_dataset['text']]
# Find the length of the longest sentence in the dataset
max_len = max(len(seq) for seq in encoded_data)
# Pad all sequences to the same length (with zeros for missing words)
padded_data = [seq + [0]*(max_len - len(seq)) for seq in encoded_data]

# Convert to tensors for PyTorch
X_tensor = torch.tensor(padded_data)  # Shape: (num_examples, max_len)
y_tensor = torch.tensor(small_dataset['label'], dtype=torch.float32).unsqueeze(1)  # Shape: (num_examples, 1)

# Set up mini-batch DataLoader for sequence models
batch_size = 32
seq_dataset = TensorDataset(X_tensor, y_tensor)  # Pair features and labels
seq_loader = DataLoader(seq_dataset, batch_size=batch_size, shuffle=True)  # Shuffle for training

#### Mini-challenge! Loss function

In [None]:
# Define a simple mean-pooling embedding model
class MeanPoolModel(nn.Module):
    def __init__(self, vocab_size, emb_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, emb_dim)  # Embedding layer
        self.fc = nn.Linear(emb_dim, 1)  # Output layer
    def forward(self, x):
        embeds = self.embedding(x)         # Shape: (batch, seq_len, emb_dim)
        pooled = embeds.mean(dim=1)        # Mean over sequence length
        return torch.sigmoid(self.fc(pooled))  # Output probability

model2 = MeanPoolModel(vocab_size=len(vocab), emb_dim=50)

optimizer = torch.optim.Adam(model2.parameters(), lr=0.01, weight_decay=1e-4)

# mini-challenge! What type of loss function should we use for this model?
loss_fn = #### YOUR CODE HERE ####

Once you have entered your loss function in the code cell above, try training your model below!

In [None]:
# Train the mean-pooling model using mini-batch gradient descent
for epoch in range(100):
    total_loss = 0
    for X_batch, y_batch in seq_loader:
        y_pred = model2(X_batch)  # Forward pass: predict probabilities
        loss = loss_fn(y_pred, y_batch)  # Compute loss
        optimizer.zero_grad()  # Clear gradients
        loss.backward()        # Backpropagation
        optimizer.step()       # Update weights
        total_loss += loss.item() * X_batch.size(0)  # Accumulate loss
    if epoch % 10 == 0:
        avg_loss = total_loss / len(seq_loader.dataset)
        print(f"Epoch {epoch}, Loss: {avg_loss:.4f}")

### ✏️ Showcase: Predict on a New Sentence
Below is a function that:
1. Tokenizes your input sentence
2. Converts it to tensor format
3. Runs it through the model
4. Outputs the sentiment prediction

Try it with your own statement! (results may not be fantastic with our small dataset...)

In [None]:
def predict_sentiment(sentence):
    # Convert the input sentence to word IDs
    encoded = encode_sentence(sentence, token_to_id)
    # Pad the sequence to match training length
    padded = encoded + [0]*(max_len - len(encoded))
    x = torch.tensor([padded])  # Shape: (1, max_len)
    with torch.no_grad():       # Disable gradient calculation for inference
        prob = model2(x).item()  # Get the model's output (probability)
    label = "Positive" if prob >= 0.5 else "Negative"
    confidence = prob if prob >= 0.5 else 1 - prob
    print(f"\n📝 Input: {sentence}")
    print(f"🤖 Prediction: {label} ({confidence*100:.2f}% confidence)")
    print(f"📊 Raw probability (positive): {prob:.4f}")


# Enter your own sentence to test the model below!
predict_sentiment("I love this workshop!")

### Visualize the mean-pooled embeddings using t-SNE

In [None]:
# t-SNE on pooled embeddings (mean-pooled sentence representations)
with torch.no_grad():
    pooled_vectors = model2.embedding(X_tensor).mean(dim=1)  # Get mean embedding for each sentence
# Reduce the high-dimensional embeddings to 2D for visualization
reduced = TSNE(n_components=2, random_state=42).fit_transform(pooled_vectors.numpy())
# Assign colors based on sentiment label (red=negative, blue=positive)
colors = ['red' if label == 0 else 'blue' for label in small_dataset['label']]

# Plot the t-SNE visualization
plt.figure(figsize=(8, 6))
plt.scatter(reduced[:, 0], reduced[:, 1], c=colors, alpha=0.6)  # Plot points
plt.title("t-SNE of Mean Pooled Embeddings")
plt.grid(True)
plt.show()

## 🧪 Model 3: GRU-Based Sequential Classifier
This model introduces a **GRU (Gated Recurrent Unit)** to handle word order. Instead of treating words as a bag or averaging their embeddings, the GRU processes them **in sequence**.

### 📌 Teaching Goal:
`Now we can use the order of words to inform predictions.`

#### Mini-Challenge! Optimizer

Complete the code below with an optimizer to help our model weights update in response to training data

In [None]:
# Define a GRU-based model
class GRUModel(nn.Module):
    def __init__(self, vocab_size, emb_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, emb_dim)  # Word embeddings
        self.gru = nn.GRU(emb_dim, hidden_dim, batch_first=True)  # GRU layer
        self.fc = nn.Linear(hidden_dim, 1)      # Output layer

    def forward(self, x):
        embeds = self.embedding(x)              # (batch, seq_len, emb_dim)
        _, h_n = self.gru(embeds)               # h_n: (1, batch, hidden_dim)
        return torch.sigmoid(self.fc(h_n.squeeze(0)))  # Output probability

model3 = GRUModel(vocab_size=len(vocab), emb_dim=50, hidden_dim=64)
loss_fn = nn.BCELoss()

# Mini-challenge! What type of optimizer will you choose for this model?
    # What hyperparameter values should we try?
optimizer = ### YOUR CODE HERE ###

# Use the same seq_loader as for mean pooling
# Mini-batch training loop
for epoch in range(100):
    total_loss = 0
    for X_batch, y_batch in seq_loader:
        y_pred = model3(X_batch) # Forward pass: predict probabilities
        loss = loss_fn(y_pred, y_batch.view(-1, 1).float()) # Compute loss
        optimizer.zero_grad() # Clear gradients
        loss.backward() # Backpropagation
        optimizer.step() # Update weights
        total_loss += loss.item() * X_batch.size(0) # Accumulate loss
    if epoch % 10 == 0:
        avg_loss = total_loss / len(seq_loader.dataset) # Average loss over the dataset
        print(f"Epoch {epoch}, Loss: {avg_loss:.4f}")

In [None]:
# t-SNE on final GRU hidden states
with torch.no_grad():
    embeds = model3.embedding(X_tensor)
    _, h_n = model3.gru(embeds)
    vectors = h_n.squeeze(0)

tsne = TSNE(n_components=2, random_state=42)
reduced = tsne.fit_transform(vectors.numpy())
colors = ['red' if label == 0 else 'blue' for label in small_dataset['label']]
plt.figure(figsize=(8, 6))
plt.scatter(reduced[:, 0], reduced[:, 1], c=colors, alpha=0.6)
plt.title("t-SNE of GRU Hidden State Representations")
plt.grid(True)
plt.show()

## Part 4b: Comparing Model Performance
Now that you've trained three different sentiment classifiers, let's compare how they perform on the Yelp Polarity dataset. We'll look at both their accuracy and F1-score, and discuss the strengths and weaknesses of each approach.

### Model Comparison Table
| Model Type                        | Learns Word Order? | Custom Embeddings? | Fast to Train? | Typical Strengths                |
|-----------------------------------|--------------------|--------------------|---------------|----------------------------------|
| 1. Bag-of-Words + Linear          | ❌ No              | ❌ No              | ✅ Yes        | Simple, interpretable, fast      |
| 2. Embeddings + Mean Pooling      | ❌ No              | ✅ Yes             | ✅ Yes        | Learns word meaning, compact     |
| 3. GRU-Based Sequential           | ✅ Yes             | ✅ Yes             | ⚠️ Slower     | Captures word order, more robust |

### Discussion
- **Bag-of-Words**: Fastest to train and easy to interpret, but ignores word order and subtle meaning.
- **Mean Pooling Embeddings**: Learns word representations, but still ignores order. Often a strong baseline.
- **GRU**: Can capture the order of words, which is important for nuanced sentiment, but takes longer to train.

Let's see how they compare on the test set!

In [None]:
# This section loads a sample of the official test set and evaluates all three models.
from sklearn.metrics import accuracy_score, f1_score

# Load 200 test samples from the Yelp Polarity test split
hf_test_dataset = load_dataset("yelp_polarity", split="test[:200]")
X_test = [ex['text'] for ex in hf_test_dataset]
y_test = [ex['label'] for ex in hf_test_dataset]

# --- Bag-of-Words ---
# Convert test sentences to Bag-of-Words vectors
X_test_bow = torch.stack([bow_vector(text, vocab) for text in X_test])
with torch.no_grad():
    y_pred_bow = (model1(X_test_bow) > 0.5).int().view(-1).numpy()
acc_bow = accuracy_score(y_test, y_pred_bow)
f1_bow = f1_score(y_test, y_pred_bow)

# --- Mean Pooling ---
# Convert test sentences to padded token IDs
encoded_test = [encode_sentence(text, token_to_id) for text in X_test]
padded_test = [seq + [0]*(max_len - len(seq)) if len(seq) < max_len else seq[:max_len] for seq in encoded_test]
X_test_tensor = torch.tensor(padded_test)
with torch.no_grad():
    y_pred_mean = (model2(X_test_tensor) > 0.5).int().view(-1).numpy()
acc_mean = accuracy_score(y_test, y_pred_mean)
f1_mean = f1_score(y_test, y_pred_mean)

# --- GRU ---
# Use the same padded test tensor as for mean pooling
with torch.no_grad():
    y_pred_gru = (model3(X_test_tensor) > 0.5).int().view(-1).numpy()
acc_gru = accuracy_score(y_test, y_pred_gru)
f1_gru = f1_score(y_test, y_pred_gru)

# Print results for all models
print(f"Bag-of-Words:     Accuracy={acc_bow:.3f}, F1={f1_bow:.3f}")
print(f"Mean Pooling:     Accuracy={acc_mean:.3f}, F1={f1_mean:.3f}")
print(f"GRU Sequential:   Accuracy={acc_gru:.3f}, F1={f1_gru:.3f}")

### Summary of Results
- **Bag-of-Words**: Quick and interpretable, but may miss subtle sentiment cues due to lack of word order and context.
- **Mean Pooling Embeddings**: Captures more nuance by learning word meanings, but still ignores order.
- **GRU**: Best at handling complex sentences and word order, but requires more computation.

In practice, the best model depends on your needs: for speed and simplicity, Bag-of-Words or Mean Pooling may suffice; for more nuanced understanding, sequential models like GRU are preferred.

## Part 5: Try Another Dataset (Optional Challenge)
Hugging Face provides many sentiment datasets you can explore. Here are a few suggestions:

| Dataset      | Description                        | Load Command                           |
|--------------|------------------------------------|----------------------------------------|
| IMDb         | Full movie reviews (binary)        | `load_dataset("imdb")`                |
| Rotten Tomatoes dataset (from the GLUE benchmark)| Positive/Negative Rotten Tomatoes reviews     | `load_dataset("rotten_tomatoes")`       |
| TweetEval    | Tweets (3-class sentiment)         | `load_dataset("tweet_eval", "sentiment")` |

Try loading one and tokenize the first few entries!

In [None]:
# This cell shows how to load a different sentiment dataset from Hugging Face
alt_dataset = load_dataset("imdb")
print(alt_dataset['train'][0])

In [None]:
# disconnect the runtime

from google.colab import runtime
runtime.unassign()