# Text Classification Lab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HassanAlgoz/B5/blob/main/W5_NLP/M3/labs/01_Text_Classification.ipynb)

## Overview

This notebook explores three different approaches to text classification using pre-trained models:
1. **Task-specific models**: Using models fine-tuned for sentiment analysis
2. **Embedding models + Classifier**: Using general-purpose embeddings with a trained classifier
3. **Embedding models + Cosine Similarity**: Zero-shot classification without labeled data

We'll work with the Rotten Tomatoes movie review dataset to classify reviews as positive or negative.

## Learning Objectives

By the end of this notebook, you will be able to:
- Load and explore text classification datasets
- Use pre-trained task-specific models for classification via Hugging Face pipelines
- Generate embeddings using sentence-transformers models
- Train a classifier on top of embeddings
- Perform zero-shot classification using cosine similarity
- Evaluate classification performance using appropriate metrics
- Understand the trade-offs between different classification approaches

## Glossary of Terms

- **Task-specific Model**: A model that has been fine-tuned for a specific task (e.g., sentiment analysis)
- **Embedding**: A dense vector representation of text that captures semantic meaning
- **Zero-shot Classification**: Classifying data without training on labeled examples
- **Cosine Similarity**: A measure of similarity between two vectors based on the angle between them
- **Pipeline**: A high-level API that combines model loading, tokenization, and inference
- **Sentence Transformers**: Models specifically designed to generate sentence-level embeddings

## Outline

1. **Section A**: Using a Task-specific Model
   - Predict → Run → Investigate → Modify
2. **Section B**: Using an Embedding Model + Classifier Head
   - Predict → Run → Investigate → Modify
3. **Section C**: Using Embedding Model + Cosine Similarity (Zero-shot)
   - Predict → Run → Investigate → Modify
4. **Make Phase**: Create your own classification system
5. **Summary and Next Steps**

## Getting Started: Loading the Dataset

Let's start by loading the Rotten Tomatoes dataset. This dataset contains movie reviews labeled as positive or negative.

In [None]:
from datasets import load_dataset

data = load_dataset("rotten_tomatoes")
data

### Investigate: Explore the Dataset

**Exercise**: Before running the code below, predict what you think the structure of the data will be:
- What keys will be in each example?
- What will the labels look like (what values will they have)?
- How many examples are in train vs test?

Now let's examine the data:

---

## Section A: Using a Task-specific Model

### Introduction to Hugging Face Transformers and Pipelines

Before we dive into classification, let's get familiar with **Hugging Face Transformers** - one of the most popular libraries for working with pre-trained language models.

#### What is Hugging Face Transformers?

**Hugging Face Transformers** is a Python library that provides easy access to thousands of pre-trained models for Natural Language Processing (NLP). These models have been trained on massive amounts of text data and can understand language patterns, making them incredibly powerful for various tasks like:
- Text classification (sentiment analysis, spam detection, etc.)
- Question answering
- Text generation
- Translation
- And much more!

#### What is a Pipeline?

A **pipeline** is Hugging Face's high-level API that makes it incredibly easy to use pre-trained models. Think of it as a "one-stop shop" that handles all the complex steps for you:

1. **Loading the model**: Downloads and loads the pre-trained model
2. **Tokenization**: Converts text into numbers the model can understand
3. **Inference**: Runs the model to make predictions
4. **Post-processing**: Formats the output in a readable way

**Why use pipelines?**
- **Simplicity**: You can classify text in just a few lines of code
- **No deep learning knowledge required**: The pipeline handles all the technical details
- **Consistent interface**: Same API for different models and tasks
- **Production-ready**: Optimized for real-world use

#### A Simple Example

Here's what using a pipeline looks like (we'll see this in action soon):

```python
from transformers import pipeline

# Create a sentiment analysis pipeline
classifier = pipeline("sentiment-analysis")

# Use it!
result = classifier("I love this movie!")
print(result)
# Output: [{'label': 'POSITIVE', 'score': 0.9998}]
```

That's it! No model architecture knowledge, no tokenization code, no manual inference - just simple, powerful text classification.

Now let's use this powerful tool to classify our movie reviews!

In [None]:
### Predict Phase

**Before running the code below, think about:**
1. What do you think `pipeline` does? What are its advantages?
2. What does `return_all_scores=True` mean?
3. Why might we specify `device="cuda"`?
4. What will the output format look like?

### Run Phase

Now let's create our pipeline. We'll use a specific model that's been trained on Twitter data for sentiment analysis:

Now let's create our pipeline. We'll use a specific model that's been trained on Twitter data for sentiment analysis:

from transformers import pipeline

# Path to our Hugging Face model
# This model was trained on Twitter data for sentiment analysis
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Create a pipeline for sentiment analysis
# - model: specifies which pre-trained model to use
# - tokenizer: converts text to numbers (usually same as model name)
# - return_all_scores: returns scores for all classes, not just the top one
# - device: "cuda" for GPU (faster), "cpu" for CPU (works everywhere)
pipe = pipeline(
    "sentiment-analysis",  # The task we want to perform
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device="cuda"  # Change to "cpu" if you don't have a GPU
)

**Investigate Further**: 
- What is the distribution of labels in the training set? (Hint: use `data["train"]["label"]`)
- How long are the reviews on average? (Hint: check the length of text strings)
- Are there any patterns you notice between positive and negative reviews?

Now let's run inference on the entire test set:

In [None]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total=len(data["test"])):
    negative_score = output[0]["score"]
    positive_score = output[2]["score"]
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)

**Investigate**: 
- Why do we use `output[0]` and `output[2]`? What is `output[1]`?
- What does `np.argmax` do? Why do we use it here?
- What are the possible values in `y_pred`? How do they map to positive/negative?

In [None]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

In [None]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

Now let's evaluate the performance:

evaluate_performance(data["test"]["label"], y_pred)

### Investigate Phase

**Exercise 1**: Test the pipeline on a single example. What does the output structure look like?
```python
# Try this:
test_text = "This movie is absolutely fantastic!"
result = pipe(test_text)
print(result)
```

**Exercise 2**: What are the different scores returned? What do they represent?
- Try running the pipeline on different texts (positive, negative, neutral)
- What do the scores tell you about the model's confidence?

**Exercise 3**: Why do we use `KeyDataset`? What would happen if we passed the data directly?
- `KeyDataset` is a helper that extracts the "text" field from each example
- It makes the pipeline work seamlessly with Hugging Face datasets

**Note**: To improve the performance of our selected model, we could do a few different things including selecting a model trained on our domain data, movie reviews in this case, like DistilBERT base uncased finetuned SST-2.

To improve the performance of our selected model, we could do a few different things including selecting a model trained on our domain data, movie reviews in this case, like DistilBERT base uncased finetuned SST-2.

---

## Section B: Using an Embedding Model + Classifier Head

### Introduction to Sentence Transformers

However, what if we cannot find a model that was pretrained for this specific task? Do we need to fine-tune a representation model ourselves? The answer is no!

There might be times when you want to fine-tune the model yourself if you have sufficient computing available. However, not everyone has access to extensive computing. This is where general-purpose embedding models come in.

#### What is Sentence Transformers?

**Sentence Transformers** is a Python library built on top of Hugging Face Transformers that specializes in creating **embeddings** - numerical representations of text that capture semantic meaning.

#### What are Embeddings?

Think of embeddings as a way to convert text into a list of numbers (a vector) that captures the **meaning** of the text. For example:
- Similar texts get similar numbers (vectors close together)
- Different texts get different numbers (vectors far apart)
- The numbers capture semantic relationships (e.g., "king" and "queen" are closer than "king" and "car")

**Why are embeddings useful?**
- **Universal representation**: Any text can be converted to the same format (a vector of numbers)
- **Semantic understanding**: The numbers capture meaning, not just words
- **Flexibility**: You can use embeddings for many different tasks (classification, search, clustering, etc.)
- **Efficiency**: Once you have embeddings, you can use simple, fast classifiers on top

#### How Sentence Transformers Works

1. **Input**: Your text (e.g., "This movie is fantastic!")
2. **Processing**: The model converts it to a vector of numbers (e.g., 768 numbers)
3. **Output**: A dense vector that represents the semantic meaning

The model `sentence-transformers/all-mpnet-base-v2` we'll use:
- Maps sentences & paragraphs to a **768-dimensional** dense vector space
- Each dimension captures some aspect of the text's meaning
- Can be used for tasks like clustering, semantic search, or (as we'll see) classification

#### The Strategy: Embeddings + Classifier

Instead of using a task-specific model, we'll:
1. **Convert text to embeddings** using Sentence Transformers (frozen, no training needed)
2. **Train a simple classifier** (like Logistic Regression) on top of these embeddings

This approach gives us:
- ✅ Flexibility to adapt to any classification task
- ✅ Fast training (only the classifier needs training, not the embedding model)
- ✅ Good performance with less computational resources
- ✅ Ability to reuse embeddings for multiple tasks

### Predict Phase

**Before running the code below, think about:**
1. What is an embedding? What does "768 dimensional dense vector space" mean?
2. Why would we use embeddings + a classifier instead of a task-specific model?
3. What type of classifier do you think would work well on top of embeddings? Why?
4. What are the advantages and disadvantages of this approach compared to Section A?


**Investigate**: 
- Why do we use `output[0]` and `output[2]`? What is `output[1]`?
- What does `np.argmax` do? Why do we use it here?
- What are the possible values in `y_pred`? How do they map to positive/negative?

In [None]:
### Run Phase

Let's load a Sentence Transformer model and convert our text to embeddings:

In [None]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained Sentence Transformer model
# This model converts text into 768-dimensional vectors
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert our text data to embeddings
# Each review becomes a vector of 768 numbers
train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings = model.encode(data["test"]["text"], show_progress_bar=True)

Let's check the shape of our embeddings to understand what we've created:

In [None]:
train_embeddings.shape

In [None]:
# Predict previously unseen instances
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)

Now let's train a simple classifier on top of these embeddings. We'll use Logistic Regression - a fast, interpretable classifier that works well with embeddings:

### Investigate Phase

**Exercise 1**: Examine the shape of the embeddings. What does each dimension represent?
- The shape tells us: `(number of examples, embedding dimension)`
- Each row is one review, each column is one dimension of meaning

**Exercise 2**: Try encoding a single sentence and examine its embedding. What do you notice about the values?
```python
# Try this:
single_embedding = model.encode("This is a test sentence")
print(f"Shape: {single_embedding.shape}")
print(f"Sample values: {single_embedding[0][:10]}")
print(f"Min: {single_embedding.min()}, Max: {single_embedding.max()}")
```

**Exercise 3**: Compare embeddings of similar vs different sentences. What patterns do you see?
```python
# Try this:
similar1 = model.encode("I love this movie")
similar2 = model.encode("This film is amazing")
different = model.encode("The weather is nice today")

# Calculate cosine similarity (we'll learn about this in Section C)
from sklearn.metrics.pairwise import cosine_similarity
print("Similar sentences:", cosine_similarity([similar1], [similar2])[0][0])
print("Different sentences:", cosine_similarity([similar1], [different])[0][0])
```

**Investigate**: 
- Why did we choose Logistic Regression? What other classifiers could we use?
- How long did training take compared to using a task-specific model?
- What are the advantages of this approach?

Now let's evaluate our classifier on the test set:

### Modify Phase

**Exercise 1**: Try a different task-specific model. Replace `"cardiffnlp/twitter-roberta-base-sentiment-latest"` with `"distilbert-base-uncased-finetuned-sst-2-english"`. How does the performance compare?

**Exercise 2**: Modify the code to handle cases where the model might return scores in a different order. Make the code more robust by finding the label names dynamically.

**Exercise 3**: Add code to show some example predictions (both correct and incorrect) to understand where the model struggles.

**Result**: By training a classifier on top of our embeddings, we managed to get an F1 score of 0.85! This demonstrates the possibilities of training a lightweight classifier while keeping the underlying embedding model frozen.

### Modify Phase

**Exercise 1**: Try different classifiers from scikit-learn (e.g., `RandomForestClassifier`, `SVM`, `NaiveBayes`). Compare their performance and training time.

**Exercise 2**: Experiment with different embedding models. Try `sentence-transformers/all-MiniLM-L6-v2` (smaller, faster) or `sentence-transformers/all-mpnet-base-v2` (current). How do they compare?

**Exercise 3**: Add regularization to the Logistic Regression. Try different `C` values and see how it affects performance.

By training a classifier on top of our embeddings, we managed to get an F1 score of 0.85! This demonstrates the possibilities of training a lightweight classifier while keeping the underlying embedding model frozen.



**Investigate**: 
- Why did we choose Logistic Regression? What other classifiers could we use?
- How long did training take compared to using a task-specific model?
- What are the advantages of this approach?

> TIP: In this example, we used sentence-transformers to extract our embeddings, which benefits from a GPU to speed up inference. However, we can remove this GPU dependency by using an external API to create the embeddings. Popular choices for generating embeddings are Cohere’s and OpenAI’s offerings. As a result, this would allow the pipeline to run entirely on the CPU.

## C. Using just the Embedding Model (headless) + Cosine Similarity

**What If We Do Not Have Labeled Data?**

Getting labeled data is a resource-intensive task that can require significant human labor. Moreover, is it actually worthwhile to collect these labels?

To perform **zero-shot classification** with embeddings, there is a neat trick that we can use. We can describe our labels based on what they should represent. For example, a negative label for movie reviews can be described as “This is a negative movie review.” By describing and embedding the labels and documents, we have data that we can work with. This process, as illustrated in Figure 4-14, allows us to generate our own target labels without the need to actually have any labeled data.

<img src="https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098150952/files/assets/holl_0414.png" alt="Figure 4-14. To embed the labels, we first need to give them a description, such as “a negative movie review.” This can then be embedded through sentence-transformers.">

Figure 4-14. To embed the labels, we first need to give them a description, such as “a negative movie review.” This can then be embedded through sentence-transformers.


### Investigate Phase

**Exercise 1**: Examine the shape of the embeddings. What does each dimension represent?

**Exercise 2**: Try encoding a single sentence and examine its embedding. What do you notice about the values?
```python
# Try this:
single_embedding = model.encode("This is a test sentence")
print(f"Shape: {single_embedding.shape}")
print(f"Sample values: {single_embedding[0][:10]}")
```

**Exercise 3**: Compare embeddings of similar vs different sentences. What patterns do you see?

In [None]:
# Create embeddings for our labels
label_embeddings = model.encode([
    "A negative review",
    "A positive review"
])

<img src="https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098150952/files/assets/holl_0415.png">

Figure 4-15. The cosine similarity is the angle between two vectors or embeddings. In this example, we calculate the similarity between a document and the two possible labels, positive and negative.


<img src="https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098150952/files/assets/holl_0416.png" />

Figure 4-16. After embedding the label descriptions and the documents, we can use cosine similarity for each label document pair.


In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

**Investigate**: 
- Why did we choose Logistic Regression? What other classifiers could we use?
- How long did training take compared to using a task-specific model?
- What are the advantages of this approach?

And that is it! We only needed to come up with names for our labels to perform our classification tasks. Let’s see how well this method works:

In [None]:
evaluate_performance(data["test"]["label"], y_pred)

### Modify Phase

**Exercise 1**: Try different classifiers from scikit-learn (e.g., `RandomForestClassifier`, `SVM`, `NaiveBayes`). Compare their performance and training time.

**Exercise 2**: Experiment with different embedding models. Try `sentence-transformers/all-MiniLM-L6-v2` (smaller, faster) or `sentence-transformers/all-mpnet-base-v2` (current). How do they compare?

**Exercise 3**: Add regularization to the Logistic Regression. Try different `C` values and see how it affects performance.

#### Improve our label emeddings

Let's try improving our label embeddings by:
1. making it more polar by having the word "very" and
2. being more specific by adding the word "movie"

In [None]:
# Create embeddings for our labels
label_embeddings = model.encode([
    "A very negative movie review",
    "A very positive movie review"
])

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

In [None]:
evaluate_performance(data["test"]["label"], y_pred)

Do you notice the performance increase?

> [The author](https://jalammar.github.io/) notes that using NLI-based [zero-shot classification](https://huggingface.co/tasks/zero-shot-classification) **is better than using emedding models**. However, this was done to illustrate the **versatility of emedding models**. We will look at **Natural Language Inference (NLI)** in the next notebook Inshallah.

### Investigate Phase

**Exercise 1**: What is the shape of `label_embeddings`? Why does it have this shape?

**Exercise 2**: Calculate the cosine similarity between the two label embeddings. What does this tell you about how similar "negative review" and "positive review" are in the embedding space?

**Exercise 3**: Try different label descriptions. How do they affect the embeddings?