# Text Classification with Machine Learning

This comprehensive guide demonstrates how to perform sentiment analysis on movie reviews using the Rotten Tomatoes dataset. We'll explore multiple approaches including traditional machine learning with feature extraction, sentence transformers for embeddings, and modern transformer models.

## Overview

Text classification is a fundamental Natural Language Processing (NLP) task where we categorize text into predefined classes. In this project, we're building a **binary sentiment classifier** that determines whether movie reviews are positive or negative. This type of analysis is crucial for:

- **Business Intelligence**: Understanding customer sentiment about products or services
- **Social Media Monitoring**: Analyzing public opinion on brands or topics
- **Content Moderation**: Automatically filtering positive/negative content
- **Market Research**: Gauging consumer reactions to marketing campaigns

## Environment Setup

First, we need to install the required libraries for our text classification pipeline:
```python
# !pip install transformers sentence-transformers openai
# !pip install -U datasets
```


**Why these libraries?**
- **transformers**: Provides access to pre-trained transformer models like BERT, RoBERTa
- **sentence-transformers**: Specialized for generating semantic embeddings from text
- **datasets**: HuggingFace's library for easy dataset loading and manipulation
- **openai**: For accessing OpenAI's API (if needed for comparison studies)

## Data Loading and Exploration

### Loading the Rotten Tomatoes Dataset

In [None]:
from datasets import load_dataset

# Load our data by specifying splitss
split=["train", "validation", "test"]
data = load_dataset("rotten_tomatoes", split=split)

**Understanding the dataset structure**: The Rotten Tomatoes dataset is a classic benchmark for sentiment analysis, containing movie review snippets labeled as either positive (1) or negative (0). By specifying multiple splits, we get a list of three Dataset objects for proper train/validation/test evaluation.

### Data Exploration and Visualization

In [3]:
import pandas as pd
for i, s in zip(range(len(split)), split):
    print(f"A sample of {s.title()} set ({len(data[i])} samples)")
    display(pd.DataFrame(data[i]).sample(10, random_state=42).reset_index(drop=True))

A sample of Train set (8530 samples)


Unnamed: 0,text,label
0,it would take a complete moron to foul up a sc...,1
1,so we got ten little indians meets friday the ...,0
2,"unfortunately , heartbreak hospital wants to c...",0
3,sustains its dreamlike glide through a success...,0
4,a modest and messy metaphysical thriller offer...,0
5,"well , it does go on forever .",0
6,broomfield is energized by volletta wallace's ...,1
7,based on a devilishly witty script by heather ...,1
8,i'm sure there's a teenage boy out there somew...,0
9,traffics in the kind of prechewed racial clich...,0


A sample of Validation set (1066 samples)


Unnamed: 0,text,label
0,where janice beard falters in its recycled asp...,1
1,"some body is a shaky , uncertain film that nev...",0
2,". . . a solid , unassuming drama .",1
3,it's a documentary that says that the alternat...,0
4,shafer's feature doesn't offer much in terms o...,0
5,"the story , touching though it is , does not q...",0
6,a sugar-coated rocky whose valuable messages a...,0
7,earnest and heartfelt but undernourished and p...,0
8,one of the most original american productions ...,1
9,while we no longer possess the lack-of-attenti...,1


A sample of Test set (1066 samples)


Unnamed: 0,text,label
0,"the main story . . . is compelling enough , bu...",1
1,some motion pictures portray ultimate passion ...,0
2,there is a general air of exuberance in all ab...,1
3,though there are many tense scenes in trapped ...,0
4,what we get in feardotcom is more like somethi...,0
5,blood work is laughable in the solemnity with ...,0
6,"if anything , see it for karen black , who cam...",0
7,"at a brief 42 minutes , we need more x and les...",0
8,this is an elegantly balanced movie -- every m...,1
9,you won't look at religious fanatics -- or bac...,1


Note that

In [4]:
import pandas as pd
for i, s in zip(range(len(split)), split):
    print(f"A sample of {s.title()} set ({len(data[i])} samples)")
    display(pd.DataFrame(data[i]).value_counts('label'))

A sample of Train set (8530 samples)


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,4265
1,4265


A sample of Validation set (1066 samples)


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,533
1,533


A sample of Test set (1066 samples)


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,533
1,533


# **Text Classification with Representation Models**

Now we'll dive into different approaches for text classification, starting with pre-trained transformer models and then exploring embedding-based methods.

### Preparing Data for Transformers

In [None]:
from transformers import pipeline

# Path to our HF model
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Load model into pipeline
pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device="cuda:0"
)

**RoBERTa for Sentiment Analysis**: RoBERTa (Robustly Optimized BERT Approach) is particularly effective for sentiment analysis because:
- **Better Training**: Uses more data and longer training than original BERT
- **Dynamic Masking**: Improves the model's understanding of context
- **No Next Sentence Prediction**: Focuses entirely on understanding individual sentences

In [12]:
data[2]

Dataset({
    features: ['text', 'label'],
    num_rows: 1066
})

In [13]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

# Run inference
y_pred = []
# using the test set
for output in tqdm(pipe(KeyDataset(data[2], "text")), total=len(data[2])):
    negative_score = output[0]["score"]
    positive_score = output[2]["score"]
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)

100%|██████████| 1066/1066 [00:10<00:00, 102.29it/s]


**Efficient batch processing**: The `KeyDataset` utility allows us to efficiently process our dataset through the pipeline without loading everything into memory at once. We extract the negative and positive scores from the model's output and use `argmax` to determine the final prediction.

In [21]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

**Performance evaluation function**: This helper function provides a standardized way to evaluate all our models using precision, recall, F1-score, and accuracy metrics. The `target_names` parameter makes the output more interpretable.

### Performance Metrics

In [14]:
# Performance metrics (test set)
evaluate_performance(data[2]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



**Understanding the results**: The classification report shows how well our RoBERTa model performs on each class. Look for balanced precision and recall scores - if one class has much lower scores, it might indicate class imbalance or model bias.

## **Classification Tasks that Leverage Embeddings**

Next, we'll explore embedding-based approaches that convert text into numerical vectors while preserving semantic meaning.

### Supervised Classification

In [15]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to embeddings
train_embeddings = model.encode(list(data[0]["text"]), show_progress_bar=True)
test_embeddings = model.encode(list(data[2]["text"]), show_progress_bar=True)


Batches:   0%|          | 0/267 [00:00<?, ?it/s]

Batches:   0%|          | 0/34 [00:00<?, ?it/s]

**Why sentence transformers?** Unlike traditional word embeddings that represent individual words, sentence transformers create representations for entire sentences that capture contextual meaning. The `all-mpnet-base-v2` model is trained specifically for semantic similarity tasks.

In [16]:
train_embeddings.shape, test_embeddings.shape

((8530, 768), (1066, 768))

**Understanding embedding dimensions**: Each text sample is converted into a 768-dimensional vector. These high-dimensional representations allow the model to capture nuanced semantic relationships between different ways of expressing sentiment.

In [17]:
from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression on our train embeddings
clf = LogisticRegression(random_state=42)
# using the train set
clf.fit(train_embeddings, data[0]["label"])

In [22]:
# Predict previously unseen instances
y_pred = clf.predict(test_embeddings)
# test set
evaluate_performance(data[2]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066



**Embedding-based classification results**: Compare these results with the direct transformer approach. The embedding + logistic regression method is often faster at inference time while maintaining competitive accuracy.

### Cosine Similarity Classification

In [23]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics.pairwise import cosine_similarity

# Average the embeddings of all documents in each target label
# train set
df = pd.DataFrame(np.hstack([train_embeddings, np.array(data[0]["label"]).reshape(-1, 1)]))
averaged_target_embeddings = df.groupby(768).mean().values

# Find the best matching embeddings between evaluation documents and target embeddings
sim_matrix = cosine_similarity(test_embeddings, averaged_target_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

# Evaluate the model
# test set
evaluate_performance(data[2]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.84      0.84       533
Positive Review       0.84      0.85      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



### Zero-shot Classification

In [24]:
# Create embeddings for our labels
label_embeddings = model.encode(["A negative review",  "A positive review"])

**Zero-shot approach**: Instead of using training data, we create embeddings for text descriptions of our classes. This allows classification without any labeled training examples - just semantic understanding of what each class represents.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

In [25]:
evaluate_performance(data[2]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.84      0.84       533
Positive Review       0.84      0.85      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



**Zero-shot performance**: Notice how this approach performs reasonably well despite never seeing labeled examples. The quality depends heavily on how well your label descriptions capture the essence of each class.

**Tip!** What would happen if you were to use different descriptions? Use **"A very negative movie review"** and **"A very positive movie review"** to see what happens!

## **Classification with Generative Models**

Now we'll explore how generative models can be used for classification tasks through text generation.

## **Classification with Generative Models**

### Encoder-decoder Models

In [26]:
# Load our model
pipe = pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    device="cuda:0"
)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


**T5 for classification**: T5 (Text-to-Text Transfer Transformer) treats all NLP tasks as text generation problems. For classification, we prompt it to generate the class label as text rather than predict probabilities.


In [27]:
prompt = "Is the following sentence positive or negative? "
data = [split.map(lambda example: {"t5": prompt + example['text']}) for split in data]


Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

**Prompt engineering**: Adding the instruction "Is the following sentence positive or negative?" helps the model understand the task. This prompt engineering approach is crucial for getting good performance from generative models.

In [28]:
# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data[2], "t5")), total=len(data[1])):
    text = output[0]["generated_text"]
    y_pred.append(0 if text == "negative" else 1)

100%|██████████| 1066/1066 [01:04<00:00, 16.63it/s]


**Text-to-prediction conversion**: Since T5 generates text, we need to convert its text outputs ("negative", "positive") back to numerical labels (0, 1) for evaluation.

In [29]:
evaluate_performance(data[2]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.83      0.85      0.84       533
Positive Review       0.85      0.83      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



**Generative model performance**: Compare these results with discriminative approaches. Generative models can be more flexible but might be less reliable for straightforward classification tasks.

### ChatGPT for Classification

In [None]:
import openai

# Create client
client = openai.OpenAI(api_key="YOUR_KEY_HERE")

**API-based classification**: Using OpenAI's API allows us to leverage state-of-the-art models like GPT-3.5 or GPT-4, but comes with API costs and latency considerations.


In [None]:
def chatgpt_generation(prompt, document, model="gpt-3.5-turbo-0125"):
    """Generate an output based on a prompt and an input document."""
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant."
            },
        {
            "role": "user",
            "content":   prompt.replace("[DOCUMENT]", document)
            }
    ]
    chat_completion = client.chat.completions.create(
      messages=messages,
      model=model,
      temperature=0
    )
    return chat_completion.choices[0].message.content

**Chat-based prompting**: The chat format allows us to provide system instructions and user queries. Setting `temperature=0` ensures consistent, deterministic outputs rather than creative variations.

In [None]:
# Define a prompt template as a base
prompt = """Predict whether the following document is a positive or negative movie review:

[DOCUMENT]

If it is positive return 1 and if it is negative return 0. Do not give any other answers.
"""

# Predict the target using GPT
document = "unpretentious , charming , quirky , original"
chatgpt_generation(prompt, document)

**Structured prompting**: The clear, specific instructions help ensure the model returns exactly what we need (0 or 1) rather than explanatory text that would complicate parsing.

The next step would be to run one of OpenAI's model against the entire evaluation dataset. However, only run this when you have sufficient tokens as this will call the API for the entire test dataset (1066 records).

In [None]:
# You can skip this if you want to save your (free) credits
predictions = [chatgpt_generation(prompt, doc) for doc in tqdm(data["test"]["text"])]

In [None]:
# Extract predictions
y_pred = [int(pred) for pred in predictions]

# Evaluate performance
evaluate_performance(data["test"]["label"], y_pred)

**Cost considerations**: This approach can be very effective but comes with API costs. Consider the trade-off between performance, cost, and latency when choosing between local models and API-based solutions for production use.

## Summary and Next Steps

You've now explored multiple approaches to text classification:

1. **Direct transformer models** (RoBERTa) - High accuracy, moderate computational cost
2. **Embedding + traditional ML** - Good balance of performance and efficiency  
3. **Prototype-based classification** - Simple, interpretable, no training required
4. **Zero-shot classification** - Works without training data
5. **Generative models** - Flexible but potentially less reliable
6. **API-based solutions** - High performance but with cost and latency trade-offs