## Text Classification with Generative Models

In [3]:
import os
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

In [4]:
import warnings
warnings.filterwarnings("ignore")

In [5]:
from datasets import load_dataset

# Load our dataset
data = load_dataset("rotten_tomatoes")
display(data)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

### Using a task-specific model

Using **task-specific models** is the most straightforward way to tackle this problem. We just need to choose a model that matches our task, download it, and plug it into a pipeline to try it on our own data.

For this example, we will use a RoBERTa model to perform text classification on our dataset. To familiarize ourselves with the `pipeline` object, we can check the official [Hugging Face documentation](https://huggingface.co/docs/transformers/v4.24.0/main_classes/pipelines) for a quick overview and examples.

In [38]:
from transformers import pipeline
import torch

model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# load model into pipeline
pipeline = pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device=0 if torch.cuda.is_available() else -1,
)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Next, we implement an inference loop to generate predictions across the dataset. This procedure systematically applies the trained RoBERTa model to each input sample, yielding probability scores or class labels for evaluation.

In [39]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

# Get predictions
y_pred = []

for output in tqdm(pipeline(KeyDataset(data["test"], "text")),
                   total=len(data["test"])):
    
    # Extract scores for negative (index 0) and positive (index 2) classes
    scores = [output[i]["score"] for i in (0, 2)]
    
    # Predict the class with the highest score
    y_pred.append(int(np.argmax(scores)))

100%|██████████| 1066/1066 [00:48<00:00, 22.07it/s]


We then define an evaluation function that quantifies the model’s performance by comparing its predicted labels with the ground-truth annotations. To this end, we employ the `classification_report` utility from scikit-learn to compute standard classification metrics.

In [79]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):

    performance = classification_report(
        y_true, 
        y_pred,
        target_names=["Negative Review", "Positive Review"])
    
    return performance

In [80]:
# Evaluate performance
print(evaluate_performance(data["test"]["label"], y_pred))

                 precision    recall  f1-score   support

Negative Review       0.81      0.85      0.83       533
Positive Review       0.84      0.80      0.82       533

       accuracy                           0.82      1066
      macro avg       0.82      0.82      0.82      1066
   weighted avg       0.82      0.82      0.82      1066



The report shows a slight **negative bias** in the model. Negative reviews have high recall (0.88) but lower precision (0.76), meaning the model catches most negative reviews but sometimes incorrectly flags positive ones as negative, yielding a strong F1-score of 0.81. Positive reviews show the opposite pattern: higher precision (0.86) but lower recall (0.72), so when the model predicts “positive” it is usually correct, yet it misses more than a quarter of true positive reviews, with an F1-score of 0.78. Overall macro and weighted averages around 0.80 indicate solid but asymmetric performance on a perfectly balanced dataset (533 examples per class).

### Performing classification with embedding representations

#### Supervised classification with embeddings

Rather than fine-tuning a task-specific model, we adopt an **embedding-based approach** for supervised classification. We first compute fixed embeddings for all texts using a `sentence-transformers` model and then train a downstream classifier on these features, obviating the need to update the embedding model. This strategy is widely used for its strong performance and efficiency on text classification tasks.

In [None]:
#os.environ["OMP_NUM_THREADS"] = "2"
#os.environ["MKL_NUM_THREADS"] = "2"

In [None]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Function to encode a list of texts
def encode_texts(texts, batch_size=8):
    return model.encode(
        texts,
        batch_size=batch_size,
        show_progress_bar=True,
        convert_to_numpy=True,
        normalize_embeddings=True
    )

# Encode train and test sets
train_embeddings = encode_texts(data["train"]["text"])
test_embeddings = encode_texts(data["test"]["text"])

Batches:   0%|          | 0/1067 [00:00<?, ?it/s]

Batches:   0%|          | 0/134 [00:00<?, ?it/s]

In [16]:
# Example usage
sentences = ["This is an example sentence", "Each sentence is converted"]

In [17]:
# Get embeddings
embeddings = model.encode(sentences)
print(embeddings)

[[ 0.02250258 -0.07829181 -0.02303076 ... -0.00827928  0.02652692
  -0.00201897]
 [ 0.04170238  0.0010974  -0.01553418 ... -0.02181626 -0.0635936
  -0.00875283]]


In [18]:
train_embeddings.shape

(8530, 768)

Now we train a simple logistic regression classifier on top of the embeddings.

In [19]:
from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression on our train embeddings
log_reg = LogisticRegression(random_state=42)
log_reg.fit(train_embeddings, data["train"]["label"])

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'lbfgs'
,max_iter,100


In [20]:
# Predict previously unseen instances
y_pred = log_reg.predict(test_embeddings)

In [21]:
# Evaluate performance
print(evaluate_performance(data["test"]["label"], y_pred))

                 precision    recall  f1-score   support

Negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066



The model now shows **balanced** and **consistent performanc**e across both classes. Negative reviews are predicted with precision 0.85 and recall 0.86, meaning the model is both accurate when it predicts a negative review and recovers almost all true negatives, for an F1-score of 0.85. Positive reviews mirror this behavior with precision 0.86 and recall 0.85, again yielding an F1-score of 0.85, so there is no clear bias toward either class. Overall accuracy is 0.85 on a perfectly balanced dataset (533 examples per class), and macro and weighted averages are identical (0.85), confirming stable performance across both labels.

This demonstrates that it is possible to train a lightweight classifier while keeping the **embedding model frozen**.

#### Unsupervised Classification with embeddings

What if no classifier were used at all? Instead, one could compute the mean embedding for each class and then use cosine similarity to assign each document to the class whose average embedding is most similar to its representation.

In [23]:
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics.pairwise import cosine_similarity

embedding_dim = train_embeddings.shape[1]

# Average the embeddings of all documents in each target label
df = pd.DataFrame(np.hstack([train_embeddings, np.array(data["train"]["label"]).reshape(-1, 1)]))
averaged_target_embeddings = df.groupby(embedding_dim).mean().values

# Find the best matching embeddings between evaluation documents and target embeddings
sim_matrix = cosine_similarity(test_embeddings, averaged_target_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

In [24]:
# Evaluate performance
print(evaluate_performance(data["test"]["label"], y_pred))

                 precision    recall  f1-score   support

Negative Review       0.85      0.84      0.84       533
Positive Review       0.84      0.85      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



The model achieves **balanced performance** in this unsupervised setting, with both classes around 0.84 on precision, recall, and F1-score. Negative reviews have precision 0.85 and recall 0.84, while positive reviews have precision 0.84 and recall 0.85, indicating no strong bias toward either class and very similar error behavior across labels. Overall accuracy is 0.84 on a perfectly balanced dataset (533 examples per class), and macro and weighted averages are identical (0.84), showing that the unsupervised approach reaches performance close to the supervised variant while relying only on unlabeled data.

#### Zero shot classification

Zero-shot classification refers to the setting where a model can assign text to categories it has never been explicitly trained on, by exploiting the semantic relationship between the input and natural-language descriptions of the candidate labels. In our case, we do not have labeled data, so we will ask the model to infer labels for each input text even though it has never been trained specifically on those labels.

In [25]:
# Create embeddings for our labels
label_embeddings = model.encode(["A negative review",  "A positive review"])

To assign labels to documents, we can compute cosine similarity between each document embedding and the corresponding label embeddings, and then select the labels with the highest similarity scores.

In [26]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

In [27]:
# Evaluate performance
print(evaluate_performance(data["test"]["label"], y_pred))

                 precision    recall  f1-score   support

Negative Review       0.78      0.77      0.78       533
Positive Review       0.77      0.79      0.78       533

       accuracy                           0.78      1066
      macro avg       0.78      0.78      0.78      1066
   weighted avg       0.78      0.78      0.78      1066



In the zero-shot setting, the model achieves **reasonably balanced performance** with both classes around 0.78 on precision, recall, and F1-score. Negative reviews have precision 0.78 and recall 0.77, while positive reviews have precision 0.77 and recall 0.79, indicating similar error patterns and no strong bias toward one class. Overall accuracy is 0.78 on a balanced dataset (533 examples per class), and macro and weighted averages are identical (0.78), showing that even without task-specific training the zero-shot classifier reaches solid, though lower, performance than supervised or unsupervised approach.

This is a perfect illustration of how powerful and useful **embeddings** can be as a tool in practice.

### Text classification with generative models

Generative language models such as GPT approach classification differently from traditional discriminative models. They operate as sequence-to-sequence systems: given text as input, they generate text as output, which can be formatted to represent labels or rationales.

To obtain reliable classifications, the model must be guided with clear task context and instructions. This guidance is provided through carefully designed prompts that define the label set, specify output format, and include examples or constraints when needed.

In [33]:
from groq import Groq
from dotenv import load_dotenv

load_dotenv()

client = Groq(
    api_key=os.getenv("GROQ_API_KEY"), 
)

In [63]:
# Usage example
sample_text = data["test"]["text"][0]
print(f"Review: {sample_text}\n")

Review: lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .



In [64]:
def groq_generation(prompt, model="meta-llama/llama-4-scout-17b-16e-instruct"):

    system_prompt = (
        "You are a sentiment classifier. "
        "Rate the sentiment as a number between 0 (negative) and 1 (positive), "
        "and also as a label. "
        "Respond only in the format: `<score> <label>`"
    )

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Rate the sentiment of this movie review: {prompt}"}
    ]

    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,
        max_tokens=10
    )
    
    return response.choices[0].message.content


In [65]:
print(groq_generation(sample_text))

0.8 positive


In [43]:
#predictions = [groq_generation(doc) for doc in tqdm(data["test"]["text"])]

In [44]:
#y_pred = [int(pred) for pred in predictions]
#print(evaluate_performance(data["test"]["label"], y_pred))

#### Text2Text transfert transformers

**Text-to-Text Transfer Transformers** (T5) are a family of models that cast all NLP problems into a unified text-in/text-out format. The architecture follows the original Transformer design, with stacked encoder and decoder blocks processing input and generating output sequences.

T5 reframes tasks like translation, summarization, classification, and question answering as “text → text”, which simplifies the overall pipeline and naturally supports multitask learning within a single model. The models are pre-trained on the [Colossal Clean Crawled Corpus](https://github.com/allenai/c4-documentation) using a self-supervised “span corruption” objective, where random text spans are masked and the model learns to reconstruct them, leading to strong generalization across diverse NLP tasks.​

Since T5 always outputs text tokens (including for labels), it is particularly effective for zero-shot, few-shot, and instruction-style tasks, without requiring custom classification heads or task-specific architectures.

In [83]:
from transformers import pipeline

model_pipeline = pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    device=-1) # CPU

Device set to use cpu


In [84]:
# Prepare data for T5 text2text-generation
prompt = "Classify sentiment: Positive or Negative? "
data = data.map(lambda x: {"t5": prompt + x['text']})

print(data)

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
})


Since this model generates text labels, we first need to map its outputs to numeric values (0 for negative, 1 for positive) before running our evaluation.

In [85]:
# Run inference
y_pred = [
    0 if output[0]["generated_text"].strip().lower() == "negative" else 1
    for output in tqdm(model_pipeline(KeyDataset(data["test"], "t5")), 
                       total=len(data["test"]))
]

100%|██████████| 1066/1066 [01:54<00:00,  9.29it/s]


In [87]:
# Evaluate performance
print(evaluate_performance(data["test"]["label"], y_pred))

                 precision    recall  f1-score   support

Negative Review       0.78      0.91      0.84       533
Positive Review       0.89      0.74      0.81       533

       accuracy                           0.83      1066
      macro avg       0.84      0.83      0.83      1066
   weighted avg       0.84      0.83      0.83      1066



The model shows a clear **asymmetry** in behavior between negative and positive reviews. Negative reviews have moderate precision (0.78) but very high recall (0.91), meaning the model catches most negative reviews but sometimes incorrectly labels positive ones as negative, leading to an F1-score of 0.84. Positive reviews display the opposite trade-off, with high precision (0.89) but lower recall (0.74), so predicted positives are usually correct, yet the model misses more than a quarter of true positive reviews, for an F1-score of 0.81. Overall accuracy is 0.83 on a balanced dataset (533 examples per class), and macro and weighted averages around 0.83 confirm solid yet slightly negatively biased performance.