# Text classification with Generative Models

Now with all the generative models it's tempting to use them for classification tasks, but are they good at it? How can we measure the success of a classification model ? Let's find out ðŸ¤”

For this example we will use the rotten_tomatoes dataset, it contains 50000 movie reviews with their corresponding sentiment (positive or negative).

In [1]:
from datasets import load_dataset

# Load our data
data = load_dataset("rotten_tomatoes")
data

  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 8530/8530 [00:00<00:00, 641264.22 examples/s]
Generating validation split: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1066/1066 [00:00<00:00, 374027.78 examples/s]
Generating test split: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1066/1066 [00:00<00:00, 455540.30 examples/s]


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

## Using a task-specific model
Using specific task models is the easiest way to solve our problem, we just need to find a model that fits our needs, download it and use it in a pipeline to test it on our data.

For this example we will use a roberta model to classify our data.

We will use a pipeline object if you are not familiar with this read the official doc



In [4]:
from transformers import pipeline
import torch

model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# load model into pipeline
pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True
)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


```Let's run an inference loop to get the predictions for our dataset```

In [5]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total=len(data["test"])):
    negative_score = output[0]["score"]
    positive_score = output[2]["score"]
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1066/1066 [00:22<00:00, 48.26it/s]


## Evaluation
Then we will define a function to evaluate how well the model performed by comparing predictions to actual labels. For this we will use the classification_report from sklearn

In [6]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

In [7]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



# Classification tasks with embeddings
Now let's see how we can use embeddings to classify our data.

What's happening if we can not find a model that fits perfectly our needs ?

Then we need to fine-tune a model to our specific task, but it will be long hard and costly ... ðŸ˜…

So what's the solution ?

## Use embeddings !

## Supervised classification with embeddings
Instead of using a pre-trained model for our specific task, we will use an embedding moidel for feature generation. Then those features will be used to train a classifier, this method is called Supervised classification with embeddings because we do not need to fine-tune the model, we just need to train a classifier on the features ðŸ§™

For this example we will use a sentence-transformers model to generate embeddings for our data it's very popular and well-performing for this kind of task.

In [8]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to embeddings
train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings = model.encode(data["test"]["text"], show_progress_bar=True)

Batches: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 267/267 [00:33<00:00,  7.89it/s]
Batches: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 34/34 [00:04<00:00,  8.33it/s]


In [9]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
embeddings = model.encode(sentences)
print(embeddings)


[[ 0.0225026  -0.07829174 -0.02303072 ... -0.00827929  0.02652687
  -0.00201898]
 [ 0.04170236  0.0010974  -0.01553416 ... -0.02181631 -0.06359357
  -0.00875285]]


In [10]:
train_embeddings.shape

(8530, 768)

Now let's train a very simple logisitic regression on our embeddings ðŸ¤“

In [11]:
from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression on our train embeddings
clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, data["train"]["label"])

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'lbfgs'
,max_iter,100


In [12]:
# Predict previously unseen instances
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066



Better F1 score than before!!!!

## What if we do not have labeled data : unsupervised use case
What would happen if we would not use a classifier at all? Instead, we can average the embeddings per class and apply cosine similarity to predict which classes match the documents best ðŸ§™

In [13]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics.pairwise import cosine_similarity

# Average the embeddings of all documents in each target label
df = pd.DataFrame(np.hstack([train_embeddings, np.array(data["train"]["label"]).reshape(-1, 1)]))
averaged_target_embeddings = df.groupby(768).mean().values

# Find the best matching embeddings between evaluation documents and target embeddings
sim_matrix = cosine_similarity(test_embeddings, averaged_target_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

# Evaluate the model
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.84      0.84       533
Positive Review       0.84      0.85      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



### Zero shot classification
A zero shot classification is when a model can classify text into categories it has never been explicitly trained on, simply by understanding the semantic relationship between the input text and candidate label descriptions.

In our case we do not have labeled data we will try to predict these labels of input text enven though the model was not trained on them ðŸ¥·

- ```To perform zero-shot classification with embeddings, there is a little trick that we can use. We can describe our labels based on what they should represent. For example, a negative label for movie reviews can be described as "This is a negative movie review." By describing and embedding the labels and documents, we have data that we can work with.```

In [14]:
# Create embeddings for our labels
label_embeddings = model.encode(["A negative review",  "A positive review"])

In [15]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

In [16]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.78      0.77      0.78       533
Positive Review       0.77      0.79      0.78       533

       accuracy                           0.78      1066
      macro avg       0.78      0.78      0.78      1066
   weighted avg       0.78      0.78      0.78      1066



We have a 0.78 F1 SCORE, which is pretty impressive considering we haven't used any labels

### Text classification with generative models
Generative language models like OpenAI's GPT differ fundamentally in their approach to classification compared to traditional methods.

Rather than following conventional classification paradigms, these models function as sequence-to-sequence systems in short : they receive text input and produce text output.



While these generative models undergo training across diverse tasks, they typically cannot handle specialized use cases immediately. Consider feeding a movie review to such a model without additional guidance: the model would lack direction on how to process it.

To achieve meaningful results, we must provide context and steer the model toward our desired outcomes. This guidance occurs primarily through carefully crafted instructions, known as prompts ðŸ˜Ž

For our demo we will use the groq API because openAI do not give us a free API keys ðŸ˜…

In [None]:
# Note: Set your GROQ_API_KEY as an environment variable or in a .env file
# Example: export GROQ_API_KEY="your_key_here"
# Or use: load_dotenv() to load from .env file

In [18]:
sample_text = data["test"]["text"][0]
print(f"Review: {sample_text}\n")


Review: lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .



In [21]:
import os
from groq import Groq
from dotenv import load_dotenv

load_dotenv()

client = Groq(
    api_key=os.getenv("GROQ_API_KEY"), 
)

chat_completion = client.chat.completions.create(
    model="meta-llama/llama-4-scout-17b-16e-instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a sentiment classifier. Respond with only 'positive' or 'negative'."
        },
        {
            "role": "user",
            "content": f"Classify the sentiment of this movie review: {sample_text}"
        }
    ],
    temperature=0,
    max_tokens=10

)
print(chat_completion.choices[0].message.content)

positive


In [22]:

def groq_generation(prompt, model="meta-llama/llama-4-scout-17b-16e-instruct"):
  message = [
        {
            "role": "system",
            "content": "You are a sentiment classifier. Rate the sentiment as a number between 0 (negative) and 1 (positive). Respond with only the number."
        },
        {
            "role": "user",
            "content": f"Rate the sentiment of this movie review: {prompt}"
        }
  ]
  chat_completion = client.chat.completions.create(
      model=model,
      messages=message,
      temperature=0,
      max_tokens=10
    )
  return chat_completion.choices[0].message.content

In [23]:
groq_generation(sample_text)

'0.8'

### Text2Text Transfert
Let's explore a final technique called text-to-text transfert transformers or T5 models. ðŸ‘€ The architecture is similar to the original Transformers with ezncoder and decoder parts stacked together.

T5 reframes every common NLP tasks such as translation, summarization, classification, question answering. As input text â†’ output text, simplifying model design and enabling multitask learning.

T5 was trained on the Colossal Clean Crawled Corpus, with a self-supervised objective called span corruption, giving it strong generalization across NLP tasks.

Because T5 generates text tokens for answers and labels, it excels in zero-shot, few-shot, and instruction-based tasks, without needing task-specific heads or architectures ðŸ˜Ž

In [26]:
# Load our model
pipe = pipeline(
    "text2text-generation",
    model="google/flan-t5-small"
)

Device set to use mps:0


In [27]:
# Prepare our data
prompt = "Is the following sentence positive or negative? "
data = data.map(lambda example: {"t5": prompt + example['text']})
data

Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 8530/8530 [00:00<00:00, 62881.57 examples/s]
Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1066/1066 [00:00<00:00, 71171.37 examples/s]
Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1066/1066 [00:00<00:00, 68534.59 examples/s]


DatasetDict({
    train: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
})

In [29]:
# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "t5")), total=len(data["test"])):
    text = output[0]["generated_text"]
    y_pred.append(0 if text == "negative" else 1)

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1066/1066 [01:18<00:00, 13.60it/s]


In [30]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.83      0.85      0.84       533
Positive Review       0.85      0.83      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066

