#Daily Challenge: Fine-Tuning GPT-2 for SMS Spam Classification (Legacy transformers API)

In this daily challenge, you’ll fine-tune a pre-trained GPT-2 model to classify SMS messages as spam or ham (not spam). We’ll work through loading the dataset, inspecting its schema, tokenizing examples, adapting to an older transformers version, and running training and evaluation with the classic do_train/do_eval flags.


👩‍🏫 👩🏿‍🏫 What You’ll learn
How to load and explore a custom text-classification dataset
Inspecting and aligning column names for tokenization
Tokenizing text for GPT-2 (with its peculiar padding setup)
Initializing GPT2ForSequenceClassification
Defining and computing multiple evaluation metrics
Configuring TrainingArguments for transformers < 4.4 (using do_train, eval_steps, etc.)
Running fine-tuning with Trainer and interpreting results
Common pitfalls when using legacy APIs

🛠️ What you will create
By the end of this challenge, you will have built:

A tokenized SMS dataset compatible with GPT-2’s requirements, including custom padding and truncation.
A fine-tuned GPT2ForSequenceClassification model that can accurately label incoming SMS messages as spam or ham.
A complete training pipeline using the legacy do_train/do_eval flags in TrainingArguments, with periodic checkpointing, logging, and evaluation.
A set of evaluation metrics (accuracy, precision, recall, F1) computed at each validation step and summarized after training.
A reusable Jupyter notebook that ties everything together—from dataset loading and inspection, through model initialization and tokenization, to training, evaluation, and results interpretation.


💼 Prerequisites
Python 3.7+
Installed packages: datasets, evaluate, transformers>=4.0.0,<4.4.0
Basic familiarity with Hugging Face’s datasets and transformers libraries
GitHub or Colab access for executing the notebook
A Hugging Face API and a WeightAndBiases API, for instructions on how to get it, click here.

Task
We will guide you through making a fine-tuning a GPT-2 model to classify SMS messages as spam or ham using an older version of transformers (<4.4). Follow the steps below and complete the “TODO” in the code.

1. Setup : Install required packages datasets, evaluate and transformers[sentencepiece].

%pip install --quiet datasets evaluate transformers[sentencepiece]

2. Load & Inspect Dataset :

import pandas as pd
from datasets import TODO #import load_dataset
TODO # import pandas

Load the UCI SMS Spam dataset (sms_spam) from Hugging Face hub
raw = TODO

We'll use 4,000 for train, 1,000 for validation
train_ds = TODO
val_ds   = TODO

TODO  # print the features of the train dataset. It should show 'sms' and 'label'

3. Tokenization :

import pandas as pd
from transformers import TODO # import GPT2Tokenizer

model_name = TODO #load the tokenize, we will use GPT2
tokenizer  = TODO
GPT-2 has no pad token by default—set it to eos
tokenizer.pad_token = tokenizer.eos_token

def tokenize_fn(examples):
    # returns input_ids, attention_mask; keep max_length small for SMS
    return tokenizer(
        examples["sms"],
        padding="max_length",
        truncation=True,
        max_length=64
    )

train_tok = TODO #apply the tokenization by loading the subset using .map function
val_tok   = TODO #apply the tokenization by loading the subset using .map function



4. Model Initialization

import torch
TODO  #import GPT2ForSequenceClassification

model = GPT2ForSequenceClassification.from_pretrained( # Load GPT-2 with sequence classification head
    model_name,
    num_labels=TODO,           # spam vs. ham
    pad_token_id=tokenizer.eos_token_id
)

5. Metrics Definition

import evaluate
import numpy as np

accuracy  = evaluate.load("accuracy")
precision = # apply the function used for accurracy but for precision
recall    = # apply the function used for accurracy but for recall
f1        = # apply the function used for accurracy but for F1

def compute_metrics(pred):
    logits, labels = pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy":  accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "precision": TODO, # apply the function used for accurracy but for precision
        "recall":    TODO, # apply the function used for accurracy but for recall
        "f1":        TODO # apply the function used for accurracy but for F1
    }


In an imbalanced dataset like SMS spam (often more “ham” than “spam”), why is it important to track precision and recall alongside accuracy?
How would you interpret a model that achieves high accuracy but low recall on the spam class?

6. TrainingArguments Configuration

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=TODO
    do_train=True,                 # turn on training
    do_eval=True,                  # turn on evaluation
    eval_steps=TODO,                # run .evaluate() every 500 steps
    save_steps=TODO,                # save a checkpoint every 500 steps
    logging_dir="./logs",
    logging_steps=TODO,             # log metrics every 500 steps

    per_device_train_batch_size=TODO,
    per_device_eval_batch_size=TODO,
    num_train_epochs=TODO,
    learning_rate=TODO,
    weight_decay=TODO,

    report_to=None,                # disable integrations
    save_total_limit=1,            # only keep last checkpoint
)

What effect does weight_decay have during fine-tuning? When might you choose a higher or lower value?

7. Train & Evaluate

Train
from transformers import Trainer
you need to have your wandb api key ready to paste in the command line
trainer = Trainer(
    model=TODO,
    args=training_args,
    train_dataset=train_tok,
    eval_dataset=val_tok,
    compute_metrics=compute_metrics,
)
trainer.train()

Evaluate
metrics = TODO
print(metrics)
Expect something like: {"eval_loss": ..., "eval_accuracy": 0.98, ...}


#1.Setup

Install required packages: datasets, evaluate and transformers[sentencepiece].

In [None]:
!pip install --quiet datasets evaluate transformers[sentencepiece]

#2. Load & Inspect Dataset

In [None]:
# @title
import pandas as pd
from datasets import load_dataset
import datasets
# Load the UCI SMS Spam dataset (sms_spam) from Hugging Face hub
raw = load_dataset("sms_spam")

HuggingFace dataset "sms_spam" is uploaded and installed. It contains sms that are to be classified as spam or not spam.

In [None]:
print(raw)

DatasetDict({
    train: Dataset({
        features: ['sms', 'label'],
        num_rows: 5574
    })
})


The dataset : Raw is a dictionary and comprises 5574 rows; its features are: ['sms', 'label']

In [None]:
# We'll use 4,000 for train, 1,000 for validation (values for GPU, reduced respectively to 500 and 100 because the process was to slow)
train_ds = raw["train"].select(range(500))           # ✅ first 4000 for training
val_ds   = raw["train"].select(range(500, 600))      # ✅ next 1000 for validation

In [None]:
print(train_ds)

Dataset({
    features: ['sms', 'label'],
    num_rows: 500
})


In [None]:
print(val_ds)

Dataset({
    features: ['sms', 'label'],
    num_rows: 100
})


#3. Tokenization

In [None]:
model_name = "gpt2" #load the tokenize, we will use GPT2
tokenizer  = GPT2Tokenizer.from_pretrained(model_name)
# GPT-2 has no pad token by default—set it to eos
tokenizer.pad_token = tokenizer.eos_token

def tokenize_fn(examples):
    # returns input_ids, attention_mask; keep max_length small for SMS
    return tokenizer(
        examples["sms"],
        padding="max_length",
        truncation=True,
        max_length=32 #64 if GPU
    )

# Apply the tokenization
train_tok = train_ds.map(tokenize_fn, batched=True)
val_tok   = val_ds.map(tokenize_fn, batched=True)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

#4. Model Initialization

In [None]:
import torch
from transformers import GPT2ForSequenceClassification  # ✅ import the model class

model = GPT2ForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,  # ✅ Two classes: spam vs ham
    pad_token_id=tokenizer.eos_token_id
)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#5. Metrics Definition

In [None]:
import evaluate
import numpy as np

accuracy  = evaluate.load("accuracy")
precision =  evaluate.load("precision")
recall    = evaluate.load("recall")
f1        = evaluate.load("f1")

def compute_metrics(pred):
    logits, labels = pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy":  accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "precision": precision.compute(predictions=preds, references=labels)["precision"],
        "recall":    recall.compute(predictions=preds, references=labels)["recall"],
        "f1":        f1.compute(predictions=preds, references=labels)["f1"],
    }

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]


- In an imbalanced dataset like SMS spam (often more “ham” than “spam”), why is it important to track precision and recall alongside accuracy?
- How would you interpret a model that achieves high accuracy but low recall on the spam class?
- Why track precision and recall alongside accuracy?


In imbalanced datasets, where one class (like "ham") appears much more frequently than the other (like "spam"), accuracy can be misleading.

 Why is Accuracy not Enough in Imbalanced Datasets?
- In the SMS spam dataset, there are many more "ham" messages than "spam". That’s an imbalanced dataset.

📉 Problem:
If a model always predicts "ham", it can still get high accuracy, but it misses most spam messages — which is bad!

✅ Why Use Precision & Recall?
Precision (spam):
Of all messages the model said were spam, how many really were spam?

Recall (spam):
Of all real spam messages, how many did the model catch?

Example:
If your model has:
- High accuracy, but
- Low recall for spam → it misses most spam, so it’s not useful as a spam filter.

In Short:
- Accuracy is not good enough.
- You need high recall to catch spam.
- Precision helps avoid false alarms (ham flagged as spam).
- F1 gives you a balance between both.

#6. TrainingArguments Configuration

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./gpt2-sms-spam",
    do_train=True,                 # turn on training
    do_eval=True,                  # turn on evaluation
    eval_steps=500,                # run .evaluate() every 500 steps
    save_steps=500,                # save a checkpoint every 500 steps
    logging_dir="./logs",
    logging_steps=500,             # log metrics every 500 steps

    per_device_train_batch_size=2, #8 if GPU
    per_device_eval_batch_size=2, #8 if GPU
    num_train_epochs=1, # 3 if GPU
    learning_rate=5e-5,
    weight_decay=0.01,

    report_to=None,                # disable integrations
    save_total_limit=1,            # only keep last checkpoint
)

- What effect does weight_decay have during fine-tuning?
- When might you choose a higher or lower value?
- What is weight_decay?

-It's a setting that helps prevent overfitting.

-It works by penalizing very large weights in the model during training.

It is important when fine-tuning a big model like GPT-2 on a small dataset (like SMS spam), because
- The model can easily memorize the training data.
- weight_decay helps it generalize better to new data.

✅ When to use:

Situation     What to set
- Small dataset (like this one) weight_decay=0.01 (good default)
- Model overfits (train accuracy high, val low)	Try a higher value like 0.05
- Model underfits (not learning at all)	Try a lower value or 0

🧠 In this case:
Use weight_decay=0.01 — it's safe and works well for most fine-tuning tasks.

#7. Train & Evaluate

In [None]:
# Train
from transformers import Trainer
# you need to have your wandb api key ready to paste in the command line
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tok,
    eval_dataset=val_tok,
    compute_metrics=compute_metrics,
)
trainer.train()

#Evaluate
metrics = trainer.evaluate()
print(metrics)
# Expect something like: {"eval_loss": ..., "eval_accuracy": 0.98, ...}

Step,Training Loss


{'eval_loss': 0.580434262752533, 'eval_accuracy': 0.94, 'eval_precision': 1.0, 'eval_recall': 0.6470588235294118, 'eval_f1': 0.7857142857142857, 'eval_runtime': 13.9352, 'eval_samples_per_second': 7.176, 'eval_steps_per_second': 3.588, 'epoch': 1.0}


Here are the training results:
eval_loss: 0.58
eval_accuracy: 0.94
eval_precision: 1.0
eval_recal: 0.64
eval__f1: 0.78
- 1.0 eval_precision means there are no false positives;
- 0.94 eval_accuracy and 1.0 precision mean the model will be good at identifying real positives : 0.94 of eval-accuracy is generally fine, but here with an imbalanced dataset, it can be misleading.
- 0.64 eval_recal means, the model actually missed 36% of false negatives.