<a href="https://colab.research.google.com/github/amoukrim/AI/blob/main/Week6/DailyChallenge/dailyChallengew_6_d3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# @Author Adil MOUKRIM
@ Daily Challenge: Fine-Tuning GPT-2 for SMS Spam Classification (Legacy transformers API)


In this daily challenge, you‚Äôll fine-tune a pre-trained GPT-2 model to classify SMS messages as spam or ham (not spam). We‚Äôll work through loading the dataset, inspecting its schema, tokenizing examples, adapting to an older transformers version, and running training and evaluation with the classic do_train/do_eval flags.



üë©‚Äçüè´ üë©üèø‚Äçüè´ What You‚Äôll learn
How to load and explore a custom text-classification dataset
Inspecting and aligning column names for tokenization
Tokenizing text for GPT-2 (with its peculiar padding setup)
Initializing GPT2ForSequenceClassification
Defining and computing multiple evaluation metrics
Configuring TrainingArguments for transformers < 4.4 (using do_train, eval_steps, etc.)
Running fine-tuning with Trainer and interpreting results
Common pitfalls when using legacy APIs


üõ†Ô∏è What you will create
By the end of this challenge, you will have built:

A tokenized SMS dataset compatible with GPT-2‚Äôs requirements, including custom padding and truncation.
A fine-tuned GPT2ForSequenceClassification model that can accurately label incoming SMS messages as spam or ham.
A complete training pipeline using the legacy do_train/do_eval flags in TrainingArguments, with periodic checkpointing, logging, and evaluation.
A set of evaluation metrics (accuracy, precision, recall, F1) computed at each validation step and summarized after training.
A reusable Jupyter notebook that ties everything together‚Äîfrom dataset loading and inspection, through model initialization and tokenization, to training, evaluation, and results interpretation.


üíº Prerequisites
Python 3.7+
Installed packages: datasets, evaluate, transformers>=4.0.0,<4.4.0
Basic familiarity with Hugging Face‚Äôs datasets and transformers libraries
GitHub or Colab access for executing the notebook
A Hugging Face API and a WeightAndBiases API, for instructions on how to get it, click here.


Task
We will guide you through making a fine-tuning a GPT-2 model to classify SMS messages as spam or ham using an older version of transformers (<4.4). Follow the steps below and complete the ‚ÄúTODO‚Äù in the code.

1. Setup : Install required packages datasets, evaluate and transformers[sentencepiece].

%pip install --quiet datasets evaluate transformers[sentencepiece]


2. Load & Inspect Dataset :

from datasets import TODO #import load_dataset
TODO # import pandas

# Load the UCI SMS Spam dataset (sms_spam) from Hugging Face hub
raw = TODO

# We'll use 4,000 for train, 1,000 for validation
train_ds = TODO
val_ds   = TODO

TODO  # print the features of the train dataset. It should show 'sms' and 'label'


3. Tokenization :

from transformers import TODO # import GPT2Tokenizer


model_name = TODO #load the tokenize, we will use GPT2
tokenizer  = TODO
# GPT-2 has no pad token by default‚Äîset it to eos
tokenizer.pad_token = tokenizer.eos_token

def tokenize_fn(examples):
    # returns input_ids, attention_mask; keep max_length small for SMS
    return tokenizer(
        examples["sms"],
        padding="max_length",
        truncation=True,
        max_length=64
    )

train_tok = TODO #apply the tokenization by loading the subset using .map function
val_tok   = TODO #apply the tokenization by loading the subset using .map function



4. Model Initialization

import torch
TODO  #import GPT2ForSequenceClassification

model = GPT2ForSequenceClassification.from_pretrained( # Load GPT-2 with sequence classification head
    model_name,
    num_labels=TODO,           # spam vs. ham
    pad_token_id=tokenizer.eos_token_id
)


5. Metrics Definition

import evaluate
import numpy as np

accuracy  = evaluate.load("accuracy")
precision = # apply the function used for accurracy but for precision
recall    = # apply the function used for accurracy but for recall
f1        = # apply the function used for accurracy but for F1

def compute_metrics(pred):
    logits, labels = pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy":  accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "precision": TODO, # apply the function used for accurracy but for precision
        "recall":    TODO, # apply the function used for accurracy but for recall
        "f1":        TODO # apply the function used for accurracy but for F1
    }


In an imbalanced dataset like SMS spam (often more ‚Äúham‚Äù than ‚Äúspam‚Äù), why is it important to track precision and recall alongside accuracy?
How would you interpret a model that achieves high accuracy but low recall on the spam class?


6. TrainingArguments Configuration

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=TODO
    do_train=True,                 # turn on training
    do_eval=True,                  # turn on evaluation
    eval_steps=TODO,                # run .evaluate() every 500 steps
    save_steps=TODO,                # save a checkpoint every 500 steps
    logging_dir="./logs",
    logging_steps=TODO,             # log metrics every 500 steps

    per_device_train_batch_size=TODO,
    per_device_eval_batch_size=TODO,
    num_train_epochs=TODO,
    learning_rate=TODO,
    weight_decay=TODO,

    report_to=None,                # disable integrations
    save_total_limit=1,            # only keep last checkpoint
)


What effect does weight_decay have during fine-tuning? When might you choose a higher or lower value?


7. Train & Evaluate

# Train
from transformers import Trainer
# you need to have your wandb api key ready to paste in the command line
trainer = Trainer(
    model=TODO,
    args=training_args,
    train_dataset=train_tok,
    eval_dataset=val_tok,
    compute_metrics=compute_metrics,
)
trainer.train()

#Evaluate
metrics = TODO
print(metrics)
# Expect something like: {"eval_loss": ..., "eval_accuracy": 0.98, ...}



Interpret your results.


√âtape 1 ‚Äì Configuration de l‚Äôenvironnement

In [11]:
!pip install -U datasets



In [12]:
# Installation des d√©pendances
!pip install --quiet evaluate transformers[sentencepiece]


# √âtape 2 ‚Äì Chargement et exploration du dataset SMS Spam



In [13]:
from datasets import load_dataset
import pandas as pd

# Chargement du jeu de donn√©es "sms_spam" depuis HF Hub
raw = load_dataset("ucirvine/sms_spam")

# S√©paration du jeu (4000 entra√Ænement, 1000 validation)
train_ds = raw["train"].shuffle(seed=42).select(range(4000))
val_ds = raw["train"].shuffle(seed=42).select(range(4000, 5000))

# Affichage du sch√©ma
print(train_ds.features)



{'sms': Value('string'), 'label': ClassLabel(names=['ham', 'spam'])}


## Chargement du dataset r√©ussie :
üì® "sms" est la colonne contenant les messages.

üè∑Ô∏è "label" est la colonne contenant les classes ham (0) et spam (1).





# √âtape 3 ‚Äî Tokenization pour GPT-2

In [14]:
from transformers import GPT2Tokenizer

# Nom du mod√®le pr√©-entra√Æn√© (de base)
model_name = "gpt2"

# Chargement du tokenizer GPT-2
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# üí° GPT-2 ne poss√®de pas de token de padding ‚Üí on utilise le token de fin
tokenizer.pad_token = tokenizer.eos_token

# Fonction de tokenization : transforme chaque message en input_ids + attention_mask
def tokenize_fn(examples):
    return tokenizer(
        examples["sms"],            # le texte du SMS
        padding="max_length",       # on remplit √† la m√™me taille (obligatoire pour batching)
        truncation=True,            # on coupe s'il d√©passe
        max_length=64               # taille max courte car ce sont des SMS
    )

# Application de la fonction sur chaque message du jeu d'entra√Ænement et de validation
train_tok = train_ds.map(tokenize_fn, batched=True)
val_tok   = val_ds.map(tokenize_fn, batched=True)


Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

## R√©sultat de la Tokenization :

| √âl√©ment                    | Description                                                               |
| -------------------------- | ------------------------------------------------------------------------- |
| üì¶ `tokenizer`             | C‚Äôest le **tokenizer GPT-2**, pr√©entra√Æn√© avec un vocabulaire sp√©cifique. |
| ‚úÖ Padding                  | J'ai utilis√© `eos_token` comme `pad_token` (GPT-2 n‚Äôa pas de PAD natif). |
| ‚úÇÔ∏è Truncation              | Limit√© chaque SMS √† 64 tokens max (largement suffisant pour des SMS).     |
| üóÇÔ∏è Tokenisation appliqu√©e | Aux **4000 SMS d‚Äôentra√Ænement** et aux **1000 SMS de validation**.        |


In [15]:
# Afficher quelques exemples tokenis√©s
train_tok[0]


{'sms': 'sports fans - get the latest sports news str* 2 ur mobile 1 wk FREE PLUS a FREE TONE Txt SPORT ON to 8007 www.getzed.co.uk 0870141701216+ norm 4txt/120p \n',
 'label': 1,
 'input_ids': [32945,
  3296,
  532,
  651,
  262,
  3452,
  5701,
  1705,
  965,
  9,
  362,
  2956,
  5175,
  352,
  266,
  74,
  17189,
  48635,
  257,
  17189,
  309,
  11651,
  309,
  742,
  6226,
  9863,
  6177,
  284,
  10460,
  22,
  7324,
  13,
  1136,
  8863,
  13,
  1073,
  13,
  2724,
  657,
  5774,
  28645,
  1558,
  486,
  20666,
  10,
  2593,
  604,
  14116,
  14,
  10232,
  79,
  220,
  198,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
 

# √âtape 4 ‚Äì Initialiser le mod√®le GPT-2 avec une t√™te de classification
l'Objectif est d'initialiser GPT-2 avec une t√™te de classification binaire (spam ou ham). GPT-2 est √† l‚Äôorigine un mod√®le de g√©n√©ration de texte, donc je  dois l‚Äôadapter pour une t√¢che de classification.

In [16]:
# üì¶ Import du mod√®le de classification bas√© sur GPT-2
from transformers import GPT2ForSequenceClassification
model_name = "gpt2"
# üéØ On initialise GPT-2 avec une "classification head"
model = GPT2ForSequenceClassification.from_pretrained(
    model_name,             # 'gpt2' ou autre si tu veux un mod√®le plus gros
    num_labels=2,           # 0 = ham, 1 = spam ‚Üí classification binaire
    pad_token_id=tokenizer.eos_token_id  # n√©cessaire pour √©viter des erreurs avec le padding
)


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
print(model)

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)


##‚úÖ √âtape 5 : D√©finir les m√©triques d‚Äô√©valuation.


In [18]:
import evaluate        # biblioth√®que Hugging Face pour les m√©triques standards
import numpy as np     # n√©cessaire pour le traitement des logits

# üìà Chargement des m√©triques depuis evaluate
accuracy  = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall    = evaluate.load("recall")
f1        = evaluate.load("f1")

# üßÆ Fonction de calcul des m√©triques, appel√©e automatiquement √† chaque √©valuation
def compute_metrics(pred):
    logits, labels = pred                     # pr√©dictions brutes du mod√®le (logits) et vraies √©tiquettes
    preds = np.argmax(logits, axis=-1)        # on convertit les logits en classes pr√©dites (0 ou 1)

    return {
        "accuracy":  accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "precision": precision.compute(predictions=preds, references=labels)["precision"],
        "recall":    recall.compute(predictions=preds, references=labels)["recall"],
        "f1":        f1.compute(predictions=preds, references=labels)["f1"]
    }

##√âtape 6 ‚Äì TrainingArguments

L'objectif est de configurerles hyperparam√®tres de l‚Äôentra√Ænement, comme le batch size, le taux d‚Äôapprentissage, la fr√©quence d‚Äô√©valuation, etc.

In [19]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",        # üìÇ O√π sauvegarder les checkpoints du mod√®le

    do_train=True,                 # ‚úÖ Active l'entra√Ænement
    do_eval=True,                  # ‚úÖ Active l'√©valuation pendant l'entra√Ænement

    eval_steps=500,                # üß™ √âvaluer toutes les 500 √©tapes
    save_steps=500,                # üíæ Sauvegarder un checkpoint toutes les 500 √©tapes
    logging_steps=500,             # ü™µ Journaliser les m√©triques toutes les 500 √©tapes

    per_device_train_batch_size=8,     # üì¶ Nombre d'exemples par lot en entra√Ænement
    per_device_eval_batch_size=8,      # üì¶ Idem pour validation
    num_train_epochs=3,                # üîÅ Nombre de fois qu'on passe sur le jeu d'entra√Ænement

    learning_rate=5e-5,            # ‚öôÔ∏è Taux d'apprentissage
    weight_decay=0.01,             # üßΩ R√©gularisation L2 pour √©viter le surapprentissage

    report_to=[],                # üö´ D√©sactiver WandB, TensorBoard, etc. (√† activer si besoin)
    save_total_limit=1             # üîÅ Ne garder que le dernier checkpoint
)



## √âtape 7 : Entra√Æner et √©valuer le mod√®le avec Trainer

In [20]:
from transformers import Trainer

# üß† Cr√©e le Trainer avec tous les √©l√©ments n√©cessaires
trainer = Trainer(
    model=model,                        # üéØ le mod√®le GPT-2 √† entra√Æner
    args=training_args,                # ‚öôÔ∏è les hyperparam√®tres d√©finis pr√©c√©demment
    train_dataset=train_tok,           # üìä dataset d'entra√Ænement tokenis√©
    eval_dataset=val_tok,              # üìä dataset de validation tokenis√©
    compute_metrics=compute_metrics    # üìè fonction pour √©valuer le mod√®le
)

# üöÄ Lancement de l'entra√Ænement
trainer.train()


Step,Training Loss
500,0.1187
1000,0.039
1500,0.0095


TrainOutput(global_step=1500, training_loss=0.0557409995396932, metrics={'train_runtime': 305.9798, 'train_samples_per_second': 39.218, 'train_steps_per_second': 4.902, 'total_flos': 391945125888000.0, 'train_loss': 0.0557409995396932, 'epoch': 3.0})

√âvaluer sur le set de validation

In [21]:
# üîç √âvaluation finale sur le dataset de validation
metrics = trainer.evaluate()

# üì¢ Affichage des r√©sultats
print(metrics)


{'eval_loss': 0.037382956594228745, 'eval_accuracy': 0.995, 'eval_precision': 0.9831932773109243, 'eval_recall': 0.975, 'eval_f1': 0.9790794979079498, 'eval_runtime': 4.0378, 'eval_samples_per_second': 247.658, 'eval_steps_per_second': 30.957, 'epoch': 3.0}


##Interpr√©tation m√©trique par m√©trique :

| M√©trique      | R√©sultat | Interpr√©tation                                                                 |
| ------------- | -------- | ------------------------------------------------------------------------------ |
| **Loss**      | `0.037`  | Tr√®s faible, le mod√®le a bien appris √† distinguer les classes                  |
| **Accuracy**  | `99.5%`  | Tr√®s haut, montre que le mod√®le est globalement tr√®s fiable                    |
| **Precision** | `98.3%`  | Peu de faux positifs ‚Üí tr√®s peu de messages normaux d√©tect√©s √† tort comme spam |
| **Recall**    | `97.5%`  | Peu de spams sont rat√©s (faux n√©gatifs faibles)                                |
| **F1-score**  | `97.9%`  | Excellent √©quilibre entre pr√©cision et rappel                                  |


--> GPT-2 est entra√Æn√© avec succ√®s pour la classification SMS et les r√©sultats sont excellents.


# ‚úÖ Conclusion
Le mod√®le GPT-2 est fine-tun√© :

* Pr√©dit les SMS spam avec tr√®s haute pr√©cision

* Rate tr√®s peu de spams

* G√©n√©ralise bien sans surapprentissage apparent (vu le loss bas et l‚Äô√©quilibre des scores)

