<a href="https://colab.research.google.com/github/arquansa/PSTB-exercises/blob/main/Week08/Day1/DC1/W8D1DC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Daily Challenge : Preprocess & fine-tune transformer-based models
👩‍🏫 👩🏿‍🏫 What You’ll learn

In this daily challenge, you will learn

how to preprocess and fine-tune transformer-based models, specifically BERT and XLM-RoBERTa, for text classification tasks. You will gain an understanding of:

- How tokenization works for these models.
- How to properly format input data.
- How to fine-tune transformer models for classification tasks.
- How to perform cross-validation using k-fold splitting.

🛠️ What you will create

By the end of this challenge, you will have a fine-tuned transformer model (BERT or XLM-RoBERTa) capable of classifying text into different categories.
Additionally, you will
- structure the data for training,
- validate it using cross-validation, and
- understand how to optimize these models for better performance.

Dataset You can find the dataset for this exercise here

Task

Understanding BERT and XLM-RoBERTa

Objective:

Learn how transformer models work and their role in NLP tasks.

Instructions:

- Read through the descriptions of BERT and XLM-RoBERTa.
- Understand how these models process text using tokenization.
- Learn about different pre-trained versions of these models and their characteristics.

Functions to use:

from transformers import BertTokenizer, XLMRobertaTokenizer

Tokenizing Text
Objective:
- Understand how to tokenize text using pre-trained tokenizers.

Instructions:

- Use the BertTokenizer and XLMRobertaTokenizer to convert sentences into tokenized input.
- Explore the different token types, such as input_ids, attention_mask, and labels.
- Experiment with single-sentence and two-sentence tokenization.

Functions to use:

tokenizer.encode_plus() tokenizer.decode()

Preparing Input Data for the Model

Objective:
- Format input data correctly for transformer models.

Instructions:
- Ensure that input sentences are padded and possibly truncated to max_length.
- Understand and set special tokens such as and .
- Learn about attention_mask and how it helps the model ignore padding tokens.

Functions to use:
- tokenizer.encode_plus()
- tokenizer.special_tokens_map
- tokenizer.vocab_size

Loading and Exploring the Dataset

Objective: Load the dataset and explore its structure.

Instructions:

- Load the training and testing data from CSV files.
- Display the first few rows to understand its structure.
- Identify the columns needed for training the model. Functions to use:

pd.read_csv() df.head() df.shape

- Creating Cross-Validation Folds

Objective:
- Implement k-fold cross-validation for training.

Instructions:

- Use StratifiedKFold to create 5 training-validation splits.
- Ensure that each fold maintains the same label distribution.
- Store the training and validation splits in separate lists.

Functions to use:

from sklearn.model_selection import StratifiedKFold kf.split() StratifiedKFold(shuffle=True)

Final objective
- Understand how tokenization works in BERT and XLM-RoBERTa
- Format text data for these models
- Fine-tune a transformer model on a classification task
- Use k-fold cross-validation for robust evaluation

**1. Understanding BERT and XLM-RoBERTa**

slightly different transformer-based models:

BERT
- Developed by Google,
- Trained on English text.
- Monolingual, bidirectional.

Example model: "bert-base-uncased".

XLM-RoBERTa
- Developed by Facebook AI.
- Trained on 100 languages (CommonCrawl).
- Multilingual, robust across non-English languages.

Example model: "xlm-roberta-base".

**2. Tokenizing Text**

Import Tokenizers

In [None]:
from transformers import BertTokenizer, XLMRobertaTokenizer

bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
xlmr_tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

**3. Encode a Single Sentence**

In [None]:
sentence = "Transformers are powerful models."

bert_tokens = bert_tokenizer.encode_plus(sentence, return_tensors="pt")
xlmr_tokens = xlmr_tokenizer.encode_plus(sentence, return_tensors="pt")

print("BERT Input IDs:", bert_tokens['input_ids'])
print("XLM-R Input IDs:", xlmr_tokens['input_ids'])

print("Decoded (BERT):", bert_tokenizer.decode(bert_tokens['input_ids'][0]))
print("Decoded (XLM-R):", xlmr_tokenizer.decode(xlmr_tokens['input_ids'][0]))

BERT Input IDs: tensor([[  101, 19081,  2024,  3928,  4275,  1012,   102]])
XLM-R Input IDs: tensor([[     0,  11062,  82772,      7,    621, 113138, 115774,      5,      2]])
Decoded (BERT): [CLS] transformers are powerful models. [SEP]
Decoded (XLM-R): <s> Transformers are powerful models.</s>


**4. Token Types**

- input_ids: Tokenized numeric IDs.
- attention_mask: 1s for real tokens, 0s for padding.
- token_type_ids (optional): Used for two-sentence tasks (like question answering).


**5. Preparing Input Data**

Encoding with Padding, Truncation, Special Tokens

In [None]:
tokens = bert_tokenizer.encode_plus(
    sentence,
    max_length=16,
    padding='max_length',
    truncation=True,
    return_attention_mask=True,
    return_tensors="pt"
)

print("Input IDs:", tokens['input_ids'])
print("Attention Mask:", tokens['attention_mask'])


Input IDs: tensor([[  101, 19081,  2024,  3928,  4275,  1012,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0]])
Attention Mask: tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])


** 6. Explore Tokenizer Info**

In [None]:
print("Special Tokens:", bert_tokenizer.special_tokens_map)
print("Vocab Size:", bert_tokenizer.vocab_size)

Special Tokens: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}
Vocab Size: 30522


Using the AG News sample to simulate a CSV-based workflow

- ad data from CSV
- View data with df.head() / df.shape()
- Identify columns
- Create Stratified K-Folds

Convert the Dataset to a CSV (simulating a CSV workflow)

In [None]:
from datasets import load_dataset

# Load a small sample of AG News
dataset = load_dataset("ag_news", split='train[:1000]')

# Save to CSV (simulating external CSV loading)
dataset.to_csv("ag_news_sample.csv")

README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

254914

 Load the CSV and Inspect It

In [None]:
import pandas as pd

# Load it back as if it's an external dataset
df = pd.read_csv("ag_news_sample.csv")

# Show first few rows
print(df.head())

# Show shape of the data
print("Shape:", df.shape)

# Check for columns
print("Columns:", df.columns)

# Check label distribution
print("Label distribution:\n", df["label"].value_counts())

                                                text  label
0  Wall St. Bears Claw Back Into the Black (Reute...      2
1  Carlyle Looks Toward Commercial Aerospace (Reu...      2
2  Oil and Economy Cloud Stocks' Outlook (Reuters...      2
3  Iraq Halts Oil Exports from Main Southern Pipe...      2
4  Oil prices soar to all-time record, posing new...      2
Shape: (1000, 2)
Columns: Index(['text', 'label'], dtype='object')
Label distribution:
 label
3    472
0    212
2    174
1    142
Name: count, dtype: int64


Clean DataFrame (drop unused columns)

In [None]:
# Drop unused index column if it exists
if "__index_level_0__" in df.columns:
    df.drop(columns=["__index_level_0__"], inplace=True)

Create Stratified K-Folds

In [None]:
from sklearn.model_selection import StratifiedKFold

X = df["text"].values
y = df["label"].values

# Set up 5-fold Stratified CV
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

folds = []

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    train_df = df.iloc[train_idx].reset_index(drop=True)
    val_df = df.iloc[val_idx].reset_index(drop=True)

    folds.append((train_df, val_df))

    print(f"\n📂 Fold {fold + 1}")
    print(f" - Train size: {train_df.shape}")
    print(f" - Val size: {val_df.shape}")
    print(" - Label distribution in Val:\n", val_df['label'].value_counts())



📂 Fold 1
 - Train size: (800, 2)
 - Val size: (200, 2)
 - Label distribution in Val:
 label
3    95
0    42
2    35
1    28
Name: count, dtype: int64

📂 Fold 2
 - Train size: (800, 2)
 - Val size: (200, 2)
 - Label distribution in Val:
 label
3    94
0    42
2    35
1    29
Name: count, dtype: int64

📂 Fold 3
 - Train size: (800, 2)
 - Val size: (200, 2)
 - Label distribution in Val:
 label
3    94
0    42
2    35
1    29
Name: count, dtype: int64

📂 Fold 4
 - Train size: (800, 2)
 - Val size: (200, 2)
 - Label distribution in Val:
 label
3    94
0    43
2    35
1    28
Name: count, dtype: int64

📂 Fold 5
 - Train size: (800, 2)
 - Val size: (200, 2)
 - Label distribution in Val:
 label
3    95
0    43
2    34
1    28
Name: count, dtype: int64


Summary
All the steps have been done, using AG News as if it were loaded from a CSV:
- Load CSV          pd.read_csv()
- Explore data      df.head(), df.shape()
- Identify columns  df.columns, df['label']
- Stratified K-fold StratifiedKFold(...).split(X, y)

 **Objective**

Implement Stratified K-Fold Cross-Validation using a dataset loaded via:

In [None]:
from datasets import load_dataset
dataset = load_dataset("ag_news", split='train[:1000]')

- Convert the Hugging Face dataset to a pandas DataFrame.
- Use StratifiedKFold to create 5 train-validation splits.
- Store the splits as lists of (train_df, val_df) pairs.

In [None]:
from datasets import load_dataset
from sklearn.model_selection import StratifiedKFold
import pandas as pd

# Step 1: Load AG News (small sample)
dataset = load_dataset("ag_news", split='train[:1000]')

# Step 2: Convert to pandas DataFrame
df = dataset.to_pandas()

# Optional: Inspect structure
print(df.head())
print("Shape:", df.shape)
print("Label distribution:\n", df['label'].value_counts())

# Step 3: Prepare StratifiedKFold
X = df["text"].values
y = df["label"].values

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Step 4: Store folds in list
cv_folds = []

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    train_df = df.iloc[train_idx].reset_index(drop=True)
    val_df = df.iloc[val_idx].reset_index(drop=True)

    cv_folds.append((train_df, val_df))

    print(f"\n📂 Fold {fold + 1}")
    print(f" - Train shape: {train_df.shape}")
    print(f" - Validation shape: {val_df.shape}")
    print(" - Validation label distribution:\n", val_df['label'].value_counts().to_dict())


                                                text  label
0  Wall St. Bears Claw Back Into the Black (Reute...      2
1  Carlyle Looks Toward Commercial Aerospace (Reu...      2
2  Oil and Economy Cloud Stocks' Outlook (Reuters...      2
3  Iraq Halts Oil Exports from Main Southern Pipe...      2
4  Oil prices soar to all-time record, posing new...      2
Shape: (1000, 2)
Label distribution:
 label
3    472
0    212
2    174
1    142
Name: count, dtype: int64

📂 Fold 1
 - Train shape: (800, 2)
 - Validation shape: (200, 2)
 - Validation label distribution:
 {3: 95, 0: 42, 2: 35, 1: 28}

📂 Fold 2
 - Train shape: (800, 2)
 - Validation shape: (200, 2)
 - Validation label distribution:
 {3: 94, 0: 42, 2: 35, 1: 29}

📂 Fold 3
 - Train shape: (800, 2)
 - Validation shape: (200, 2)
 - Validation label distribution:
 {3: 94, 0: 42, 2: 35, 1: 29}

📂 Fold 4
 - Train shape: (800, 2)
 - Validation shape: (200, 2)
 - Validation label distribution:
 {3: 94, 0: 43, 2: 35, 1: 28}

📂 Fold 5
 - Train

**Output Structure**

- cv_folds: a list of 5 tuples
- Each tuple: (train_df, val_df)
- Each train_df and val_df is a pandas.DataFrame with text and label columns.

Install dependencies

In [None]:
!pip install transformers datasets scikit-learn --quiet

In [None]:
# 📚 Import libraries
import pandas as pd
import numpy as np
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
import torch
from datasets import Dataset


 Load and Split the Dataset

In [None]:
from datasets import load_dataset

# Load the dataset (1,000 rows only for demo)
dataset = load_dataset("ag_news", split='train[:1000]')

# Split into train/test (80/20)
dataset = dataset.train_test_split(test_size=0.2)

# Optional: check dataset
print(dataset)


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 800
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 200
    })
})


Choose Model and Tokenizer (BERT)

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification

model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=4)  # AG News has 4 classes


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Tokenize the Dataset

In [None]:
def tokenize(example):
    return tokenizer(
        example["text"],
        padding="max_length",
        truncation=True,
        max_length=128
    )

# Tokenize
tokenized_dataset = dataset.map(tokenize, batched=True)

# Format for PyTorch
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])


Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

 Define TrainingArguments and Trainer

In [None]:
from transformers import TrainingArguments, Trainer
from sklearn.metrics import accuracy_score
import numpy as np

# Compute metrics manually
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)
    return {"accuracy": accuracy_score(labels, predictions)}

training_args = TrainingArguments(
    output_dir="./ag_news_results",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    logging_steps=10,
    save_strategy="no",  # no checkpoint saving needed for this small demo
    logging_dir="./logs"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics
)


Train the Model

In [None]:
trainer.train()


Step,Training Loss
10,0.9149
20,0.9496
30,0.7809
40,0.5917
50,0.4347
60,0.4016
70,0.595
80,0.5246
90,0.3004
100,0.4737


TrainOutput(global_step=300, training_loss=0.30440501193205516, metrics={'train_runtime': 3175.7077, 'train_samples_per_second': 0.756, 'train_steps_per_second': 0.094, 'total_flos': 161553088978944.0, 'train_loss': 0.30440501193205516, 'epoch': 3.0})

Evaluate the Model

In [None]:
results = trainer.evaluate()
print(f"📊 Accuracy on test set: {results['eval_accuracy']:.4f}")

📊 Accuracy on test set: 0.8700


#Conclusion Finale :

Dans ce projet, nous avons exploré et appliqué les fondements du fine-tuning de modèles Transformers pour la classification de texte, en nous concentrant sur BERT et XLM-RoBERTa.

✅ Les objectifs principaux ont été atteints :

- Compréhension des architectures BERT et XLM-R.
- Maîtrise de la tokenisation avec BertTokenizer et XLMRobertaTokenizer.
- Préparation des entrées avec attention aux input_ids, attention_mask, padding et truncation.
- Utilisation efficace du jeu de données AG News.
- Mise en place d'une validation croisée avec StratifiedKFold.
- Entraînement d’un modèle BERT avec Hugging Face Trainer.
- Évaluation de la performance avec une métrique pertinente (accuracy).
