# 🤗 Hugging Face Trainer - Detailed Training Lab
This notebook is a comprehensive guide to using Hugging Face's `Trainer` API for training and fine-tuning transformer models. It provides detailed explanations of each component, including dataset loading, tokenization, model selection, training argument configuration, custom optimizer integration, metric evaluation, and model saving.


## 📋 Overview
We will walk through the following steps:
1. **Load and explore a dataset**
2. **Tokenize and preprocess the data**
3. **Load a pre-trained model for classification**
4. **Configure training arguments**
5. **Customize the optimizer and learning rate scheduler**
6. **Define the Hugging Face `Trainer`**
7. **Train and evaluate the model**
8. **Save the fine-tuned model**

## 📦 Install Required Libraries
Install Hugging Face Transformers, Datasets, and Evaluate packages.

In [None]:
!pip install transformers datasets evaluate -q

## 📚 Import Libraries
We import all necessary components from the Transformers and Datasets libraries.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from transformers import AdamW, get_scheduler
from datasets import load_dataset
import evaluate
import torch
import numpy as np

## 🗂️ Load and Explore the Dataset
We use the IMDb dataset, which is a binary sentiment classification dataset (positive/negative reviews).

In [None]:
dataset = load_dataset("imdb")
dataset = dataset.shuffle(seed=42)
dataset['train'][0]

## ✂️ Tokenize the Dataset
We use a tokenizer corresponding to a pre-trained transformer model to tokenize the raw text data. Tokenization is essential to convert text into input IDs and attention masks suitable for transformers.

In [None]:
checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["text"], truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

## 🧠 Load Pre-trained Model
We load a pre-trained DistilBERT model for sequence classification. The model head is adjusted for binary classification (2 labels).

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

## ⚙️ Configure Training Arguments
`TrainingArguments` is a configuration class to customize the training process. You can set batch sizes, learning rate, evaluation strategy, logging, weight decay, and other parameters.

In [None]:
training_args = TrainingArguments(
    output_dir="./results",                 # Where to store model checkpoints
    evaluation_strategy="epoch",            # Evaluate after every epoch
    save_strategy="epoch",                  # Save model after every epoch
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,                        # L2 regularization
    logging_dir="./logs",
    logging_steps=10,
    load_best_model_at_end=True              # Load best checkpoint (based on eval metric)
)

## 🛠 Custom Optimizer and Learning Rate Scheduler
Instead of using the default optimizer/scheduler, we define our own:
- `AdamW` is a popular optimizer for transformers
- `get_scheduler` allows for linear decay of the learning rate during training

In [None]:
# Define custom optimizer
optimizer = AdamW(model.parameters(), lr=training_args.learning_rate, weight_decay=training_args.weight_decay)

# Setup learning rate scheduler
num_training_steps = len(tokenized_datasets['train']) // training_args.per_device_train_batch_size * training_args.num_train_epochs
lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

## 📏 Define Evaluation Metrics
We use `accuracy` as the evaluation metric. The `compute_metrics` function will be called by the `Trainer` during evaluation.

In [None]:
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

## 🧪 Define the Trainer
The `Trainer` class handles the training loop, evaluation, and saving. You pass in the model, datasets, tokenizer, training arguments, metrics function, and optimizer/scheduler.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].shuffle(seed=42).select(range(2000)),
    eval_dataset=tokenized_datasets["test"].select(range(1000)),
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    optimizers=(optimizer, lr_scheduler)
)

## 🚀 Train the Model
We now train the model using the `train()` method of the `Trainer`.

In [None]:
trainer.train()

## 📊 Evaluate the Model
Use the `evaluate()` method to get performance metrics on the evaluation dataset.

In [None]:
trainer.evaluate()

## 💾 Save the Fine-tuned Model
After training, save the model to disk for later use or deployment.

In [None]:
trainer.save_model("./fine-tuned-imdb")