# Hugging Face Trainer — Deep Dive Tutorial

*A comprehensive, code‑along notebook with in‑depth explanations before each code cell.*

> **Last updated:** 2025-09-17 09:17 UTC
> 
> **What you’ll learn**
> - Conceptual background of the Trainer API
> - Preparing datasets and tokenization
> - Loading pre‑trained models for classification
> - Deep dive into `TrainingArguments`
> - What happens inside the training loop
> - Evaluation, prediction, and metrics
> - Saving and resuming checkpoints
> - Sharing models via the Hugging Face Hub
> - Advanced usage: custom collators and callbacks


## 0) Prerequisites & Environment

The Hugging Face **Trainer API** is a high‑level abstraction built on top of PyTorch (and optionally TensorFlow).  
It automates most of the boilerplate needed for training large language models while still giving you hooks for customization.

**What the Trainer does for you:**
- Handles the full training loop (forward pass, loss computation, backpropagation, optimizer step).
- Manages evaluation at defined intervals.
- Saves and loads checkpoints (including optimizer/scheduler states).
- Supports distributed and mixed‑precision training out of the box.
- Logs metrics for you (to stdout, TensorBoard, WandB, etc.).

You’ll need:
- `transformers` (for Trainer, models, tokenizers)
- `datasets` (to load IMDB dataset)
- `evaluate` (for accuracy metric)
- `torch` (deep learning backend)

> If you don’t have them installed, run the cell below.


In [None]:
# (Optional) Install dependencies
# %pip install -U transformers datasets evaluate torch torchvision torchaudio accelerate

## 1) Load & Prepare a Dataset

We’ll use the **IMDB dataset** for binary sentiment classification (positive vs. negative).

### Key ideas
- `datasets.load_dataset` automatically downloads and caches datasets from the Hugging Face Hub.
- Each split (`train`, `test`) is a `Dataset` object containing rows of data.
- We must **tokenize** raw text into model‑readable IDs before feeding it into a Transformer.

### Tokenization details
- `AutoTokenizer` chooses the right tokenizer for the model architecture (e.g., WordPiece for BERT).
- We enable `truncation=True` so reviews longer than the model’s max length (512 for BERT‑like models) are truncated.
- Padding can be applied dynamically later (better for efficiency).


In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True)

tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])
tokenized["train"][0]

## 2) Load a Pre‑Trained Model

We use **DistilBERT**, a smaller, faster variant of BERT.  
For classification, we need a **classification head** on top of the transformer encoder.

- `AutoModelForSequenceClassification` automatically attaches the right head.
- `num_labels=2` indicates binary classification (positive vs. negative).

> Under the hood, the model outputs **logits** (unnormalized scores), which are then passed to softmax for probabilities.


In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

## 3) Define TrainingArguments

The **TrainingArguments** class is the backbone of training configuration.  
It controls **when, how, and where** training happens.

### Under the hood
When passed to `Trainer`, these arguments:
- Configure the optimizer (`AdamW` by default) and learning rate scheduler.
- Define batch sizes, number of epochs, and gradient accumulation.
- Decide when evaluation and checkpoint saving occurs.
- Control device placement (CPU, GPU, TPU) and mixed precision (fp16/bf16).

### Important parameters
- `output_dir`: where checkpoints and logs are written.
- `evaluation_strategy`: when to run evaluation (`"no"`, `"steps"`, or `"epoch"`).
- `save_strategy`: how often to save checkpoints.
- `per_device_train_batch_size`: batch size per GPU/CPU device.
- `num_train_epochs`: how many epochs to train for.
- `learning_rate`: optimizer learning rate.
- `weight_decay`: regularization to prevent overfitting.
- `logging_steps`: how often to log training progress.
- `push_to_hub`: whether to upload checkpoints automatically.

### Best practices
- Use small batch sizes if limited by GPU memory and scale up gradually.
- Combine `gradient_accumulation_steps` with small batch sizes to simulate larger ones.
- Always specify `seed` for reproducibility.
- Use `load_best_model_at_end=True` with `metric_for_best_model` for automatic model selection.


In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="/mnt/data/imdb_trainer",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir="/mnt/data/logs",
    logging_steps=100,
    push_to_hub=False,
)

## 4) Define Evaluation Metrics

Trainer does not hardcode metrics. Instead, you provide a function (`compute_metrics`) that computes them from predictions.

### How it works
- During evaluation, the Trainer gets raw logits and labels.
- It calls `compute_metrics(eval_pred)` with a tuple `(logits, labels)`.
- You return a dictionary of metrics to be logged and displayed.

We’ll use the `evaluate` library to compute **accuracy**.

> You can also compute F1, precision, recall, BLEU, ROUGE, etc.


In [None]:
import evaluate

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    return accuracy.compute(predictions=preds, references=labels)

## 5) Create the Trainer

The `Trainer` object brings everything together:
- The model (with classification head).
- The dataset (train/eval splits).
- The tokenizer (needed for saving/pushing to Hub).
- The training arguments.
- The metric function.

Under the hood, `.train()` will:
1. Shuffle and batch the dataset.
2. Run forward passes and compute loss.
3. Backpropagate gradients and update weights.
4. Evaluate at specified intervals.
5. Save checkpoints.


In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"].shuffle(seed=42).select(range(2000)),
    eval_dataset=tokenized["test"].shuffle(seed=42).select(range(1000)),
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

## 6) Train the Model

Calling `.train()` launches the full training loop.

- Logs training loss periodically.
- Runs evaluation at the end of each epoch (per `evaluation_strategy`).
- Saves checkpoints into `output_dir`.
- Supports resuming if training is interrupted.

> Checkpoints contain: model weights, optimizer state, scheduler state, RNG state.


In [None]:
trainer.train()

## 7) Evaluate & Predict

After training, you’ll want to check performance and make predictions.

- `.evaluate()` runs the evaluation loop on a dataset and returns metrics.
- `.predict()` returns raw logits, true labels, and computed metrics if available.

### Note
The logits must be converted to predictions (e.g., `argmax` for classification).


In [None]:
results = trainer.evaluate()
print(results)

preds = trainer.predict(tokenized["test"].select(range(10)))
preds.predictions.argmax(axis=-1), preds.label_ids

## 8) Save, Load, and Resume Checkpoints

Trainer automatically saves checkpoints, but you can also call `.save_model()`.

- **Manual save**: writes model weights and tokenizer files.
- **Resume**: `trainer.train(resume_from_checkpoint=True)` continues training from the last checkpoint.

> This ensures experiments are reproducible and training can survive interruptions.


In [None]:
trainer.save_model("/mnt/data/imdb_trained_model")

# Reloading
from transformers import AutoModelForSequenceClassification
reloaded = AutoModelForSequenceClassification.from_pretrained("/mnt/data/imdb_trained_model")

## 9) Push to the Hub

The Hugging Face Hub is the central place to share models.

### How it works
1. Authenticate with `huggingface-cli login` or `huggingface_hub.login()`.
2. Set `push_to_hub=True` in `TrainingArguments` **or** call `trainer.push_to_hub()`.
3. A new repo is created under your account if it doesn’t exist.
4. Each push creates a new commit with model weights and config.

> Include a **Model Card** (README.md) to document dataset, metrics, and intended use.


In [None]:
# trainer.push_to_hub("my-imdb-distilbert")

## 10) Advanced: Custom Data Collator

A **data collator** handles how batches are formed.  
The default simply stacks features, but for NLP tasks we often need **dynamic padding**.

- `DataCollatorWithPadding` pads each batch to the length of the longest sequence in that batch.
- This is more memory efficient than static padding to a fixed length.

You can also write a fully custom collator (e.g., for multi‑task learning).


In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"].shuffle(seed=42).select(range(2000)),
    eval_dataset=tokenized["test"].shuffle(seed=42).select(range(1000)),
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

## 11) Advanced: Custom Callbacks

`TrainerCallback` lets you hook into training events.  
Useful for:
- Early stopping
- Custom logging
- Learning rate adjustments
- Notifications (e.g., Slack/Discord alerts)

Callbacks are called at events like: `on_train_begin`, `on_epoch_end`, `on_log`, etc.


In [None]:
from transformers import TrainerCallback

class PrinterCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        print("Log:", logs)

trainer.add_callback(PrinterCallback)

## 12) Wrap‑Up & Next Steps

You’ve now seen the **Trainer API** in action, with both basic and advanced features.

### Key takeaways
- **Trainer** abstracts the boilerplate but is highly configurable.
- **TrainingArguments** is the central config object — learn its parameters well.
- Use `compute_metrics` to evaluate meaningfully.
- Save/resume checkpoints to make experiments reproducible.
- Push to the Hub to share and collaborate.

### Where to go next
- Try multi‑GPU training with `accelerate` integration.
- Explore mixed‑precision training (`fp16`/`bf16`) for faster runs.
- Experiment with custom schedulers and optimizers.
- Build advanced callbacks (early stopping, metric‑based LR scheduling).

**Resources:**
- [Trainer documentation](https://huggingface.co/docs/transformers/main/en/main_classes/trainer)
- [TrainingArguments reference](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments)
- [Transformers examples](https://github.com/huggingface/transformers/tree/main/examples)
