# **Adding Advanced Features to the Training Loop**

Here we will cover some advanced techniques which are used during the pre-training, but also overlap with fine-tuning.

Specifically, we will cover:
- Warm-up
- Cosine Decay
- Gradient Clipping
- A modified training loop covering all three techniques.

## **Setup**

In [1]:
from importlib.metadata import version
import torch

print(f"Torch version: {version('torch')}")

Torch version: 2.7.0


In [None]:
from pathlib import Path
PROJECT_ROOT = Path().absolute().parent

In [2]:
from utils.components import GPTModel, create_dataloader_v1

GPT_CONFIG_124M = {
    "vocab_size": 50257,   # Vocabulary size
    "context_length": 256, # Shortened context length (orig: 1024)
    "emb_dim": 768,        # Embedding dimension
    "n_heads": 12,         # Number of attention heads
    "n_layers": 12,        # Number of layers
    "drop_rate": 0.1,      # Dropout rate
    "qkv_bias": False      # Query-key-value bias
}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

torch.manual_seed(42)
model = GPTModel(GPT_CONFIG_124M)
model.eval(); # Dropout disabled during inference

In [None]:
data="data/the-law-bastiat.txt"
data_path = PROJECT_ROOT / data

with open(data_path, "r", encoding="utf-8") as file:
    text_data = file.read()

In [15]:
# Train - Validation ratio
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))

torch.manual_seed(42)

train_loader = create_dataloader_v1(
    text_data[:split_idx],
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride = GPT_CONFIG_124M["context_length"],
    drop_last=True,
    shuffle=True,
    num_workers=0
)

val_loader = create_dataloader_v1(
    text_data[split_idx:],
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride = GPT_CONFIG_124M["context_length"],
    drop_last=False,
    shuffle=False,
    num_workers=0
)

## 1. Learning Rate Warmup