## Understanding LLM Training Workloads

In the lecture we looked at how to (theoretically) calculate FLOPs and built a mental model around where much of compute is spent. We also tried to build intuition about how compute patterns massively differ during LLM training and inference. Here we will try and actually analyze some simple training and inference workloads on a Llama-like model. 

#### Install pre-requisites

In [None]:
! pip install torch>=2.0.0 transformers>=4.30.0 datasets>=2.10.0 accelerate>=0.20.0  --quiet

In [None]:
import torch
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM, 
    LlamaForCausalLM,
    LlamaConfig,
    TrainingArguments, 
    Trainer,
    DataCollatorForLanguageModeling,
    AutoConfig
)
from datasets import load_dataset
from typing import Tuple, List

In [2]:
import os
# HF Trainer complains and tries to initialze distributed, we set these flags to make sure we train on one GPU
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["NCCL_P2P_DISABLE"] = "1"
os.environ["NCCL_IB_DISABLE"] = "1"
os.environ["WORLD_SIZE"] = "1"
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355"

# Disable wandb and other logging services completely
os.environ["WANDB_DISABLED"] = "true"
os.environ["WANDB_MODE"] = "disabled"
os.environ["TRANSFORMERS_NO_ADVISORY_WARNINGS"] = "true"


In [None]:
!nvidia-smi

## Standard Huggingface inference code for a pre-trained model

You've probably seen this many times by now :)

In [4]:
def load_model(model_id: str) -> AutoModelForCausalLM:
    return AutoModelForCausalLM.from_pretrained(model_id).cuda()

def load_tokenizer(model_id) -> AutoTokenizer:
    return AutoTokenizer.from_pretrained(model_id)

def inference(model, tokenizer, inputs: List[str]):
    model_inputs = tokenizer(inputs, return_tensors="pt").to("cuda")
    generated_ids = model.generate(**model_inputs, 
                max_new_tokens=30,
                do_sample=True,
                temperature=0.7,
                top_p=0.9,
                pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(generated_ids[0], skip_special_tokens=True)

In [5]:
# model_id = "meta-llama/Llama-3.2-1B-Instruct"
# model = load_model(model_id)
# tokenizer = load_tokenizer(model_id)
# print (inference(model, tokenizer, ["What is the capital of Germany"]))

# # free up the GPU
# del model
# torch.cuda.empty_cache()

## Helper functions for model training

In [6]:
def prepare_dataset(tokenizer, max_length=512):
    """Load and prepare a simple dataset from HF Hub"""
    
    # Using a small, simple dataset
    dataset = load_dataset("roneneldan/TinyStories", split="train")
    
    # Take only a small subset for demo (first 1000 examples)
    dataset = dataset.select(range(1000))
    print(f"Dataset size: {len(dataset)} examples")
    
    def tokenize_function(examples):
        """Tokenize the text data"""
        # Tokenize and truncate to max_length
        tokenized = tokenizer(
            examples["text"], 
            truncation=True, 
            padding=False, 
            max_length=max_length,
            return_overflowing_tokens=False,
        )
        return tokenized
    
    # Tokenize the dataset
    print("Tokenizing dataset...")
    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=dataset.column_names,
        desc="Tokenizing"
    )
    
    # Split into train/eval (90/10 split)
    train_size = int(0.9 * len(tokenized_dataset))
    train_dataset = tokenized_dataset.select(range(train_size))
    eval_dataset = tokenized_dataset.select(range(train_size, len(tokenized_dataset)))
    
    print(f"Train examples: {len(train_dataset)}")
    print(f"Eval examples: {len(eval_dataset)}")
    
    return train_dataset, eval_dataset

def train(config: dict):
    print("🚀 Starting Configurable Qwen Training (Single GPU)")
    print("=" * 50)
    
    BATCH_SIZE = 1       # Adjust based on GPU memory
    MAX_LENGTH = 256     # Sequence length
    
    # config = LlamaConfig(**config)
    config = AutoConfig.for_model(**config)
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")
    
    # 1. Load tokenizer (we'll use Llama tokenizer but create our own model)
    print("Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-1.8B")
    
    # 2. Create model with random initialization
    print("Creating model with random initialization...")
    # model = LlamaForCausalLM(config).to(dtype=torch.bfloat16, device=device)
    model = AutoModelForCausalLM.from_config(config).cuda()
    
    # Add padding token
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        model.config.pad_token_id = tokenizer.eos_token_id
    
    print(f"📊 Actual model parameters: {model.num_parameters():,}")
    print(f"🎯 Model device: {next(model.parameters()).device}")
    
    # 3. Prepare dataset
    train_dataset, eval_dataset = prepare_dataset(tokenizer, max_length=MAX_LENGTH)
    
    # 4. Data collator
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False,
    )
    
    # 5. Training arguments
    training_args = TrainingArguments(
        output_dir="./toy-llama-training",
        overwrite_output_dir=True,
        
        # Training setup
        max_steps=10,  # Only 10 steps for profiling
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        
        # Optimization
        learning_rate=5e-4,
        warmup_steps=50,
        weight_decay=0.01,
        
        # Logging and evaluation
        logging_steps=5,     
        eval_steps=50,       # Disable eval during short profiling run
        save_steps=50,       # Disable saving during short profiling run
        eval_strategy="no",  # Disable evaluation for profiling
        save_strategy="no",  # Disable saving for profiling
        
        # Single GPU settings
        bf16=True,
        dataloader_pin_memory=False,
        dataloader_num_workers=0,
        local_rank=-1,
        ddp_find_unused_parameters=False,
        
        # Misc
        load_best_model_at_end=True,
        report_to=None,
        remove_unused_columns=False,
    )
    
    print(f"\n🔧 Training Configuration:")
    print(f"   • Batch Size: {BATCH_SIZE}")
    print(f"   • Max Sequence Length: {MAX_LENGTH}")
    print(f"   • Learning Rate: {training_args.learning_rate}")
    print(f"   • Mixed Precision: {training_args.bf16}")
    
    # 7. Create trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=data_collator,
    )
    
    # 8. Train the model
    print("\n🏋️ Starting training...")
    trainer.train()
    print("✅ Training completed!")

## Exercise 1: Fit model on the GPU by playing with config parameters [20 mins]

### 1A. Plot peak memory as a function of num_hidden_layers (number of transformer layers)

### 1B. Plot peak memory as a function of hidden_size (size of model embedding)

### 1C. Plot peak memory as a function of intermediate_size (This is the hidden size of MLP)

On Colab, the GPU is not large enough to fit a 1B llama model for training, play with parameters in the config to make the model, grads, optimizer, activations train on one GPU. 

Hint: recall where most of the parameter count lies

Hint: For peak memory you can use `torch.cuda.max_memory_allocated() / 1024**3` in appropriate place in the `train` function

In [7]:
# # original llama 1b config
# config = {
#     "vocab_size": 128256,
#     "hidden_size": 2048,
#     "intermediate_size": 8192,
#     "num_hidden_layers": 16,
#     "num_attention_heads": 32,
#     "num_key_value_heads": 8,
#     "max_position_embeddings": 131072,
#     "rms_norm_eps": 1e-05,
#     "rope_theta": 500000.0,
#     "attention_dropout": 0.0,
#     "hidden_dropout": 0.0,
#     "hidden_act": "silu",
#     "mlp_bias": False,
#     "attention_bias": False,
# }

# original qwen 1.8b config
config = {
        "vocab_size": 151936,
        "hidden_size": 2048,
        "intermediate_size": 11008,
        "num_hidden_layers": 24,
        "num_attention_heads": 16,
        "num_key_value_heads": 16,
        "max_position_embeddings": 32768,
        "rms_norm_eps": 1e-6,
        "rope_theta": 1000000.0,
        "attention_dropout": 0.0,
        "hidden_dropout": 0.0,
        "hidden_act": "silu",
        "model_type": "qwen2",
        "tie_word_embeddings": False,
    }

In [8]:
train(config)

🚀 Starting Configurable Qwen Training (Single GPU)
Using device: cuda:0
Loading tokenizer...
Creating model with random initialization...


Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


📊 Actual model parameters: 2,648,426,496
🎯 Model device: cuda:0


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Dataset size: 1000 examples
Tokenizing dataset...
Train examples: 900
Eval examples: 100

🔧 Training Configuration:
   • Batch Size: 1
   • Max Sequence Length: 256
   • Learning Rate: 0.0005
   • Mixed Precision: True

🏋️ Starting training...


OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 1.81 MiB is free. Including non-PyTorch memory, this process has 23.64 GiB memory in use. Of the allocated memory 23.01 GiB is allocated by PyTorch, and 47.82 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [15]:
# your code here

## [Stretch] Use PyTorch Profiler to profile training and inference of the above config [20 min]

Generate a trace to analyze the workloads for training and inference of the same model. See documentation here: https://docs.pytorch.org/tutorials/recipes/recipes/profiler_recipe.html

In [17]:
# your code here

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


'Capital of Germany is Berlin\nBerlin is the capital and largest city of Germany, located in the central part of the country'

## Exercise 2: Activation Checkpointing [20 min]

Add activation checkpointing after every transformer layer to the `train` function and analyze its effect on peak memory

In [None]:
# your code here

## Exercise 3: Gradient Accumulation [20 min]

Currently we trained with a batch size of 1, increase this (global) batch size to a larger value and make it run on a single GPU by passing gradient accumulation to TrainingArgs in the `train` function.

# your code here