# Scaling to Multi-GPU Training

This notebook demonstrates techniques for scaling large language model training to multiple GPUs.

## 1. Setup and Imports

First, let's import the necessary libraries and check our versions:

In [1]:
import os
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset
from accelerate import FullyShardedDataParallelPlugin
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
import math 

print(f"transformers version: {transformers.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")

transformers version: 4.45.1
PyTorch version: 2.4.1+cu121
CUDA available: False
GPU count: 0


## 2. Load Pre-trained Model and Dataset

We'll use a larger version of GPT-2 for this multi-GPU example:

In [None]:
model_name = "gpt2-medium"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

dataset = load_dataset("wikitext", "wikitext-103-v1")

## 3. Implement Fully Sharded Data Parallel (FSDP) Training

We'll use PyTorch's FSDP for efficient multi-GPU training:

In [4]:
fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,
    fp16=True,
    save_steps=1000,
    logging_steps=100,
    fsdp="full_shard",
    fsdp_config={"fsdp_offload_params": True, "fsdp_state_dict_type": "FULL_STATE_DICT"},
)

## 4. Use Mixed Precision Training

Mixed precision is already enabled with fp16=True in the training arguments.

## 5. Apply Per-Parameter FSDP for Flexible Sharding

Per-Parameter FSDP is handled automatically by the FSDP implementation in the Transformers library.

## 6. Implement Efficient Data Loading

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512, padding="max_length")

tokenized_datasets = dataset.map(
    tokenize_function,
    batched=True,
    num_proc=4,
    remove_columns=dataset["train"].column_names,
)

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

## 7. Set Up the Trainer and Start Training

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
)

trainer.train()

## 8. Monitor Multi-GPU Performance

To monitor the performance across multiple GPUs, you can use tools like nvidia-smi or PyTorch's built-in profiling tools. Here's a simple way to check GPU utilization during training:

In [None]:
import subprocess
import time

def get_gpu_memory_usage():
    result = subprocess.check_output(
        [
            'nvidia-smi', '--query-gpu=memory.used',
            '--format=csv,nounits,noheader'
        ], encoding='utf-8')
    gpu_memory = [int(x) for x in result.strip().split('\n')]
    return gpu_memory

# You can call this function periodically during training to check GPU memory usage
print(f"GPU Memory Usage: {get_gpu_memory_usage()} MB")

## 9. Evaluate the Model

After training, we can evaluate our model:

In [None]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")