# Chapter 1: Introduction to Modern Distributed AI

This notebook contains all code examples from Chapter 1, organized into logical sections for hands-on experimentation.

## 1. Setup and CUDA Check

First, verify your GPU setup and CUDA availability.


In [3]:
import torch

def check_cuda():
    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"Number of GPUs: {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")

check_cuda()


CUDA available: True
Number of GPUs: 8
GPU 0: NVIDIA H200
GPU 1: NVIDIA H200
GPU 2: NVIDIA H200
GPU 3: NVIDIA H200
GPU 4: NVIDIA H200
GPU 5: NVIDIA H200
GPU 6: NVIDIA H200
GPU 7: NVIDIA H200


## 2. GPU-Friendly Configuration

Example configuration for GPU-friendly training settings.


In [5]:
# GPU-friendly configuration example
config = {
    'batch_size': 32,  # Fits in GPU memory
    'sequence_length': 2048,  # Reasonable for most GPUs
    'precision': 'bf16',  # Better than FP32, more stable than FP16
    'gradient_checkpointing': True,  # Save memory
    'gradient_accumulation_steps': 4  # Effective batch size = 128
}

def print_config():
    for k, v in config.items():
        print(f"{k}: {v}")

print_config()

batch_size: 32
sequence_length: 2048
precision: bf16
gradient_checkpointing: True
gradient_accumulation_steps: 4


## 3. Single-GPU Baseline Training

Single-GPU training baseline for comparison with distributed training.


In [4]:
import torch
import torch.nn as nn
import time
from torch.utils.data import DataLoader, TensorDataset

class SimpleModel(nn.Module):
    def __init__(self, input_size=1000, hidden_size=512, output_size=10):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        return self.fc2(x)

def train_single_gpu():
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model = SimpleModel().to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    
    # Create dummy dataset
    dataset = TensorDataset(
        torch.randn(1000, 1000),
        torch.randint(0, 10, (1000,))
    )
    dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
    
    # Training loop
    model.train()
    start_time = time.time()
    
    for epoch in range(10):
        epoch_loss = 0.0
        for batch_idx, (data, target) in enumerate(dataloader):
            data, target = data.to(device), target.to(device)
            
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
        
        print(f"Epoch {epoch+1}/10, Loss: {epoch_loss/len(dataloader):.4f}")
    
    total_time = time.time() - start_time
    print(f"\nTotal training time: {total_time:.2f}s")
    if torch.cuda.is_available():
        print(f"Peak memory: {torch.cuda.max_memory_allocated()/1024**3:.2f} GB")

# Uncomment to run:
train_single_gpu()


Epoch 1/10, Loss: 2.3584
Epoch 2/10, Loss: 0.9866
Epoch 3/10, Loss: 0.2530
Epoch 4/10, Loss: 0.0666
Epoch 5/10, Loss: 0.0317
Epoch 6/10, Loss: 0.0204
Epoch 7/10, Loss: 0.0145
Epoch 8/10, Loss: 0.0111
Epoch 9/10, Loss: 0.0088
Epoch 10/10, Loss: 0.0072

Total training time: 0.48s
Peak memory: 0.07 GB


## 4. Distributed Basic Test

Basic distributed test to verify process group initialization works. This script tests the fundamental distributed setup: process group initialization, rank identification, and basic communication.

**Note:** This requires running with `torchrun` from command line:
```bash
OMP_NUM_THREADS=8 torchrun --nproc_per_node=2 code/chapter1/ch01_distributed_basic_test.py
```


In [None]:
import torch
import torch.distributed as dist

def test_distributed_setup():
    """Test basic distributed process group initialization and communication"""
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    print(f"Rank {rank} says hello.")
    dist.destroy_process_group()

# Note: This must be run with torchrun, not directly in notebook
# if __name__ == "__main__":
#     test_distributed_setup()


## 5. Multi-GPU Simulation (Single GPU)

Single-GPU simulation of multi-GPU distributed training. This allows you to test distributed training code on a single GPU by simulating multiple processes.

**Note:** This requires running with `torchrun` from command line:
```bash
# Option 1: With MPS (recommended)
sudo nvidia-cuda-mps-control -d
CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=4 torchrun --nproc_per_node=2 code/chapter1/ch01_multi_gpu_simulation.py

# Option 2: Without MPS
CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=4 torchrun --nproc_per_node=2 code/chapter1/ch01_multi_gpu_simulation.py
```


In [None]:
import torch
import torch.distributed as dist

def simulate_multi_gpu():
    """Simulate multi-GPU distributed training on a single GPU"""
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    print(f"Rank {rank} says hello.")
    dist.destroy_process_group()

# Note: This must be run with torchrun, not directly in notebook
# if __name__ == "__main__":
#     simulate_multi_gpu()


## 6. Multi-GPU DDP Training

First multi-GPU distributed training using PyTorch DDP. This is a complete distributed training example using DDP with proper setup, DistributedSampler usage, and cleanup.

**Note:** This requires running with `torchrun` from command line:
```bash
OMP_NUM_THREADS=4 torchrun --nproc_per_node=2 code/chapter1/ch01_multi_gpu_ddp.py
# Or use the launch script:
bash code/chapter1/ch01_launch_torchrun.sh
```


In [None]:
import torch
import torch.distributed as dist
import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
from torch.utils.data import TensorDataset
import os

def setup():
    """Initialize the process group using torchrun environment variables"""
    # torchrun sets these environment variables automatically
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    local_rank = int(os.environ.get('LOCAL_RANK', 0))
    torch.cuda.set_device(local_rank)
    return rank, dist.get_world_size(), local_rank

def cleanup():
    """Clean up the process group"""
    dist.destroy_process_group()

def train_ddp():
    """Run distributed training using PyTorch DDP"""
    rank, world_size, local_rank = setup()
    
    # Create model
    model = nn.Sequential(
        nn.Linear(1000, 512),
        nn.ReLU(),
        nn.Linear(512, 10)
    ).cuda()
    
    # Wrap with DDP
    model = DDP(model, device_ids=[local_rank])
    
    # Create dataset with distributed sampler
    dataset = TensorDataset(
        torch.randn(1000, 1000),
        torch.randint(0, 10, (1000,))
    )
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)
    
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    
    # Training loop
    model.train()
    for epoch in range(10):
        sampler.set_epoch(epoch)  # Important for shuffling
        epoch_loss = 0.0
        
        for batch_idx, (data, target) in enumerate(dataloader):
            data, target = data.cuda(), target.cuda()
            
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
        
        if rank == 0:
            print(f"Epoch {epoch+1}/10, Loss: {epoch_loss/len(dataloader):.4f}", flush=True)
    
    cleanup()

# Note: This must be run with torchrun, not directly in notebook
# if __name__ == "__main__":
#     train_ddp()


## 7. Profiling and Performance Analysis

Memory and latency profiling for model training. This demonstrates how to use PyTorch's profiler to measure CUDA operations and memory usage.


In [None]:
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity

def profile_model():
    if not torch.cuda.is_available():
        print("CUDA not available, skipping profiling")
        return
        
    model = nn.Sequential(
        nn.Linear(1000, 512),
        nn.ReLU(),
        nn.Linear(512, 256),
        nn.ReLU(),
        nn.Linear(256, 10)
    ).cuda()
    
    inputs = torch.randn(32, 1000).cuda()
    targets = torch.randint(0, 10, (32,)).cuda()
    
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters())
    
    # Memory profiling
    torch.cuda.reset_peak_memory_stats()
    
    # Time profiling with PyTorch profiler
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        record_shapes=True,
        profile_memory=True,
        with_stack=True
    ) as prof:
        with record_function("forward"):
            outputs = model(inputs)
            loss = criterion(outputs, targets)
        
        with record_function("backward"):
            loss.backward()
        
        with record_function("optimizer_step"):
            optimizer.step()
    
    # Print results
    print("=" * 80)
    print("CUDA Time Summary:")
    print("=" * 80)
    print(prof.key_averages().table(
        sort_by="cuda_time_total",
        row_limit=20
    ))
    
    print("\n" + "=" * 80)
    print("Memory Summary:")
    print("=" * 80)
    print(prof.key_averages().table(
        sort_by="cuda_memory_usage",
        row_limit=20
    ))
    
    peak_memory = torch.cuda.max_memory_allocated() / 1024**3
    print(f"\nPeak GPU Memory: {peak_memory:.2f} GB")

# Uncomment to run:
# profile_model()


## 8. Common DDP Pitfalls

Examples of common mistakes and correct patterns in distributed training.


In [None]:
import os

def wrong_master_port(rank):
    # Wrong: Each process uses different port
    os.environ['MASTER_PORT'] = str(12355 + rank)  # ❌

def correct_master_port():
    # Correct: All processes use same port
    os.environ['MASTER_PORT'] = '12355'  # ✅

def wrong_dataloader(dataset):
    # Wrong: Each process sees all data
    from torch.utils.data import DataLoader
    dataloader = DataLoader(dataset, batch_size=32)  # ❌
    return dataloader

def correct_dataloader(dataset, world_size, rank):
    # Correct: Each process sees subset of data
    from torch.utils.data import DataLoader, DistributedSampler
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)  # ✅
    return dataloader

def set_epoch_for_shuffling(sampler, epoch):
    # Correct: Shuffle data each epoch
    sampler.set_epoch(epoch)  # ✅

print("DDP Pitfalls examples defined. See functions above for correct patterns.")


## 9. Measure Training Components

Helper function to measure and report the time breakdown of different training components.


In [None]:
def measure_components(measure_data_loading_time, measure_compute_time, measure_communication_time):
    """
    Measure and report the time breakdown of training components.
    
    Args:
        measure_data_loading_time: Function that returns data loading time
        measure_compute_time: Function that returns computation time
        measure_communication_time: Function that returns communication time
    """
    data_time = measure_data_loading_time()
    compute_time = measure_compute_time()
    comm_time = measure_communication_time()

    total_time = data_time + compute_time + comm_time
    print(f"Data loading: {data_time/total_time*100:.1f}%")
    print(f"Computation: {compute_time/total_time*100:.1f}%")
    print(f"Communication: {comm_time/total_time*100:.1f}%")

# Example usage:
# measure_components(
#     lambda: 0.2,  # data loading time
#     lambda: 0.6,  # computation time
#     lambda: 0.2   # communication time
# )


## 10. Running Distributed Training in Notebook

While `torchrun` is designed for command-line use, there are ways to run distributed training in Jupyter notebooks. Here are the main approaches:


### Option 2: Using subprocess to call torchrun (Not Recommended)

You can use subprocess to call torchrun, but this is not ideal as it runs outside the notebook context and output capture can be tricky.


In [None]:
import subprocess
import sys
import os

def run_torchrun_in_notebook(script_path, nproc_per_node=2):
    """
    Run torchrun via subprocess.
    Note: This runs outside the notebook context.
    """
    # Set environment variables
    env = os.environ.copy()
    env['OMP_NUM_THREADS'] = '4'
    
    # Build command
    cmd = [
        sys.executable, '-m', 'torch.distributed.run',
        '--nproc_per_node', str(nproc_per_node),
        script_path
    ]
    
    # Run and capture output
    result = subprocess.run(
        cmd,
        env=env,
        capture_output=True,
        text=True
    )
    
    print("STDOUT:")
    print(result.stdout)
    if result.stderr:
        print("\nSTDERR:")
        print(result.stderr)
    
    return result.returncode == 0

# Example usage (uncomment to run):
# run_torchrun_in_notebook('code/chapter1/ch01_multi_gpu_ddp.py', nproc_per_node=2)


### Option 3: Using IPython Magic Command (Recommended for Notebooks)

You can use IPython magic commands to run shell commands, including torchrun.


In [None]:
# Using IPython magic command to run torchrun
# This allows you to run torchrun commands directly in the notebook

# Example:
# !OMP_NUM_THREADS=4 torchrun --nproc_per_node=2 code/chapter1/ch01_multi_gpu_ddp.py

# Or with environment variable:
# import os
# os.environ['OMP_NUM_THREADS'] = '4'
# !torchrun --nproc_per_node=2 code/chapter1/ch01_multi_gpu_ddp.py

print("To run torchrun in notebook, use IPython magic command:")
print("!OMP_NUM_THREADS=4 torchrun --nproc_per_node=2 code/chapter1/ch01_multi_gpu_ddp.py")


### Option 4: Using Hugging Face Accelerate (Alternative Approach)

If you have `accelerate` installed, you can use `notebook_launcher` for distributed training in notebooks. This is a cleaner approach for notebook environments.

**Note:** This requires installing accelerate: `pip install accelerate`


In [None]:
# Example using Hugging Face accelerate (if installed)
# Uncomment and install accelerate first: pip install accelerate

# from accelerate import notebook_launcher
# from accelerate import Accelerator
# import torch.nn as nn
# from torch.utils.data import DataLoader, TensorDataset

# def training_function():
#     accelerator = Accelerator()
#     
#     model = nn.Sequential(
#         nn.Linear(1000, 512),
#         nn.ReLU(),
#         nn.Linear(512, 10)
#     )
#     
#     dataset = TensorDataset(
#         torch.randn(1000, 1000),
#         torch.randint(0, 10, (1000,))
#     )
#     dataloader = DataLoader(dataset, batch_size=32)
#     
#     optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
#     criterion = nn.CrossEntropyLoss()
#     
#     model, dataloader, optimizer = accelerator.prepare(model, dataloader, optimizer)
#     
#     model.train()
#     for epoch in range(10):
#         epoch_loss = 0.0
#         for data, target in dataloader:
#             optimizer.zero_grad()
#             output = model(data)
#             loss = criterion(output, target)
#             accelerator.backward(loss)
#             optimizer.step()
#             epoch_loss += loss.item()
#         
#         if accelerator.is_main_process:
#             print(f"Epoch {epoch+1}/10, Loss: {epoch_loss/len(dataloader):.4f}")

# notebook_launcher(training_function, num_processes=2)

print("Accelerate example code shown above. Install accelerate to use this approach.")


### Summary: Running Distributed Training in Notebooks

**Best Practices:**
1. **For quick command execution**: Use IPython magic `!torchrun ...` (Option 2) - simplest
2. **For production-like setup**: Use command line with `torchrun` - most reliable
3. **For Hugging Face workflows**: Use `accelerate notebook_launcher` (Option 3) - cleanest for notebooks

**Important Notes:**
- Notebook environments can have limitations with multiprocessing
- Some distributed operations may not work perfectly in all notebook environments
- For production training, command-line `torchrun` is still recommended
- Make sure you have the required number of GPUs available
