# Lab 2: Multi-GPU Training with FSDP on SageMaker

## Overview
Learn how to scale medical image segmentation training across multiple GPUs using Fully Sharded Data Parallel (FSDP). This enables training larger models that don't fit on a single GPU.

## Learning Objectives
- Understand FSDP vs DDP for distributed training
- Configure multi-GPU SageMaker training jobs
- Train large models (SwinUNETR) with memory efficiency
- Monitor distributed training with TensorBoard, MLflow, and WandB

## Prerequisites
- Completed Lab 1 (Single GPU Training)
- Understanding of distributed training concepts
- MLflow and WandB accounts (optional)

**Estimated Time:** 45-60 minutes

## What is FSDP?

**Fully Sharded Data Parallel (FSDP)** shards model parameters, gradients, and optimizer states across GPUs:

| Feature | DDP | FSDP |
|---------|-----|------|
| Model Replication | Full copy per GPU | Sharded across GPUs |
| Memory Efficiency | Low | High |
| Best For | Models < 1B params | Models > 1B params |
| Communication | Gradients only | Parameters + Gradients |

**Use FSDP when:**
- Model doesn't fit in single GPU memory
- Training large transformer models (SwinUNETR, ViT)
- Need to maximize batch size

## Step 1: Setup Environment

In [None]:
%pip install sagemaker==2.200.0

In [None]:
import sagemaker
from sagemaker.pytorch.estimator import PyTorch
from sagemaker import get_execution_role
import boto3

sagemaker_session = sagemaker.Session(boto3.Session(region_name='us-east-1'))
# sagemaker_session = sagemaker.Session()
role = get_execution_role()
region = sagemaker_session.boto_region_name
bucket = sagemaker_session.default_bucket()
print(f"Region: {region}")
print(f"Bucket: {bucket}")

## Step 2: Configure Data and Output Paths

In [None]:
bucket = 'YOUR_SAGEMAKER_BUCKET_NAME'  # Replace with your actual bucket name
data_path = f's3://{bucket}/segmentation_data/'
output_path = f's3://{bucket}/segmentation_data/output'

print(f"Training data: {data_path}")
print(f"Output path: {output_path}")

## Step 3: Configure Experiment Tracking (Optional)

Enable MLflow and/or WandB for advanced experiment tracking.

Create an MLFlow tracking server

In [None]:
# Set to True to enable tracking
use_mlflow = True
use_wandb = False

# MLflow configuration
# Create an MLflow tracking server in SageMaker and provide its URI here
# mlflow_tracking_uri = "arn:aws:sagemaker:us-east-1:xxxxx"  # e.g., "http://mlflow-server:5000"
mlflow_tracking_uri='./mlruns'  # Local path for testing
mlflow_experiment_name = "medical-segmentation-fsdp"

# WandB configuration
wandb_project = "medical-segmentation-fsdp"
wandb_api_key = ""  # Your WandB API key

print(f"MLflow: {'Enabled' if use_mlflow else 'Disabled'}")
print(f"WandB: {'Enabled' if use_wandb else 'Disabled'}")
print(f"TensorBoard: Always Enabled")

## Step 4: Define Hyperparameters

Configure for multi-GPU training with larger model (SwinUNETR).

In [None]:
hyperparameters = {
    "model_name": "SegResNet",  # Larger transformer-based model
    "batch_size": 2,
    "epochs": 20,
    "lr": 1e-4,
    "use_mlflow": use_mlflow,
    "use_wandb": use_wandb,
}

if use_mlflow:
    hyperparameters["mlflow_tracking_uri"] = mlflow_tracking_uri
    hyperparameters["mlflow_experiment_name"] = mlflow_experiment_name

if use_wandb:
    hyperparameters["wandb_project"] = wandb_project
    hyperparameters["wandb_api_key"] = wandb_api_key

print("Hyperparameters:")
for key, value in hyperparameters.items():
    if 'api_key' not in key:
        print(f"  {key}: {value}")

## Step 5: Create Multi-GPU Estimator

Configure for 4 GPUs using ml.g5.12xlarge:
- 4x NVIDIA A10G GPUs (96GB total GPU memory)
- 192GB system memory
- Ideal for large model training

In [None]:
estimator = PyTorch(
    entry_point="train_fsdp_all.py",
    source_dir="../code/training", # leave the requirements.txt in the source_dir
    role=role,
    instance_count=1,
    instance_type="ml.g4dn.12xlarge",  # 4 GPUs
    framework_version="2.1.0",
    py_version="py310",
    hyperparameters=hyperparameters,
    output_path=output_path,
    base_job_name="medical-seg-fsdp",
    keep_alive_period_in_seconds=1800,
    environment={
        "PIP_CACHE_DIR": "/opt/ml/sagemaker/warmpoolcache/pip",
        "NCCL_DEBUG": "INFO"  # Enable NCCL debugging
    },
    distribution={
        "pytorchddp": {
            "enabled": True
        }
    },
    # dependencies=['requirements.txt'], not needed as included in source_dir
    sagemaker_session = sagemaker_session,
)

print("✓ Multi-GPU Estimator created!")
print("  Instance: ml.g4dn.12xlarge (4 GPUs)")
print("  Model: SwinUNETR")
print("  Strategy: FSDP")

## Step 6: Launch Distributed Training

Start FSDP training across 4 GPUs.

**Expected Duration:** 30-60 minutes

In [None]:
estimator.fit({"training": data_path}, wait=True, logs="All")

## Step 7: Analyze Training Results

In [None]:
training_job_name = estimator.latest_training_job.name
model_data = estimator.model_data

print(f"Training Job: {training_job_name}")
print(f"Model Artifacts: {model_data}")
print(f"\nView logs in CloudWatch:")
print(f"  Log Group: /aws/sagemaker/TrainingJobs")
print(f"  Log Stream: {training_job_name}")

## Step 8: Download and Visualize Results

In [None]:
# Download model artifacts
!aws s3 cp {model_data} ./model.tar.gz
!tar -xzf model.tar.gz

print("✓ Model downloaded")
print("\nAvailable files:")
!ls -lh *.pth

print("\nLaunch TensorBoard:")
print("  tensorboard --logdir=./")

## Step 9: Compare with Single GPU Training

Analyze performance improvements.

In [None]:
import boto3
import pandas as pd

sm_client = boto3.client('sagemaker', region_name=region)

# Get training job details
job_details = sm_client.describe_training_job(TrainingJobName=training_job_name)

training_time = job_details['TrainingTimeInSeconds']
billable_time = job_details['BillableTimeInSeconds']

print(f"Training Time: {training_time/60:.1f} minutes")
print(f"Billable Time: {billable_time/60:.1f} minutes")
print(f"\nEstimated Cost:")
print(f"  ml.g5.12xlarge: ${(billable_time/3600) * 7.09:.2f}")

## Understanding FSDP Performance

### Memory Efficiency
```
Single GPU (24GB):  Model (20GB) + Batch (4GB) = OOM ❌
FSDP 4 GPUs:        Model (5GB/GPU) + Batch (4GB/GPU) = ✓
```

### Training Speed
- **Linear Scaling**: 4 GPUs ≈ 3.5x faster (90% efficiency)
- **Communication Overhead**: ~10-15% for FSDP
- **Optimal Batch Size**: 2-4 per GPU

### When to Use FSDP
✓ Model > 10GB  
✓ Need larger batch sizes  
✓ Training time > 1 hour  
✓ Budget allows multi-GPU instances  

## Key Takeaways

✓ **FSDP Benefits:**
- Train models that don't fit on single GPU
- 3-4x faster training with 4 GPUs
- Memory efficient for large models

✓ **Best Practices:**
- Start with 2-4 GPUs for cost efficiency
- Monitor GPU utilization (aim for >80%)
- Use gradient checkpointing for even larger models
- Enable mixed precision (FP16) for speed

✓ **Monitoring:**
- TensorBoard: Always available
- MLflow: Experiment tracking and model registry
- WandB: Real-time collaboration and visualization

## Next Steps

- **Lab 3**: Advanced experiment tracking with WandB
- Optimize hyperparameters with SageMaker Automatic Model Tuning
- Deploy model with SageMaker Inference

## Cost Comparison

| Instance | GPUs | Cost/Hour | 1hr Training |
|----------|------|-----------|-------------|
| ml.g5.xlarge | 1 | $1.41 | $1.41 |
| ml.g5.12xlarge | 4 | $7.09 | $7.09 |
| **Speedup** | **4x** | **5x cost** | **~2x cost** |

**Recommendation**: Use multi-GPU for production training, single GPU for prototyping.