# Distributed Training with DeepSpeed and Ray Train

This notebook demonstrates using Microsoft's DeepSpeed with Ray Train for memory-efficient distributed training.

**Learning Objectives:**
1. Configure DeepSpeed ZeRO for memory optimization
2. Use DeepSpeed's configuration-based approach
3. Compare with FSDP2

## What is DeepSpeed?

[DeepSpeed](https://www.deepspeed.ai/) is Microsoft's deep learning optimization library:

- **ZeRO**: Partitions optimizer states, gradients, and parameters across GPUs
- **Mixed Precision**: FP16/BF16 with automatic loss scaling
- **CPU Offloading**: Extends training beyond GPU memory

**When to use DeepSpeed:**
- Training large models (billions of parameters)
- Using HuggingFace Transformers
- Prefer configuration-driven setup

## DeepSpeed ZeRO Stages

ZeRO progressively partitions training state:

| Stage | Partitions | Memory Reduction |
|-------|------------|------------------|
| ZeRO-1 | Optimizer states | ~25% |
| ZeRO-2 | + Gradients | ~50% |
| ZeRO-3 | + Parameters | ~75% |
| ZeRO-Infinity | + CPU/NVMe offload | ~90% |

## Key Differences from FSDP2

| Aspect | FSDP2 | DeepSpeed |
|--------|-------|-----------|
| Setup | `fully_shard(model, ...)` | `deepspeed.initialize(model, config)` |
| Optimizer | User creates separately | Managed by DeepSpeed |
| Backward | `loss.backward()` | `model.backward(loss)` |
| Config | Python API | JSON/dict config |

## Step 1: Environment Setup

Check Ray cluster status and configure environment.

**Requirements:**
- Ray cluster with at least 2 GPU workers
- DeepSpeed installed on all workers
- CUDA runtime (nvcc not required)

In [1]:
# Check Ray cluster status
!ray status

Node status
---------------------------------------------------------------
Active:
 1 head
 1 1xL4:16CPU-64GB-2
Idle:
 1 1xL4:16CPU-64GB-1
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/32.0 CPU
 0.0/2.0 GPU
 0.0/2.0 anyscale/accelerator_shape:1xL4
 0.0/1.0 anyscale/cpu_only:true
 0.0/1.0 anyscale/node-group:1xL4:16CPU-64GB-1
 0.0/1.0 anyscale/node-group:1xL4:16CPU-64GB-2
 0.0/1.0 anyscale/node-group:head
 0.0/3.0 anyscale/provider:aws
 0.0/3.0 anyscale/region:us-west-2
 0B/160.00GiB memory
 16.30KiB/44.64GiB object_store_memory

From request_resources:
 (none)
Pending Demands:
 (no resource demands)
[0m

In [None]:
%%bash
pip install -q torch torchvision deepspeed

In [3]:
# Verify installation
import torch
import ray
import deepspeed

print(f"PyTorch version: {torch.__version__}")
print(f"Ray version: {ray.__version__}")
print(f"DeepSpeed version: {deepspeed.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

PyTorch version: 2.10.0+cu128
Ray version: 2.53.0
DeepSpeed version: 0.18.5
CUDA available: False


In [None]:
# Setup - configure environment for DeepSpeed
import os

# Ray Train V2 API
os.environ["RAY_TRAIN_V2_ENABLED"] = "1"

# DeepSpeed environment settings (avoid nvcc checks on clusters without CUDA toolkit)
os.environ["DS_BUILD_OPS"] = "0"
os.environ["DS_SKIP_CUDA_CHECK"] = "1"

import tempfile
import uuid
import torch

## Step 2: Model Definition

Same Vision Transformer as FSDP2 tutorial - DeepSpeed wraps your existing PyTorch model.

In [None]:
from torchvision.models import VisionTransformer
from torchvision.datasets import FashionMNIST
from torchvision.transforms import ToTensor, Normalize, Compose

def init_model():
    """Initialize Vision Transformer for FashionMNIST."""
    model = VisionTransformer(
        image_size=28, patch_size=7, num_layers=10, num_heads=2,
        hidden_dim=128, mlp_dim=128, num_classes=10,
    )
    model.conv_proj = torch.nn.Conv2d(1, 128, kernel_size=7, stride=7)
    return model

# Verify model
test_model = init_model()
print(f"Model parameters: {sum(p.numel() for p in test_model.parameters()):,}")
del test_model

## Step 3: DeepSpeed Configuration

DeepSpeed uses a configuration dictionary instead of Python API. Key sections:
- **optimizer**: DeepSpeed manages the optimizer internally
- **fp16**: Mixed precision settings
- **zero_optimization**: ZeRO stage and settings
- **train_micro_batch_size_per_gpu**: Batch size configuration

In [None]:
def get_deepspeed_config(batch_size=64, lr=0.001):
    """DeepSpeed ZeRO Stage 2 configuration."""
    return {
        "optimizer": {
            "type": "Adam",
            "params": {"lr": lr, "betas": [0.9, 0.999], "eps": 1e-8}
        },
        "fp16": {"enabled": False},  # Disabled for simplicity
        "zero_optimization": {
            "stage": 2,
            "allgather_bucket_size": 2e8,
            "reduce_bucket_size": 2e8,
            "overlap_comm": True,
            "contiguous_gradients": True,
        },
        "train_micro_batch_size_per_gpu": batch_size,
        "gradient_accumulation_steps": 1,
        "gradient_clipping": 1.0,
        "steps_per_print": 1000,
    }

# Preview config
import json
print(json.dumps(get_deepspeed_config(), indent=2))

## Step 4: DeepSpeed Checkpointing

DeepSpeed has built-in checkpointing methods - simpler than FSDP2's DCP approach:
- `model.save_checkpoint(dir)` - Save checkpoint
- `model.load_checkpoint(dir)` - Load checkpoint

In [None]:
import ray.train
import torch.distributed as dist

def save_checkpoint(model_engine, metrics, epoch):
    """Save DeepSpeed checkpoint and report to Ray Train."""
    with tempfile.TemporaryDirectory() as tmp_dir:
        model_engine.save_checkpoint(tmp_dir, tag=f"epoch_{epoch}", client_state={"epoch": epoch})
        dist.barrier()
        ray.train.report(metrics, checkpoint=ray.train.Checkpoint.from_directory(tmp_dir))

def load_checkpoint(model_engine, ckpt):
    """Load DeepSpeed checkpoint."""
    with ckpt.as_directory() as ckpt_dir:
        tags = [d for d in os.listdir(ckpt_dir) if d.startswith("epoch_")]
        if tags:
            _, client_state = model_engine.load_checkpoint(ckpt_dir, tag=sorted(tags)[-1])
            return client_state.get("epoch", 0) if client_state else 0
    return 0

In [None]:
def save_model_for_inference(model_engine, world_rank):
    """Save consolidated model for inference (rank 0 only)."""
    with tempfile.TemporaryDirectory() as tmp_dir:
        ckpt = None
        if world_rank == 0:
            torch.save(model_engine.module.state_dict(), os.path.join(tmp_dir, "full-model.pt"))
            ckpt = ray.train.Checkpoint.from_directory(tmp_dir)
        dist.barrier()
        ray.train.report({}, checkpoint=ckpt, checkpoint_dir_name="full_model")

## Step 5: Training Function

Key differences from FSDP2:
- `deepspeed.initialize()` creates model engine and optimizer
- `model.backward(loss)` replaces `loss.backward()`
- `model.step()` replaces `optimizer.step()`

In [None]:
import ray.train.torch
from torch.nn import CrossEntropyLoss
from torch.utils.data import DataLoader

def train_func(config):
    """DeepSpeed training function."""
    import os
    
    # Set DeepSpeed environment on worker before import
    os.environ["DS_BUILD_OPS"] = "0"
    os.environ["DS_SKIP_CUDA_CHECK"] = "1"
    
    import deepspeed
    
    # Model setup with DeepSpeed
    model = init_model()
    ds_config = get_deepspeed_config(batch_size=config.get('batch_size', 64), lr=config.get('lr', 0.001))
    model_engine, optimizer, _, _ = deepspeed.initialize(model=model, config=ds_config, model_parameters=model.parameters())
    device = model_engine.device
    
    criterion = CrossEntropyLoss()
    
    # Resume from checkpoint if available
    start_epoch = 0
    if ray.train.get_checkpoint():
        start_epoch = load_checkpoint(model_engine, ray.train.get_checkpoint()) + 1
    
    # Data loading with DistributedSampler
    transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])
    train_data = FashionMNIST(root=tempfile.gettempdir(), train=True, download=True, transform=transform)
    sampler = torch.utils.data.DistributedSampler(
        train_data,
        num_replicas=ray.train.get_context().get_world_size(),
        rank=ray.train.get_context().get_world_rank(),
        shuffle=True,
    )
    train_loader = DataLoader(train_data, batch_size=config.get('batch_size', 64), sampler=sampler)
    
    world_rank = ray.train.get_context().get_world_rank()
    
    # Training loop
    for epoch in range(start_epoch, config.get('epochs', 1)):
        sampler.set_epoch(epoch)
        total_loss, num_batches = 0.0, 0
        
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            
            # DeepSpeed forward/backward
            outputs = model_engine(images)
            loss = criterion(outputs, labels)
            model_engine.backward(loss)
            model_engine.step()
            
            total_loss += loss.item()
            num_batches += 1
        
        avg_loss = total_loss / num_batches
        save_checkpoint(model_engine, {"loss": avg_loss, "epoch": epoch}, epoch)
        if world_rank == 0:
            print(f"Epoch {epoch}: loss={avg_loss:.4f}")
    
    # Save final model
    save_model_for_inference(model_engine, world_rank)

## Step 6: Launch Distributed Training

Ray Train setup is nearly identical to FSDP2 - DeepSpeed handles everything through its config.

In [None]:
import ray.train.torch

# Configuration
experiment_name = f"deepspeed_{uuid.uuid4().hex[:8]}"
scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True)

# Set environment variables on workers to avoid DeepSpeed nvcc checks
run_config = ray.train.RunConfig(
    storage_path="/mnt/cluster_storage/",
    name=experiment_name,
)
train_config = {"epochs": 1, "lr": 0.001, "batch_size": 64}

print(f"Experiment: {experiment_name}")

In [None]:
# Run training
trainer = ray.train.torch.TorchTrainer(
    train_loop_per_worker=train_func,
    scaling_config=scaling_config,
    train_loop_config=train_config,
    run_config=run_config,
)
result = trainer.fit()
print(f"Training complete! Checkpoint: {result.checkpoint}")

## Step 7: Inspect Training Artifacts

DeepSpeed checkpoint structure:
- `epoch_N/` - Checkpoint with model and optimizer states
- `full_model/` - Consolidated model for inference

In [None]:
# List artifacts
storage_path = f"/mnt/cluster_storage/{experiment_name}/"
print(f"Artifacts in {storage_path}:")
for item in sorted(os.listdir(storage_path)):
    print(f"  {item}/" if os.path.isdir(os.path.join(storage_path, item)) else f"  {item}")

## Step 8: Load Model for Inference

Loading is identical to FSDP2 - we saved a standard PyTorch checkpoint.

In [None]:
# Load model for inference
model_path = f"/mnt/cluster_storage/{experiment_name}/full_model/full-model.pt"
print(f"Loading from: {model_path}")

In [None]:
inference_model = init_model()
inference_model.load_state_dict(torch.load(model_path, map_location='cpu', weights_only=True))
inference_model.eval()
print("Model loaded.")

In [None]:
# Test inference
test_data = FashionMNIST(root="/tmp", train=False, download=True, transform=Compose([ToTensor(), Normalize((0.5,), (0.5,))]))
with torch.no_grad():
    sample = test_data.data[0].reshape(1, 1, 28, 28).float()
    output = inference_model(sample)
print(f"Inference output shape: {output.shape}")

## Optional: TensorBoard Profiling

Add PyTorch profiler to the training loop:

```python
from torch.profiler import profile, ProfilerActivity, tensorboard_trace_handler

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    on_trace_ready=tensorboard_trace_handler("./tensorboard"),
) as prof:
    # training loop
    prof.step()
```

View with: `tensorboard --logdir=./tensorboard`

## Step 9: Cleanup

Clean up GPU and CPU memory.

In [None]:
import gc

# Cleanup
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
print("Cleanup complete.")

## Summary

This tutorial covered:
1. **DeepSpeed ZeRO** - Memory optimization via state partitioning
2. **Ray Train integration** - Multi-GPU distributed training
3. **Built-in checkpointing** - Simpler than FSDP2's DCP

**Key differences from FSDP2:**
- Configuration-based (dict) vs Python API
- DeepSpeed manages optimizer internally
- `model.backward(loss)` and `model.step()` API

**Next Steps:**
- Try ZeRO Stage 3 for larger models
- Enable CPU offloading for memory-constrained scenarios

**Resources:**
- [DeepSpeed Documentation](https://www.deepspeed.ai/)
- [Ray Train DeepSpeed Guide](https://docs.ray.io/en/latest/train/deepspeed.html)