# DeepSpeed API - Minimal Demonstration

This notebook provides a minimal demonstration of DeepSpeed APIs for distributed training.

## Overview

We'll cover:
1. DeepSpeed configuration creation
2. Model initialization with DeepSpeed
3. ZeRO optimization stages
4. Basic usage patterns

For complete examples, see `DeepSpeed_FSDP.example_experiments.ipynb`.


In [None]:
# Import required libraries
import json
import torch
import deepspeed
from transformers import ViTForImageClassification

# Import utility functions
from DeepSpeed_FSDP_utils import (
    load_vit_model,
    create_deepspeed_config,
    save_deepspeed_config,
    initialize_deepspeed_model,
    get_gpu_memory_stats
)

print("[OK] Imports successful!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")


## 1. Creating DeepSpeed Configurations

DeepSpeed uses JSON configuration files to specify optimization settings. We'll create configs for different ZeRO stages.


In [None]:
# Create ZeRO Stage 2 configuration
config_stage2 = create_deepspeed_config(
    zero_stage=2,
    micro_batch_size=8,
    gradient_accumulation_steps=4,
    offload_optimizer=False,
    use_bf16=True,
    learning_rate=1e-4,
    weight_decay=0.01
)

print("ZeRO Stage 2 Configuration:")
print(json.dumps(config_stage2, indent=2))


In [None]:
# Create ZeRO Stage 3 configuration (maximum memory optimization)
config_stage3 = create_deepspeed_config(
    zero_stage=3,
    micro_batch_size=8,
    gradient_accumulation_steps=4,
    offload_optimizer=False,
    offload_param=False,
    use_bf16=True,
    learning_rate=1e-4,
    weight_decay=0.01
)

print("ZeRO Stage 3 Configuration:")
print(json.dumps(config_stage3, indent=2))


In [None]:
# Save configurations to files
save_deepspeed_config(config_stage2, "deepspeed_config_stage2.json")
save_deepspeed_config(config_stage3, "deepspeed_config_stage3.json")

print("[OK] Configurations saved!")


## 2. Loading Vision Transformer Model

We'll load a pre-trained ViT model and prepare it for DeepSpeed training.


In [None]:
# Load ViT model (using smaller base model for demonstration)
model = load_vit_model(
    model_name="google/vit-base-patch16-224",
    num_labels=101,  # Food-101 has 101 classes
    torch_dtype=torch.bfloat16,
    enable_gradient_checkpointing=True
)

print(f"[OK] Model loaded: {type(model)}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")


## 3. Initializing DeepSpeed Engine

DeepSpeed wraps the model and provides distributed training capabilities.


In [None]:
# Note: DeepSpeed initialization requires distributed environment
# This is a demonstration - actual initialization happens in training scripts

print("DeepSpeed initialization pattern:")
print("""
# In distributed training script:
import deepspeed

# Initialize distributed
deepspeed.init_distributed()

# Initialize DeepSpeed engine
model_engine, optimizer, lr_scheduler = initialize_deepspeed_model(
    model=model,
    config_path="deepspeed_config_stage3.json"
)

# The model_engine has additional methods:
# - model_engine.backward(loss)
# - model_engine.step()
# - model_engine.device
""")


## 4. Understanding ZeRO Stages

Different ZeRO stages provide different levels of memory optimization:


In [None]:
import json

# Compare different ZeRO stages
stages = {
    "Stage 0": "No optimization (baseline DDP)",
    "Stage 1": "Partitions optimizer states (~4x memory reduction)",
    "Stage 2": "Partitions optimizer states + gradients (~8x reduction)",
    "Stage 3": "Partitions optimizer states + gradients + parameters (max reduction)"
}

print("ZeRO Optimization Stages:")
for stage, description in stages.items():
    print(f"  {stage}: {description}")

# Show configuration differences
print("\n" + "="*60)
print("Configuration Comparison:")
print("="*60)

configs = {
    "Stage 2": config_stage2,
    "Stage 3": config_stage3
}

for name, config in configs.items():
    print(f"\n{name}:")
    print(f"  ZeRO Stage: {config['zero_optimization']['stage']}")
    print(f"  Micro Batch Size: {config['train_micro_batch_size_per_gpu']}")
    print(f"  Gradient Accumulation: {config['gradient_accumulation_steps']}")
    print(f"  BF16 Enabled: {config.get('bf16', {}).get('enabled', False)}")
    if 'offload_optimizer' in config['zero_optimization']:
        print(f"  CPU Offload Optimizer: True")
    if 'offload_param' in config['zero_optimization']:
        print(f"  CPU Offload Parameters: True")


## 5. GPU Memory Statistics

Monitor GPU memory usage to understand DeepSpeed's memory optimization:


In [None]:
# Check GPU memory stats
if torch.cuda.is_available():
    memory_stats = get_gpu_memory_stats(device_id=0)
    print("GPU Memory Statistics:")
    for key, value in memory_stats.items():
        print(f"  {key}: {value:.2f} GB")
else:
    print("CUDA not available")


## 6. Training Pattern with DeepSpeed

Here's the typical training loop pattern with DeepSpeed:


In [None]:
print("""
# Typical DeepSpeed Training Loop:

for epoch in range(num_epochs):
    # Set epoch for distributed sampler
    if hasattr(train_loader.sampler, 'set_epoch'):
        train_loader.sampler.set_epoch(epoch)
    
    # Training
    model_engine.train()
    for batch_idx, (images, labels) in enumerate(train_loader):
        images = images.to(model_engine.device)
        labels = labels.to(model_engine.device)
        
        # Forward pass
        loss = model_engine(images, labels=labels).loss
        
        # Backward pass (handles gradient accumulation)
        model_engine.backward(loss)
        model_engine.step()
    
    # Validation
    accuracy, metrics = evaluate_deepspeed(
        model_engine=model_engine,
        val_loader=val_loader,
        rank=rank
    )
""")


## Summary

This notebook demonstrated:
1. Creating DeepSpeed configurations for different ZeRO stages
2. Loading Vision Transformer models
3. Understanding DeepSpeed initialization
4. Comparing ZeRO optimization stages
5. Monitoring GPU memory usage
6. Training loop patterns

For a complete working example with all experiments, see `DeepSpeed_FSDP.example_experiments.ipynb`.
