# Lab 3: Experiment Tracking with WandB on SageMaker

## Overview
Learn how to integrate Weights & Biases (WandB) with SageMaker for advanced experiment tracking, visualization, and collaboration. This lab demonstrates multi-GPU training with comprehensive monitoring.

## Learning Objectives
- Set up WandB integration with SageMaker
- Track experiments across multiple training runs
- Visualize training metrics in real-time
- Compare different model architectures and hyperparameters
- Share results with team members

## Prerequisites
- Completed Lab 1 and Lab 2
- WandB account (free tier available at wandb.ai)
- WandB API key

**Estimated Time:** 45-60 minutes

## Why Use WandB?

| Feature | TensorBoard | MLflow | WandB |
|---------|-------------|--------|-------|
| Real-time Tracking | ‚úì | ‚úó | ‚úì |
| Cloud Hosting | ‚úó | Self-hosted | ‚úì |
| Collaboration | ‚úó | Limited | ‚úì |
| Hyperparameter Sweeps | ‚úó | ‚úó | ‚úì |
| Model Registry | ‚úó | ‚úì | ‚úì |
| Artifacts Tracking | Limited | ‚úì | ‚úì |

**Use WandB for:**
- Team collaboration and sharing
- Comparing multiple experiments
- Hyperparameter optimization
- Production model tracking

## Step 1: Setup WandB Account

1. Go to https://wandb.ai/signup
2. Create a free account
3. Get your API key from https://wandb.ai/authorize
4. Create a new project: "medical-segmentation-workshop"

In [None]:
# Store your WandB API key
import getpass

wandb_api_key = getpass.getpass("Enter your WandB API key: ")
wandb_project = "medical-segmentation-workshop"

print(f"‚úì WandB configured for project: {wandb_project}")

## Step 2: Setup SageMaker Environment

In [None]:
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
role = get_execution_role()
region = sagemaker_session.boto_region_name
bucket = sagemaker_session.default_bucket()

print(f"Region: {region}")
print(f"Bucket: {bucket}")

## Step 3: Configure Data Paths

In [None]:
data_path = f's3://{bucket}/medical-imaging/data'
output_path = f's3://{bucket}/medical-imaging/wandb-output'

print(f"Training Data: {data_path}")
print(f"Output Path: {output_path}")

## Step 4: Experiment 1 - SegResNet Baseline

Train a baseline model with SegResNet architecture.

In [None]:
hyperparameters_segresnet = {
    "model_name": "SegResNet",
    "batch_size": 4,
    "epochs": 20,
    "lr": 1e-4,
    "use_wandb": True,
    "use_mlflow": False,
    "wandb_project": wandb_project,
    "wandb_api_key": wandb_api_key
}

estimator_segresnet = PyTorch(
    entry_point="train_ddp_all.py",
    source_dir="../code",
    role=role,
    instance_count=1,
    instance_type="ml.g5.2xlarge",  # 1 GPU
    framework_version="2.5.1",
    py_version="py311",
    hyperparameters=hyperparameters_segresnet,
    output_path=output_path,
    base_job_name="segresnet-baseline",
    keep_alive_period_in_seconds=1800
)

print("‚úì Experiment 1: SegResNet Baseline configured")

In [None]:
# Launch training
estimator_segresnet.fit({"training": data_path}, wait=True, logs="All")

## Step 5: Experiment 2 - SwinUNETR with Higher Learning Rate

Test a larger model with different hyperparameters.

In [None]:
hyperparameters_swin = {
    "model_name": "SwinUNETR",
    "batch_size": 2,
    "epochs": 20,
    "lr": 5e-4,  # Higher learning rate
    "use_wandb": True,
    "use_mlflow": False,
    "wandb_project": wandb_project,
    "wandb_api_key": wandb_api_key
}

estimator_swin = PyTorch(
    entry_point="train_ddp_all.py",
    source_dir="../code",
    role=role,
    instance_count=1,
    instance_type="ml.g5.2xlarge",
    framework_version="2.5.1",
    py_version="py311",
    hyperparameters=hyperparameters_swin,
    output_path=output_path,
    base_job_name="swinunetr-high-lr",
    keep_alive_period_in_seconds=1800
)

print("‚úì Experiment 2: SwinUNETR configured")

In [None]:
# Launch training
estimator_swin.fit({"training": data_path}, wait=True, logs="All")

## Step 6: Experiment 3 - Multi-GPU Training

Scale to 4 GPUs with DDP for faster training.

In [None]:
hyperparameters_multigpu = {
    "model_name": "SwinUNETR",
    "batch_size": 2,
    "epochs": 20,
    "lr": 1e-4,
    "use_wandb": True,
    "use_mlflow": False,
    "wandb_project": wandb_project,
    "wandb_api_key": wandb_api_key
}

estimator_multigpu = PyTorch(
    entry_point="train_ddp_all.py",
    source_dir="../code",
    role=role,
    instance_count=1,
    instance_type="ml.g5.12xlarge",  # 4 GPUs
    framework_version="2.5.1",
    py_version="py311",
    hyperparameters=hyperparameters_multigpu,
    output_path=output_path,
    base_job_name="swinunetr-4gpu",
    keep_alive_period_in_seconds=1800,
    distribution={
        "pytorchddp": {
            "enabled": True
        }
    }
)

print("‚úì Experiment 3: Multi-GPU DDP configured")

In [None]:
# Launch training
estimator_multigpu.fit({"training": data_path}, wait=True, logs="All")

## Step 7: Analyze Results in WandB

View and compare all experiments in WandB dashboard.

In [None]:
import wandb

# Login to WandB
wandb.login(key=wandb_api_key)

# Get project URL
project_url = f"https://wandb.ai/{wandb.api.default_entity}/{wandb_project}"
print(f"\nüéØ View your experiments:")
print(f"   {project_url}")
print(f"\nüìä Compare runs:")
print(f"   {project_url}/table")

## Step 8: Programmatic Analysis with WandB API

In [None]:
import pandas as pd

# Fetch runs from WandB
api = wandb.Api()
runs = api.runs(f"{wandb.api.default_entity}/{wandb_project}")

# Create comparison table
summary_list = []
for run in runs:
    summary_list.append({
        "name": run.name,
        "model": run.config.get("model_name"),
        "batch_size": run.config.get("batch_size"),
        "lr": run.config.get("lr"),
        "best_dice": run.summary.get("val/best_dice", 0),
        "duration": run.summary.get("_runtime", 0) / 60,  # minutes
        "state": run.state
    })

df = pd.DataFrame(summary_list)
df = df.sort_values("best_dice", ascending=False)

print("\nüìà Experiment Comparison:")
print(df.to_string(index=False))

# Find best model
best_run = df.iloc[0]
print(f"\nüèÜ Best Model:")
print(f"   Name: {best_run['name']}")
print(f"   Model: {best_run['model']}")
print(f"   Dice Score: {best_run['best_dice']:.4f}")
print(f"   Training Time: {best_run['duration']:.1f} minutes")

## Step 9: Download Best Model

In [None]:
# Get best run details
best_run_name = df.iloc[0]['name']
best_run_obj = [r for r in runs if r.name == best_run_name][0]

# Download artifacts
print(f"Downloading artifacts from: {best_run_name}")
for artifact in best_run_obj.logged_artifacts():
    artifact.download()
    print(f"  ‚úì Downloaded: {artifact.name}")

## WandB Features Demonstrated

### 1. Real-time Monitoring
- Live training metrics
- GPU utilization
- System metrics (CPU, memory)

### 2. Experiment Comparison
- Side-by-side metric plots
- Hyperparameter correlation
- Performance tables

### 3. Collaboration
- Share project links with team
- Comment on runs
- Create reports

### 4. Artifact Tracking
- Model checkpoints
- Training logs
- Predictions and visualizations

## Key Takeaways

‚úì **WandB Benefits:**
- Zero-setup cloud hosting
- Real-time collaboration
- Comprehensive experiment tracking
- Easy hyperparameter comparison

‚úì **Best Practices:**
- Tag runs with meaningful names
- Use groups for related experiments
- Log custom metrics and artifacts
- Create reports for stakeholders

‚úì **Integration Tips:**
- Store API key in AWS Secrets Manager
- Use WandB sweeps for hyperparameter tuning
- Enable artifact versioning
- Set up alerts for failed runs

## Cost Analysis

| Experiment | Instance | Duration | Cost |
|------------|----------|----------|------|
| SegResNet | ml.g5.2xlarge | 30 min | $1.41 |
| SwinUNETR | ml.g5.2xlarge | 45 min | $2.11 |
| Multi-GPU | ml.g5.12xlarge | 20 min | $2.36 |
| **Total** | | | **$5.88** |

**WandB Cost:** Free tier (100GB storage, unlimited runs)

## Next Steps

- Set up automated hyperparameter sweeps
- Deploy best model to SageMaker Endpoint
- Create WandB reports for stakeholders
- Integrate with CI/CD pipeline

## Additional Resources

- [WandB Documentation](https://docs.wandb.ai/)
- [SageMaker + WandB Guide](https://docs.wandb.ai/guides/integrations/sagemaker)
- [Hyperparameter Sweeps](https://docs.wandb.ai/guides/sweeps)