## Ray Train Observability

This notebook will walk you through the different observability features in Ray Train. We will cover the following topics:

<div class="alert alert-block alert-info">

<b> Here is the roadmap for this notebook:</b>

<ul>
    <li><b>Part 1:</b> Starting with a sample distributed training loop</li>
    <li><b>Part 2:</b> Using the Ray dashboard</li>
    <li><b>Part 3:</b> Monitoring throughput</li>
    <li><b>Part 4:</b> Profiling the training loop 
        <ul>
            <li><b>Part 4.1:</b> Operator view</li>
            <li><b>Part 4.2:</b> Trace view</li>
            <li><b>Part 4.2:</b> Memory view</li>
            <li><b>Part 4.2:</b> Kernel view</li>
        </ul>
    </li>
    <li><b>Part 5:</b> Adding Ray Data to the mix</li>
</ul>
</div>

## Imports

In [None]:
import os   
import tempfile
import time

import numpy as np
import pandas as pd
import torch
import torchmetrics
from PIL import Image
from torch.nn import CrossEntropyLoss
from torch.optim import Adam
from torch.utils.data import DataLoader
from torchvision.models import VisionTransformer
from torchvision.datasets import CIFAR10
from torchvision.transforms import ToTensor, Normalize, Compose

import ray
from ray.train import ScalingConfig, RunConfig
from ray.train.torch import TorchTrainer

## 1. Starting with a sample distributed training loop

Below is a sample distributed training loop of Ray Train and PyTorch. We will use this training loop to demonstrate the different observability features in Ray Train.

In [None]:
def train_loop_ray_train(config: dict):  # pass in hyperparameters in config
    criterion = CrossEntropyLoss()
    # Use Ray Train to wrap the model with DistributedDataParallel
    model = load_model_ray_train()
    optimizer = Adam(model.parameters(), lr=1e-3)
    
    # Calculate the batch size for each worker
    global_batch_size = config["global_batch_size"]
    batch_size = global_batch_size // ray.train.get_context().get_world_size()
    # Use Ray Train to wrap the data loader as a DistributedSampler
    data_loader = build_data_loader_ray_train(batch_size=batch_size) 
    
    acc = torchmetrics.Accuracy(task="multiclass", num_classes=10).to(model.device)

    for epoch in range(config["num_epochs"]):
        # Ensure data is on the correct device
        data_loader.sampler.set_epoch(epoch)

        for images, labels in data_loader: # images, labels are now sharded across the workers
            outputs = model(images)
            loss = criterion(outputs, labels)
            optimizer.zero_grad()
            loss.backward() # gradients are now accumulated across the workers
            optimizer.step()
            acc(outputs, labels)

        accuracy = acc.compute() # accuracy is now aggregated across the workers

        # Use Ray Train to report metrics
        metrics = print_metrics_ray_train(loss, accuracy, epoch)

        # Use Ray Train to save checkpoint and metrics
        save_checkpoint_and_metrics_ray_train(model, metrics)
        acc.reset() 

<div class="alert alert-info">

**On aggregating evaluation metrics from different workers**

Ray Train natively supports [TorchMetrics](https://lightning.ai/docs/torchmetrics/stable/), which provides a collection of machine learning metrics for distributed, scalable PyTorch models.

Torchmetrics follows these three steps:
1. Initialize
2. Compute
3. Reset

Where Compute performs a distributed gathering of individual metrics from the training workers.

</div>


Here is how to build and prepare the model 

In [None]:
def build_visual_transformer():
    model = VisionTransformer(
        image_size=32,   # CIFAR-10 image size is 32x32
        patch_size=4,    # Patch size is 4x4
        num_layers=12,   # Number of transformer layers
        num_heads=8,     # Number of attention heads
        hidden_dim=384,  # Hidden size (can be adjusted)
        mlp_dim=768,     # MLP dimension (can be adjusted)
        num_classes=10   # CIFAR-10 has 10 classes
    )
    return model

def load_model_ray_train() -> torch.nn.Module:
    model = build_visual_transformer()
    model = ray.train.torch.prepare_model(model)
    return model

Here is how to build and prepare the data loader

In [None]:
def build_data_loader_ray_train(batch_size: int) -> DataLoader:
    transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])
    train_data = CIFAR10(root="./data", train=True, download=True, transform=transform)
    train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, drop_last=True, num_workers=2)

    # Add DistributedSampler to the DataLoader
    train_loader = ray.train.torch.prepare_data_loader(train_loader)
    return train_loader

Simple function to print the metrics

In [None]:
def print_metrics_ray_train(
    loss: torch.Tensor, accuracy: torch.Tensor, epoch: int
) -> None:
    metrics = {"loss": loss.item(), "accuracy": accuracy.item(), "epoch": epoch}
    if ray.train.get_context().get_world_rank() == 0:
        print(metrics)
    return metrics

Storing metrics and checkpoints using Ray Train's reporting

In [None]:
def save_checkpoint_and_metrics_ray_train(
    model: torch.nn.Module, metrics: dict[str, float]
) -> None:
    with tempfile.TemporaryDirectory() as temp_checkpoint_dir:
        checkpoint = None
        if ray.train.get_context().get_world_rank() == 0:
            torch.save(
                model.module.state_dict(), os.path.join(temp_checkpoint_dir, "model.pt")
            )
            checkpoint = ray.train.Checkpoint.from_directory(temp_checkpoint_dir)

        ray.train.report(
            metrics,
            checkpoint=checkpoint,
        )

Let's define the scaling config and run config

In [None]:
scaling_config = ScalingConfig(num_workers=2, use_gpu=True)

storage_path = "/mnt/cluster_storage/distributed-training/"
run_config = RunConfig(storage_path=storage_path, name="distributed-cifar-vit")

We can now launch a distributed training job with a `TorchTrainer`.

In [None]:
trainer = TorchTrainer(
    train_loop_ray_train,
    scaling_config=scaling_config,
    run_config=run_config,
    train_loop_config={"num_epochs": 3, "global_batch_size": 512},
)

Calling `trainer.fit()` will start the run and block until it completes.

In [None]:
result = trainer.fit()

## 2. Using the Ray Dashboard

We can use the Ray Dashboard to monitor the performance of the training job. The Ray Dashboard provides a lot of useful information about the cluster and the running tasks. 

### Ray Dashboard - metrics based monitoring

1. Look at overall metrics for the cluster

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/apple/cluster_util.png" width=400>

2. Inspect GPU utilization

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/apple/gpu_util_and_disk.png" width=400>

Note: we also show the disk IO where the data is being downloaded locally to the worker nodes prior to training.

3. Inspect Cluster Network IO 

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/apple/network_io.png" width=400>

4. Inspection of  GPU memory usage and auxiliary resoures (CPU and Memory usage of the cluster)

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/apple/gpu_gram.png" width=400>



### Ray Train Dashboard - training-specific monitoring and debugging

Here is the Train Dashboard overview page:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/apple/train_dashboard.png" width=800>

Here is a usual workflow for debugging a training job:

1. View the Training worker Ray actor

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/apple/train_dashboard_worker.png" width=700>

2. Inspect the Stack trace of each training worker

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/apple/train_dashboard_stack_trace.png" width=700>

3. Perform CPU profiling of the training worker

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/apple/train_dashboard_cpu_profile.png" width=700>

4. Perform memory profiling of the training worker

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/apple/train_dashboard_memory_profile.png" width=700>


## 3. Monitoring throughput

It is important to monitor the throughput of the training job to ensure that the model is training at the desired speed.

You can do so either:
- By counting the number of rows and dividing by the total time of a training step or epoch.
- By using a higher-level API like Pytorch Lightning's [`ThroughputMonitor`](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.ThroughputMonitor.html)

In [None]:
def train_loop_ray_train_monitored(config: dict):
    criterion = CrossEntropyLoss()
    model = load_model_ray_train()
    optimizer = Adam(model.parameters(), lr=1e-3)

    global_batch_size = config["global_batch_size"]
    batch_size = global_batch_size // ray.train.get_context().get_world_size()

    data_loader = build_data_loader_ray_train(batch_size=batch_size)

    acc = torchmetrics.Accuracy(task="multiclass", num_classes=10).to(model.device)

    for epoch in range(config["num_epochs"]):
        epoch_start_time = time.perf_counter()
        num_rows = 0
        num_steps = 0
        data_loader.sampler.set_epoch(epoch)

        for images, labels in data_loader:
            outputs = model(images)
            loss = criterion(outputs, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            acc(outputs, labels)
            num_rows += images.size(0)
            num_steps += 1

        accuracy = acc.compute()

        # ensure all relevant CUDA operations are complete before measuring time
        torch.cuda.synchronize()
        epoch_end_time = time.perf_counter()
        epoch_duration = epoch_end_time - epoch_start_time

        print(f"Epoch {epoch} completed in {epoch_duration:.2f} seconds and {num_steps} steps")

        worker_throughput = num_rows / epoch_duration
        print(f"Worker throughput: {worker_throughput:.2f} rows/sec")

        num_workers = ray.train.get_context().get_world_size()
        global_throughput = worker_throughput * num_workers
        print(f"Global throughput: {global_throughput:.2f} rows/sec")

        metrics = print_metrics_ray_train(loss, accuracy, epoch)

        save_checkpoint_and_metrics_ray_train(model, metrics)
        acc.reset()

We can proceed to run the training loop with throughput monitoring:

In [None]:
TorchTrainer(
    train_loop_ray_train_monitored,
    scaling_config=ScalingConfig(num_workers=2, use_gpu=True),
    run_config=RunConfig(storage_path=storage_path, name="distributed-cifar-vit-monitored"),
    train_loop_config={
        "num_epochs": 3,
        "global_batch_size": 512,
    },
).fit()

Let's re-run with four workers and check how throughput is scaling

<div class="alert alert-info">

**Important Note on Distributed Training Performance:**

1. **Throughput Scaling:**
   - We observed that the global throughput scales linearly with the number of workers
   - For example, doubling the number of workers (from 2 to 4) resulted in approximately 2x throughput
   - This linear scaling is the ideal behavior for distributed training

2. **Hyperparameter Considerations:**
   - When increasing the number of workers, one options is to increase the effective global batch size. This helps maximize the utilization of the GPU.
   - However, larger batch sizes typically require adjustments to hyperparameters, particularly:
     - Learning rate
     - Other training parameters
   - This is because the optimal hyperparameter values are sensitive to batch size changes

For detailed guidance on batch size selection and hyperparameter tuning, refer to the [Google Research Tuning Playbook](https://github.com/google-research/tuning_playbook?tab=readme-ov-file#choosing-the-batch-size).

</div>

## 4. Profiling the training loop with torch.profiler

PyTorch includes a simple profiler API that is useful when user needs to determine the most expensive operators in the model.

Here is an example of how to use the profiler to profile your Ray Train training loop:

In [None]:
def train_loop_ray_train_profiled(config: dict):
    criterion = CrossEntropyLoss()
    model = load_model_ray_train()
    optimizer = Adam(model.parameters(), lr=1e-3)

    global_batch_size = config["global_batch_size"]
    batch_size = global_batch_size // ray.train.get_context().get_world_size()
    data_loader = build_data_loader_ray_train(batch_size=batch_size)

    acc = torchmetrics.Accuracy(task="multiclass", num_classes=10).to(model.device)
    world_rank = ray.train.get_context().get_world_rank()

    wait = 10
    warmup = 1
    active = 2
    repeat = 1

    for epoch in range(config["num_epochs"]):
        data_loader.sampler.set_epoch(epoch)

        with torch.profiler.profile(
            activities=[
                torch.profiler.ProfilerActivity.CPU,
                torch.profiler.ProfilerActivity.CUDA,
            ],
            schedule=torch.profiler.schedule(
                wait=wait, warmup=warmup, active=active, repeat=repeat
            ),
            on_trace_ready=torch.profiler.tensorboard_trace_handler(
                "/mnt/cluster_storage/vit/distributed-cifar-vit-profiled",
                worker_name=f"rank={world_rank}",
            ),
            record_shapes=True,
            with_stack=True,
            profile_memory=True,
        ) as profiler:
            for step, (images, labels) in enumerate(data_loader):
                outputs = model(images)
                loss = criterion(outputs, labels)
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                profiler.step()  # Add this line to profile the training loop
                acc(outputs, labels)

                if step >= wait + warmup + active:
                    # no need to profile further
                    break

        # in case we want the memory timeline as well
        profiler.export_memory_timeline(
            f"/mnt/cluster_storage/vit/distributed-cifar-vit-profiled/memory_{world_rank}.html"
        )

        accuracy = acc.compute()
        metrics = print_metrics_ray_train(loss, accuracy, epoch)
        save_checkpoint_and_metrics_ray_train(model, metrics)
        acc.reset()

Let's re-run the training loop with the profiler and inspect the generated traces

In [None]:
trainer = TorchTrainer(
    train_loop_ray_train_profiled,
    scaling_config=ScalingConfig(num_workers=2, use_gpu=True),
    run_config=RunConfig(
        storage_path=storage_path, name="distributed-cifar-vit-profiled"
    ),
    train_loop_config={"num_epochs": 1, "global_batch_size": 512},
)

result = trainer.fit()

### Launch Tensorboard

In [None]:
!ls /mnt/cluster_storage/vit/distributed-cifar-vit-profiled

In [None]:
# Copy and paste the following command in a terminal to start TensorBoard
# tensorboard --logdir /mnt/cluster_storage/vit/distributed-cifar-vit-profiled

### Torch-Profiler Output

Let's take a look at the different views available in the Pytorch tensorboard profile.

#### High-level overview
You should be able to view a high-level overview like this:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/apple/tensorboard_summary.png" width=900>

- The "GPU Summary" panel shows the GPU configuration and GPU usage metrics (Utilization, SM Efficiency, and Achieved Occupancy).
- The "Step breakdown" shows the distribution of time spent in each step over different categories of execution.


#### Operator view

The Operator view shows the time spent in each operator:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-train-deep-dive/operator-view-v2.png" width=900>

Note: The "Self" duration does not include child operators' time. Whereas the "Total" duration includes child operators' time.


#### Trace view
The Trace view shows graphs of time spent in both CPU threads and GPU streams:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/apple/tensorboard_trace_torch_loader.png" width=900>

Note in the above sample trace, we can see the GPU idling while waiting for the data to be loaded.



#### Kernel view

The GPU kernel view shows all kernels time spent on GPU.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-train-deep-dive/kernel-view.png" width=900>

- Tensor Cores Used: Whether this kernel uses Tensor Cores.
- "Mean Blocks per SM" = `Blocks of this kernel / SM number of this GPU`. If this number is less than 1, it indicates the GPU multiprocessors are not fully utilized. "Mean Blocks per SM" is weighted average of all runs of this kernel name, using each run's duration as weight.
- "Mean Est. Achieved Occupancy": For most cases such as memory bandwidth bounded kernels, the higher the better. "Mean Est. Achieved Occupancy" is a weighted average of all runs of a given kernel, using each run's duration as weight.



#### Memory views

You can also view the memory timeline of the training job either as a standalone HTML file or through the PyTorch tensorboard profile.

Below is an example memory timeline as a standalone HTML file:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-train-deep-dive/torch-profile-memory-html-view.png" width=900>

It shows the memory usage of different components:
- Parameters
- Gradients
- Optimizer states
- Activations
- Other


## 5. Adding Ray Data to the mix

Instead of using torch data loaders, we can use Ray Data to load and preprocess the data in a distributed manner. This can be done by using the `ray.data` API to load the data and then use the `iter_torch_batches` function to build a torch compatible data loader.

In [None]:
dataset = CIFAR10(root="./data", train=True, download=True)
df = pd.DataFrame({"image": dataset.data.tolist(), "label": dataset.targets})
df.to_parquet("/mnt/cluster_storage/cifar10.parquet")

Here is the code to define the Ray Data pipeline:

In [None]:
train_ds = ray.data.read_parquet("/mnt/cluster_storage/cifar10.parquet")

def transform_images(row: dict):
    # Define the torchvision transform.
    transform = Compose([ToTensor(), Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
    image_arr = np.array(row["image"], dtype=np.uint8)
    row["image"] = transform(Image.fromarray(image_arr))
    return row

train_ds = train_ds.map(transform_images)

Here is the updated training loop with Ray Data and the torch profiler: 

In [None]:
def train_loop_ray_train_ray_data(config: dict):
    # Same initialization as before
    criterion = CrossEntropyLoss()
    model = load_model_ray_train()
    optimizer = Adam(model.parameters(), lr=1e-3)

    # This time we use Ray Train's integration with Ray Data to load the data
    global_batch_size = config["global_batch_size"]
    batch_size = global_batch_size // ray.train.get_context().get_world_size()
    data_loader = build_data_loader_ray_train_ray_data(batch_size=batch_size)

    acc = torchmetrics.Accuracy(task="multiclass", num_classes=10).to(model.device)

    for epoch in range(config["num_epochs"]):
        with torch.profiler.profile(
            activities=[
                torch.profiler.ProfilerActivity.CPU,
                torch.profiler.ProfilerActivity.CUDA,
            ],
            schedule=torch.profiler.schedule(wait=10, warmup=1, active=2, repeat=1),
            on_trace_ready=torch.profiler.tensorboard_trace_handler(
                "/mnt/cluster_storage/vit/ray_train_and_data/",
                worker_name=f"rank={ray.train.get_context().get_world_rank()}",
            ),
            with_stack=False,
        ) as profiler:
            for batch in data_loader:
                outputs = model(batch["image"])
                loss = criterion(outputs, batch["label"])
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                profiler.step()
                acc(outputs, batch["label"])

        accuracy = acc.compute()

        metrics = print_metrics_ray_train(loss, accuracy, epoch)
        save_checkpoint_and_metrics_ray_train(model, metrics)
        acc.reset()

Here is how to build the data loader using Ray Data. Note we are prefetching 4 batches to keep the GPU saturated.

In [None]:
def build_data_loader_ray_train_ray_data(
    batch_size: int, prefetch_batches: int = 4
) -> DataLoader:
    dataset_iterator = ray.train.get_dataset_shard("train")
    data_loader = dataset_iterator.iter_torch_batches(
        batch_size=batch_size, prefetch_batches=prefetch_batches
    )
    return data_loader

Finally we define the TorchTrainer and fit the model.

In [None]:
datasets = {"train": train_ds}

trainer = TorchTrainer(
    train_loop_ray_train_ray_data,
    train_loop_config={"num_epochs": 1, "global_batch_size": 512},
    scaling_config=ScalingConfig(num_workers=2, use_gpu=True),
    run_config=RunConfig(storage_path=storage_path, name="dist-cifar-vit-ray-data"),
    datasets=datasets,
)

trainer.fit()

Let's inspect the generated traces

In [None]:
!ls -lla /mnt/cluster_storage/vit/

We can now run tensorboard to visualize the profiling data:

In [None]:
# Copy and paste the following command in a terminal to start TensorBoard
# tensorboard --logdir /mnt/cluster_storage/vit/ray_train_and_data/

Now we can inspect the trace in TensorBoard.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/apple/tensorboard_trace_data.png" width=900>


Note, in the above trace, we can see the GPU idling on ingest has resolved.