# Ray Train - A Gentle introuduction to Ray Train: 
A library for distributed training for deep learning

© 2019-2022, Anyscale. All Rights Reserved

📖 [Back to Table of Contents](../ex_00_tutorial_overview.ipynb)<br>

### Learning Objective:
In this introductory tutorial, you will:
 * understand the Ray Train library components and architecture
 * how to use its API to build distributed trainer
 * walk through a FashionMNIST PyTorch example 

### Introduction to Ray Train

Ray Train is a library that aims to simplify distributed deep learning. As a library, Ray Train is built to abstract away the coordination/configuration setup of distributed deep learning frameworks such as [Pytorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html) and [Tensorflow Distributed](https://www.tensorflow.org/guide/distributed_training), allowing users to only focus on implementing training logic for their respective framework. For example: 
 * For Pytorch, Ray Train automatically handles the construction of the distributed process group.
 * For Tensorflow, Ray Train automatically handles the coordination of the `TF_CONFIG`. The current implementation assumes that the user will use a _MultiWorkerMirroredStrategy_, but this will change in the near future.
 * For Horovod, Ray Train automatically handles the construction of the Horovod runtime and [Rendezvous server](https://horovod.readthedocs.io/en/stable/_modules/horovod/ray/runner.html).

Built for data scientists/ML practitioners, Ray Train has support for standard ML tools and features that practitioners love. For example:
 * Callbacks for early stopping, reducing costs and time for training
 * Checkpointing at regular intervals, allowing to restart for fault-tolerence
 * Integration with Tensorboard, Weights/Biases, and MLflow, providing extensibilty for experimentation and observation of runs
 * Jupyter notebooks, giving developers familiar development tools for iteration and experimentation

More importantly, Ray Train integrates with the Ray Ecosystem. Distributed deep learning often comes with a lot of complexity, so you can:
 * Use [Ray Datasets](https://docs.ray.io/en/latest/data/dataset.html#datasets) with Ray Train to inject, handle or train on large amounts of data
 * Use [Ray Tune](https://docs.ray.io/en/latest/tune/index.html#tune-main) with Ray Train to leverage cutting edge hyperparameter techniques and distribute both your training and tuning

> **NOTE**: Ray SGD is renamed to Ray Train

### Ray Train Architecture and concepts

<img src="https://docs.ray.io/en/latest/_images/train-arch.svg" width="70%" height="3%"> 

**Trainer**: The Trainer is the main class that is exposed in the [Ray Train API](https://docs.ray.io/en/latest/train/api.html) that users will interact with. A user will pass in a function which defines the training logic. In our case, the trainin
function is `train_func_distributed` with `configs` as its argument. The Trainer will create an Executor to run the distributed training. It will also will handle callbacks based on the results from the `BackendExecutor`. Read the Trainer [source here](https://github.com/ray-project/ray/blob/f1acabe9cf37d5d123017fb3f158c37fb36499a5/python/ray/train/trainer.py#L78).

**BackendExecutor**: The executor is an interface that handles execution of distributed training. It creates an actor group and initializes in conjunction with a specific backend. Worker resources, number of workers, and placement strategy are passed to the `Worker Group.` Read the BackendExecutor [source here](https://github.com/ray-project/ray/blob/f1acabe9cf37d5d123017fb3f158c37fb36499a5/python/ray/train/backend.py#L102).

**Backend**: A backend is used in conjunction with the `Executor` to initialize and manage framework-specific communication protocols. Each communication library (Torch, Horovod, TensorFlow, etc.) will have a separate backend and will take a specific configuration value. In the diagram, they are labelled as `XBackend`, `XConfig`, `YBackend`, and `YConfig` respectively. Read the Backend [source here](https://github.com/ray-project/ray/blob/f1acabe9cf37d5d123017fb3f158c37fb36499a5/python/ray/train/trainer.py#L64).

**WorkerGroup**:The `WorkerGroup` is a generic utility class for managing a group of Ray Actors, regardless of the backend. Read WorkGroup [source here](https://github.com/ray-project/ray/blob/f1acabe9cf37d5d123017fb3f158c37fb36499a5/python/ray/train/worker_group.py#L84).


## PyTorch Fashion MNIST for Distributed Training

<img src="../images/fashion-mnist-sprite.jpeg" width="70%" height="35%"> 

We will use Ray Train to distribute our training using couple of models and evaluating which of the two provides us
the best accuracy and a minimal loss. 

As excercise, you can try to further investigate how you improve the model—via regularization techniques, using CNN layers, trying different loss functions.

The steps we will follow can be applied to any of your models too.

So let's go!

First, do the necessary imports, as before.

In [1]:
import os
from typing import Dict
import logging

import torch
import torch.nn.functional as F

import ray
import ray.train as train
from ray.train.trainer import Trainer
from ray.train.callbacks import JsonLoggerCallback, TBXLoggerCallback
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

### Step 1: Download Train and test datasets 

In [24]:
training_data = datasets.FashionMNIST(
    root="~/data",
    train=True,
    download=True,
    transform=ToTensor(),
)
# Download test data from open datasets.
test_data = datasets.FashionMNIST(
    root="~/data",
    train=False,
    download=True,
    transform=ToTensor(),
)

## Step 2: Define a Neural Network Models. 

This is a quite simple NN model

#### Model 1

In [25]:
# Define model-1
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28 * 28, 512), nn.ReLU(), nn.Linear(512, 512), nn.ReLU(),
            nn.Linear(512, 10), nn.ReLU())

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

Define a deeper NN model archiecture with dropouts

<img src="https://miro.medium.com/max/1400/1*2SHOuTUK51_Up3D9JMAplA.png" width="70%" height="50%">

[source](https://medium.com/@aaysbt/fashion-mnist-data-training-using-pytorch-7f6ad71e96f4)

#### Model 2

In [26]:
# Define model-2
class Classifier(nn.Module):
  def __init__(self):
    super().__init__()
    self.fc1 = nn.Linear(784, 120)
    self.fc2 = nn.Linear(120, 120)
    self.fc3 = nn.Linear(120,10)
    self.dropout = nn.Dropout(0.2)

  def forward(self,x):
    x = x.view(x.shape[0],-1)
    x = self.dropout(F.relu(self.fc1(x)))
    x = self.dropout(F.relu(self.fc2(x)))
    x = F.log_softmax(self.fc3(x), dim=1)
    return x

In [27]:
# Define accuracy function
def accuracy_fn(y_pred, y_true):
    n_correct = torch.eq(y_pred, y_true).sum().item()
    acc = (n_correct / len(y_pred)) * 100
    return acc

### Step 3: Define per epoch training and validation functinos

In [28]:
def train_epoch(dataloader, model, loss_fn, optimizer, epoch):
    size = len(dataloader.dataset)
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)
        
        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

In [29]:
def validate_epoch(dataloader, model, loss_fn, epoch):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct, acc =  0, 0, 0.0
    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
            predictions = pred.max(dim=1)[1]
            acc += accuracy_fn(predictions, y)
    test_loss /= num_batches
    acc /= num_batches
    correct /= size
    if epoch > 0 and epoch % 50 == 0:
        print(f"Epoc: {epoch}, Avg validation loss: {test_loss:.2f}, Avg validation accuracy: {acc:.2f}%") 
        print("--" * 40)
    return test_loss

### Step 4: Define Ray Train Training function
This function will be passed to `train.run(...)`

In [31]:
def train_func(config: Dict):
    batch_size = config.get("batch_size", 64) 
    lr = config.get('lr', 1e-3)
    epochs = config.get("epochs", 20)
    momentum = config.get("momentum", 0.9)
    model_type = config.get('model_type', 0)
    loss_fn = config.get("loss_fn", nn.NLLLoss())

    # Create data loaders.
    train_dataloader = DataLoader(training_data, batch_size=batch_size)
    test_dataloader = DataLoader(test_data, batch_size=batch_size)

    # Prepare to use Ray integrated wrappers around PyTorch's Dataloaders
    train_dataloader = train.torch.prepare_data_loader(train_dataloader)
    test_dataloader = train.torch.prepare_data_loader(test_dataloader)

    # Create model.
    model = Classifier() if model_type else NeuralNetwork()
    # Prepare to use Ray integrated wrappers around PyTorch's model
    model = train.torch.prepare_model(model)
    
    # Get or objective loss function
    loss_fn = config.get("loss_fn", nn.NLLLoss())

    optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=momentum)

    loss_results = []

    for e in range(epochs):
        train_epoch(train_dataloader, model, loss_fn, optimizer, e)
        loss = validate_epoch(test_dataloader, model, loss_fn, e)
        train.report(loss=loss)
        loss_results.append(loss)

    return loss_results

### Step 5: Wrap our Trainer around a main driver function

In [32]:
def train_fashion_mnist(num_workers=12, use_gpu=False):
    trainer = Trainer(
        backend="torch", num_workers=num_workers, use_gpu=use_gpu)
    trainer.start()
    result = trainer.run(
        train_func=train_func,
        config={
            "lr": 1e-3,
            "batch_size": 128,
            "epochs": 150,
            "momentum": 0.9,
            "model_type": 0,                     # Use 0 for Model-1 and 1 to Model-2 (Classifier())
            "loss_fn": nn.CrossEntropyLoss()     # change to nn.nn.NLLLoss() 
        },
        callbacks=[JsonLoggerCallback(), TBXLoggerCallback()])
    trainer.shutdown() 
    return result

### Step 6: Define some parallelism parameters 
And a URL to connect to a Ray Cluster if running on Anysacle

In [33]:
number_of_workers = 8
use_gpu = False                              # change to True if using a Ray cluster with GPUs

### Step 6: Connect to Ray cluster

In [2]:
if ray.is_initialized:
    ray.shutdown()
ray.init(logging_level=logging.ERROR)

0,1
Python version:,3.8.13
Ray version:,3.0.0.dev0
Dashboard:,http://127.0.0.1:8274


### Step 7: Run the main Trainer driver

In [36]:
%%time
results = train_fashion_mnist(num_workers=number_of_workers, use_gpu=use_gpu)

[2m[36m(BaseWorkerMixin pid=26052)[0m 2022-05-27 18:08:34,136	INFO torch.py:334 -- Setting up process group for: env:// [rank=2, world_size=8]
[2m[36m(BaseWorkerMixin pid=26054)[0m 2022-05-27 18:08:34,131	INFO torch.py:334 -- Setting up process group for: env:// [rank=4, world_size=8]
[2m[36m(BaseWorkerMixin pid=26051)[0m 2022-05-27 18:08:34,143	INFO torch.py:334 -- Setting up process group for: env:// [rank=1, world_size=8]
[2m[36m(BaseWorkerMixin pid=26055)[0m 2022-05-27 18:08:34,148	INFO torch.py:334 -- Setting up process group for: env:// [rank=5, world_size=8]
[2m[36m(BaseWorkerMixin pid=26057)[0m 2022-05-27 18:08:34,145	INFO torch.py:334 -- Setting up process group for: env:// [rank=7, world_size=8]
[2m[36m(BaseWorkerMixin pid=26053)[0m 2022-05-27 18:08:34,120	INFO torch.py:334 -- Setting up process group for: env:// [rank=3, world_size=8]
[2m[36m(BaseWorkerMixin pid=26056)[0m 2022-05-27 18:08:34,141	INFO torch.py:334 -- Setting up process group for: env:// [

[2m[36m(BaseWorkerMixin pid=26055)[0m Epoc: 50, Avg validation loss: 0.59, Avg validation accuracy: 79.10%
[2m[36m(BaseWorkerMixin pid=26055)[0m --------------------------------------------------------------------------------
[2m[36m(BaseWorkerMixin pid=26051)[0m Epoc: 50, Avg validation loss: 0.59, Avg validation accuracy: 78.95%
[2m[36m(BaseWorkerMixin pid=26051)[0m --------------------------------------------------------------------------------
[2m[36m(BaseWorkerMixin pid=26054)[0m Epoc: 50, Avg validation loss: 0.58, Avg validation accuracy: 79.97%
[2m[36m(BaseWorkerMixin pid=26054)[0m --------------------------------------------------------------------------------
[2m[36m(BaseWorkerMixin pid=26057)[0m Epoc: 50, Avg validation loss: 0.57, Avg validation accuracy: 80.59%
[2m[36m(BaseWorkerMixin pid=26057)[0m --------------------------------------------------------------------------------
[2m[36m(BaseWorkerMixin pid=26053)[0m Epoc: 50, Avg validation loss: 

[2m[36m(BaseWorkerMixin pid=26052)[0m E0527 18:12:43.200063000 123145542377472 chttp2_transport.cc:1103]     Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
[2m[36m(BaseWorkerMixin pid=26054)[0m E0527 18:12:43.200376000 123145394028544 chttp2_transport.cc:1103]     Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
[2m[36m(BaseWorkerMixin pid=26053)[0m E0527 18:12:43.200245000 123145448992768 chttp2_transport.cc:1103]     Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
[2m[36m(BaseWorkerMixin pid=26056)[0m E0527 18:12:43.200585000 123145501032448 chttp2_transport.cc:1103]     Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"


### Step 8: Observe metrics in Tensorboard 

Subsitute your path `train_path` printed in the cell below
!tensorboard --logdir <train_path/ray_results/<train_func_path>

In [37]:
!ls ~/ray_results/

[1m[36mtrain_2022-05-27_18-08-29[m[m


In [None]:
!tensorboard --logdir ~/ray_results/train_2022-05-27_18-08-29

In [None]:
ray.shutdown()

### Exercises

Have a go at this in your spare time and observe the results:

 1. Change the learning rate and batch size in `config`
 2. Try chaning the number of workers to 1/2 number of cores on your localhost or laptop
 3. Change the `batch_size` and `epochs`
 4. Try the second model by changing the `mode_type` in `config` to 1
 5. Did it improve the accuracy or minimize the loss?

### Homework
1. Can you try some deep learning regularization techniques to bring the loss down?
2. Change a the loss function and test if that help

📖 [Back to Table of Contents](../ex_00_tutorial_overview.ipynb)<br>