# Intro to Ray Tune

This notebook will walk you through the basics of hyperparameter tuning with Ray Tune.

<div class="alert alert-block alert-info">
<p> Here is the roadmap for this notebook:</p>
<ul>
    <li> <b>Part 1:</b> Loading the data </li>
    <li> <b>Part 2:</b> Starting out with vanilla PyTorch </li>
    <li> <b>Part 3:</b> Hyperparameter tuning with Ray Tune </li>
</ul>
</div>


## Imports

In [None]:
import tempfile
import os
from typing import Any

import torch
import numpy as np
from torchvision.datasets import MNIST
from torchvision.transforms import Compose, ToTensor, Normalize
from torchvision.models import resnet18
from torch.utils.data import DataLoader
from torch.optim import Adam
from torch.nn import CrossEntropyLoss

import ray
from ray import tune, train
from ray.tune.search import optuna

## Loading the data

Our Dataset is the MNIST dataset

The MNIST dataset consists of 28x28 pixel grayscale images of handwritten digits (0-9).

Dataset details:
- Training set: 60,000 images
- Test set: 10,000 images
- Image size: 28x28 pixels
- Number of classes: 10 (digits 0-9)

Data format:
Each image is represented as a 2D array of pixel values, where each pixel is a grayscale intensity between 0 (black) and 255 (white).

Task:
The task is to classify each image into one of the 10 digit classes (0-9).

## Starting out with vanilla PyTorch

Here is a high level overview of the model training process:

- **Objective**: Classify handwritten digits (0-9)
- **Model**: Simple Neural Network using PyTorch
- **Evaluation Metric**: Accuracy
- **Dataset**: MNIST

We'll start with a basic PyTorch implementation to establish a baseline before moving on to more advanced techniques. This will give us a good foundation for understanding the benefits of hyperparameter tuning and distributed training in later sections.

In [None]:
def train_loop_torch(num_epochs: int = 2, batch_size: int = 128, lr: float = 1e-5):
    criterion = CrossEntropyLoss()

    model = resnet18()
    model.conv1 = torch.nn.Conv2d(
        1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False
    )
    model.to("cuda")

    optimizer = Adam(model.parameters(), lr=lr)
    transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])
    train_data = MNIST(root="./data", train=True, download=True, transform=transform)
    data_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, drop_last=True)

    for epoch in range(num_epochs):
        for images, labels in data_loader:
            images, labels = images.to("cuda"), labels.to("cuda")
            outputs = model(images)
            loss = criterion(outputs, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Report the metrics
        print(f"Epoch {epoch}, Loss: {loss}")

We fit the model by submitting it onto a GPU node using Ray Core

In [None]:
train_loop_torch(num_epochs=2)

**Can we do any better ?** let's see if we can tune the hyperparameters of our model to get a better loss.

But hyperparameter tuning is a computationally expensive task, and it will take a long time to run sequentially.

[Ray Tune](https://docs.ray.io/en/master/tune/) is a distributed hyperparameter tuning library that can help us speed up the process!

## Hyperparameter tuning with Ray Tune

### Intro to Ray Tune

<img src="https://docs.ray.io/en/master/_images/tune_overview.png" width="250">

Tune is a Python library for experiment execution and hyperparameter tuning at any scale.

Let's take a look at a very simple example of how to use Ray Tune to tune the hyperparameters of our XGBoost model.

### Getting started

We start by defining our training function

In [None]:
def my_simple_model(distance: np.ndarray, a: float) -> np.ndarray:
    return distance * a

# Step 1: Define the training function
def train_my_simple_model(config: dict[str, Any]) -> None: # Expected function signature for Ray Tune
    distances = np.array([0.1, 0.2, 0.3, 0.4, 0.5])
    total_amts = distances * 10
    
    a = config["a"]
    predictions = my_simple_model(distances, a)
    rmse = np.sqrt(np.mean((total_amts - predictions) ** 2))

    train.report({"rmse": rmse}) # This is how we report the metric to Ray Tune

<div class="alert alert-block alert-warning">
<b>Note:</b> how the training function needs to accept a config argument. This is because Ray Tune will pass the hyperparameters to the training function as a dictionary.
</div>

Next, we define and run the hyperparameter tuning job by following these steps:

1. Create a `Tuner` object (in our case named `tuner`)
2. Call `tuner.fit`

In [None]:
# Step 2: Set up the Tuner
tuner = tune.Tuner(
    trainable=train_my_simple_model,  # Training function or class to be tuned
    param_space={
        "a": tune.randint(0, 20),  # Hyperparameter: a
    },
    tune_config=tune.TuneConfig(
        metric="rmse",  # Metric to optimize (minimize)
        mode="min",     # Minimize the metric
        num_samples=5,  # Number of samples to try
    ),
)

# Step 3: Run the Tuner and get the results
results = tuner.fit()

In [None]:
# Step 4: Get the best result
best_result = results.get_best_result()
best_result

In [None]:
best_result.config

So let's recap what actually happened here ?

```python
tuner = tune.Tuner(
    trainable=train_my_simple_model,  # Training function or class to be tuned
    param_space={
        "a": tune.randint(0, 20),  # Hyperparameter: a
    },
    tune_config=tune.TuneConfig(
        metric="rmse",  # Metric to optimize (minimize)
        mode="min",     # Minimize the metric
        num_samples=5,  # Number of samples to try
    ),
)

results = tuner.fit()
```

A Tuner accepts:
- A training function or class which is specified by `trainable`
- A search space which is specified by `param_space`
- A metric to optimize which is specified by `metric` and the direction of optimization `mode`
- `num_samples` which correlates to the number of trials to run

`tuner.fit` then runs multiple trials in parallel, each with a different set of hyperparameters, and returns the best set of hyperparameters found.


### Diving deeper into Ray Tune concepts

You might be wondering:
- How does the tuner allocate resources to trials?
- How does it decide how to tune - i.e. which trials to run next?
    - e.g. A random search, or a more sophisticated search algorithm like a bayesian optimization algorithm.
- How does it decide when to stop - i.e. whether to kill a trial early?
    - e.g. If a trial is performing poorly compared to other trials, it perhaps makes sense to stop it early (successive halving, hyperband)

By default: 
- Each trial will run in a separate process and consume 1 CPU core.
- Ray Tune uses a search algorithm to decide which trials to run next.
- Ray Tune uses a scheduler to decide if/when to stop trials, or to prioritize certain trials over others.

Here is the same code with the default settings for Ray Tune *explicitly* specified.

In [None]:
tuner = tune.Tuner(
    trainable=tune.with_resources(train_my_simple_model, {"cpu": 1}), # this is how to specify resources for your trainable
    param_space={"a": tune.randint(0, 20)},
    tune_config=tune.TuneConfig(
        mode="min",
        metric="rmse",
        num_samples=5, 
        search_alg=tune.search.BasicVariantGenerator(), # performs a basic variation of random/grid search based on parameter space
        scheduler=tune.schedulers.FIFOScheduler(), # this is a simple scheduler - no early stopping, just run all trials in submission order
    ),
)
results = tuner.fit()

Below is a diagram showing the relationship between the different Ray Tune components we have discussed.

<img src="https://docs.ray.io/en/latest/_images/tune_flow.png" width="800" />


Here is the same experiment table annotated 

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/attentive-ray-101/experiment_table_annotated.png" width="800" />




#### Exercise

<div class="alert alert-block alert-info">
    
__Lab activity: Finetune a linear regression model.__
    

Given the below code to train a linear regression model from scratch: 

```python
def train_linear_model(lr: float, epochs: int) -> None:
    x = np.array([0, 1, 2, 3, 4])
    y = x * 2
    w = 0
    for _ in range(epochs):
        loss = np.sqrt(np.mean((w * x - y) ** 2))
        dl_dw = np.mean(2 * x * (w * x - y)) 
        w -= lr * dl_dw
        print({"rmse": loss})

# Hint: Step 1 update the function signature

# Hint: Step 2 Create the tuner object
tuner = tune.Tuner(...)

# Hint: Step 3: Run the tuner
results = tuner.fit()
```

Use Ray Tune to tune the hyperparameters `lr` and `epochs`. 

Perform a search using the optuna.OptunaSearch search algorithm with 5 samples over the following ranges:
- `lr`: loguniform(1e-4, 1e-1)
- `epochs`: randint(1, 100)

</div>


In [None]:
# Write your code here


<div class="alert alert-block alert-info">
<details>
<summary>Click here to view the solution</summary>

```python
def train_linear_model(config) -> None:
    epochs = config["epochs"]
    lr = config["lr"]
    x = np.array([0, 1, 2, 3, 4])
    y = x * 2
    w = 0
    for _ in range(epochs):
        loss = np.sqrt(np.mean((w * x - y) ** 2))
        dl_dw = np.mean(2 * x * (w * x - y)) 
        w -= lr * dl_dw
        train.report({"rmse": loss})

tuner = tune.Tuner(
    trainable=train_linear_model,  # Training function or class to be tuned
    param_space={
        "lr": tune.loguniform(1e-4, 1e-1),  # Hyperparameter: learning rate
        "epochs": tune.randint(1, 100),  # Hyperparameter: number of epochs
    },
    tune_config=tune.TuneConfig(
        metric="rmse",  # Metric to optimize (minimize)
        mode="min",     # Minimize the metric
        num_samples=5,  # Number of samples to try
        search_alg=optuna.OptunaSearch(), # Use Optuna for hyperparameter search
    ),
)

results = tuner.fit()
```

</details>
</div>

### Hyperparameter tune the PyTorch model using Ray Tune

The first step is to move in all the PyTorch code into a function that we can pass to the `trainable` argument of the `tune.run` function.

In [None]:
def train_pytorch(config): # we change the function so it accepts a config dictionary
    criterion = CrossEntropyLoss()

    model = resnet18()
    model.conv1 = torch.nn.Conv2d(
        1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False
    )
    model.to("cuda")

    optimizer = Adam(model.parameters(), lr=config["lr"])
    transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])
    train_data = MNIST(root="./data", train=True, download=True, transform=transform)
    data_loader = DataLoader(train_data, batch_size=config["batch_size"], shuffle=True, drop_last=True)

    for epoch in range(config["num_epochs"]):
        for images, labels in data_loader:
            images, labels = images.to("cuda"), labels.to("cuda")
            outputs = model(images)
            loss = criterion(outputs, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Report the metrics using train.report instead of print
        train.report({"loss": loss.item()})

The second and third steps are the same as before. We define the tuner and run it by calling the fit method.

In [None]:
tuner = tune.Tuner(
    trainable=tune.with_resources(train_pytorch, {"gpu": 1}), # we will dedicate 1 GPU to each trial
    param_space={
        "num_epochs": 1,
        "batch_size": 128,
        "lr": tune.loguniform(1e-4, 1e-1),
    },
    tune_config=tune.TuneConfig(
        mode="min",
        metric="loss",
        num_samples=2,
        search_alg=tune.search.BasicVariantGenerator(),
        scheduler=tune.schedulers.FIFOScheduler(),
    ),
)

results = tuner.fit()

Finally, we can get the best result and its configuration:

In [None]:
best_result = results.get_best_result()
best_result.config