# Ray Tune - A gentle introduction understanding hyperparameters optimization

© 2019-2022, Anyscale. All Rights Reserved

This lesson introduces the concepts of _Hyperparameter Tuning or Optimization_ (HPO) and works through a nontrivial example using Tune. 


### Learning Objective:
In this introductory tutorial, you will:
 * understand the Ray tune concepts, components and architecture
 * how to use its API to tune distributed hyper-parameter optimzation
 * walk through a short example

See also the [Hyperparameter Tuning References](References-Hyperparameter-Tuning.ipynb) notebook and the [Tune documentation](http://tune.io), in particular, the [API reference](https://docs.ray.io/en/latest/tune/api_docs/overview.html). 

## What Are Hyperparameters?

In _supervised learning_, we train a model with labeled data so the model can properly label new data values. Everything about the model is defined by a set of _parameters_, such as the weights in a linear regression. 

In contrast, the _hyperparameters_<sup>1</sup> define structural details about the kind of model itself, like whether or not we are using a linear regression or what architecture is best for a neural network, etc. Other quantities considered hyperparameters include learning rates, discount rates, etc. If we want our training process and resulting model to work well, we first need to determine the optimal or near-optimal set of hyperparameters.

How do we determine the optimal hyperparameters? The most straightfoward approach is to perform a loop where we pick a candidate set of values from some reasonably inclusive list of possible values, train a model, compare the results achieved with previous loop iterations, and pick the set that performed best. This process is called _Hyperparameter Tuning_ or _Optimization_ (HPO).

This simple algorithm can quickly become very expensive, however. Training a single neural networks can be compute intensive and the space of all possible architectures is huge. Hence, much of the research in hyperparameter tuning, especially for neural networks, focuses on ways to optimize HPO, such as early stopping and pruning the search space when some combinations appear to perform poorly.

1. _Hyperparameter_ is often spelled _hyper parameter_ or _hyper-parameter_, but we'll use the spelling with no space or dash.

<img src="images/what-are-hyperparameters.png" height="25%" width="60%">

## A Simple Example: $k$-Means 

Let's start with a very simple example of HPO, finding $k$ in $k$-means. 

The $k$-means algorithm finds clusters in a data set. It's a canonical example of _unsupervised learning_, where information is extracted from a data set, rather than using labeled data to train a model for labelling new data, as in _supervised learning_. We won't discuss the algorithm details, but the essense of it involves a "guess" for the expected number of clusters, the $k$ value, then calculating $k$ centroids (the coordinates at the center), one per cluster, along with determining to which cluster each data point belongs. The details are in [$k$-means Wikipedia article](https://en.wikipedia.org/wiki/K-means_clustering). The following animation shows the algorithm in action for a two-dimensional data set where three clusters are evident.

#### K-Means Convergence

<img src="images/K-means_convergence.gif">

(source: [Wikipedia](https://en.wikipedia.org/wiki/K-means_clustering). [Larger Image](https://en.wikipedia.org/wiki/K-means_clustering#/media/File:K-means_convergence.gif))

While it is easy to see the clusters in this two-dimensional data set, that won't be for arbitrary datasets, especially those with more than three dimensions. Hence, we should determine the best $k$ value by trying many values and picking the value that appears to be best. In this case, "best" would mean that we minimize the distances between the datapoints and centroids. 

With just one hyperparameter, this problem is comparatively simple and brute force calculations to find the optimal $k$ is usually good enough.

## HPO for Neural Networks

Where HPO really becomes a challenge is finding the right neural network architecture for your problem. Why are neural networks a challenge? Consider this image of a typical architecture:

<img src="images/hpo-neural-network-example.png" height="25%" width="60%">

Every number you see is a hyperparameter! So are the decisions about how many layers to have, what kind of layer to use for each layer, etc. The space of possible hyperparameters is enormous, too big to explore naively.

So called _neural architecture search_ (NAS) has become a research field in its own right, along with general research in optimizing HPO. 

## Introduction to Ray Tune

[Ray Tune](http://tune.io) is the Ray-based native library for hyperparameter tuning. Tune makes it nearly as easy to run distributed, parallelized HPO as it is to run trials on a single machine manually, one after the other. 

Tune is built as an extensible, pluggable framework, with built-in integrations for many frameworks, [PyTorch](https://pytorch.org), [TensorFlow](http://tensorflow.org), and recently, [sci-kit learn](https://scikit-learn.org/stable/) (see [this recent blog post](https://medium.com/distributed-computing-with-ray/gridsearchcv-2-0-new-and-improved-ee56644cbabf)).

## How Tune Works

Before we get into using Tune, let's understand the some definitions, terms, and components. With this understanding, you will get an insight into what happens when you use Tune to search your hyperparameter space and optimize your process to select the best, optimized model.

<img src="https://docs.ray.io/en/latest/_images/tune_flow.png" height="25%" width="60%">

## Definitions

Let's get an intuition of what those terms mean.  

#### Trainable
This is your training function, with an objective function. As [trainable](https://docs.ray.io/en/latest/tune/api_docs/trainable.html?highlight=trainable#ray.tune.Trainable), it's one of the argument to `tune.run(...)` method. Tune offers two iterface APIs for trainable: functional and class.

#### Search Spaces
To optimize your hyperparameters, you have to define a search space. A search space defines valid values for your hyperparameters and can specify how these values are sampled (e.g., from a uniform distribution or a normal distribution).

#### Search Algorithms
To optimize the hyperparameters of your training process, you use a Search Algorithm which suggests hyperparameter configurations. If you don’t specify a search algorithm, Tune will use random search by default, which can provide you with a good starting point for your hyperparameter optimization.

#### Schedulers
To make your training process more efficient, you can use a Trial Scheduler. For instance, in our trainable example minimizing a function in a training loop, we used tune.report(). This reported incremental results, given a hyperparameter configuration selected by a search algorithm. Based on these reported results, a Tune scheduler can decide whether to stop the trial early or not. If you don’t specify a scheduler, Tune will use a first-in-first-out (FIFO) scheduler by default, which simply passes through the trials selected by your search algorithm in the order they were picked and does not perform any early stopping.

In short, schedulers can stop, pause, or tweak the hyperparameters of running trials, potentially making your hyperparameter tuning process much faster. Unlike search algorithms, Trial Scheduler do not select which hyperparameter configurations to evaluate.

#### Trial

A trial is an execution or run of a logical representation of a single hyperparameter configuration. Each trial is associated with an instance of a Trainable. And a collection of trials comprise an experiment.


#### Lifecycle of a trial¶
A trial’s life cycle consists of 6 stages:

Initialization (generation): A trial is first generated as a hyperparameter sample, and its parameters are configured according to what was provided in `tune.run` as part of the `config` arggument. Trials are then placed into a queue to be executed (with status PENDING).

**PENDING**: A pending trial is a trial to be executed on the machine. Every trial is configured with resource values. Whenever the trial’s resource values are available, tune will run the trial (by starting a ray actor holding the config and the training function).

**RUNNING**: A running trial is assigned a Ray Actor. There can be multiple running trials in parallel.

**ERRORED**: If a running trial throws an exception, Tune will catch that exception and mark the trial as errored. Note that exceptions can be propagated from an actor to the main Tune driver process. If `max_retries` is set, Tune will set the trial back into “PENDING” and later start it from the last checkpoint.

**TERMINATED**: A trial is terminated if it is stopped or finished by a Stopper/Scheduler. If using the Function API, the trial is also terminated when the function stops.

**PAUSED**: A trial can be paused by a Trial scheduler. This means that the trial’s actor will be stopped too. A paused trial can later be resumed from the most recent checkpoint.


#### Driver/worker process

The driver process is the python process that calls `tune.run` (which calls ray.init() underneath the hood); therefore, you
do not need to invoke `ray.init(...)` explicity. Tune does it for you during its inital run. The Tune's driver process runs on the node where you run your script (which calls `tune.run`), while Ray Tune trainable “actors” run on any node (either on the same node on multiple cores) or on worker nodes (with multiple cores on a distributed Ray cluster).

#### Ray Actors

Tune uses Ray Actors as worker node's processes to evaluate multiple Trainables in parallel.

[Ray Actors](https://docs.ray.io/en/latest/actors.html#actor-guide) allow you to parallelize an instance of a class in Python. When you instantiate a class that is a Ray actor, Ray will start a instance of that class on a separate process either on the same machine (or another distributed machine, if running a Ray cluster). This actor can then asynchronously execute method calls and maintain its own internal state.

### The execution of a trainable¶
Tune uses Ray actors to parallelize the evaluation of multiple hyperparameter configurations. Each actor is a Python process that executes an instance of the user-provided Trainable. The definition of the user-provided Trainable will be [serialized via cloudpickle](https://docs.ray.io/en/latest/serialization.html#serialization-guide) and sent to each actor process. Each Ray actor will start an instance of the Trainable to be executed.

If the Trainable is a class, it will be executed iteratively by calling train/step. After each invocation, the driver is notified that a “result dict” is ready. The driver will then pull the result via `ray.get`.

If the trainable is a callable or a function, it will be executed on the Ray actor process on a separate execution thread. Whenever `tune.report` is called, the execution thread is paused and waits for the driver to pull a result. After pulling, the actor’s execution thread will automatically resume.

The diagram below depicts how Tune launches trainables on the worker nodes as processes in which the the trainables are run. 
Each trial will have its own instance of a trainable, hence we parallelize trials and its respective configuration across cores on a worker. 

<img src="images/ray_tune_report_launch_trainables.png" height="25%" width="60%">

Whenever the trainble calls `tune.report`, the driver will pull the metrics via `ray.get`, as shown in the diagram below.

<img src="images/ray_tune_report_metrics.png" height="25%" width="60%">


Tune also integrates implementations of many state-of-the-art [search algorithms](https://docs.ray.io/en/latest/tune/key-concepts.html#search-algorithms) and [schedulers](https://docs.ray.io/en/latest/tune/key-concepts.html#schedulers), so it is easy to optimize your HPO process.

### Three simple steps to use Ray Tune

In [1]:
import os
import warnings
import time
import logging

import ray
from ray import tune

In [2]:
warnings.filterwarnings("ignore")
os.environ["PYTHONWARNINGS"] = "ignore"

In [3]:
if ray.is_initialized:
    ray.shutdown()
ray.init(logging_level=logging.ERROR)

0,1
Python version:,3.8.13
Ray version:,3.0.0.dev0
Dashboard:,http://127.0.0.1:8273


#### 1. Setup training using Trainable APIs

Let's define our objective function

In [6]:
def evaluation_fn(step, width, height):
    time.sleep(0.1)
    return (0.1 + width * step / 100)**(-1) + height * 0.1

Next, we define a Trainable used by Tune using Tune's [Functional API](https://docs.ray.io/en/latest/tune/api_docs/trainable.html#function-api)

In [7]:
def easy_objective_fn(config):
    # fetch our Hyperparameters sent as arguments
    width, height = config["width"], config["height"]
    # Iterate over number of steps
    for step in range(config["steps"]):
        # Iterative training function - can be any arbitrary training procedure
        # Here our objective function is the evaluation_fn
        intermediate_score = evaluation_fn(step, width, height)
        # Feed the score back back to Tune.
        tune.report(iterations=step, mean_loss=intermediate_score)

#### Step 2. Use tune API to execute tuning

This will do a grid search over the activation parameter. This means that each of the two values (`relu` and `tanh`) will be sampled once for each sample (`num_samples`). We end up with `2 * N = 2N samples`, where is N is `num_samples`. The width and height parameters are sampled randomly. `steps` is a constant parameter.

The `tune.run(...)` API returns a large [analysis](https://docs.ray.io/en/latest/tune/api_docs/analysis.html#analysis-tune-analysis) object.

In [8]:
analysis = tune.run(
    easy_objective_fn,
    metric="mean_loss",
    mode="min",
    num_samples=5,
    # Define our hypyerparameter search space
    config={
        "steps": 5,
        "width": tune.uniform(0, 20),
        "height": tune.uniform(-100, 100),
        "activation": tune.grid_search(["relu", "tanh"]),
    },
    verbose=1
)

### Step 3. Analyse the results

In [9]:
print("Best hyperparameters found were: ", analysis.best_config)

Best hyperparameters found were:  {'steps': 5, 'width': 4.576539242686842, 'height': -89.58469559984677, 'activation': 'tanh'}


In [10]:
analysis.results_df.head(5)

Unnamed: 0_level_0,iterations,mean_loss,time_this_iter_s,done,timesteps_total,episodes_total,training_iteration,neg_mean_loss,experiment_id,date,...,node_ip,time_since_restore,timesteps_since_restore,iterations_since_restore,warmup_time,experiment_tag,config.steps,config.width,config.height,config.activation
trial_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
9436d_00000,4,3.329998,0.113018,True,,,5,-3.329998,a1813e135a554922b674ac1e8a0be4b9,2022-07-18_09-06-30,...,127.0.0.1,0.557857,0,5,0.001975,"0_activation=relu,height=20.4609,width=16.9717",5,16.971743,20.460862,relu
9436d_00001,4,4.620703,0.105743,True,,,5,-4.620703,6c78ab82390740c198ec087eb0a932aa,2022-07-18_09-06-31,...,127.0.0.1,0.523544,0,5,0.002242,"1_activation=tanh,height=21.4085,width=7.5812",5,7.581245,21.408504,tanh
9436d_00002,4,5.093847,0.106594,True,,,5,-5.093847,86804ab54c54414297ff23614cf6edfd,2022-07-18_09-06-31,...,127.0.0.1,0.523724,0,5,0.001622,"2_activation=relu,height=34.3086,width=12.5332",5,12.53316,34.308568,relu
9436d_00003,4,-5.425669,0.104017,True,,,5,5.425669,7e028e93a65940bfb758a807ee537d2f,2022-07-18_09-06-31,...,127.0.0.1,0.523478,0,5,0.003478,"3_activation=tanh,height=-89.5847,width=4.5765",5,4.576539,-89.584696,tanh
9436d_00004,4,-1.822946,0.107686,True,,,5,1.822946,1329f94462754285a89533df93b99b90,2022-07-18_09-06-31,...,127.0.0.1,0.532685,0,5,0.001862,"4_activation=relu,height=-35.5489,width=11.9347",5,11.934685,-35.548858,relu


### Ray Tune's TuneGridSearchCV and Scikit-Learn
Basically, there are three basic steps or Ray Tune pattern for you as a newcomer to get started with using Ray Tune. We'll use a drop-in replacement for normal Scikit-learn's distributed `TuneGridSearchCV` 

See also the [Tune documentation](http://tune.io/), in particular, the [API reference](https://docs.ray.io/en/latest/tune/api_docs/overview.html).

In [11]:
# Import Tune's replacement
from ray.tune.sklearn import TuneGridSearchCV

# Other relevant imports
from sklearn.model_selection import train_test_split

# Use the stochastic gradient descent (SGD) classifier
from sklearn.linear_model import SGDClassifier

# import the classification dataset
from sklearn.datasets import make_classification
import numpy as np

### Create Feature Set
 * 250K rows
 * 250 features
 * 2 classes

In [12]:
def create_classification_data() -> (np.ndarray, np.ndarray):
    X, y = make_classification(
        n_samples=250000,
        n_features=250,
        n_informative=50,
        n_redundant=0,
        n_classes=2,
        class_sep=2.5)
    return X, y

X, y = create_classification_data()
# Split the dataset into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=10000)

#### Step 1. Define parameter search space

In [13]:
# Example parameters grid to tune from SGDClassifier
parameter_grid = {"alpha": [1e-4, 1e-1, 1], "epsilon": [0.01, 0.1]}

#### Step 2. Use Ray's Scikit-learn drop-in replacement TuneGridSearchCV

Use all cores on a Ray Cluster (or local host) to tune

The `early_stopping parameter` allows us to terminate unpromising configurations. If `early_stopping=True`, `TuneGridSearchCV` will default to using Tune’s `ASHAScheduler`. You can pass in a custom algorithm - see Tune’s documentation on schedulers here for a full list to choose from. 

`max_iters` is the maximum number of iterations a given hyperparameter set could run for; it may run for fewer iterations if it is early stopped.

In [14]:
# Now let's do with Tune's in-place replacement
# Note: If early_stopping=True, TuneGridSearchCV will default to using Tune’s ASHAScheduler.
tune_sklearn = TuneGridSearchCV(SGDClassifier(), 
                    parameter_grid,
                    early_stopping=True,
                    max_iters=30,
                    n_jobs=-1,    # Use all cores if running on a cluster
                    mode="min",
                    verbose=True)

### Step 3. Run tune

The `.fit()` under the hood will call tune

In [15]:
tune_sklearn.fit(x_train, y_train)

In [16]:
print(f"Ray Tune Scikit-learn TuneGridSearchCV Best params: {tune_sklearn.best_params}")

Ray Tune Scikit-learn TuneGridSearchCV Best params: {'alpha': 0.1, 'epsilon': 0.01}


In [17]:
ray.shutdown()

### Homework

1. Walk through and convert into a notebook the quick PyTorch Tutorial with Ray Tune
2. Try some Ray Tune [How-to guides](https://docs.ray.io/en/latest/tune/examples/index.html)