# Ray Architecture and Setup

[Ray Cluster](https://docs.ray.io/en/latest/cluster/key-concepts.html#id3)[](https://docs.ray.io/en/latest/cluster/key-concepts.html#ray-cluster "Permalink to this headline")
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

A Ray cluster consists of a single [head node](https://docs.ray.io/en/latest/cluster/key-concepts.html#cluster-head-node) and any number of connected [worker nodes](https://docs.ray.io/en/latest/cluster/key-concepts.html#cluster-worker-nodes):

[![../_images/ray-cluster.svg](https://docs.ray.io/en/latest/_images/ray-cluster.svg)](https://docs.ray.io/en/latest/_images/ray-cluster.svg)

*A Ray cluster with two worker nodes. Each node runs Ray helper processes to facilitate distributed scheduling and memory management. The head node runs additional control processes (highlighted in blue).*[](https://docs.ray.io/en/latest/cluster/key-concepts.html#id1 "Permalink to this image")

The number of worker nodes may be *autoscaled* with application demand as specified by your Ray cluster configuration. The head node runs the [autoscaler](https://docs.ray.io/en/latest/cluster/key-concepts.html#cluster-autoscaler).

> Note: Ray nodes are implemented as pods when [running on Kubernetes](https://docs.ray.io/en/latest/cluster/kubernetes/index.html#kuberay-index).

Users can submit jobs for execution on the Ray cluster, or can interactively use the cluster by connecting to the head node and running [`ray.init`](https://docs.ray.io/en/latest/ray-core/package-ref.html#ray.init "ray.init"). See [Ray Jobs](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/quickstart.html#jobs-quickstart) for more information.

[Head Node](https://docs.ray.io/en/latest/cluster/key-concepts.html#id4)[](https://docs.ray.io/en/latest/cluster/key-concepts.html#head-node "Permalink to this headline")
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Every Ray cluster has one node which is designated as the *head node* of the cluster. The head node is identical to other worker nodes, except that it also runs singleton processes responsible for cluster management such as the [autoscaler](https://docs.ray.io/en/latest/cluster/key-concepts.html#cluster-autoscaler) and the Ray driver processes [which run Ray jobs](https://docs.ray.io/en/latest/cluster/key-concepts.html#cluster-clients-and-jobs). Ray may schedule tasks and actors on the head node just like any other worker node, unless configured otherwise.

[Worker Node](https://docs.ray.io/en/latest/cluster/key-concepts.html#id5)[](https://docs.ray.io/en/latest/cluster/key-concepts.html#worker-node "Permalink to this headline")
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

*Worker nodes* do not run any head node management processes, and serve only to run user code in Ray tasks and actors. They participate in distributed scheduling, as well as the storage and distribution of Ray objects in [cluster memory](https://docs.ray.io/en/latest/ray-core/scheduling/memory-management.html#memory).

[Autoscaling](https://docs.ray.io/en/latest/cluster/key-concepts.html#id6)[](https://docs.ray.io/en/latest/cluster/key-concepts.html#autoscaling "Permalink to this headline")
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The *Ray autoscaler* is a process that runs on the [head node](https://docs.ray.io/en/latest/cluster/key-concepts.html#cluster-head-node) (or as a sidecar container in the head pod if [using Kubernetes](https://docs.ray.io/en/latest/cluster/kubernetes/index.html#kuberay-index)). When the resource demands of the Ray workload exceed the current capacity of the cluster, the autoscaler will try to increase the number of worker nodes. When worker nodes sit idle, the autoscaler will remove worker nodes from the cluster.

It is important to understand that the autoscaler only reacts to task and actor resource requests, and not application metrics or physical resource utilization. To learn more about autoscaling, refer to the user guides for Ray clusters on [VMs](https://docs.ray.io/en/latest/cluster/vms/index.html#cloud-vm-index) and [Kubernetes](https://docs.ray.io/en/latest/cluster/kubernetes/index.html#kuberay-index).

[Ray Jobs](https://docs.ray.io/en/latest/cluster/key-concepts.html#id7)[](https://docs.ray.io/en/latest/cluster/key-concepts.html#ray-jobs "Permalink to this headline")
------------------------------------------------------------------------------------------------------------------------------------------------------------------------

A Ray job is a single application: it is the collection of Ray tasks, objects, and actors that originate from the same script. The worker that runs the Python script is known as the *driver* of the job.

There are three ways to run a Ray job on a Ray cluster:

1.  (Recommended) Submit the job using the [Ray Jobs API](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html#jobs-overview).

2.  Run the driver script directly on any node of the Ray cluster, for interactive development.

3.  Use [Ray Client](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/ray-client.html#ray-client-ref) to connect remotely to the cluster within a driver script.

For details on these workflows, refer to the [Ray Jobs API guide](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html#jobs-overview).

`pip install -U "ray[air]" # installs Ray + dependencies for Ray AI Runtime`

# Ray Dataset, Train, and Tune

To illustrate Ray AIR's capabilities, you will implement an end-to-end example - predicting big tips with New York City Taxi data. Each section will introduce the Ray AIR library before demonstrating its functionality with code examples.

|Ray AIR Component|Example Use Case|
|:--|:--|
|Ray Data|use `Preprocessor` to load and transform input data|
|Ray Train|use `Trainer` to scale XGBoost model training|
|Ray Tune|use `Tuner` for hyperparameter search|
|Ray AIR Predictor|use `BatchPredictor` to load model from best checkpoint for batch inference|
|Ray Serve|use `PredictorDeployment` for online inference|

For this example, you will train [XGBoost](https://xgboost.readthedocs.io/en/stable/) model. XGBoost is a gradient boosted decision trees library, and you will set up a simple version for this classification task.

You will use the June 2021 [New York City Taxi & Limousine Commission's Trip Record Data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) which contains over 2 million samples to predict whether a trip may result in a tip over 20%.

**Key features**
- `passenger_count`
- `trip_distance` (in miles)
- `fare_amount` (including tax, tip, fees, etc.)
- `trip_duration` (in seconds)
- `hour` (hour that the trip started)
- `day_of_week`
- `is_big_tip` (whether the tip amount was greater than 20%)

## Ray Data
---

First up, you will load in the taxi dataset and transform its raw input into features that will be given to our machine learning model.

|<img src="images/data_highlight.png" width="70%" loading="lazy">|
|:--|
|Ray AIR wraps Ray Data to provide distributed data ingestion and transformation during training, tuning, and inference.|

### Introduction to Ray Datasets

Backed by PyArrow, [Ray Datasets](https://docs.ray.io/en/latest/data/user-guide.html) parallelize loading and transforming data, and provide a standard way to pass references to data across Ray libraries and applications.

#### Key features

- **Flexibility**

    Compatible with a variety of file formats, data sources, and distributed frameworks, Datasets work seamlessly with library integrations like Dask on Ray and can be passed between Ray tasks and actors without copying data.

- **Performance for ML Workloads**

    Datasets offers important features like accelerator support, pipelining, and global random shuffles that accelerate ML training and inference workloads along with basic distributed data transformations such as map, filter, sort, groupby, and repartition.

- **Persistent Preprocessor**

    The `Preprocessor` primitive explicitly captures and stores the transformations applied to convert inputs into features and is applied at both training and serving to keep the processing consistent across the pipeline.
    
- **Built on Ray Core**

    Inherits scalability to hundreds of nodes, efficient memory usage due to memory across processes on the same node, and object spilling and recovery to handle failures. Because Datasets are just lists of object references, they can be passed between tasks and actors without needing to make a copy of the data, which is crucial for making data-intensive applications and libraries scalable.

### Start Ray runtime

In [None]:
import ray

In [None]:
if ray.is_initialized:
    ray.shutdown()

ray.init()

Start a Ray cluster (check out these [instructions](https://docs.ray.io/en/latest/ray-overview/installation.html) if you haven't installed) so that Ray can utilize all the cores available to you as workers. 

- check `ray.is_initialized` to ensure that you start with a fresh cluster
- use `ray.init()` to initialize a Ray context

### Create Ray Datasets

In [None]:
# read Parquet file to Ray Dataset
dataset = ray.data.read_parquet(
    "s3://anyscale-training-data/intro-to-ray-air/nyc_taxi_2021.parquet"
)

In [None]:
# split data into training and validation subsets
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)

In [None]:
# split datasets into blocks for parallel preprocessing
# num_blocks should be lower than number of cores in the cluster
train_dataset = train_dataset.repartition(num_blocks=5)
valid_dataset = valid_dataset.repartition(num_blocks=5)

There exist many [`Dataset` API elements](https://docs.ray.io/en/latest/data/api/dataset.html#) available for common transformations and operations.

We'll take a look at our data:

1. Inspect the schema from the underlying Parquet metadata.
2. Count how many rows are in the training and validation datasets.
3. Inspect the first five samples of either dataset.
4. What is the average `fare_amount` grouped by `passenger_count`?

In [None]:
print(f"Schema of training dataset: \n {train_dataset.schema()}")

In [None]:
print(f"Number of samples in training dataset: \n {train_dataset.count()}")
print(f"Number of samples in validation dataset: \n {valid_dataset.count()}")

In [None]:
train_dataset.show(5)

In [None]:
train_dataset.groupby("passenger_count").mean("fare_amount").show()

### Preprocess the dataset
To transform our raw data into features, you will define a `Preprocessor`. Ray AIR's `Preprocessor` captures the data transformation you apply and persists:

- **During Training**

    `Preprocessor` is passed into a `Trainer` to `fit` and `transform` input `Dataset`s.
- **During Tuning**

    Each `Trial` will create its own copy of the `Preprocessor` and the fitting and transformation logic will occur once per `Trial`
- **During Checkpointing**

    The `Preprocessor` is saved in the `Checkpoint` if it was passed into the `Trainer`
- **During Predicting**

    If the `Checkpoint` contains a `Preprocessor`, then it will be used to call `transform_batch` on input batches prior to performing inference

In [None]:
from ray.data.preprocessors import MinMaxScaler

In [None]:
preprocessor = MinMaxScaler(columns=["trip_distance", "trip_duration"])

You define a `MinMaxScaler` preprocessor that will normalize the `trip_distance` and `trip_duration` columns by their range.

Ray AIR provides several [preprocessors out of the box](https://docs.ray.io/en/latest/ray-air/preprocessors.html#) as well as support for implementing custom preprocessors.

### Summary

#### Key concepts

`Dataset`

The standard way to load and exchange data in Ray AIR. In AIR, Datasets are used extensively for data loading, preprocessing, and batch inference.

`Preprocessors`

Preprocessors are primitives that can be used to transform input data into features. Preprocessors operate on Datasets, which makes them scalable and compatible with a variety of datasources and dataframe libraries.

Preprocessors persist:

- during training to fit and transform input data
- in each trial of hyperparameter tuning
- within a checkpoint
- on input batches for inference

AIR comes with a collection of built-in preprocessors, and you can also define your own with simple templates which you can read more about in the [user guide](https://docs.ray.io/en/latest/ray-air/preprocessors.html).

## Ray Train
***

Following data pre-processing, you can define the model for binary classification of big tip rides.

|<img src="images/train_highlight.png" width="70%" loading="lazy">|
|:--|
|Ray AIR wraps Ray Train to provide distributed model training.|

### Introduction to Ray Train

ML practitioners tend to run into a few common problems with training models that prompt them to consider distributed solutions:

1. training time is too long to be practical
2. the data is too large to fit on one machine
3. training many models sequentially doesn't utilize resources efficiently
4. the model itself is too large to fit on a single machine

[Ray Train](https://docs.ray.io/en/latest/ray-air/trainer.html) addresses these issues by cutting down runtime through distributed multi-node training with fault tolerance and leveraging Ray Data to distribute preprocessing and data ingestion.

Fully integrated into the Ray AIR ecosystem, `Trainer`s can plug into:

- Ray Data: to enable scalable data loading and preprocessing
- Ray Tune: for distributed hyperparameter tuning
- Ray AIR Predictor: as a checkpointed trained model to be applied during inference
- Popular ML training frameworks like:
    - PyTorch
    - Tensorflow
    - Horovod
    - XGBoost
    - HuggingFace Transformers
    - Scikit-Learn
    - and more

#### Key features

* callbacks for early stopping
* checkpointing
* integration with Tensorboard, Weights & Biases, and MLflow for observability
* export mechanisms for models

|<img src="images/train_code.png" width="70%" loading="lazy">|
|:--|
|Training comes in two major parts: defining the `Trainer` object and then fitting it to the training dataset. In this code snippet, you use a `TorchTrainer`, however, this may be swapped out with any [integrations](https://docs.ray.io/en/latest/ray-air/package-ref.html#trainer-and-predictor-integrations).|

Let's put these concepts in practice by applying it to our taxi problem.

### Define AIR `Trainer`

Ray AIR provides a variety of [`Trainer`s](https://docs.ray.io/en/latest/ray-air/trainer.html) (PyTorch, Tensorflow, HuggingFace, etc.). In the example below, you will use an `XGBoostTrainer` to perform binary classification on these NYC Taxi rides.

In [None]:
from ray.air.config import ScalingConfig
from ray.train.xgboost import XGBoostTrainer

In [None]:
trainer = XGBoostTrainer(
    label_column="is_big_tip",
    num_boost_round=50,
    scaling_config=ScalingConfig(
        num_workers=5,
        use_gpu=False,
    ),
    params={
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
        "tree_method": "approx",
    },
    datasets={"train": train_dataset, "valid": valid_dataset},
    preprocessor=preprocessor,
)

To construct a `Trainer`, you provide:

- a `ScalingConfig` which specifies how many parallel training workers and what type of resources (CPUs/GPUs) to use per worker during training
- a dictionary of training and validation sets
- the `Preprocessor` used to transform the `Dataset`s

Optionally, you can choose to add `resume_from_checkpoint` which allows you to continue training from a saved checkpoint should the run be interrupted.

### Fit the Trainer

In [None]:
result = trainer.fit()

To invoke training, call `.fit()`. Trainer objects produce a `Result` object which gives you access to metrics, checkpoints, and errors.

You can check out the training results from the `Result` object with the following calls:

```python
# returns last saved checkpoint
result.checkpoint

# returns the `n` best saved checkpoints as configured in `RunConfig.CheckpointConfig`
result.best_checkpoints

# returns the final metrics as reported
result.metrics

# returns the contain an Exception if training failed
result.error
```

Inspect your training result below. What is the reported accuracy for the training and validation runs? 

Note: `error` is the binary classification error rate in this case calculated as `#(wrong cases)/#(all cases)`

In [None]:
print(f"Result metrics: \n {result.metrics} \n")

In [None]:
print(f"Training accuracy: {1 - result.metrics['train-error']:.4f}")
print(f"Validation accuracy: {1 - result.metrics['valid-error']:.4f}")

### Summary

#### Key concepts

`Trainer`

Trainers are wrapper classes around third-party training frameworks such as XGBoost, Pytorch, and Tensorflow. They are built to help integrate with core Ray Actors (for distribution), Ray Datasets, and Ray Tune.

## Ray Tune
***

Now that you have a baseline XGBoost model trained, you may want to improve performance by running hyperparameter tuning experiments.

|<img src="images/tune_highlight.png" width="70%" loading="lazy">|
|:--|
|Ray AIR wraps Ray Tune to provide distributed hyperparameter optimization.|

### Introduction to Ray Tune

<div class="alert alert-info">
  <strong><a href="https://en.wikipedia.org/wiki/Hyperparameter_optimization" target="_blank">Hyperparameter tuning (or optimization) (HPO)</a></strong> is the process of choosing optimal hyperparameters for a machine learning model. Hyperparameters, in contrast to weights learned by the model, are parameters that you set to influence training.
</div>


Setting up and executing hyperparameter optimization (HPO) can be expensive in terms of compute resources and runtime with several complexities including:

- **Vast Search Space**

    Your model could have several hyperparameters, each with different data types, ranges, and possible correlations.
    Sampling good candidates from high-dimensional spaces is difficult.
- **Search Algorithms**

    Choosing hyperparameters strategically requires testing complex search algorithms to achieve good results.
- **Long Runtime**

    Even if you distribute tuning, training complex models in themselves can take a long time to complete per run, so it's best to have an efficiency at every stage in the pipeline.
- **Resource Allocation**

    You must have enough compute resources available to during each trial as to not slow down search because of scheduling mismatches.
- **User Experience**

    Observability tooling for developers like stopping bad runs early, saving intermediate results, restarting from checkpoints, or pausing/resuming runs makes HPO easier.

Ray Tune is a distributed HPO library that addresses all of these topics above to provide a simplified interface for running trials and integrates with popular frameworks such as HyperOpt and Optuna.

|<img src="images/tune_code.png" width="70%" loading="lazy">|
|:--|
|General pattern for using AIR `Tuner`s which involves taking in a trainable, defining a search space, establishing a search algorithm, scheduling trials, and analyzing results.|

Let's see how to interact with Ray Tune to make some improvements to our big tip classifier.

### Use AIR `Tuner` for hyperparameter search

In [None]:
from ray import tune
from ray.tune.tuner import Tuner, TuneConfig

In [None]:
param_space = {
    "params": {
        "eta": tune.uniform(0.2, 0.4),
        "max_depth": tune.randint(1, 6),
        "min_child_weight": tune.uniform(0.8, 1.0),
    }
}

tuner = Tuner(
    trainer,
    param_space=param_space,
    tune_config=TuneConfig(num_samples=3, metric="train-logloss", mode="min"),
)

First define a search space with a few hyperparameters to tune:

- `eta` is the learning rate
- `max_depth` specifies how deep each tree is (default=6). A higher value leads to a more complex model.
- `min_child_weight` defines the minimum sum of weights of all observations in a child; used to control overfitting

To set up an AIR `Tuner`, you must specify:

- `Trainer`: the training loop
- `search space`: a set of hyperparameters you wish to tune
- `search_algorithm`: to optimize parameter search
- `scheduler`: (optional) to stop searches early and speed up experiments

### Execute hyperparameter search and analyze results

In [None]:
result_grid = tuner.fit()

In [None]:
best_result = result_grid.get_best_result()

Now, you can execute tuning on `num_samples=10` trials. After tuning, you can query the `ResultGrid` object to see metrics, results, and checkpoints of each trial.

You can probe the `ResultGrid` for metrics using these calls:

```python

# checks if there have been errors
result_grid.errors

# gets the best result
best_result = result_grid.get_best_result()

# gets the best checkpoint
best_checkpoint = best_result.checkpoint

# gets the best metrics
best_metrics = best_result.metrics

```

Inspect your tuning results, what is the best result from these experiments? Are they better than the baseline model in the training step in the previous section?

In [None]:
best_result = result_grid.get_best_result()
print(f"Best result: \n {best_result} \n")

In [None]:
print(f"Training accuracy: {1 - best_result.metrics['train-error']:.4f}")
print(f"Validation accuracy: {1 - best_result.metrics['valid-error']:.4f}")

### Summary

#### Key concepts

`Tuner`

Provides an interface that works with AIR `Trainer`s to perform distributed hyperparameter tuning. You define a set of hyperparameters you wish to tune in a search space, specify a search algorithm, and the `Tuner` returns its results in a `ResultGrid` that contains metrics, results, and checkpoints for each `trial`.