# Introduction to Ray RLlib

<img src="../_static/assets/Generic/ray_logo.png" width="20%" loading="lazy">

## About this notebook

### Is it right for you?

This notebook is an example-based introduction to the Ray RLlib.

You will go through an end-to-end example that covers data loading, training, hyper-parameter tuning, predicting and serving. Along the way you will learn about Ray AIR's specialized libraries that collectively form a unified API for scalable ML applications.

It is right for you if:

* have basic familiarity with Ray project
* you want to learn about Ray AIR: the unified API for scalable ML applications
* you have an existing ML application or workload and you look for tools that will let you scale it easily.

### Prerequisites

For this notebook you should have:

* practical Python and machine learning experience

You have completed:
* [Overview of Ray](https://github.com/ray-project/ray-educational-materials/blob/main/Introductory_modules/Overview_of_Ray.ipynb)

### Learning objectives

Upon completion of this notebook, you will know about:

* high-level ML libraries that compose Ray AIR: Data, Train, Tune, Serve, and RLlib
* how to use Ray AIR as a unified toolkit to write an end-to-end ML application in Python as well as scale individual jobs
* problems and challenges that Ray AIR attempt to solve

### What will you do?

You will also scale reinforcement learning (RL) application with RLlib and practice key concepts relevant in the RL domain.

## Part 1: Overview of Ray AI Runtime (AIR)

<div class="alert alert-info">
  <strong><a href="https://docs.ray.io/en/latest/ray-air/getting-started.html" target="_blank">Ray AI Runtime (AIR)</a></strong> is an open-source, Python, domain specific library that equips ML engineers, data scientists, and researchers with a scalable and unified toolkit for ML applications.
</div>

Ray AIR is built on top of Ray core. It caters for distributed data processing, model training, tuning, model serving, and reinforcement learning, all in Python. To that end it enables both individual workloads and end-to-end use cases to be implemented in the single unified library.

### Machine learning workflow with Ray AIR

Each of the five native libraries that Ray AIR wraps is focused on a specific stage of the ML workflow. Because this abstraction layer is built on top of Ray Core, it is distributed and scalable. Ray AIR brings together an ever-growing ecosystem of integrations with your favorite machine learning frameworks.

|<img src="../_static/assets/Introduction_to_Ray_AIR/e2e_air.png" width="70%" loading="lazy">|
|:--|
|Ray AIR enables end-to-end ML development and provides multiple options to integrate with other tools and libraries form the MLOps ecosystem.|

1. [Ray Data](https://docs.ray.io/en/latest/data/dataset.html): scalable, framework-agnostic loading and transforming raw data
1. [Ray Train](https://docs.ray.io/en/latest/train/train.html): distributed multi-node and multi-core model training with fault tolerance that integrates with your favorite training libraries
1. [Ray Tune](https://docs.ray.io/en/latest/tune/index.html): scales experiment execution and hyper-parameter tuning to optimize model performance
1. [Ray Serve](https://docs.ray.io/en/latest/serve/index.html): deploys your model for online or batch inference
1. [Ray RLlib](https://docs.ray.io/en/latest/rllib/index.html): distributed reinforcement learning workloads that integrate with the other Ray AIR libraries above

## Part 2: Production ready reinforcement learning with RLlib

In addition to scaling end-to-end workflows with supervised learning problems, we can use Ray AIR to scale reinforcement learning workloads. Here, we will demonstrate this by training a reinforcement learning agent using online training.

**A Brief Primer on Reinforcement Learning Basics**

Reinforcement learning (RL) involves an **agent** learning what to do through **rewards** based on its interactions from its **environment**. Unlike other types of machine learning, the path to maximize rewards is not prescribed, but rather must be learned through trying and feedback over time. To unpack this further, here are some key componenents of RL problem:

- **Action Space** - all possible actions; could be discrete steps (left, right) or continuous (accelerate $F  \frac{m}{s^2}$)
- **State Space** - a complete description of the environment; a *value function* specifies the value of reward the agent can accumulate in the future starting from that state
- **Observation Space** - an observation by the agent of certain parts of the state
- **Reward** - feedback, positive or negative, after each action; defines the goal
- **Policy** - defines the learning agent's way of behaving based on its expected sum over all future rewards

![Agent/Env](https://www.kdnuggets.com/images/reinforcement-learning-fig1-700.jpg)

We'll go into how to work with these components in the coding exercise. For now, let's chat about how to run reinforcement learning applications with Ray RLlib.


### Ray RLLib
***

![RLlib Highlight](../_static/assets/Introduction_to_Ray_AIR/rllib_highlight.png)

RLlib is an open-source library for reinforcement learning (RL), offering support for [production-level](https://www.anyscale.com/events/2021/06/23/applying-ray-and-rllib-to-real-life-industrial-use-cases), distributed RL workloads while maintaining unified and simple APIs for a large variety of industry applications. As part of the Ray ecosystem, RLlib integrates well with other Ray libraries like Ray Tune for checkpointing and Ray Serve for deploying models.

**Some Key Features of RLlib:**

- **[PyTorch](https://github.com/ray-project/ray/blob/master/rllib/examples/custom_torch_policy.py) and [Tensorflow](https://github.com/ray-project/ray/blob/master/rllib/examples/custom_tf_policy.py)** - available as backends, with the option to switch between them with one line of code
- **Hightly Distributed** - inherits from Ray Core and allows you to configure `num_workers` to run on hundreds of nodes
- **Vectorized and Remote Environments** - batched and parallel environments that auto-vectorizes `gym.Envs` via the `num_envs_per_worker` config
- **Support for Multi-Agent** - convert custom `gym.Envs` into a multi-agent set-up to start training with cooperative policies, adversarial scenarios, and/or independent learning
- **External Simulators** - support for external environment API and comes with a pluggable, off-the-shelf client/server setup that allows you to run hundreds of independent simulators on the "outside" connecting to a central RLlib Policy-Server that learns and serves actions.
- **Offline Support** - comes with several offline algorithms (CQL, MARWIL, and DQfD) allowing either behavior-cloning an existing system or learning how to improve it
- [**Algorithms**](https://docs.ray.io/en/latest/rllib/rllib-algorithms.html) - a growing collection of 25+ algorithms to apply in offline, model-free on-policy, model-free off-policy, model-based, derivative-free, recommender systems, contextual bandits, multi-agent, and other RL
- [**Environments**](https://docs.ray.io/en/latest/rllib/rllib-env.html) - support for several different types of environments including OpenAI Gym, user-defined, multi-agent, and batch environments

### Example: CartPole Training and Online Evaluation

For our example, we will run training on the [CartPole environment](https://www.gymlibrary.dev/environments/classic_control/cart_pole/) from [OpenAI Gym](https://www.gymlibrary.dev/). The premise is essentially that there is a pole attached to a cart on a frictionless track and the agent's job is to balance this pendulum upright by moving left and right. The observation space consists of the cart position, cart velocity, pole angle, and pole angular velocity, and the goal is to keep the pole upright for as long as possible.

![Cartpole](https://www.gymlibrary.dev/_images/cart_pole.gif)

*Figure 8*

#### Importing Relevant Packages

In [None]:
import gym
import numpy as np

from ray.air import RunConfig
from ray.air import ScalingConfig

from ray.air import Checkpoint
from ray.air import Result

from ray.train.rl import RLTrainer
from ray.train.rl import RLPredictor

1. To begin, we'll be using [OpenAI Gym](https://www.gymlibrary.dev/) which is a standard open source Python library for developing and comparing reinforcement learning algorithms as well as providing a standard set of environments.
2. With Ray AIR's `RunConfig` and `ScalingConfig` we can specify configurations for training/tuning runs and scaling training respectively so that your settings are preserved in the pipeline.
3. Callback to `Checkpoint` and `Result` objects from previous sections that store the state of your model and the result from a training or tuning trial.
4. Import RL specific trainers and predictors which are able to take in Ray datasets and preprocessors from prior steps.

Note: We are using a Ray AIR wrapper for RLlib's trainable which allows a smoother integration with the Ray ecosystem. For custom environments, preprocessors, or models, you can check out the [Training APIs for Rllib](https://docs.ray.io/en/latest/rllib/rllib-training.html).

#### Define a Training Function

In [None]:
def train_rl(num_workers, use_gpu):
    trainer = RLTrainer(
        run_config=RunConfig(stop={"training_iteration": 5}),
        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
        algorithm="PPO",
        config={
            "env": "CartPole-v0",
            "framework": "torch",
        },
    )
    
    return trainer.fit()

We can define a training function, `train_rl()` that takes in:
- `num_workers`: int, the number of workers to start
- `use_gpu`: bool, whether to use gpu

and creates an `RLTrainer`, similar to the `Trainer` object we encountered in the Ray Train section, which specifies:

- `RunConfig`: sets up how the training run should happen
- `ScalingConfig`: allows you to adjust settings for how to scale training
- `algorithm`: we use the [`PPO` algorithm](https://openai.com/blog/openai-baselines-ppo/)
- `config`: specifies our `CartPole-v1` environment and uses Tensorflow as a our backend

When we call `train_rl()`, it returns a `Result` object automatically created by `trainer.fit()` through which we can access a `Checkpoint` of the trained model.

#### Define an Evaluation Function

Next, we want to create a function to evaluate how well our model trained. It's performance will be evaluated on a reset version of the same environment. In this online evaluation technique, unlike supervised learning cases where we evaluate on a static test set, we probe how well the agent performs through a live simulation.

In [None]:
def evaluate(checkpoint, num_episodes):
    predictor = RLPredictor.from_checkpoint(checkpoint)

    env = gym.make("CartPole-v0")

    rewards = []
    for i in range(num_episodes):
        obs = env.reset()
        reward = 0.0
        done = False
        while not done:
            action = predictor.predict(np.array([obs]))
            obs, r, done, _ = env.step(action[0])
            reward += r
        rewards.append(reward)

    return rewards

We create an `evaluate()` function that takes in:
- `checkpoint`: the saved model state from the training `Result`
- `num_episodes`: the number of episodes to run, i.e. agent-environment iteration cycles

To begin, we:

- Create an `RLPredictor` from the `checkpoint`
- Set the environment to `CartPole-v1`
- Create a list of rewards for each episode

For every episode:
- `env.reset()` - the observation state is reset to a uniformly random value `(-0.05, 0.05)` for cart position, cart velocity, pole angle, and pole angular velocity
- set `reward` to 0 and `done` flag to `False`

While we're not in a terminal state:
- take an action based on the trained model's best judgement of the observation space
- obtain a new observation state, reward, and done flag after taking a step
- we assign a reward of `+1` for every step taken, including the termination step

<!-- **Example Terminal States**

1. Termination: Pole Angle is greater than $\pm 12 \degree$
2. Termination: Cart Position is greater than $\pm2.4$ (center of the cart reaches the edge of the display)
3. Truncation: Episode length is greater than 500 (200 for v0) -->

#### Online Reinforcement Learning

Finally, let's put it all together to train the model, evaluate the policy on a fresh environment (using the checkpoint from training) for `num_episodes`. For `CartPole-v1`, the reward threshold is set to `+475`, so let's see how we stack up!

In [None]:
num_episodes = 3

In [None]:
result = train_rl(num_workers=4, use_gpu=False)

In [None]:
rewards = evaluate(result.checkpoint, num_episodes=num_episodes)

In [None]:
print(f"Average Reward Over {num_episodes} Episodes: " f"{np.mean(rewards)}")

**Coding Exercise**

We have mostly kept this example to focus on RLlib, but we can definitely apply our learnings from previous sections to extend this solution. RLlib integrates particularly well with Ray Tune.

Modify the function `train_rl(num_workers, use_gpu)` we created above to include a tuning step. Some things that will help you along the way:

- RLlib Trainers can be passed into the first argument when instantiating a `Tuner` object.
- We want a `Checkpoint` at the end of tuning to access later on in our evaluation step, so turn this parameter to `True`
- Remember that Tuner returns a `ResultGrid` that contains all the results from your training run. You can either elect to return the first result, or better yet, return the best result by querying `result_grid.get_best_result()`

In [None]:
from ray import tune

def train_rl(num_workers, use_gpu):
    trainer = RLTrainer(
        run_config=RunConfig(stop={"training_iteration": 5}),
        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
        algorithm="PPO",
        config={
            "env": "CartPole-v1",
            "framework": "torch",
        },
    )
    
    ### MODIFY TRAIN_RL HERE ###

    return trainer.fit()

**Solution**

In [None]:
### SAMPLE IMPLEMENTATION ###

from ray import tune

def train_rl(num_workers, use_gpu):
    trainer = RLTrainer(
        run_config=RunConfig(stop={"training_iteration": 5}),
        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
        algorithm="PPO",
        config={
            "env": "CartPole-v1",
            "framework": "torch",
        },
    )
    
    tuner = Tuner(
        trainer,
        _tuner_kwargs={"checkpoint_at_end": True},
    )
    result = tuner.fit()[0]

    return result



### Part 2: Summary

In this section, we trained a reinforcement learning agent using online training in a Cartpole environment.

#### Key Concepts

Reinforcement Learning Concepts
- Agent
- Action Space
- State Space
- Observation Space
- Reward
- Policy

#### Key RLlib Objects

* `RunConfig`
* `ScalingConfig`
* `Checkpoint`
* `Result`
* `RLTrainer`
* `RLPredictor`

### Other Resources

If you would like to practice your new skills further with some in-depth examples beyond the embedded coding exercises, take a look at this list of suggested problems:

* Watch the Ray Summit Talk on [Introduction to Ray AIR](https://github.com/ray-project/hackathon5-algo)
* Check out the [Ray AIR Documentation](https://docs.ray.io/en/latest/ray-air/getting-started.html)
* Understand its [Components and APIs](https://docs.ray.io/en/latest/ray-air/package-ref.html)
* Ray AIR [User Guides](https://docs.ray.io/en/latest/ray-air/user-guides.html) and [Examples](https://docs.ray.io/en/latest/ray-air/examples/index.html)

# Connect with the Ray community

You can learn and get more involved with the Ray community of developers and researchers:

* [Ray documentation](https://docs.ray.io/en/latest)
* [Official Ray Website](https://www.ray.io/): Browse the ecosystem and use this site as a hub to get the information that you need to get going and building with Ray.
* [Join the Community on Slack](https://forms.gle/9TSdDYUgxYs8SA9e8): Find friends to discuss your new learnings in our Slack space.
* [Use the Discussion Board](https://discuss.ray.io/): Ask questions, follow topics, and view announcements on this community forum.
* [Join a Meetup Group](https://www.meetup.com/Bay-Area-Ray-Meetup/): Tune in on meet-ups to listen to compelling talks, get to know other users, and meet the team behind Ray.
* [Open an Issue](https://github.com/ray-project/ray/issues/new/choose): Ray is constantly evolving to improve developer experience. Submit feature requests, bug-reports, and get help via GitHub issues.
* [Become a Ray contributor](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html): We welcome community contributions to improve our documentation and Ray framework.

<img src="../_static/assets/Generic/ray_logo.png" width="20%" loading="lazy">