# Notebook 01. Introduction to OpenAI Gym and RLlib

© 2019-2022, Anyscale. All Rights Reserved<br>
📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb) <br>
➡️ [Next notebook](./ex_02_create_multiagent_rllib_env.ipynb) <br>

### Learning objectives
In this this tutorial, you will learn:
 * [What is an Environment in Reinforcement Learning (RL)?](#intro_env)
 * [Overview of RL terminology](#intro_rl)
 * [Introduction to OpenAI Gym environments](#intro_gym)
 * [High-level OpenAI Gym API calls](#intro_gym_api)
 * [Overview of RLlib](#intro_rllib)
 * [Train a policy using an algorithm from RLlib](#intro_rllib_api)
 * [Evaluate a RLlib policy](#eval_rllib)
 * [Reload RLlib model from checkpoint and run inference](#reload_rllib)
  

## What is an environment in RL? <a class="anchor" id="intro_env"></a>

Solving a problem in RL begins with an **environment**. In the simplest definition of RL:

> An **agent** interacts with an **environment** and receives a reward.

An environment in RL is the agent's world, it is a simulation of the problem to be solved. 

<img src="./images/env_key_concept1.png" width="50%" />

The **environment** simulator might be of a:
<ul>
    <li>real, physical machine such as a gas turbine or autonomous vehicle</li>
    <li>real, abstract system such as user behavior on a website or the stock market</li>
    <li>virtual sytem on a computer such as a board game or a video game</li>
    </ul>
    
The **agent** represents what is triggering the actions.  For example it could be:
<ul>
    <li>a software system that is triggering actions for machines</li>
    <li>a type of user or investor</li>
    <li>a game player or game system that is competing against real players </li>
    </ul> 
    

<i>Comparison of RL to supervised learning</i> <br>
<ul>
    <li><i>Data</i>.  In supervised learning, you start with a labeled dataset.  In contrast, the <b>data in RL is not given up front; the environment acts as a data generator</b>.  One can also do RL on a pre-collected dataset (called offline RL), we will touch on offline RL later. </li>
    <li><i>Training</i>.  Traditional supervised learning views the world as more of a one-shot training, not as action -> fedback -> improved action -> repeat. </li>
    </ul>

<b>Why bother with an Agent, Environment, and RL?</b>  <br>
<div class="alert alert-block alert-success">    
<b>💡 Reinforcement learning (RL) is useful when you have sequential decisions that need to be optimized over time.</b> 
</div>

<br> 

## Overview of RL terminology <a class="anchor" id="intro_rl"></a>

An RL environment consists of: 

1. all possible actions (**action space**)
2. a complete description of the environment, nothing hidden (**state space**)
3. an observation by the agent of certain parts of the state (**observation space**)
4. **reward**, which is the only feedback the agent receives after each action.

The model that tries to maximize the expected sum over all future rewards is called a **policy**. The policy is a function mapping the environment's observations to an action to take, usually written **π** (s(t)) -> a(t).  <i>In deep reinforcement learning, this function is a neural network</i>.

<b>Policy vs Model? </b>
In traditional supervised learning, model means a trained algorithm, or a learned function.

> <i>In RL, a model is roughly equivalent to a policy, but policy is more specific</i> because it is trained to act in a specific environment.  For deployment, we use the word "model" because more people understand the ML meaning of a trained model.

Below is a high-level image of how the Agent and Environment work together to train a Policy model in a RL simulation feedback loop in RLlib.

<img src="./images/env_key_concept2.png" width="98%" />

The **RL simulation feedback loop** repeatedly collects data, for one (single-agent case) or multiple (multi-agent case) policies, trains the policies on these collected data, and makes sure the policies' weights are kept in synch. 

During simulation loops, the environment collects observations, taken actions, receives rewards and so-called **done** flags, indicating the boundaries of different episodes the agents play through in the simulation.

Each simulation iteration is called a <b>time step</b>.  The simulation iterations of action -> reward -> next state -> train -> repeat, until the end state, is called an **episode**, or in RLlib, a **rollout**.  
> 👉 Each episode consists of one or many time steps.

<b>Per episode</b> (or between **done** flag == True), the RL simulation feedback loop repeats up to some specified end state (termination state or timesteps). Examples of termination are:
<ul>
    <li>the end of a maze (termination state)</li>  
    <li>the player died in a game (termination state)</li>
    <li>after 60 videos watched in a recommender system (timesteps).</li>
    </ul>
    
<b>Why train for many episodes?</b>  When you are doing machine learning, you do not just do something once and report the result.  You do it many times, to make sure you did not just get "lucky" one time, and you report typically the average result over all the trials.  RL is similar.  

<div class="alert alert-block alert-success">
<b>💡 In RL, the policy is trained by repeating trials, or episodes (or rollouts), then reporting the calculated reward typically as an average of all achieved rewards.</b> 
</div>

<br>

## Introduction to OpenAI Gym example: frozen lake <a class="anchor" id="intro_gym"></a>

[OpenAI Gym](https://gym.openai.com/) is a well-known reference library of RL environments. 

#### 1. import gym

Below is how you would import gym and view all available environments.

In [1]:
# import libraries
import gym
print(f"gym: {gym.__version__}")

# List all available gym environments
all_env  =  list(gym.envs.registry.all())
print(f'Num Gym Environments: {len(all_env)}')

# You could loop through and list all environments if you wanted
# [print(e) for e in all_env]
envs_starting_with_f = [e for e in all_env if str(e).startswith("EnvSpec(Frozen")]
envs_starting_with_f

gym: 0.21.0
Num Gym Environments: 103


[EnvSpec(FrozenLake-v1), EnvSpec(FrozenLake8x8-v1)]

#### 2. Instatiate your Gym object

The way you instantiate a Gym environment is with the **make()** function.

The .make() function takes arguments:
- **name of the Gym environment**, type: str, Required.
- **runtime parameter values**, Optional.

For the required string argument, you need to know the Gym name.  You can find the Gym name in the Gym documentation for environments, either:
<ol>
    <li>The doc page in <a href="https://www.gymlibrary.ml/environments/toy_text/frozen_lake/">Gym's website</a></li>
    <li>The environment's <a href="https://github.com/openai/gym/blob/master/gym/envs/toy_text/frozen_lake.py">source code </a></li>
    <li>
        <a href="https://www.gymlibrary.ml/environments/classic_control/cart_pole/#description">Research paper (if one exists)</a> referenced in the environment page </li>
    </ol>
    
Below is an example of how to create a basic Gym environment, [frozen lake](https://www.gymlibrary.ml/environments/toy_text/frozen_lake/).  We can see below that the termination condition of an episode will be time steps.

In [2]:
env_name = "FrozenLake-v1"

# Instantiate gym env object with a runtime parameter value (is_slippery).
# is_slippery specifies if the environment is deterministic or stochastic
env = gym.make(
    env_name,
    is_slippery=False,  # whether the grid-world behaves deterministically or not
)

# inspect the gym spec for the environment
print(f"env: {env}")
env_spec = env.spec
print(f"env_spec: {env_spec}")

# Note: "TimeLimit" means termination condition for an episode will be time steps

env: <TimeLimit<FrozenLakeEnv<FrozenLake-v1>>>
env_spec: EnvSpec(FrozenLake-v1)


#### 3. Inspect the environment action and observations spaces

Gym Environments can be deterministic or stochastic.

<ul>
    <li>
        <b>Deterministic</b> if the current state + selected action determines the next state of the environment.  <i>Chess is an example of a deterministic environment</i>, since all possible states/action combinations can be described as a discrete set of rules with states bounded by the pieces and size of the board.</li>
    <li>
        <b>Stochastic</b> if the policy output action is a probability distribution over a set of possible actions at time step t. In this case, the agent needs to compute its action from the policy in two steps. i) sample actions from the policy according to the probability distribution, ii) compute log likelihoods of the actions. <i>Random visitors to a website is an example of a stochastic environment</i>. </li>
    </ul>

<b>Gym actions.</b> The action_space describes the numerical structure of the legitimate actions that can be applied to the environment. 

For example, if we have 4 possible discrete actions, we could encode them as:
<ul>
    <li>0: LEFT</li>
    <li>1: DOWN</li>
    <li>2: RIGHT</li>
    <li>3: UP</li>
</ul>

<b>Gym observations.</b>  The observation_space defines the structure as well as the legitimate values for the observation of a state of the environment.  

For example, if we have a 4x4 grid, we could encode them as {0,1,2,3, 4, … ,15} for grid positions ((0,0), (0,1), (0,2), (0,3), …. (3,3)).


From the Gym [documentation](https://www.gymlibrary.ml/environments/toy_text/frozen_lake/) about the frozen lake environment, we see: <br>

|Frozen Lake      | Gym space   |
|---------------- | ----------- |
|Action Space     | Discrete(4) |
|Observation Space| Discrete(16)|


 
<b><a href="https://github.com/openai/gym/tree/master/gym/spaces">Gym spaces</a></b> are gym data types.  The main types are `Discrete` for discrete numbers and `Box` for continuous numbers.  

Gym Space `Discrete` elements are Python type `int`, and Gym Space `Box` are Python type `float32`.

Below is an example how to inspect the environment action and observations spaces.

In [3]:
# check if it is a gym instance
if isinstance(env, gym.Env):
    print("This is a gym environment.")
    print()

    # print gym Spaces
    if isinstance(env.action_space, gym.spaces.Space):
        print(f"gym action space: {env.action_space}")
    if isinstance(env.observation_space, gym.spaces.Space):
        print(f"gym observation space: {env.observation_space}") 
        
# Note: the action space is discrete with 4 possible actions.
# Note: the observation space is 4x4 and thus runs from 0 to 15.

This is a gym environment.

gym action space: Discrete(4)
gym observation space: Discrete(16)


#### 4. Inspect gym environment default & runtime parameters

Gym environments contain 2 sets of parameters that are set after the environment object is instantiated.
<ul>
    <li><b>Default parameters</b> are fixed in the Gym environment code itself.</li>
    <li><b>Runtime parameters</b> are passed into the make() function as **kwargs.</li>
    </ul>

Below is an example of how to inspect the environment parameters.  Notice we can tell from the parameters that our frozen lake environment is: <br>
1) <i>Deterministic</i>, and <br>
2) Episode terminates with time step condition <i>max_episode_steps</i> = 100.

In [4]:
# inspect env.spec parameters
 
# View default env spec params that are hard-coded in Gym code itself
# Default parameters are fixed
print("Default spec params...")
print(f"id: {env_spec.id}")
# rewards above this value considered "success"
print(f"reward_threshold: {env_spec.reward_threshold}")
# env is deterministic or stochastic
print(f"nondeterministic: {env_spec.nondeterministic}")
# number of time steps per episode
print(f"max_episode_steps: {env_spec.max_episode_steps}")
# must reset before step or render
print(f"order_enforce: {env_spec.order_enforce}") 

# View runtime **kwargs .spec params.  These params set after env instantiated.
# print(f"type(env_spec._kwargs): {type(env_spec._kwargs)}") #dict
print()
print("Runtime spec params...")
[print(f"{k}: {v}") for k,v in env_spec._kwargs.items()]
print()

# Note:  We can tell that our frozen lake environment is: 
# 1) Success criteria is rewards >= 0.7
# 2) Deterministic
# 3) Episode terminates when number time_steps = 100


Default spec params...
id: FrozenLake-v1
reward_threshold: 0.7
nondeterministic: False
max_episode_steps: 100
order_enforce: True

Runtime spec params...
map_name: 4x4
is_slippery: False



## High-level OpenAI Gym API calls <a class="anchor" id="intro_gym_api"></a>

The most basic Gym API methods are:
<ul>
    <li><b>env.reset()</b> <br>Reset the environment to an initial state.  You should call this method every time at the start of a new episode.</li>
    <li><b>env.render()</b>  <br>Visually inspect the environment. This is for human/debugging purposes; it is not seen by the agent/algorithm.  Note you cannot inspect an environment before it has been initialized with env.reset().</li>
    <li><b>env.step(action)</b> <br>Take an action from the possible action space values.  It takes an action as input, computes the state of the environment after applying that action and returns the 4-tuple (observation, reward, done, info).</li>
    <li><b>env.close()</b> <br>Close an environment.</li>
    </ul>

In [5]:
# Print the starting observation.  
# Recall possible observations are between 4x4 grid.
print(env.reset())
env.render()

0

[41mS[0mFFF
FHFH
FFFH
HFFG


In [6]:
# Take an action
# Recall the possible actions are: 0: LEFT, 1: DOWN, 2: RIGHT, 3: UP

new_obs, reward, done, _ = env.step(2) #Right
print(f"obs: {new_obs}, reward: {reward}, done: {done}")
env.render()
new_obs, reward, done, _ = env.step(1) #Down
print(f"obs: {new_obs}, reward: {reward}, done: {done}")
env.render()

obs: 1, reward: 0.0, done: False
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
obs: 5, reward: 0.0, done: True
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG


We can also try to run an action in the frozen lake environment which is outside the defined number range.

In [7]:
# Comment this cell if you want whole notebook to run without errors

# Try to take an invalid action

#env.step(4) # invalid

# should see KeyError below

To test out your environment, typically you will loop through a few episodes to make sure it works.  

In [8]:
from ipywidgets import Output
from IPython import display
import time

# The following three lines are for rendering purposes only.
# They allow us to render the env frame-by-frame in-place
# (w/o creating a huge output which we would then have to scroll through).
out = Output()
display.display(out)
with out:

    # Putting the simple API methods together.
    # Here is a pattern for running a bunch of episodes.
    num_episodes = 5 # Number of episodes you want to run the agent
    total_reward = 0.0  # Initialize reward to 0

    # Loop through episodes
    for ep in range(num_episodes):

        # Reset the environment at the start of each episode
        obs = env.reset()
        done = False

        # Loop through time steps per episode
        while True:
            # take random action, but you can also do something more intelligent 
            action = env.action_space.sample()

            # apply the action
            new_obs, reward, done, info = env.step(action)
            total_reward += reward

            # If the epsiode is up, then start another one
            if done:
                break

            # Render the env (in place).
            time.sleep(0.3)
            out.clear_output(wait=True)
            print(f"episode: {ep}")
            print(f"obs: {new_obs}, reward: {total_reward}, done: {done}")
            env.render()

# Close the env
env.close()


Output()

## Overview of RLlib <a class="anchor" id="intro_rllib"></a>

<img width="7%" src="./images/rllib-logo.png"> is the most comprehensive open-source Reinforcement Learning framework. **[RLlib](https://github.com/ray-project/ray/tree/master/rllib)** is <b>distributed by default</b> since it is built on top of **[Ray](https://docs.ray.io/en/latest/)**, an easy-to-use, open-source, distributed computing framework for Python that can handle complex, heterogeneous applications. Ray and RLlib run on compute clusters on any cloud without vendor lock.

RLlib includes <b>25+ available [algorithms](https://docs.ray.io/en/master/rllib/rllib-algorithms.html)</b>, converted to both <img width="3%" src="./images/tensorflow-logo.png">_TensorFlow_ and <img width="3%" src="./images/pytorch-logo.png">_PyTorch_, covering different sub-categories of RL: _model-free_, _offline RL_, _model-based_, and _gradient-free_. Almost any RLlib algorithm can learn in a <b>multi-agent</b> setting. Many algorithms support <b>RNNs</b> and <b>LSTMs</b>.

On a very high level, RLlib is organized by **environments**, **algorithms**, **examples**, **tuned_examples**, and **models**.  

    ray
    |- rllib
    |  |- env 
    |  |- algorithms
    |  |  |- alpha_zero 
    |  |  |- appo 
    |  |  |- ppo 
    |  |  |- ... 
    |  |- examples 
    |  |- tuned_examples
    |  |- models

Within **_env_** you will find different [base classes](https://docs.ray.io/en/latest/rllib/package_ref/env.html) that you can inherit from to make it easy to implement your environment. RLlib supports environments created using the **OpenAI Gym API** (which supports most user cases). The base classes in the env directory allow for users to implement environments that are not covered by OpenAI Gym, such as multi agent environments or environments that have strict performance or hosting requirements. In the next notebook, we will use the **RLlib MultiAgentEnv** base class to train a **multi agent** RL model.

Within **_examples_** you will find some examples of common custom rllib use cases.  

Within **_tuned\_examples_**, you will find, sorted by algorithm, suggested hyperparameter value choices within .yaml files. Ray RLlib team ran simulations/benchmarks to find suggested hyperparameter value choices.  These files are used for daily testing, and weekly hard-task testing to make sure they all run at speed, for both TF and Torch. Helps give you a leg-up with initial parameter choices!

Within **_models_**, you will find advanced building blocks for customizing the specifics that are used by an algorithm to produce a policy that outputs actions.  In case the model architecture is a neural network, building blocks are given in either <img width="3%" src="./images/tensorflow-logo.png">_TensorFlow_, <img width="3%" src="./images/pytorch-logo.png">_PyTorch_, or both.  For example, the building blocks for DNN, CNN, RNN, LSTM are here. 

In this tutorial, we will mainly focus on the **_algorithms_** package, where we will find RLlib algos to train policy models on environments.


## Train a policy using an algorithm from RLlib <a class="anchor" id="intro_rllib_api"></a>

Once you have an environment, next you need to decide which RL algorithm to use.

#### Step 1.  Import ray

In [9]:
# import libraries
import ray
from ray import tune
from ray.tune.logger import pretty_print
print(f"ray: {ray.__version__}")

ray: 3.0.0.dev0


#### Step 2. Check environment for errors   

Before you start training, it is a good idea to check the environment for errors.  RLlib provides a convenient [Environment pre-check function](https://github.com/ray-project/ray/blob/master/rllib/utils/pre_checks/env.py#L22) for this.  It checks that the environment is compatible with OpenAI Gym and RLlib (and outputs a warning if necessary).

We will start with a new environment, Cart-Pole.  Take a look at the Gym documentation:
<ol>
    <li>The doc page in <a href="https://www.gymlibrary.ml/environments/classic_control/cart_pole/">Gym's website</a></li>
    <li>The environment's <a href="https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py">source code </a></li>
    <li>
        <a href="https://www.gymlibrary.ml/environments/classic_control/cart_pole/#description">Research paper (if one exists)</a> referenced in the environment page </li>
    </ol>

Below, we start a new Cart Pole environment, then in the next cell, check it for errors.

In [10]:
# Instantiate gym env object with a runtime parameter value
env = gym.make("CartPole-v1")

# inspect the gym spec for the environment
print(f"env: {env}")
env_spec = env.spec
print(f"env_spec: {env_spec}")

# inspect gym env.spec parameters
print()
print("Environment parameters...")
print(pretty_print(vars(env_spec)))

# Note: "TimeLimit" means termination condition for an episode will be time steps
# Note:  We can tell that our CartPole environment is: 
# 1) Success criteria is rewards >= 475
# 2) Deterministic
# 3) Episode terminates when number time_steps = 500

env: <TimeLimit<CartPoleEnv<CartPole-v1>>>
env_spec: EnvSpec(CartPole-v1)

Environment parameters...
_env_name: CartPole
_kwargs: {}
entry_point: gym.envs.classic_control:CartPoleEnv
id: CartPole-v1
max_episode_steps: 500
nondeterministic: false
order_enforce: true
reward_threshold: 475.0



In [11]:
from ray.rllib.utils.pre_checks.env import check_env

# How to check you do not have any environment errors
print("checking environment ...")
try:
    check_env(env)
    print("All checks passed. No errors found.")
except:
    print("failed")

checking environment ...
All checks passed. No errors found.


<b>Get an environment baseline</b>

Let's run through the environment, without rendering, and record the mean reward.  The purpose of this is to obtain a baseline before training a RLlib algorithm.

<div class="alert alert-block alert-success">
💡 If you are doing benchmarks, this random policy is often called a <b>"baseline".</b>
</div>

In [12]:
# Putting the simple API methods together.
# Here is a pattern for running a bunch of episodes.
num_episodes = 60000 # Number of episodes you want to run the agent
total_reward = 0.0  # Initialize reward to 0

# Loop through episodes
for ep in range(num_episodes):

    # Reset the environment at the start of each episode
    obs = env.reset()
    done = False
    
    # Loop through time steps per episode
    while True:
        # take random action, but you can also do something more intelligent 
        action = env.action_space.sample()

        # apply the action
        new_obs, reward, done, info = env.step(action)
        total_reward += reward

        # If the epsiode is up, then start another one
        if done:
            break

# calculate mean_reward
print()
print("**************")
mean_reward = total_reward / num_episodes
print(f"Baseline mean_reward: {mean_reward:.2f} out of success: {env_spec.reward_threshold} after {num_episodes} episodes")
print("**************")
        
# Close the env
env.close()


**************
Baseline mean_reward: 22.32 out of success: 475.0 after 60000 episodes
**************


#### Step 3.  Select an algorithm and find that algorithm's config class  

There are many factors to consider when selecting which algorithm to use on your environment.  Following are some high-level best practices.
<ol>
    <li>
        <b>The first distinction comes from your action space</b>, i.e., do you have discrete (e.g. LEFT, RIGHT, …) or continuous actions (ex: go to a certain speed)? To check high-level if an algorithm will work, look at the <a href="https://docs.ray.io/en/master/rllib/rllib-algorithms.html">RLlib algorithms doc page</a>.  <i>Algorithms are listed according to whether or not they support Discrete action spaces vs Continuous action spaces or both.</i> 
    </li>
    <li>
        <b>Choose a stable algorithm.</b>  When you look at the cumulative rewards per time step, they should rise steadily.  You do not want an algorithm where reward jumps up and down a lot.
    </li>
    <li><b>Choose the most sample-efficient algorithm that works for your environment</b>. <i>PPO is extremely sample-efficient.  SAC is much less sample-efficient.</i>
    </li>
</ol>


Once you have selected the algorithm, <b>look up that algorithm's config class</b>.
<ol>
    <li>Open RLlib docs <a href="https://docs.ray.io/en/master/rllib/rllib-algorithms.html">and navigate to the Algorithms page.</a></li>
    <li>Scroll down and click url of algo you want to use, e.g. <i><b>PPO</b></i></li>
    <li>On the <a href="https://docs.ray.io/en/master/rllib/rllib-algorithms.html#ppo">algo docs page </a>, click on the link <i><b>Implementation</b></i>.  This will open the <a href="https://github.com/ray-project/ray/blob/master/rllib/algorithms/ppo/ppo.py">algo code file on github</a>.</li>
    <li>Search the github code file for the word <i><b>config</b></i></li>
    <li>Typically the docstring example will show: </li>
    <ol>
        <li>Example code implementing RLlib API, then </li>
        <li>Example code implementing Ray Tune API.</li>
    </ol>
    <li>Scroll down to the config <b>__init()__</b> method</li>
    <ol>
            <li>Algorithm default hyperparameter values are here.</li>
    </ol>
    </ol>

In [13]:
# config is an object instead of a dictionary since Ray version >= 1.13
from ray.rllib.algorithms.ppo import PPOConfig

#### Step 4. Choose your config settings and instantiate a config object with those settings

As of Ray 1.13, RLlib configs been converted from primitive dictionaries into Objects. This makes them harder to print, but easier to set/pass.

**Note about RLlib config precedence**
<ol>
    <li>Highest precedence are <b>trainer instantiation settings</b>, these override any other config settings</li>
    <li>RLlib <b>specific algorithm config</b> (see config class description above)</li>
    <li>RLlib <b><a href"https://github.com/ray-project/ray/blob/master/rllib/algorithms/algorithm_config.py#L58">general config</a></b> settings have the lowest precedence</li>
    </ol>

In [14]:
# Common RLlib General config (for all algorithms)

from ray.rllib.algorithms.algorithm_config import AlgorithmConfig
config = AlgorithmConfig()

# # uncomment below to see the long list of RLlib general config values
# print(f"RLlib's general default training config values:")
# print(pretty_print(config.to_dict()))

**Note about num_workers**

Number of Ray workers is the number of parallel workers or actors for rollouts.  Actual num_workers will be what you specifiy+1 for head node.

<div class="alert alert-block alert-success">
💡 <b>For num_workers, use ONE LESS than the number of cores you want to use</b> (or omit this argument and let Ray automatically use all cores)!
</div>


Below, num_workers = 7  # means actual number workers=8 including head node; where 8 is #cpu on my laptop.

In [15]:
# uncomment below to see the long list of specifically PPO default config values
# print(pretty_print(PPOConfig().to_dict()))

# Define algorithm config values
env_name = "CartPole-v1"
evaluation_interval = 2   #100, num training episodes to run between eval steps
evaluation_duration = 20  #100, num eval episodes to run for the eval step
num_workers = 4          # +1 for head node, num parallel workers or actors for rollouts
num_gpus = 0             # num gpus to use in the cluster
num_envs_per_worker = 1  #1, no vectorization of environments to run at same time

# Define trainer runtime config values
checkpoint_freq = evaluation_interval # freq save checkpoints >= evaulation_interval
checkpoint_at_end = True                # always save last checkpoint
relative_checkpoint_dir = "my_PPO_logs" # redirect logs instead of ~/ray_results/
random_seed = 415
# Set the log level to DEBUG, INFO, WARN, or ERROR 
log_level = "ERROR"

# Create a new training config
# override certain default algorithm config values
config_PPO = (
    PPOConfig()
    .framework(framework='torch')
    .environment(env=env_name, disable_env_checking=False)
    .rollouts(num_rollout_workers=num_workers, num_envs_per_worker=num_envs_per_worker)
    .resources(num_gpus=num_gpus, )
#     .training(gamma=0.9, lr=0.01, kl_coeff=0.3)  # do not override defaults
    .evaluation(evaluation_interval=evaluation_interval, 
                evaluation_duration=evaluation_duration)
    .debugging(seed=random_seed, log_level=log_level)
)

print(type(config_PPO))


<class 'ray.rllib.algorithms.ppo.ppo.PPOConfig'>


In [20]:
# Create a PPOConfig object (same as we did in the previous notebook):
config_PPO = PPOConfig()

# Setup our config object the exact same way as before:
# Point to our MultiAgentArena env:
config_PPO.environment(env="CartPole-v1",
                       disable_env_checking=False)

# Reduce the number of workers from 2 (default) to 1 
# to save some resources on the expensive hyperparameter sweep.
# IMPORTANT: More information on resource requirements for tune hyperparameter 
#            and different RLlib algorithm setups below
config_PPO.rollouts(num_rollout_workers=1)

# Set up evaluation:
config_PPO.evaluation(
    # Run evaluation once per `train()` call.
    evaluation_interval=2,
    # Use separate resources (RLlib rollout workers).
    evaluation_num_workers=2,
    # Run 20 episodes per evaluation (per iteration) 
    # -> 10 per eval worker (we have 2 eval workers).
    evaluation_duration=20,
    evaluation_duration_unit="timesteps",
    # # Run evaluation alternatingly with training (not in parallel).
    # evaluation_parallel_to_training=False,
)

config_PPO.evaluation(evaluation_interval=evaluation_interval, 
                evaluation_duration=evaluation_duration)

# Set the log level to DEBUG, INFO, WARN, or ERROR 
config_PPO.debugging(seed=415, log_level="ERROR")

<ray.rllib.algorithms.ppo.ppo.PPOConfig at 0x284727b80>

#### Step 5. Instantiate a Trainer from the environment and algorithm config objects

**Two ways to train RLlib policies***
<ol>
    <li><a href="https://docs.ray.io/en/master/rllib/package_ref/index.html">RLlib API.</a> The main methods are:</li>
    <ul>
        <li>train()</li>
        <li>evaluate()</li>
        <li>save()</li>
        <li><b>restore()</b></li>
        <li><b>compute_single_action()</b></li>
    </ul>
    <li><a href="https://docs.ray.io/en/master/tune/api_docs/overview.html">Ray Tune API.</a>  The main methods are:</li>
        <ul>
            <li><b>run()</b></li>
    </ul>
    </ol>
    
*Actually 3 ways.  RLlib CLI from command line using .yml file is a 3rd way, but the .yml file is undocumented: <i>rllib train -f [myfile_name].yml</i><br>

👉 RLlib API <b>.train()</b> will train for 1 episode only.  Good for debugging since every single output will be shown for the 1 episode of training.  

👉 However for usual purposes, Ray Tune API <b>.run()</b> is more convenient since with 1 function call you get experiment management: save, checkpoint, evaluate, and train up to a stopping criteria.

✔️ Both methods will run the RLlib [environment pre-check function](https://github.com/ray-project/ray/blob/master/rllib/utils/pre_checks/env.py#L22) you saw earlier in this notebook (Step 4. Check environment).

👍 You have to use RLlib API method <b>.restore()</b> to reload a checkpointed RLlib model for Serving and Offline learning.  Tune API methods will not work.

👍 After a model is trained, it can be used for inference.  The RLlib API method <b>compute_single_action()</b> will use the trained <i>`policy`</i> (RL word for trained model) to calculate actions for the entire number of time steps in 1 <i>`rollout`</i> (RLlib word for episode during inference).  You will see this method used at the end of this notebook. 

<div class="alert alert-block alert-success">
<b>In summary: <br>
    💡 If you are training a RLlib algorithm, train it using Ray Tune API method .run()!!  <br>
    👉  If you are developing or debugging a RLlib algorithm, train it using RLlib API method .train()!! <br>
    👉  If you need to restore a RLlib model, use RLlib API method .restore()!!</b>
</div>

💡 <b>Right-click on the cell below and choose "Enable Scrolling for Outputs"!</b>  This will make it easier to view, since model training output can be very long!

In [16]:
# ###############
# # EXAMPLE USING RLLIB API .train() FOR 1 EPISODE
# # For completeness, here is how to use RLlib API's .train() method
# # To train for N number of episodes, you put _.train()_ into a loop, 
# # similar to the way we ran the Gym _env.step()_ in a loop.
# ###############

# # To start fresh, restart Ray in case it is already running
# if ray.is_initialized():
#     ray.shutdown()

# # Use the config object's `build()` method for generating
# # an RLlib Algorithm instance that we can then train.
# ppo_algo = config_PPO.build()
# print(f"Algorithm type: {type(ppo_algo)}")

# # train the PPO Algorithm instance for 8 episodes
# for i in range(8):
#     # Call its `train()` method
#     result = ppo_algo.train()
#     print(f"Iteration={i}, Mean Reward={result['episode_reward_mean']}")

# # To stop the Algorithm and release its blocked resources, use:
# ppo_algo.stop()
# print()

# # Below, you will see the output evaluation_interval times.
# # Iteration=7, Mean Reward=237.3


From the above cell, you can see how to train a RLlib algorithm 1 episode at a time.  But it is more practical to train RLlib algorithms using Ray Tune, since many more options are available.

**Instantiate a trainer using Ray Tune API**

Ray Tune offers experiment management in a single call <b>.run()</b>.  In the code below, we <b>specify a stopping criteria</b> to train until a certain Reward is achieved.  In case the desired training reward level is never reached, backup stop criteria can be given.  Tune will stop training whenever the earliest stop criteria is met.  However, best practice starting out is to only have 1 criteria, so you can be sure what is going to happen.

💡 <b>Right-click on the cell below and choose "Enable Scrolling for Outputs"!</b>  This will make it easier to view, since model training output can be very long!


In [22]:
experiment_results = tune.run("PPO", 
                    
    # training config params
    config = config_PPO.to_dict(),
                              
    # Stopping criteria whichever occurs first: average reward over training episodes, or ...
    stop={"episode_reward_mean": 400, # stop if achieve 400 out of max 500
          # "training_iteration": 200,  # stop if achieved 200 episodes
          # "timesteps_total": 100000,  # stop if achieved 100,000 timesteps
          },  
                    
    #redirect logs instead of default ~/ray_results/
    local_dir = relative_checkpoint_dir, #relative path
         
    # set frequency saving checkpoints >= evaulation_interval
    checkpoint_freq = 2,  
    checkpoint_at_end=True,
         
    # Reduce logging messages
    ###############
    # Note about Ray Tune verbosity.
    # Screen verbosity in Ray Tune is defined as verbose = 0, 1, 2, or 3, where:
    # 0 = silent
    # 1 = only status updates, no logging messages
    # 2 = status and brief trial results, includes logging messages
    # 3 = status and detailed trial results, includes logging messages
    # Defaults to 3.
    ###############                          
    verbose = 2,
                              
    # Define what we are comparing for, when we search for the
    # "best" checkpoint at the end.
    metric="episode_reward_mean",
    mode="max",
                              
    )

print("Training completed.")


TuneError: Traceback (most recent call last):
  File "/Users/christy/Documents/ray/python/ray/tune/execution/trial_runner.py", line 815, in _wait_and_handle_event
    event = self.trial_executor.get_next_executor_event(
  File "/Users/christy/Documents/ray/python/ray/tune/execution/ray_trial_executor.py", line 910, in get_next_executor_event
    self._stage_and_update_status(live_trials)
  File "/Users/christy/Documents/ray/python/ray/tune/execution/ray_trial_executor.py", line 299, in _stage_and_update_status
    if not self._pg_manager.stage_trial_pg(trial):
  File "/Users/christy/Documents/ray/python/ray/tune/execution/placement_groups.py", line 447, in stage_trial_pg
    return self._stage_pgf_pg(pgf)
  File "/Users/christy/Documents/ray/python/ray/tune/execution/placement_groups.py", line 461, in _stage_pgf_pg
    self._staging_futures[pg.ready()] = (pgf, pg)
  File "/Users/christy/Documents/ray/python/ray/util/placement_group.py", line 78, in ready
    return bundle_reservation_check.options(
  File "/Users/christy/Documents/ray/python/ray/remote_function.py", line 216, in remote
    return func_cls._remote(args=args, kwargs=kwargs, **updated_options)
  File "/Users/christy/Documents/ray/python/ray/util/tracing/tracing_helper.py", line 307, in _invocation_remote_span
    return method(self, args, kwargs, *_args, **_kwargs)
  File "/Users/christy/Documents/ray/python/ray/remote_function.py", line 400, in _remote
    return invocation(args, kwargs)
  File "/Users/christy/Documents/ray/python/ray/remote_function.py", line 375, in invocation
    object_refs = worker.core_worker.submit_task(
  File "python/ray/_raylet.pyx", line 1511, in ray._raylet.CoreWorker.submit_task
TypeError: submit_task() takes exactly 11 positional arguments (12 given)


<br>

Scroll through the output in the cell above.  Look for a single table row that looks like this: <br>

<img src="./images/ppo_cartpole_tune_output.png"></img>

What this telling you is that Tune ran your experiment.  It was terminated when Reward reached 409.14, which satisified the stopping criteria of Reward >= 400, out of an environment max reward possible of 500.

## Evaluate a RLlib Policy <a class="anchor" id="eval_rllib"></a>

RLlib policies can be evaluated by:
<ul>
    <li>Calling RLlib Algorithm API method .evaluate()*</li>
    <li>Examining <b>Ray Tune</b> experiment results </li>
    <li>Visualizing training progress in <b>TensorBoard</b></li>
    </ul>

*RLlib algorithm objects can be examined manually using, for example <i>ppo_algo.evaluate()</i>, but it gives the same info you already saw in the single episode .train() output.

**Ray Tune experiment results** <br>
First, let's start by looking at the experiment results object. How long did training take?


In [None]:
# Get RLlib default stats
stats = experiment_results.stats()
secs = stats["timestamp"] - stats["start_time"]
print(f'{secs:7.2f} seconds, {secs/60.0:7.2f} minutes')

# Typically takes about 67 seconds

**Ray Tune experiment trial results**

Read all the experiment trials into a single pandas dataframe.  The dataframe will have 1 row per trial.

Below, we see a dataframe with only 1 row because Tune only ran 1 trial.  (Because we did not specify a hyperparameter space to search for tuning.)

In [None]:
# Read Tune experiment results in a pandas dataframe
df = experiment_results.results_df

print(f"df.shape: {df.shape}")  #Only 1 trial
print(f"df.columns: {df.columns}")
df.iloc[:,0:8].head()

For how many episodes did training run?

In [None]:
# Get number of episodes for the 1st trial
df.iloc[0,:].episodes_this_iter  

# Answer is 8 episodes

What were the best parameter values?  


In [None]:
# Watch out, the following output is long because there are many parameters!

# experiment_results.get_best_config(metric="episode_reward_mean", mode="mean")

#### Visualize the training progress in TensorBoard

RLlib automatically creates logs for your trained RLlib models that can be visualized in TensorBoard.  To visualize the performance of your RL model:

<ol>
    <li>Open a terminal</li>
    <li><i><b>cd</b></i> into the logdir path from the above cell's output.</li>
    <li><i><b>ls</b></i></li>
    <li>You should see files that look like: checkpoint_NNNNNN</li>
    <li>To be able to compare all your experiments, cd one dir level up.
    <li><i><b>cd ..</b></i>  
    <li><i><b>tensorboard --logdir . </b></i></li>
    <li>Look at the url in the message, and open it in a browser</li>
        </ol>

#### Screenshot of Tensorboard

TensorBoard will give you many pages of charts.  Below displaying just Train/Eval mean and min rewards.

The charts below are showing "sample efficiency", the number of training steps it took to achieve a certain level of performance.

<b>Train Performance:</b> <br>

---
<img src="./images/ppo_cartpole_training_rewards.png" width="80%" />

<b>Eval Performance:</b> <br>
<img src="./images/ppo_cartpole_eval_rewards.png" width="80%" />

## Reload RLlib model from checkpoint and run inference <a class="anchor" id="reload_rllib"></a>

We want to reload the desired RLlib model from checkpoint file and then run the model inference mode on the environment it was trained on.  

You will need:
<ul>
    <li>Your <b>algorithm's config class</b> and exact same <a href="#intro_rllib_api">config settings you used to train your model.</a></li>
    <li>Name of the <b>environment</b> you used to train the model.</li>
    <li>Path to the desired <b>checkpoint</b> file you want to use to restore the model.</li>
    </ul>

#### Step 1. Find the best model checkpoint file

In [None]:
# Get best checkpoint path
logdir = experiment_results.get_best_logdir(metric="evaluation_reward_mean", mode="max")
print(logdir)

# Get last checkpoint path
checkpoint = experiment_results.get_last_checkpoint()
print(f"\n{checkpoint}")

#### Step 2. Re-initialize an already-trained algorithm object from the checkpoint file


In [None]:
# Create new Algorithm and restore its state from the last checkpoint.

# create an empty Algorithm
algo = config_PPO.build(env=env_name)

# restore the agent from the checkpoint
algo.restore(checkpoint)

#### Step 3. Play and render the game as a video

Now we want to record a video of the trained model doing inference in the environment it was trained on.

Gym includes video recording and saving capability in its wrapper <i>`gym.wrappers.RecordVideo`</i>, so we will import and use that method to wrap the <i>`gym.make()`</i> method.

<div class="alert alert-block alert-success">
👍 During inference, call the RLlib API method <b>compute_single_action()</b>, which uses the trained <i>`policy`</i> (RL word for trained model) to calculate actions for the entire number of time steps in 1 <i>`rollout`</i> (RLlib word for episode during inference). 
</div>

Execute the cell below, and you will see a video of the rollouts, so you can verify visually that the agent is starting from a near perfect score.

Note that the CartPole environment's perfect score is 500.  Since we trained our model to around 400, the restored agent should appear very stable. 
<br><br>

In [None]:
from gym.wrappers import RecordVideo

#############
## Create the env to do inference on
#############
# Try this first to make sure it is working
# env = gym.make(env_name)
# Once you have confirmed it works, wrap the method
# RecordVideo() takes as input a path name where to store the video
# RecordVideo() includes its own render step, records, and saves the video.
env = RecordVideo(gym.make(env_name), "videos", )
obs = env.reset()

#############
## Use the restored model and run it in inference mode
## You will see a pop-up video rendering for about 10 seconds
#############
num_episodes_during_inference = 1
num_episodes = 0
episode_reward = 0.0
done = False

while num_episodes < num_episodes_during_inference:
    # Compute an action (`a`).
    a = algo.compute_single_action(observation=obs)
    # Send the computed action `a` to the env.
    obs, reward, done, _ = env.step(a)
    episode_reward += reward
    
    # Is the episode `done`? -> Reset.
    if done:
        print(f"Episode done: reward = {episode_reward}")
        obs = env.reset()
        num_episodes += 1
        episode_reward = 0.0
        break
        
env.close()

# The restored agent manages to achieve a perfect score during the 1 episode rollout.


<br>
<b>Play and share your video .mp4 file with others if you want.</b>
<br><br>

Show a video player in the notebook, and play the video you just recorded!

In [None]:
from IPython.display import Video

cart_pole_video='videos/rl-video-episode-0.mp4'
Video(cart_pole_video, width=500)

In [None]:
# Shut down Ray if you are done
if ray.is_initialized():
    ray.shutdown()

### Summary

TODO

### Exercises

1. How would you choose another algorithm to train Cart Pole?  Hint:  Look at the [RLlib algorithm doc page](https://docs.ray.io/en/master/rllib/rllib-algorithms.html).  How would you change the choice of RLlib algorithm from <b>PPO to DQN</b>?

### References

1. [OpenAI Gym Environments](https://www.gymlibrary.ml/)
2. [Ray doc page](https://docs.ray.io/en/latest/)
3. [Rllib github](https://github.com/ray-project/ray/tree/master/rllib)
4. [RLlib Algorithms doc page](https://docs.ray.io/en/master/rllib/rllib-algorithms.html)

📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb)<br>
➡ [Next notebook](./ex_02_create_multiagent_rllib_env.ipynb) <br>