# Exercise 01. Introduction to the OpenAI Gym Environment and RLlib Algorithm top-level APIs

© 2019-2022, Anyscale. All Rights Reserved

### Learning objectives
In this this tutorial, we will learn about:
 * [What is an Environment in RL?](#intro_env)
 * [Overview of RL terminology](#intro_rl)
 * [Introduction to OpenAI Gym environments](#intro_gym)
 * [High-level OpenAI Gym API calls](#intro_gym_api)
 * [Overview of RLlib](#intro_rllib)
 * [Train a RL model using an algorithm from RLlib](#intro_rllib_api)
 * [Evaluate a RLlib model](#eval_rllib)
 * [Reload RLlib model from checkpoint and run inference](#reload_rllib)
 

## What is an environment in RL? <a class="anchor" id="intro_env"></a>

Solving a problem in RL begins with an **environment**. In the simplest definition of RL:

> An **agent** interacts with an **environment** and receives a reward.

An environment in RL is the agent's world, it is a simulation of the problem to be solved. 

<img src="images/env_key_concept1.png" width="50%" />

The environment simulator might be a:
<ul>
    <li>real, physical situation such as a gas turbine</li>
    <li>virtual sytem on a computer such as a board game or video game</li>
    </ul>
Why bother with an Agent and Environment?  

> RL is useful when you have sequential decisions that need to be optimized over time. 

Traditional supervised learning views the world as more of a one-shot training, not as action -> fedback -> improved action -> repeat.
<br> 

## Overview of RL terminology <a class="anchor" id="intro_rl"></a>

An RL environment consists of: 

1. all possible actions (**action space**)
2. a complete description of the environment, nothing hidden (**state space**)
3. an observation by the agent of certain parts of the state (**observation space**)
4. **reward**, which is the only feedback the agent receives per action.

The model that tries to maximize the expected sum over all future rewards is called a **policy**. The policy is a function mapping the environment's observations to an action to take, usually written **π** (s(t)) -> a(t).

Below is a high-level image of how the Agent and Environment work together in a RL simulation feedback loop in RLlib.

<img src="images/env_key_concept2.png" width="98%" />

The **RL simulation feedback loop** repeatedly collects data, for one (single-agent case) or multiple (multi-agent case) policies, trains the policies on these collected data, and makes sure the policies' weights are kept in synch. Thereby, the collected environment data contains observations, taken actions, received rewards and so-called **done** flags, indicating the boundaries of different episodes the agents play through in the simulation.

The simulation iterations of action -> reward -> next state -> train -> repeat, until the end state, is called an **episode**, or in RLlib, a **rollout**

<b>Per episode</b> (or between **done** flag == True), the RL simulation feedback loop repeats up to some specified end state (termination state or timesteps). Examples of termination could be:
<ul>
    <li>the end of a maze (termination state)</li>  
    <li>the player died in a game (termination state)</li>
    <li>after 60 videos watched in a recommender system (timesteps).</li>
    </ul>

## Introduction to OpenAI Gym example: frozen lake <a class="anchor" id="intro_gym"></a>

[OpenAI Gym](https://gym.openai.com/) is a well-known reference library of RL environments. 

#### 1. import gym

Below is how you would import gym and view all available environments.

In [1]:
# import libraries
import gym
print(f"gym: {gym.__version__}")

# List all available gym environments
all_env  =  list(gym.envs.registry.all())
print(f'Num Gym Environments: {len(all_env)}')

# # You could loop through and list all environments if you wanted
# [print(e) for e in all_env]

gym: 0.21.0
Num Gym Environments: 103


#### 2. Instatiate your Gym object

The way you instantiate a Gym environment is with the **make()** function.

The .make() function takes arguments:
- **name of the Gym environment**, type: str, Required.
- **runtime parameter values**, Optional.

For the required string argument, you need to know the Gym name.  You can find the Gym name in the Gym documentation for environments, either:
<ol>
    <li>The doc page in <a href="https://www.gymlibrary.ml/environments/toy_text/frozen_lake/">Gym's website</a></li>
    <li>The environment's <a href="https://github.com/openai/gym/blob/master/gym/envs/toy_text/frozen_lake.py">source code </a></li>
    <li>
        <a href="https://www.gymlibrary.ml/environments/classic_control/cart_pole/#description">Research paper (if one exists)</a> referenced in the environment page </li>
    </ol>
    
Below is an example of how to create a basic Gym environment, [frozen lake](https://www.gymlibrary.ml/environments/toy_text/frozen_lake/).  We can see below that the termination condition of an episode will be time steps.

In [2]:
env_name = "FrozenLake-v1"
env_runtime_param_value = False

# Instantiate gym env object with a runtime parameter value
env = gym.make(
        env_name, 
        is_slippery=env_runtime_param_value)

# inspect the gym spec for the environment
print(f"env: {env}")
env_spec = env.spec
print(f"env_spec: {env_spec}")

# Note: "TimeLimit" means termination condition for an episode will be time steps

env: <TimeLimit<FrozenLakeEnv<FrozenLake-v1>>>
env_spec: EnvSpec(FrozenLake-v1)


#### 3. Inspect the environment action and observations spaces

Gym Environments can be deterministic or stochastic.

<ul>
    <li>
        <b>Deterministic</b> if the current state + selected action determines the next state of the environment.  Chess is a deterministic environment, since all possible states/action combinations can be described as a discrete set of rules with states bounded by the pieces and size of the board.</li>
    <li>
        <b>Stochastic</b> if the policy output action is a probability distribution over a set of possible actions at time step t. In this case the agent needs to compute its action from the policy in two steps. i) sample actions from the policy according to the probability distribution, ii) compute log likelihoods of the actions. Stochastic environments are random in nature.  Random visitors to a website is an example of a stochastic environment. </li>
    </ul>

<b>Gym actions.</b> The action_space describes the numerical structure of the legitimate actions that can be applied to the environment. 

For example, if we have 4 possible discrete actions, we could encode them as:
<ul>
    <li>0: LEFT</li>
    <li>1: DOWN</li>
    <li>2: RIGHT</li>
    <li>3: UP</li>
</ul>

<b>Gym observations.</b>  The observation_space defines the structure as well as the legitimate values for the observation of the state of the environment.  

For example, if we have a 4x4 grid, we could encode them as {0,1,2,3, 4, … ,16} for grid positions ((0,0), (0,1), (0,2), (0,3), …. (3,3)).


From the Gym [documentation](https://www.gymlibrary.ml/environments/toy_text/frozen_lake/) about the frozen lake environment, we see: <br>

|Frozen Lake      | Gym space   |
|---------------- | ----------- |
|Action Space     | Discrete(4) |
|Observation Space| Discrete(16)|


 
<b><a href="https://github.com/openai/gym/tree/master/gym/spaces">Gym spaces</a></b> are gym data types.  The main types are `Discrete` for discrete numbers and `Box` for continuous numbers.  

Gym Space `Discrete` elements are Python type `int`, and Gym Space `Box` are Python type `float32`.

Below is an example how to inspect the environment action and observations spaces.


In [3]:
# check if it is a gym instance
if isinstance(env, gym.Env):
    print("This is a gym environment.")
    print()

    # print gym Spaces
    if isinstance(env.action_space, gym.spaces.Space):
        print(f"gym action space: {env.action_space}")
    if isinstance(env.observation_space, gym.spaces.Space):
        print(f"gym observation space: {env.observation_space}") 

This is a gym environment.

gym action space: Discrete(4)
gym observation space: Discrete(16)


In [4]:
# # Comment this cell if you want whole notebook to run without errors

# # Try to take an invalid action

# env.step(4) # invalid

# # should see KeyError below

#### 4. Inspect gym environment parameters

Gym environments contain 2 sets of configuration parameters that are set after the environment object is instantiated.
<ul>
    <li><b>Runtime parameters</b> are passed into the make() function as **kwargs.</li>
    <li><b>Default parameters</b> are fixed in the Gym environment code.</li>
    </ul>

Below is an example of how to inspect the environment parameters.  

Notice we can tell from the parameters that our frozen lake environment is: 
1) Deterministic, and 
2) Episode terminates with time step condition max_episode_steps = 100.

In [5]:
# inspect env.spec parameters
 
# View runtime **kwargs .spec params.  These params set after env instantiated.
# print(f"type(env_spec._kwargs): {type(env_spec._kwargs)}") #dict
print("Runtime spec params...")
[print(f"{k}: {v}") for k,v in env_spec._kwargs.items()]
print()
 
# View default env spec params
# Default parameters are fixed
print("Default spec params...")
print(f"id: {env_spec.id}")
print(f"entry_point: {env_spec.entry_point}")
print(f"reward_threshold: {env_spec.reward_threshold}")
print(f"nondeterministic: {env_spec.nondeterministic}")
print(f"max_episode_steps: {env_spec.max_episode_steps}")
print(f"order_enforce: {env_spec.order_enforce}")

# We can tell that our frozen lake environment is: 
# 1) Deterministic, and 
# 2) Episode terminates with condition max_episode_steps = 100


Runtime spec params...
map_name: 4x4
is_slippery: False

Default spec params...
id: FrozenLake-v1
entry_point: gym.envs.toy_text:FrozenLakeEnv
reward_threshold: 0.7
nondeterministic: False
max_episode_steps: 100
order_enforce: True


#### 5. Perform some basic Gym API calls <a class="anchor" id="intro_gym_api"></a>

The most basic Gym API methods are:
<ul>
    <li><b>env.reset()</b> <br>Reset the environment to an initial state, this is how you initialize an environment so you can run a simulation on it.  You should call this method every time to initiate a new episode.</li>
    <li><b>env.render()</b>  <br>Visually inspect the environment anytime. Note you cannot inspect an environment before it has been initialized with env.reset().</li>
    <li><b>env.step(action)</b> <br>Take an action from the possible action space values.  It accepts an action, computes the state of the environment after applying that action and returns the 4-tuple (observation, reward, done, info).</li>
    <li><b>env.close()</b> <br>Close an environment.</li>
    </ul>

In [6]:
# Print the starting observation.  Recall possible observations are between 0-16.
print(env.reset())
env.render()

0

[41mS[0mFFF
FHFH
FFFH
HFFG


In [7]:
# Take an action
# Recall the possible actions are: 0: LEFT, 1: DOWN, 2: RIGHT, 3: UP

new_obs, reward, done, _ = env.step(2) #Right
print(f"obs: {new_obs}, reward: {reward}, done: {done}")
env.render()
new_obs, reward, done, _ = env.step(1) #Down
print(f"obs: {new_obs}, reward: {reward}, done: {done}")
env.render()

obs: 1, reward: 0.0, done: False
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
obs: 5, reward: 0.0, done: True
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG


We can also try to run an action in the frozen lake environment which is outside the defined number range.

In [8]:
# # Comment this cell if you want whole notebook to run without errors

# # Try to take an invalid action

# env.step(4) # invalid

# # should see KeyError below

In [9]:
# Putting the simple API methods together.
# Here is a pattern for running a bunch of episodes.
 
num_episodes = 1000 # Number of episodes you want to run the agent
render_freq = 200  # Render every X number of episodes 
total_reward = 0  # Initialize reward to 0

# Loop through episodes
for ep in range(num_episodes):

    # Reset the environment at the start of each episode
    obs = env.reset()
    done = False
    
    # Loop through time steps per episode
    while True:
        # take random action, but you can also do something more intelligent 
        action = env.action_space.sample()

        # apply the action
        new_obs, reward, done, info = env.step(action)
        total_reward += reward

        # If the epsiode is up, then start another one
        if done:
            break
            
    # Render the env only every render_freq episodes
    if ep % render_freq == 0:
        print(f"episode: {ep}")
        print(f"obs: {new_obs}, reward: {total_reward}, done: {done}")
        env.render()

# Close the env
env.close()

episode: 0
obs: 5, reward: 0.0, done: True
  (Right)
SFFF
F[41mH[0mFH
FFFH
HFFG
episode: 200
obs: 5, reward: 1.0, done: True
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
episode: 400
obs: 5, reward: 1.0, done: True
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
episode: 600
obs: 5, reward: 2.0, done: True
  (Right)
SFFF
F[41mH[0mFH
FFFH
HFFG
episode: 800
obs: 12, reward: 5.0, done: True
  (Down)
SFFF
FHFH
FFFH
[41mH[0mFFG


## Overview of RLlib <a class="anchor" id="intro_rllib"></a>

<img width="7%" src="images/rllib-logo.png"> is the most comprehensive open-source Reinforcement Learning framework. **[RLlib](https://github.com/ray-project/ray/tree/master/rllib)** is <b>distributed by default</b> since it is built on top of **[Ray](https://docs.ray.io/en/latest/)**, an easy-to-use, open-source, distributed computing framework for Python that can handle complex, heterogeneous applications. Ray and RLlib run on compute clusters on any cloud without vendor lock.

RLlib includes <b>25+ available [algorithms](https://docs.ray.io/en/master/rllib/rllib-algorithms.html)</b>, converted to both <img width="3%" src="images/tensorflow-logo.png">_TensorFlow_ and <img width="3%" src="images/pytorch-logo.png">_PyTorch_, covering different sub-categories of RL: _model-based_, _model-free_, and _Offline RL_. Almost any RLlib algorithm can learn in a <b>multi-agent</b> setting. Many algorithms support <b>RNNs and LSTMs</b>.

Roughly, RLlib is organized by **environments**, **algorithms**, **examples**, and **tuned_examples**.  

    ray
    |- rllib
    |  |- env 
    |  |- algorithms 
    |  |  |- alpha_zero 
    |  |  |- appo 
    |  |  |- ppo 
    |  |  |- ... 
    |  |- examples 
    |  |- tuned_examples

Within **_env_** you will find different [base classes](https://docs.ray.io/en/latest/rllib/package_ref/env.html) that you can inherit from to make it easy to implement your environment. RLlib supports environments created using the **OpenAI Gym API** (which supports most user cases). The base classes in the env directory allow for users to implement environments that don't fall into common use cases such as multi agent environments and environments that have strict performance or hosting requirements. In the next notebook, you will see we're using the **RLlib MultiAgentEnv** base class to train a **multi agent** RL model.

Within **_examples_** you will find some examples of common custom rllib use cases..  

Within **_tuned_examples_**, you will find, sorted by algorithm, suggested hyperparameter value choices within .yaml files. Ray RLlib team ran simulations/benchmarks to find suggested hyperparameter value choices.  These files used for daily testing, and weekly hard-task testing to make sure they all run at speed, for both TF and Torch. Helps you with leg-up with parameter choices!


In this tutorial, we will mainly focus on **_algorithms_**, where we will find RLlib algorithms to train RLlib models on environments.


#### Step 1.  Import ray

In [10]:
# import libraries
import ray
from ray import tune
from ray.tune.logger import pretty_print
print(f"ray: {ray.__version__}")

Package pickle5 becomes unnecessary in Python 3.8 and above. Its presence may confuse libraries including Ray. Please uninstall the package.


ray: 3.0.0.dev0


#### Check environment for errors

Before you start training, it is a good idea to check the environment for errors.  RLlib provides a convenient [Environment pre-check function](https://github.com/ray-project/ray/blob/master/rllib/utils/pre_checks/env.py#L22) for this.  
    
Below, we start with a new environment, [Cart-Pole](https://www.gymlibrary.ml/environments/classic_control/cart_pole/), then in the next cell, check it for errors.


In [11]:
# Instantiate gym env object with a runtime parameter value
env = gym.make("CartPole-v1")

# inspect the gym spec for the environment
print(f"env: {env}")
env_spec = env.spec
print(f"env_spec: {env_spec}")
 
# View runtime **kwargs .spec params.  These params set after env instantiated.
# print(f"type(env_spec._kwargs): {type(env_spec._kwargs)}") #dict
print("Runtime spec params...")
[print(f"{k}: {v}") for k,v in env_spec._kwargs.items()]
print()
 
# View default env spec params
# Default parameters are fixed
print("Default spec params...")
print(f"id: {env_spec.id}")
print(f"entry_point: {env_spec.entry_point}")
print(f"reward_threshold: {env_spec.reward_threshold}")
print(f"nondeterministic: {env_spec.nondeterministic}")
print(f"max_episode_steps: {env_spec.max_episode_steps}")
print(f"order_enforce: {env_spec.order_enforce}")

# Note: "TimeLimit" means termination condition for an episode will be time steps

env: <TimeLimit<CartPoleEnv<CartPole-v1>>>
env_spec: EnvSpec(CartPole-v1)
Runtime spec params...

Default spec params...
id: CartPole-v1
entry_point: gym.envs.classic_control:CartPoleEnv
reward_threshold: 475.0
nondeterministic: False
max_episode_steps: 500
order_enforce: True


In [12]:
from ray.rllib.utils.pre_checks.env import check_env

# How to check you do not have any environment errors
print("checking environment ...")
try:
    ray.rllib.utils.pre_checks.env.check_env(env)
    print("All checks passed. No errors found.")
except:
    print("failed")

checking environment ...
All checks passed. No errors found.


## Train a RL model using an algorithm from RLlib <a class="anchor" id="intro_rllib_api"></a>

#### Step 2.  Select an algorithm and instantiate a config object using that algorithm's config class  

<ol>
    <li><a href="https://docs.ray.io/en/master/rllib/rllib-algorithms.html">Open RLlib docs</a></li>
    <li>Scroll down and click url of algo you're searching for, e.g. <i><b>PPO</b></i></li>
    <li>On the <a href=""https://docs.ray.io/en/master/rllib/rllib-algorithms.html#ppo>algo docs page </a>, click on the link <i><b>Implementation</b></i>.  This will open the <a href="https://github.com/ray-project/ray/blob/master/rllib/algorithms/ppo/ppo.py">algo code file on github</a>.</li>
    <li>Search the github code file for the word <i><b>config</b></i></li>
    <li>Typically the docstring example will show: </li>
    <ol>
        <li>Example code implementing RLlib API, then </li>
        <li>Example code implementing Ray Tune API.</li>
    </ol>
    <li>Scroll down to the config <b>__init()__</b> method</li>
    <ol>
            <li>Algorithm default hyperparameter values are here.</li>
    </ol>
    </ol>

In [13]:
# Common RLlib General config (for all algorithms)

from ray.rllib.algorithms.algorithm_config import AlgorithmConfig
config = AlgorithmConfig()

# # Uncomment for long list of parameters
# print(f"RLlib Trainer's general default config is:")
# config.to_dict()

#### Step 3. Choose your config settings   

As of Ray 1.13, RLlib configs been converted from primitive dictionaries into Objects. This makes them harder to print, but easier to set/pass.

**Note about RLlib config precedence**
<ol>
    <li>Highest precedence are <b>trainer instantiation settings</b>, these override any other config settings</li>
    <li>RLlib <b>specific algorithm config</b> (see config class description above)</li>
    <li>RLlib <b><a href"https://github.com/ray-project/ray/blob/master/rllib/algorithms/algorithm_config.py#L58">general config</a></b> settings have the lowest precedence</li>
    </ol>


**Note about num_workers**

Number of Ray workers is the number of parallel workers or actors for rollouts.  Actual num_workers will be what you specifiy+1 for head node.

<b>Use ONE LESS than the number of cores you want to use</b> (or omit this argument and let Ray automatically use all cores)! <br>


Below, num_workers = 7  # means actual number workers=8 including head node; where 8 is #cpu on my laptop.

In [14]:
# config is an object instead of a dictionary since Ray version >= 1.13
from ray.rllib.algorithms.ppo import PPOConfig

# uncomment below to see the long list of specifically PPO default config values
# print(pretty_print(PPOConfig().to_dict()))

# Define algorithm config values
env_name = "CartPole-v1"
evaluation_interval = 2   #100, num training episodes to run between eval steps
evaluation_duration = 20  #100, num eval episodes to run for the eval step
num_workers = 4          # +1 for head node, num parallel workers or actors for rollouts
num_gpus = 0             # num gpus to use in the cluster
num_envs_per_worker = 1  #1, no vectorization of environments to run at same time

# Define trainer runtime config values
checkpoint_freq = evaluation_interval # freq save checkpoints >= evaulation_interval
checkpoint_at_end = True                # always save last checkpoint
relative_checkpoint_dir = "my_PPO_logs" # redirect logs instead of ~/ray_results/
random_seed = 415
# Set the log level to DEBUG, INFO, WARN, or ERROR 
log_level = "ERROR"

# Create a new training config
# override certain default algorithm config values
config_train = (
    PPOConfig()
    .framework(framework='torch')
    .environment(env=env_name, disable_env_checking=False)
    .rollouts(num_rollout_workers=num_workers, num_envs_per_worker=num_envs_per_worker)
    .resources(num_gpus=num_gpus, )
#     .training(gamma=0.9, lr=0.01, kl_coeff=0.3)
    .evaluation(evaluation_interval=evaluation_interval, 
                evaluation_duration=evaluation_duration)
    .debugging(seed=random_seed, log_level=log_level)
)

print(type(config_train))


<class 'ray.rllib.algorithms.ppo.ppo.PPOConfig'>


#### Step 4. Instantiate a Trainer from the config object

**Two ways to train RLlib models***
<ol>
    <li><a href="https://docs.ray.io/en/master/rllib/package_ref/index.html">RLlib API.</a> The main methods are:</li>
    <ul>
        <li>train()</li>
        <li>evaluate()</li>
        <li>save()</li>
        <li><b>restore()</b></li>
        <li><b>compute_single_action()</b></li>
    </ul>
    <li><a href="https://docs.ray.io/en/master/tune/api_docs/overview.html">Ray Tune API.</a>  The main methods are:</li>
        <ul>
            <li><b>run()</b></li>
    </ul>
    </ol>
    
*RLlib CLI from command line using .yml file also exists, but the .yml file is undocumented: <i>rllib train -f [myfile_name].yml</i><br>

👉 RLlib API <b>.train()</b> will train for 1 episode only.  Good for debugging since every single output will be shown for the 1 episode of training.  

👉 However for usual purposes, Ray Tune API <b>.run()</b> is more convenient since with 1 function call you get experiment management: save, checkpoint, evaluate, and train subject to stopping criteria.

✔️ Both methods will run the RLlib [environment pre-check function](https://github.com/ray-project/ray/blob/master/rllib/utils/pre_checks/env.py#L22) you saw earlier in this notebook (cells just after Step 1. Import ray).

👍 You have to use RLlib API method <b>.restore()</b> to reload a checkpointed RLlib model for Serving and Offline learning.  Tune API methods will not work.

👍 After a model is trained, it can be used in inference mode.  The RLlib API method <b>compute_single_action()</b> will use the trained <i>`policy`</i> (RL word for trained model) to calculate actions for the entire number of time steps in 1 <i>`rollout`</i> (RLlib word for episode during inference).  You will see this method used at the end of this notebook.  

<div class="alert alert-block alert-success">
<b>In summary, if you are going to train a RLlib model, train it use Ray Tune API method .run()!!  <br>
    If you need to restore a RLlib model, use RLlib API method .restore()!!</b>
</div>


💡 <b>Right-click on the cell below and choose "Enable Scrolling for Outputs"!</b>  This will make it easier to view, since model training output can be very long!

In [15]:
# ###############
# # EXAMPLE USING RLLIB API .train() FOR 1 EPISODE
# # For completeness, here is how to train RLlib using RLlib API's .train() method
# # The code below instantiates a trainer and trains for 1 single episode. 
# # To train for N number of episodes, you would put _.train()_ into a loop, 
# # similar to the way we ran the Gym _env.step()_ in a loop.
# ###############

# # To start fresh, restart Ray in case it is already running
# if ray.is_initialized():
#     ray.shutdown()
    
# # Use .build() similar to how gym environments are passed to the gym .make() method.
# rllib_trainer = config_train.build(env=env_name)
# print(type(rllib_trainer))

# # run the trainer for 1 episode
# rllib_trainer.train()

# # Below, you will see the output evaluation_interval times.


From the above cell, you can see how to train a RLlib algorithm 1 episode at a time.  But it is more practical to train RLlib algorithms using Ray Tune, since many more options are available.

**Instantiate a trainer using Ray Tune API**

Ray Tune offers experiment management in a single call <b>.run()</b>.  In the code below, we <b>specify a stopping criteria</b> to train until a certain Reward is achieved.  In case the desired training reward level is never reached, backup stop criteria can be given.  Tune will stop training whenever the earliest stop criteria is met.  However, best practice starting out is to only have 1 criteria, so you can be sure what is going to happen.

💡 <b>Right-click on the cell below and choose "Enable Scrolling for Outputs"!</b>  This will make it easier to view, since model training output can be very long!


In [16]:
###############
# EXAMPLE USING RAY TUNE API .run() IN A LOOP UNTIL STOP CONDITION
# Note about Ray Tune verbosity.
# Screen verbosity in Ray Tune is defined as verbose = 0, 1, 2, or 3, where:
# 0 = silent
# 1 = only status updates, no logging messages
# 2 = status and brief trial results, includes logging messages
# 3 = status and detailed trial results, includes logging messages
# Defaults to 3.
###############

# To start fresh, restart Ray in case it is already running
if ray.is_initialized():
    ray.shutdown()

evaluation_interval = 100   #100, num training episodes to run between eval steps
verbosity = 2 # Tune screen verbosity

trainer = tune.run("PPO", 
                    
    # Stopping criteria whichever occurs first: average reward over training episodes, or ...
    stop={"episode_reward_mean": 400, # stop if achieve 400 out of max 500
          # "training_iteration": 200,  # stop if achieved 200 episodes
          # "timesteps_total": 100000,  # stop if achieved 100,000 timesteps
          },  
              
    # training config params
    config = config_train.to_dict(),
                    
    #redirect logs instead of default ~/ray_results/
    local_dir = relative_checkpoint_dir, #relative path
         
    # set frequency saving checkpoints >= evaulation_interval
    checkpoint_freq = checkpoint_freq,
    checkpoint_at_end=True,
         
    # Reduce logging messages
    verbose = verbosity,
    )

print("Training completed.")


2022-07-10 18:26:49,027	ERROR services.py:1494 -- Failed to start the dashboard: Failed to start the dashboard, return code 0
 The last 10 lines of /tmp/ray/session_2022-07-10_18-26-47_099252_67268/logs/dashboard.log:
  File "/Users/christy/Documents/ray/python/ray/dashboard/head.py", line 105, in _configure_http_server
    http_server = HttpServerDashboardHead(
  File "/Users/christy/Documents/ray/python/ray/dashboard/http_server_head.py", line 69, in __init__
    raise ex
  File "/Users/christy/Documents/ray/python/ray/dashboard/http_server_head.py", line 60, in __init__
    build_dir = setup_static_dir()
  File "/Users/christy/Documents/ray/python/ray/dashboard/http_server_head.py", line 31, in setup_static_dir
    raise dashboard_utils.FrontendNotFoundError(
ray.dashboard.utils.FrontendNotFoundError: [Errno 2] Dashboard build directory not found. If installing from source, please follow the additional steps required to build the dashboard(cd python/ray/dashboard/client && npm insta

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_89997_00000,TERMINATED,127.0.0.1:67303,14,60.1151,56000,409.14,500,21,409.14


[2m[36m(PPO pid=67303)[0m 2022-07-10 18:26:53,481	INFO ppo.py:378 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
[2m[36m(PPO pid=67303)[0m 2022-07-10 18:26:53,481	INFO algorithm.py:332 -- Current log_level is ERROR. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


Trial PPO_CartPole-v1_89997_00000 reported custom_metrics={},episode_media={},info={'learner': {'default_policy': {'learner_stats': {'allreduce_latency': 0.0, 'cur_kl_coeff': 0.20000000000000004, 'cur_lr': 5.0000000000000016e-05, 'total_loss': 8.712568479455927, 'policy_loss': -0.039127241758008795, 'vf_loss': 8.745841918965821, 'vf_explained_var': 0.00914683085615917, 'kl': 0.029268890223067345, 'entropy': 0.6648874541123708, 'entropy_coeff': 0.0}, 'model': {}, 'custom_metrics': {}, 'num_agent_steps_trained': 128.0}}, 'num_env_steps_sampled': 4000, 'num_env_steps_trained': 4000, 'num_agent_steps_sampled': 4000, 'num_agent_steps_trained': 4000},sampler_results={'episode_reward_max': 72.0, 'episode_reward_min': 9.0, 'episode_reward_mean': 21.281767955801104, 'episode_len_mean': 21.281767955801104, 'episode_media': {}, 'episodes_this_iter': 181, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [29.0, 11.0, 

2022-07-10 18:27:57,556	INFO tune.py:737 -- Total run time: 67.54 seconds (66.91 seconds for the tuning loop).


Training completed.


<br>

Scroll through the output in the cell above.  Look for a single table row that looks like this: <br>

<img src="images/ppo_cartpole_tune_output.png"></img>

What this telling you is that Tune ran your experiment.  It was terminated when Reward reached 409.14, which satisified the stopping criteria of Reward >= 400, out of an environment max reward possible of 500.

## Evaluate a RLlib model <a class="anchor" id="eval_rllib"></a>

RLlib trainers can be evaluated by*:
<ul>
    <li>Examining <b>Ray Tune</b> trainer object </li>
    <li>Examining <b>Ray Tune</b> experiment trial results </li>
    <li>Visualizing training progress in <b>TensorBoard</b></li>
    </ul>

*RLlib trainer objects can also be examined manually, but it gives the same info you already saw in the single episode .train() output: <i>rllib_trainer.evaluate()</i>

**Ray Tune trainer object** <br>
First, let's start by looking at the trainer object. How long did training take?

In [17]:
# Get RLlib default stats
stats = trainer.stats()
secs = stats["timestamp"] - stats["start_time"]
print(f'{secs:7.2f} seconds, {secs/60.0:7.2f} minutes')

# Typically takes about 67 seconds

  66.01 seconds,    1.10 minutes


**Ray Tune experiment trial results**

Read all the experiment trials into a single pandas dataframe.  The dataframe will have 1 row per trial.

Below, we see a dataframe with only 1 row because Tune only ran 1 trial.  (Because we did not specify a hyperparameter space to search for tuning.)

In [18]:
# Read trainer results in a pandas dataframe
df = trainer.results_df

print(f"df.shape: {df.shape}")  #Only 1 trial
print(df.columns)
df.iloc[:,0:8].head()

df.shape: (1, 422)
Index(['episode_reward_max', 'episode_reward_min', 'episode_reward_mean',
       'episode_len_mean', 'episodes_this_iter', 'num_healthy_workers',
       'num_agent_steps_sampled', 'num_agent_steps_trained',
       'num_env_steps_sampled', 'num_env_steps_trained',
       ...
       'info/learner/default_policy/learner_stats/total_loss',
       'info/learner/default_policy/learner_stats/policy_loss',
       'info/learner/default_policy/learner_stats/vf_loss',
       'info/learner/default_policy/learner_stats/vf_explained_var',
       'info/learner/default_policy/learner_stats/kl',
       'info/learner/default_policy/learner_stats/entropy',
       'info/learner/default_policy/learner_stats/entropy_coeff',
       'config/evaluation_config/tf_session_args/gpu_options/allow_growth',
       'config/evaluation_config/tf_session_args/device_count/CPU',
       'config/evaluation_config/multiagent/policies/default_policy'],
      dtype='object', length=422)


Unnamed: 0_level_0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_healthy_workers,num_agent_steps_sampled,num_agent_steps_trained
trial_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
89997_00000,500.0,21.0,409.14,409.14,9,4,56000,56000


For how many episodes did training run?

In [19]:
# Get number of episodes for the 1st trial
df.iloc[0,:].episodes_this_iter  

# Answer is 9 episodes

9

What were the best parameter values?  


In [20]:
# Watch out, the following cell output is long because there are many parameters!

# trainer.get_best_config(metric="episode_reward_mean", mode="mean")

#### Visualize the training progress in TensorBoard

RLlib automatically creates logs for your trained RLlib models that can be visualized in TensorBoard.  To visualize the performance of your RL model:

<ol>
    <li>Open a terminal</li>
    <li><i><b>cd</b></i> into the logdir path from the above cell's output.</li>
    <li><i><b>ls</b></i></li>
    <li>You should see files that look like: checkpoint_NNNNNN</li>
    <li>To be able to compare all your experiments, cd one dir level up.
    <li><i><b>cd ..</b></i>  
    <li><i><b>tensorboard --logdir . </b></i></li>
    <li>Look at the url in the message, and open it in a browser</li>
        </ol>

#### Screenshot of Tensorboard

TensorBoard will give you many pages of charts.  Below displaying just Train/Eval mean and min rewards.

The charts below are showing "sample efficiency", the number of training steps it took to achieve a certain level of performance.

<b>Train Performance:</b> <br>

---
<img src="images/ppo_cartpole_training_rewards.png" width="80%" />

<b>Eval Performance:</b> <br>
<img src="images/ppo_cartpole_eval_rewards.png" width="80%" />

## Play and render the game as a video <a class="anchor" id="reload_rllib"></a>

To do this, we need to reload the desired RLlib model from checkpoint and then run the model inference mode on the environment it was trained on.  

Where is the best model checkpoint file?

In [21]:
# Get best checkpoint path
logdir = trainer.get_best_logdir(metric="evaluation_reward_mean", mode="max")
print(logdir)

# Get last checkpoint path
checkpoint = trainer.get_last_checkpoint()
print(checkpoint)

/Users/christy/Documents/github_ray_summit_2022/ray-summit-2022-training/ray-rllib/my_PPO_logs/PPO/PPO_CartPole-v1_89997_00000_0_2022-07-10_18-26-50
/Users/christy/Documents/github_ray_summit_2022/ray-summit-2022-training/ray-rllib/my_PPO_logs/PPO/PPO_CartPole-v1_89997_00000_0_2022-07-10_18-26-50/checkpoint_000014/checkpoint-14


<br>

<b>Restore the desired, already-trained RLlib model from checkpoint file.</b>  

You will need:
<ul>
    <li>Your <b>algorithm's config class</b> and exact same <a href="#intro_rllib_api">config settings you used to train your model.</a></li>
    <li>Name of the <b>environment</b> you used to train the model.</li>
    <li>Path to the desired <b>checkpoint</b> file you want to use to restore the model.</li>
    </ul>

See the example below.

In [22]:
# Create new Agent and restore its state from the last checkpoint.

# create an empty agent
agent = config_train.build(env=env_name)

# restore the agent from the checkpoint
agent.restore(checkpoint)

2022-07-10 18:27:57,896	INFO ppo.py:378 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
2022-07-10 18:27:57,897	INFO algorithm.py:332 -- Current log_level is ERROR. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2022-07-10 18:28:01,293	INFO trainable.py:590 -- Restored on 127.0.0.1 from checkpoint: /Users/christy/Documents/github_ray_summit_2022/ray-summit-2022-training/ray-rllib/my_PPO_logs/PPO/PPO_CartPole-v1_89997_00000_0_2022-07-10_18-26-50/checkpoint_000014/checkpoint-14
2022-07-10 18:28:01,298	INFO trainable.py:599 -- Current state after restoring: {'_iteration': 14, '_timesteps_total': None, '_time_total': 60.11512041091919, '_episodes_total': 409}


<br>
<b>Record a video of the trained model doing inference in the environment it was trained on.</b>
<br><br>

Gym includes video recording and saving capability in its wrapper <i>`gym.wrappers.RecordVideo`</i>, so we will import and use that method to wrap the <i>`gym.make()`</i> method.

<div class="alert alert-block alert-success">
👍 During inference, call the RLlib API method <b>compute_single_action()</b>, which uses the trained <i>`policy`</i> (RL word for trained model) to calculate actions for the entire number of time steps in 1 <i>`rollout`</i> (RLlib word for episode during inference). 
</div>

Execute the cell below, and you will see a video of the rollouts, so you can verify visually that the agent is starting from a near perfect score.

Note that the CartPole environment's perfect score is 500.  Since we trained our model to around 400, the restored agent should appear very stable. 
<br><br>

In [23]:
from gym.wrappers import RecordVideo

#############
## Create the env to do inference on
#############
# Try this first to make sure it is working
# env = gym.make(env_name)
# Once you have confirmed it works, wrap the method
# RecordVideo() takes as input a path name where to store the video
# RecordVideo() includes its own render step, records, and saves the video.
env = RecordVideo(gym.make(env_name), "ppo_video", )
obs = env.reset()

#############
## Use the restored model and run it in inference mode
## You will see a pop-up video rendering for about 10 seconds
#############
num_episodes_during_inference = 1
num_episodes = 0
episode_reward = 0.0
done = False

while num_episodes < num_episodes_during_inference:
    # Compute an action (`a`).
    a = agent.compute_single_action(observation=obs)
    # Send the computed action `a` to the env.
    obs, reward, done, _ = env.step(a)
    episode_reward += reward
    
    # Is the episode `done`? -> Reset.
    if done:
        print(f"Episode done: reward = {episode_reward}")
        obs = env.reset()
        num_episodes += 1
        episode_reward = 0.0
        break
        
env.close()

# The restored agent manages to achieve a perfect score during the 1 episode rollout.


  logger.warn(


Episode done: reward = 500.0


<br>
<b>Play and share your video .mp4 file with others if you want.</b>
<br><br>

Show a video player in the notebook, and play the video you just recorded!

In [24]:
from IPython.display import Video

cart_pole_video='ppo_video/rl-video-episode-0.mp4'
Video(cart_pole_video, width=500)

In [25]:
# Shut down Ray if you are done
if ray.is_initialized():
    ray.shutdown()

### Summary


### Exercises

1. How would choose another algorithm to train Cart Pole?  Hint:  Look at the [RLlib algorithm doc page](https://docs.ray.io/en/master/rllib/rllib-algorithms.html).  How would you change the choice of RLlib algorithm from <b>PPO to TD3</b>?

### References

1. 