# Notebook 01. Introduction to the OpenAI Gym Environment and RLlib Algorithm top-level APIs

© 2019-2022, Anyscale. All Rights Reserved<br>
📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb) <br>

➡️ [Next notebook](./ex_02_create_multiagent_rllib_env.ipynb) <br>

### Learning objectives
In this this tutorial, you will learn:
 * [What is an Environment in Reinforcement Learning (RL)?](#intro_env)
 * [Overview of RL terminology](#intro_rl)
 * [Introduction to OpenAI Gym environments](#intro_gym)
 * [High-level OpenAI Gym API calls](#intro_gym_api)
 * [Overview of RLlib](#intro_rllib)
 * [Train a policy using an algorithm from RLlib](#intro_rllib_api)
 * [Evaluate a RLlib policy](#eval_rllib)
 * [Reload RLlib policy from checkpoint and run inference](#reload_rllib)
  

## What is an environment in RL? <a class="anchor" id="intro_env"></a>

Solving a problem in RL begins with an **environment**. In the simplest definition of RL:

> An **agent** interacts with an **environment** and receives a reward.

An environment in RL is the agent's world, it is a simulation of the problem to be solved. 

<img src="../images/env_key_concept1.png" width="50%" />

The **environment** simulator might be of a:
<ul>
    <li>real, physical machine such as a gas turbine or autonomous vehicle</li>
    <li>real, abstract system such as user behavior on a website or the stock market</li>
    <li>virtual sytem on a computer such as a board game or a video game</li>
    </ul>
    
The **agent** represents what is triggering the actions.  For example it could be:
<ul>
    <li>a software system that is triggering actions for machines</li>
    <li>a type of user or investor</li>
    <li>a game player or game system that is competing against real players </li>
    </ul> 
<br>    
    
<b>Comparison of RL to supervised learning</b> <br>
<ul>
    <li><u><i>Data</i></u>.  In supervised learning, you start with a labeled dataset.  In contrast, the <b>data in RL is not given up front; the environment acts as a data generator</b>.  One can also do RL on a pre-collected dataset (called offline RL), we will touch on offline RL later. </li> <br>
    <li><u><i>Training</i></u>.  In supervised learning, a ML algorithm is trained on ALL the labeled training data AT ONCE.  <b>RL trains over a sequence of feedback loops.</b>  The RL algorithm optimizes the sum of individual rewards over repeated lifetimes (episodes) of sequential decisions: action -> fedback -> improved action -> repeat. </li><br>
    <li><u><i>Evaluation</i></u>.  In supervised learning, a ML algorithm is evaluated on ALL the hold-out validation data AT ONCE. <b>RL REPEATEDLY evaluates a policy at different time steps</b>, typically whenever you save a checkpoint file. Evaluation at particular points in time in RL is similar in concept to "backtesting" in time series forecasting. RL evaluations are specific to a time step. </li>
    </ul>

<b>Why bother with an Agent, Environment, and RL?</b>  <br>
Supervised learning can be too shortsighted or overlook important, changing user intents or business conditions.  <br>
<div class="alert alert-block alert-success">    
<b> 💡 Reinforcement learning (RL) is a powerful technique when there are sequential decision-making processes, and you want to optimize for long-term possibly delayed rewards.  <br>
    💡 RL can also work when there is no existing model to rely on or you want to improve over an existing decision-making strategy. </b> 
</div> 

<br> 

## Overview of RL terminology <a class="anchor" id="intro_rl"></a>

An RL environment consists of: 

1. all possible actions (**action space**)
2. a complete description of the environment, nothing hidden (**state space**)
3. an observation by the agent of certain parts of the state (**observation space**)
4. **reward**, which is the only feedback the agent receives after each action.

The model that tries to maximize the expected sum over all future rewards is called a **policy**. The policy is a function mapping the environment's observations to an action to take, usually written **π** (s(t)) -> a(t).  <i>In deep reinforcement learning, this function is a neural network</i>.

<b>Policy vs Model? </b>
In traditional supervised learning, model means a trained algorithm, or a learned function.

> <i>In RL, a model is roughly equivalent to a policy, but policy is more specific</i> because it is trained in a specific environment.  For deployment, we use the word "model" because more people understand the ML meaning of a trained model.

Below is a high-level image of how the Agent and Environment work together to train a Policy in a RL simulation feedback loop in RLlib.

<img src="../images/env_key_concept2.png" width="98%" />

The **RL simulation feedback loop** repeatedly collects data, for one (single-agent case) or multiple (multi-agent case) policies, trains the policies on these collected data, and makes sure the policies' weights are kept in synch. 

During simulation loops, the environment collects observations, taken actions, receives rewards and so-called **done** flags, indicating the boundaries of different episodes the agents play through in the simulation.

Each simulation iteration is called a <b>time step</b>.  The simulation iterations of action -> reward -> next state -> train -> repeat, until the end state, is called an **episode**, or in RLlib, a **rollout**.  At the end of the episode, when the <i>done</i> flag is True, we call RLlib method .reset(), which sets the <i>done</i> flag to False again.
> 👉 Each episode consists of one or many time steps.

<b>Per episode</b> (or between **done** flag == True), the RL simulation feedback loop repeats up to some specified end state (termination state or timesteps). Examples of termination are:
<ul>
    <li>the end of a maze (termination state)</li>  
    <li>the player died in a game (termination state)</li>
    <li>after 60 videos watched in a recommender system (timesteps).</li>
    </ul>
    
<b>Why train for many episodes?</b>  When you are doing machine learning, you do not just do something once and report the result.  You do it many times, to make sure you did not just get "lucky" one time.  RL is similar.  By training for many episodes, you collect more data, which provides more variance, which is hopefully more realistic.  

<div class="alert alert-block alert-success">
<b>💡 In RL, the policy is trained by repeating trials, or episodes (or rollouts), then reporting the calculated reward typically as an average of all achieved rewards per episode.  The cumulative sum of all mean episode rewards is called the Return.</b> 
</div>
    
<br>

## Introduction to OpenAI Gym example: frozen lake <a class="anchor" id="intro_gym"></a>

[OpenAI Gym](https://gym.openai.com/) is a well-known reference library of RL environments. 

#### 1. import gym

Below is how you would import gym and view all available environments.

In [1]:
# import libraries
import gym
print(f"gym: {gym.__version__}")

# List all available gym environments
all_env  =  list(gym.envs.registry.all())
print(f'Num Gym Environments: {len(all_env)}')

# You could loop through and list all environments if you wanted
# [print(e) for e in all_env]
envs_starting_with_f = [e for e in all_env if str(e).startswith("EnvSpec(Frozen")]
envs_starting_with_f

gym: 0.21.0
Num Gym Environments: 103


[EnvSpec(FrozenLake-v1), EnvSpec(FrozenLake8x8-v1)]

#### 2. Instatiate your Gym object

The way you instantiate a Gym environment is with the **make()** function.

The .make() function takes arguments:
- **name of the Gym environment**, type: str, Required.
- **runtime parameter values**, Optional.

For the required string argument, you need to know the Gym name.  You can find the Gym name in the Gym documentation for environments, either:
<ol>
    <li>The doc page in <a href="https://www.gymlibrary.ml/environments/toy_text/frozen_lake/">Gym's website</a></li>
    <li>The environment's <a href="https://github.com/openai/gym/blob/master/gym/envs/toy_text/frozen_lake.py">source code </a></li>
    <li>
        <a href="https://www.gymlibrary.ml/environments/classic_control/cart_pole/#description">Research paper (if one exists)</a> referenced in the environment page </li>
    </ol>
    
Below is an example of how to create a basic Gym environment, [frozen lake](https://www.gymlibrary.ml/environments/toy_text/frozen_lake/).  We can see below that the termination condition of an episode will be <b>TimeLimit</b> (the environment automatically ends an episode and sets done=True after this many timesteps).


In [2]:
env_name = "FrozenLake-v1"

# Instantiate gym env object with a runtime parameter value (is_slippery).
# is_slippery=True specifies the environment is stochastic
# is_slippery=False is the same as "deterministic=True"
env = gym.make(
    env_name,
    is_slippery=False,  # whether the environment behaves deterministically or not
)

# inspect the gym spec for the environment
print(f"env: {env}")
env_spec = env.spec
print(f"env_spec: {env_spec}")

# Note: "TimeLimit" means termination condition for an episode will be time steps

env: <TimeLimit<FrozenLakeEnv<FrozenLake-v1>>>
env_spec: EnvSpec(FrozenLake-v1)


#### 3. Inspect the environment action and observations spaces

Gym Environments can be deterministic or stochastic.

<ul>
    <li>
        <b>Deterministic</b> if the current state + selected action determines the next state of the environment.  <i>Chess is an example of a deterministic environment</i>, since all possible states/action combinations can be described as a discrete set of rules with states bounded by the pieces and size of the board.</li>
    <li>
        <b>Stochastic</b> if the policy output action is a probability distribution over a set of possible actions at time step t. In this case, the agent needs to compute its action from the policy in two steps. i) sample actions from the policy according to the probability distribution, ii) compute log likelihoods of the actions. <i>Random visitors to a website is an example of a stochastic environment</i>. </li>
    </ul>

<b>Gym actions.</b> The action_space describes the numerical structure of the legitimate actions that can be applied to the environment. 

For example, if we have 4 possible discrete actions, we could encode them as:
<ul>
    <li>0: LEFT</li>
    <li>1: DOWN</li>
    <li>2: RIGHT</li>
    <li>3: UP</li>
</ul>

<b>Gym observations.</b>  The observation_space defines the structure as well as the legitimate values for the observation of a state of the environment.  

For example, if we have a 4x4 grid, we could encode them as {0,1,2,3, 4, … ,15} for grid positions ((0,0), (0,1), (0,2), (0,3), …. (3,3)).

From the Gym [documentation](https://www.gymlibrary.ml/environments/toy_text/frozen_lake/) about the frozen lake environment, we see: <br>

|Frozen Lake      | Gym space   |
|---------------- | ----------- |
|Action Space     | Discrete(4) |
|Observation Space| Discrete(16)|
 
<b><a href="https://github.com/openai/gym/tree/master/gym/spaces">Gym spaces</a></b> are gym data types.  The main types are `Discrete` for discrete numbers and `Box` for continuous numbers.  

Gym Space `Discrete` elements are Python type `int`, and Gym Space `Box` are Python type `float32`.

Below is an example how to inspect the environment action and observations spaces.

In [3]:
# check if it is a gym instance
if isinstance(env, gym.Env):
    print("This is a gym environment.")
    print()

    # print gym Spaces
    if isinstance(env.action_space, gym.spaces.Space):
        print(f"gym action space: {env.action_space}")
    if isinstance(env.observation_space, gym.spaces.Space):
        print(f"gym observation space: {env.observation_space}") 
        
# Note: the action space is discrete with 4 possible actions.
# Note: the observation space is 4x4 and thus runs from 0 to 15.
# Note: if we chose 8x8, the observation space would change to Discrete(64).

This is a gym environment.

gym action space: Discrete(4)
gym observation space: Discrete(16)


#### 4. Inspect gym environment default & runtime parameters

Gym environments contain 2 sets of parameters that are set after the environment object is instantiated.
<ul>
    <li><b>Default parameters</b> are fixed in the Gym environment code itself.</li>
    <li><b>Runtime parameters</b> are passed into the make() function as **kwargs.</li>
    </ul>

Below is an example of how to inspect the environment parameters.  Notice we can tell from the parameters that our frozen lake environment is: <br>
1) <i>Deterministic</i>, and <br>
2) Episode terminates with time step condition <i>max_episode_steps</i> = 100.

In [4]:
# inspect env.spec parameters
 
# View default env spec params that are hard-coded in Gym code itself
# Default parameters are fixed
print("Default spec params...")
print(f"id: {env_spec.id}")
# rewards above this value considered "success"
print(f"reward_threshold: {env_spec.reward_threshold}")
# env is deterministic or stochastic
print(f"nondeterministic: {env_spec.nondeterministic}")
# number of time steps per episode
print(f"max_episode_steps: {env_spec.max_episode_steps}")
# must reset before step or render
print(f"order_enforce: {env_spec.order_enforce}") 

# View runtime **kwargs .spec params.  These params set after env instantiated.
# print(f"type(env_spec._kwargs): {type(env_spec._kwargs)}") #dict
print()
print("Runtime spec params...")
# Note: gym > v21 use just .kwargs instead of ._kwargs
[print(f"{k}: {v}") for k,v in env_spec._kwargs.items()]
print()

# Note:  We can tell that our frozen lake environment is: 
# 1) Success criteria is rewards >= 0.7
# 2) Deterministic
# 3) Episode terminates when number time_steps = 100


Default spec params...
id: FrozenLake-v1
reward_threshold: 0.7
nondeterministic: False
max_episode_steps: 100
order_enforce: True

Runtime spec params...
map_name: 4x4
is_slippery: False



## High-level OpenAI Gym API calls <a class="anchor" id="intro_gym_api"></a>

The most basic Gym API methods are:
<ul>
    <li><b>env.reset()</b> <br>Reset the environment to an initial state.  Returns the initial observation.  <b>You should call this method every time at the start of a new episode.</b></li>
    <li><b>env.step(action)</b> <br>Take an action from the possible action space values.  It <b><i>takes an action as input</i></b>, computes the state of the environment after applying that action and <b><i>returns the 4-tuple (next-observation, reward, done, info)</i></b>.</li>
    <li><b>env.render()</b>  <br>Visually inspect the environment. This is for human/debugging purposes; it is not seen by the agent/algorithm.  Note you cannot inspect an environment before it has been initialized with env.reset().</li>
    <li><b>env.close()</b> <br>Close an environment.</li>
    </ul>
    
<div class="alert alert-block alert-success">
💡 <b>To play an episode, call reset() first!  <br>
💡 After that, continue to call step() until the environment automatically returns done=True.</b> 
</div>

<br>

In [5]:
# Print the starting observation.  
# Recall possible observations are between 4x4 grid.
print(env.reset())
env.render()

0

[41mS[0mFFF
FHFH
FFFH
HFFG


In [6]:
# Take an action
# Recall the possible actions are: 0: LEFT, 1: DOWN, 2: RIGHT, 3: UP

new_obs, reward, done, _ = env.step(2) #Right
print(f"obs: {new_obs}, reward: {reward}, done: {done}")
env.render()
new_obs, reward, done, _ = env.step(1) #Down
print(f"obs: {new_obs}, reward: {reward}, done: {done}")
env.render()

obs: 1, reward: 0.0, done: False
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
obs: 5, reward: 0.0, done: True
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG


We can also try to run an action in the frozen lake environment which is outside the defined number range.

In [7]:
# Comment this cell if you want whole notebook to run without errors

# Try to take an invalid action

#env.step(4) # invalid

# should see KeyError below

To test out your environment, typically you will loop through a few episodes to make sure it works.  

In [8]:
from ipywidgets import Output
from IPython import display
import time

# The following three lines are for rendering purposes only.
# They allow us to render the env frame-by-frame in-place
# (w/o creating a huge output which we would then have to scroll through).
out = Output()
display.display(out)
with out:

    # Putting the Gym simple API methods together.
    # Here is a pattern for running a bunch of episodes.
    num_episodes = 5 # Number of episodes you want to run the agent
    total_reward = 0.0  # Initialize reward to 0

    # Loop through episodes
    for ep in range(num_episodes):

        # Reset the environment at the start of each episode
        obs = env.reset()
        done = False

        # Loop through time steps per episode
        while True:
            # take random action, but you can also do something more intelligent 
            action = env.action_space.sample()

            # apply the action
            new_obs, reward, done, info = env.step(action)
            total_reward += reward

            # If the epsiode is up, then start another one
            if done:
                break

            # Render the env (in place).
            time.sleep(0.3)
            out.clear_output(wait=True)
            print(f"episode: {ep}")
            print(f"obs: {new_obs}, reward: {total_reward}, done: {done}")
            env.render()

# Close the env
env.close()

Output()

## Overview of RLlib <a class="anchor" id="intro_rllib"></a>

<img width="7%" src="../images/rllib-logo.png"> is the most comprehensive open-source Reinforcement Learning framework. **[RLlib](https://github.com/ray-project/ray/tree/master/rllib)** is <b>distributed by default</b> since it is built on top of **[Ray](https://docs.ray.io/en/latest/)**, an easy-to-use, open-source, distributed computing framework for Python that can handle complex, heterogeneous applications. Ray and RLlib run on compute clusters on any cloud without vendor lock.  RLlib Resources:
<ol>
    <li>The doc page on <a href="https://docs.ray.io/en/master/rllib/index.html">ray.io website</a></li>
    <li><a href="https://github.com/ray-project/ray/tree/master/rllib">RLlib source code</a></li>
    </ol>

RLlib includes <b>25+</b> available [algorithms](https://docs.ray.io/en/master/rllib/rllib-algorithms.html), converted to both <img width="3%" src="../images/tensorflow-logo.png">_TensorFlow_ and <img width="3%" src="../images/pytorch-logo.png">_PyTorch_, covering different sub-categories of RL: _model-free_, _offline RL_, _model-based_, and _gradient-free_. Almost any RLlib algorithm can learn in a <b>multi-agent</b> setting. Many algorithms support <b>RNNs</b> and <b>LSTMs</b>.

On a very high level, RLlib is organized by **environments**, **algorithms**, **examples**, **tuned_examples**, and **models**.  

    ray
    |- rllib
    |  |- env 
    |  |- algorithms
    |  |  |- alpha_zero 
    |  |  |- appo 
    |  |  |- ppo 
    |  |  |- ... 
    |  |- examples 
    |  |- tuned_examples
    |  |- models

Within **_env_** you will find [classes](https://docs.ray.io/en/latest/rllib/package_ref/env.html) that allow RLlib to handle e.g. the multi-agent cases (which gym does NOT cover).  RLlib automatically supports any **OpenAI Gym environment** (which supports most user cases). RLlib also handle external environments that have strict performance or hosting requirements. <i>(In the next notebook, we will use the **RLlib MultiAgentEnv** base class to create a **multi agent** environment).</i>

Within **_examples_** you will find some examples of common custom rllib use cases.  

Within **_tuned\_examples_**, you will find, sorted by algorithm, suggested hyperparameter value choices within .yaml files. Ray RLlib team ran simulations/benchmarks to find suggested hyperparameter value choices.  These files are used for daily testing, and weekly hard-task testing to make sure they all run at speed, for both TF and Torch. Helps give you a leg-up with initial parameter choices!

Within **_models_**, you will find building blocks for NNs, default models that RLlib will use (for either <img width="3%" src="../images/tensorflow-logo.png">_TensorFlow_ or <img width="3%" src="../images/pytorch-logo.png">_PyTorch_). For example, here are building blocks for DNN, CNN, RNN, and LSTM. 

In this tutorial, we will mainly focus on the **_algorithms_** package, where we will find RLlib algos to train policies on environments.


## Train a policy using an algorithm from RLlib <a class="anchor" id="intro_rllib_api"></a>

Once you have an environment, next you need to decide which RL algorithm to use.  There are many factors to consider when selecting which algorithm to use on your environment.  Following are some high-level best practices.
<ol>
    <li>
        <b>The first distinction comes from your action space</b>, i.e., do you have discrete (e.g. LEFT, RIGHT, …) or continuous actions (ex: go to a certain speed)? To check high-level if an algorithm will work, look at the <a href="https://docs.ray.io/en/master/rllib/rllib-algorithms.html">RLlib algorithms doc page</a>.  <i>Algorithms are listed according to whether or not they support Discrete action spaces vs Continuous action spaces or both.</i> 
    </li>
    <li>
        <b>Choose a stable algorithm.</b>  Look at the cumulative rewards per time step, they should rise steadily.  You do not want an algorithm where reward jumps up and down a lot.
    </li>
    <li><b>Choose the most sample-efficient algorithm that works for your environment</b>.  Look at the cumulative rewards per time step, they should rise quickly. <i>PPO is extremely sample-efficient.  SAC is much less sample-efficient.</i>
    </li>
</ol>

#### Step 1.  Import ray

In [9]:
# import libraries
import time
import numpy as np

import ray
from ray import tune
from ray.tune.logger import pretty_print
print(f"ray: {ray.__version__}")

ray: 3.0.0.dev0


#### Step 2. Check environment for errors   

Before you start training, it is a good idea to check the environment for errors.  RLlib provides a convenient [Environment pre-check function](https://github.com/ray-project/ray/blob/master/rllib/utils/pre_checks/env.py#L22) for this.  It checks that the environment is compatible with OpenAI Gym and RLlib (and outputs a warning if necessary).

Below, we check our Frozen Lake environment for errors.

In [10]:
from ray.rllib.utils.pre_checks.env import check_env

# How to check you do not have any environment errors
print("checking environment ...")
try:
    check_env(env)
    print("All checks passed. No errors found.")
except:
    print("failed")

checking environment ...
All checks passed. No errors found.


#### Step 3. Calculate an environment baseline

Let's run through the environment, acting randomly, without rendering, and record the mean reward.  The purpose of this is to obtain a baseline before training a RLlib algorithm.

<div class="alert alert-block alert-success">
💡 If you are doing benchmarks, this random policy is often called a <b>"baseline".</b>
</div>

In [11]:
# Putting the Gym simple API methods together.
# Here is a pattern for running a bunch of episodes.
num_episodes = 3000 # Number of episodes you want to run the agent
num_timesteps = 0
# Collect all episode rewards here
episode_rewards = []

# Loop through episodes
for ep in range(num_episodes):

    # Reset the environment at the start of each episode
    obs = env.reset()
    done = False
    episode_reward = 0.0
    
    # Loop through time steps per episode
    while True:
        # take random action, but you can also do something more intelligent 
        action = env.action_space.sample()

        # apply the action
        new_obs, reward, done, info = env.step(action)
        episode_reward += reward

        # If the epsiode is up, then start another one
        num_timesteps += 1
        if done:
            episode_rewards.append(episode_reward)
            break

# calculate mean_reward
env_mean_random_reward = np.mean(episode_rewards)
env_sd_reward = np.std(episode_rewards)
# calculate number of wins
total_reward = np.sum(episode_rewards)
    
print()
print("**************")
print(f"Baseline Mean Reward={env_mean_random_reward:.2f}+/-{env_sd_reward:.2f}", end="")
print(f" (out of success={env_spec.reward_threshold})")
print(f"Baseline won {total_reward} times over {num_episodes} episodes ({num_timesteps} timesteps)")
print(f"Approx {total_reward/num_episodes:.2f} wins per episode")
print("**************")
        
# Close the env
env.close()


**************
Baseline Mean Reward=0.02+/-0.13 (out of success=0.7)
Baseline won 50.0 times over 3000 episodes (23091 timesteps)
Approx 0.02 wins per episode
**************


#### Step 4.  Select an algorithm and find that algorithm's config class  

Here is how to find an <b>RLlib algorithm's config class</b>.
<ol>
    <li>Open RLlib docs <a href="https://docs.ray.io/en/master/rllib/rllib-algorithms.html">and navigate to the Algorithms page.</a></li>
    <li>Scroll down and click url of algo you want to use, e.g. <i><b>PPO</b></i></li>
    <li>On the <a href="https://docs.ray.io/en/master/rllib/rllib-algorithms.html#ppo">algo docs page </a>, click on the link <i><b>Implementation</b></i>.  This will open the <a href="https://github.com/ray-project/ray/blob/master/rllib/algorithms/ppo/ppo.py">algo code file on github</a>.</li>
    <li>Search the github code file for the start of the <b>config class definition</b>.</li>
    <li>Typically the docstring example will show: </li>
    <ol>
        <li>Example code implementing RLlib API, and </li>
        <li>Example code implementing Ray Tune API.</li>
    </ol>
    </ol>

In [12]:
# config is an object instead of a dictionary since Ray version >= 1.13
from ray.rllib.algorithms.ppo import PPOConfig

# Default PPO config values
# uncomment below to see the long list of specifically PPO default config values
# print(pretty_print(PPOConfig().to_dict()))

#### Step 5. Choose your config settings and instantiate a config object with those settings

As of Ray 1.13, RLlib configs been converted from primitive Python dictionaries into Objects. This makes them harder to print, but easier to set/pass.

**Note about RLlib config values precedence**
<ol>
    <li><i>Highest precedence</i>: <b>user's algorithm config settings at time of training</b>.  These override all other config settings.</li>
    <li><i>Lower precedence</i>: <b>specific RLlib algorithm (e.g. PPO) config</b>:  
        <ol>
            <li>Open the <a href="https://github.com/ray-project/ray/blob/master/rllib/algorithms/ppo/ppo.py">algo code file on github</a>.  </li>
            <li>Search the github code file for the start of the <b>config class definition</b>.</li>
            <li>Scroll down to the config class <b>__init()__</b> method.</li>
            <ol>
            <li><i>Algorithm default hyperparameter values are here</i>.</li>
            </ol>
        </ol>
    <li><i>Lowest</i> precedence: Common RLlib <b><a href"https://github.com/ray-project/ray/blob/master/rllib/algorithms/algorithm_config.py#L58">generic algorithm config</a></b> settings.</li>
    </ol>

In [13]:
# Common RLlib generic (for all algorithms) config values

from ray.rllib.algorithms.algorithm_config import AlgorithmConfig
config = AlgorithmConfig()

# # uncomment below to see the long list of RLlib general config values
# print(f"RLlib's general default training config values:")
# print(pretty_print(config.to_dict()))

**Note about num_workers**

Number of Ray workers is the number of parallel workers or actors for rollouts.  Actual num_workers will be what you specifiy+1 for head node.

<div class="alert alert-block alert-success">
💡 <b>For num_workers, use ONE LESS than the number of cores you want to use</b> (or omit this argument and let Ray automatically use all cores)!
</div>


Below, num_workers = 4,  <br>
means actual number processors used = 5 (including head node). <br>
Since I know 8 is #cpu on my laptop.

In [14]:
# Create a PPOConfig object
config = PPOConfig()

# Setup our config object to use our environment
config.environment(env="FrozenLake-v1")

# Decide if you want torch or tensorflow DL framework.  Default is "tf"
config.framework(framework="torch")

# +1 for head node, num parallel workers or actors for rollouts
config.rollouts(num_rollout_workers=1)

# Set the log level to DEBUG, INFO, WARN, or ERROR 
config.debugging(seed=415, log_level="ERROR")

# Setup evaluation
# Explicitly set "explore"=False to override default
config.evaluation(evaluation_interval=10, 
                evaluation_duration=20, 
                evaluation_config = {"explore" : False})

# Setup sampling rollout workers
config.rollouts(num_rollout_workers=4, 
                num_envs_per_worker=1)

<ray.rllib.algorithms.ppo.ppo.PPOConfig at 0x288a16a00>

#### Step 6. Instantiate an algorithm from the environment and algorithm config objects

**Two ways to train RLlib policies***
<ol>
    <li><a href="https://docs.ray.io/en/master/rllib/package_ref/index.html">RLlib API.</a> The main methods are:</li>
    <ul>
        <li>train()</li>
        <li>save()</li>
        <li>evaluate()</li>
        <li><b>restore()</b></li>
        <li><b>compute_single_action()</b></li>
    </ul>
    <li><a href="https://docs.ray.io/en/master/tune/api_docs/overview.html">Ray Tune API.</a>  The main methods are:</li>
        <ul>
            <li><b>run()</b></li>
    </ul>
    </ol>
    
*3rd way is RLlib CLI from command line using .yml file, but the .yml file is undocumented: <i>rllib train -f [myfile_name].yml</i><br>

<b>RLlib API train()</b> will train for 1 <i>iteration</i> only.  Good for debugging since every single output will be shown for the single iteration.  

<b>Ray Tune API run()</b> is usually more convenient since with 1 function call you get experiment management: hyperparameter tuning, save checkpoints, evaluate, and training up to a stopping criteria.

✔Both methods will run the RLlib [environment pre-check function](https://github.com/ray-project/ray/blob/master/rllib/utils/pre_checks/env.py#L22) you saw earlier in this notebook (Step 2. Check environment).

<b>RLlib API restore()</b> will reload a checkpointed RLlib model for Serving and Offline learning, even if the model was trained using Tune.  Tune API methods will not work for this.

<b>RLlib API compute_single_action()</b> will use the trained <i>`policy`</i> (RL word for trained model) and use that for inference on an environment.   

<div class="alert alert-block alert-success">
In summary: <br>
    💡 <b>Train</b> a RLlib algorithm with Ray Tune method <b>`.run()`</b>  <br>
    👉  <b>Develop</b> or debug a RLlib algorithm with RLlib method <b>`.train()`</b> <br>
    👉  <b>Restore</b> a RLlib policy with RLlib  method <b>`.restore()`</b> <br>
    👉  <b>Run inference</b> on an environment using a trained policy with RLlib method <b>`.compute_single_action()`</b>
</div>

💡 <b>Right-click on the cell below and choose "Enable Scrolling for Outputs"!</b>  This will make it easier to view, since model training output can be very long!

In [None]:
# SINGLE .TRAIN() OUTPUT

# instantiate an algo instance
ppo_algo = config.build()
print(f"Algorithm type: {type(ppo_algo)}")

# Perform single `.train() iteration` call
# Result is a Python dict object
result = ppo_algo.train()

# Erase config dict from result (for better overview).
del result["config"]
# Print out training iteration results.
print(pretty_print(result))

In [16]:
###############
# EXAMPLE USING RLLIB API .train() IN A LOOP
# To train for N number of episodes, you put .train() into a loop, 
# similar to the way we ran the Gym env.step() in a loop.
###############
# start fresh in case ray already running
# if ray.is_initialized():
#     ray.shutdown()

start_time = time.time()

# Use the config object's `build()` method for generating
# an RLlib Algorithm instance that we can then train.
ppo_algo = config.build()
print(f"Algorithm type: {type(ppo_algo)}")

# train the Algorithm instance for 30 iterations
num_iterations = 30
rewards = []
checkpoint_dir = "results/PPO/"

for i in range(num_iterations):
    # Call its `train()` method
    result = ppo_algo.train()
    
    # Extract reward from results.
    rewards.append(result["episode_reward_mean"])
    
    # print something every 10 episodes
    if ((i % 10 == 0) or (i == num_iterations-1)):
        print(f"Iteration={i}, Mean Reward={result['episode_reward_mean']:.2f}",end="")
        try:
            print(f"+/-{np.std(rewards):.2f}")
        except:
            print()
        # save checkpoint file
        checkpoint_file = ppo_algo.save(checkpoint_dir)
        print(f"Checkpoints saved at {checkpoint_file}")
        # evaluate the policy
        eval_result = ppo_algo.evaluate()

# convert num_iterations to num_episodes
num_episodes = len(result["hist_stats"]["episode_lengths"]) * num_iterations
# convert num_iterations to num_timesteps
num_timesteps = sum(result["hist_stats"]["episode_lengths"] * num_iterations)
# calculate number of wins
num_wins = np.sum(result["hist_stats"]["episode_reward"])

# train time
print(f"PPO won {num_wins} times over {num_episodes} episodes ({num_timesteps} timesteps)") 
print(f"Approx {num_wins/num_episodes:.2f} wins per episode")
print(f"Training took {time.time() - start_time:.2f} seconds")


Algorithm type: <class 'ray.rllib.algorithms.ppo.ppo.PPO'>
Iteration=0, Mean Reward=0.01+/-0.00
Checkpoints saved at results/PPO/checkpoint_000001
Iteration=10, Mean Reward=0.15+/-0.04
Checkpoints saved at results/PPO/checkpoint_000011
Iteration=20, Mean Reward=0.21+/-0.09
Checkpoints saved at results/PPO/checkpoint_000021
Iteration=29, Mean Reward=0.62+/-0.20
Checkpoints saved at results/PPO/checkpoint_000030
PPO won 66.0 times over 3180 episodes (122640 timesteps)
Approx 0.02 wins per episode
Training took 83.55 seconds


<b>Understanding the output of RLlib .train()</b>

⬆️ Notice above, the `train()` method returns a dictionary containing information about the iteration of training. Here "iteration" consists of many episodes, the exact number depending on config values.  

<b>Compare the PPO Training results to Random Baseline <br></b>
- PPO Mean Reward=~0.61+/-0.22.  This is much higher than baseline!
> Baseline Mean Reward=~0.02+/-0.13 (out of success=0.7) <br>

<div class="alert alert-block alert-success">
    ✔ <b>PPO mean reward is approx 30x higher than the random baseline! <br>
</div>

<br>
What were the last parameter values?  <br>
<br>

In [17]:
# experiment_results.get_best_config(metric="episode_reward_mean", mode="mean")
result['info']


{'learner': {'default_policy': {'learner_stats': {'allreduce_latency': 0.0,
    'grad_gnorm': 1.3092542168433947,
    'cur_kl_coeff': 9.765625000000002e-05,
    'cur_lr': 5.0000000000000016e-05,
    'total_loss': 0.15082930225825128,
    'policy_loss': 0.0049029359894414105,
    'vf_loss': 0.14592618676283026,
    'vf_explained_var': -0.020948633711825135,
    'kl': 0.0018398162273860898,
    'entropy': 0.1457263630083812,
    'entropy_coeff': 0.0},
   'model': {},
   'custom_metrics': {},
   'num_agent_steps_trained': 128.0}},
 'num_env_steps_sampled': 120000,
 'num_env_steps_trained': 120000,
 'num_agent_steps_sampled': 120000,
 'num_agent_steps_trained': 120000}

In [18]:
# To stop the Algorithm and release its blocked resources, use:
ppo_algo.stop()
print()




⬇️ Below for completeness, is an example how to do this same thing using `Ray Tune`.  We won't go into this right now, because it will be covered very soon in another notebook!

In [19]:
# ##############
# # EXAMPLE USING RAY TUNE API .run() 1 UNTIL STOPPING CONDITION
# # For completeness, here is how to use Ray Tune's .run() method
# ##############

# # To start fresh, restart Ray in case it is already running
# if ray.is_initialized():
#     ray.shutdown()

# experiment_results = tune.run("PPO", 
                    
#     # Stopping criteria whichever occurs first: average reward over training episodes, or ...
#     stop={
#           # "episode_reward_mean": 0.2, # stop if achieve 0.2 out of max 0.7
#           "training_iteration": 22,  # stop if achieved 200 iterations
#           # "timesteps_total": 3000,  # stop if achieved 3000 timesteps
#           },  
              
#     # training config params
#     config = config.to_dict(),
                    
#     #redirect logs to relative path instead of default ~/ray_results/
#     local_dir = "my_Tune_PPO_logs",
         
#     # set frequency saving checkpoints >= evaulation_interval
#     checkpoint_freq = 7,
#     checkpoint_at_end=False,
         
#     # Reduce logging messages
#     ###############
#     # Note about Ray Tune verbosity.
#     # Screen verbosity in Ray Tune is defined as verbose = 0, 1, 2, or 3, where:
#     # 0 = silent
#     # 1 = only status updates, no logging messages
#     # 2 = status and brief trial results, includes logging messages
#     # 3 = status and detailed trial results, includes logging messages
#     # Defaults to 3.
#     ###############                          
#     verbose = 2,
#     )

# print("Training completed.")


## Evaluate a RLlib Policy <a class="anchor" id="eval_rllib"></a>

Traditional Supervised ML splits data into train/valid/test, and runs evaluate on the valid dataset AFTER the model has been trained.  RL, on the other hand, runs evaluation typically every time a checkpoint is saved.  

RLlib policies can be evaluated by:
<ul>
    <li>Calling RLlib Algorithm API <b>.evaluate()</b> typically every time <b>.save()</b> is called.</li>
    <li>Visualizing training progress in <b>TensorBoard</b></li>
    <li>Examining Ray Tune experiment results - will be covered very soon in another notebook!  </li>
    </ul>



#### Visualize the training progress in TensorBoard

<b>Ray Tune</b> automatically creates logs for your trained RLlib models that can be visualized in TensorBoard.  Ray Tune logs are stored in the specified redirect `local_dir`; or if none specified then the logs are stored in `~/ray_results/`.

<b>RLlib Algorithm .train() requires an explicit .save() step</b> in order to create logs.  The default format for .save() is Ray Tune .json logs compatible with TensorBoard.  Unlike Ray Tune, using .save(), it is only possible to store logs in `~/ray_results/`.  You cannot change the location of the TensorBoard logs.

To visualize the performance of your RL policy:

<ol>
    <li>Open a terminal</li>
    <li><i><b>cd</b></i> into the correct log directory.</li>
    <li><i><b>ls</b></i></li>
    <li>You should see files such as: <i>result.json, params.json, ... </i></li>
    <li>To be able to compare all your experiments, cd one dir level up.
    <li><i><b>cd ..</b></i>  
    <li><i><b>tensorboard --logdir . </b></i></li>
    <li>Look at the url in the message, and open it in a browser</li>
        </ol>
        
Note Step 7 above: if running RLlib on a cluster, use <a href="https://blog.tensorflow.org/2019/12/introducing-tensorboarddev-new-way-to.html">tensorboard.dev</a> instead.  Navigate to the directory on the head node where `ray_results/` directory is located.  From there, run 
`tensorboard dev upload --logdir .`

#### Screenshot of Tensorboard

TensorBoard will give you many pages of charts.  Below displaying just Train/Eval max and mean rewards.

The charts below are showing "sample efficiency", the number of training steps it took to achieve a certain level of performance.

<b>Train Performance:</b> <br>

---
<img src="../images/frozen_lake_training_rewards.png" width="80%" />

<b>Eval Performance:</b> <br>
<img src="../images/frozen_lake_eval_rewards.png" width="80%" />

## Reload RLlib policy from checkpoint and run inference <a class="anchor" id="reload_rllib"></a>

We want to reload the desired RLlib model from checkpoint file and then run the policy in inference mode on the environment it was trained on.  

You will need:
<ul>
    <li>Your <b>algorithm's config class</b></li>
    <li>Name of the <b>environment</b> you used to train the policy.</li>
    <li>Path to the desired <b>checkpoint</b> file you want to use to restore the policy.</li>
    </ul>

#### Step 1. Find the best model checkpoint file

In [20]:
# EXAMPLE GETTING CHECKPOINT FROM RLLIB TRAIN

# Enter the last checkpoint manually
checkpoint = "results/PPO/checkpoint_000030/checkpoint-30"
print(f"\n{checkpoint}")


results/PPO/checkpoint_000030/checkpoint-30


In [21]:
# # EXAMPLE GETTING CHECKPOINT FROM RAY TUNE

# # Get best checkpoint path
# checkpoint_path = experiment_results.get_best_logdir(metric="evaluation_reward_mean", mode="max")
# # checkpoint_path = "my_Tune_PPO_logs/PPO/PPO_CartPole-v1_a3973_00000_0_2022-08-10_19-09-32/checkpoint_000020/checkpoint-18"
# print(checkpoint_path)

# # Get last checkpoint
# checkpoint = experiment_results.get_last_checkpoint()
# print(f"\n{checkpoint}")


#### Step 2. Re-initialize an already-trained algorithm object from the checkpoint file


In [22]:
# Create new Algorithm and restore its state from the last checkpoint.

# create an empty Algorithm
algo = config.build()

# restore the agent from the checkpoint
algo.restore(checkpoint)

2022-08-15 15:32:17,189	INFO trainable.py:668 -- Restored on 127.0.0.1 from checkpoint: results/PPO/checkpoint_000030
2022-08-15 15:32:17,191	INFO trainable.py:677 -- Current state after restoring: {'_iteration': 30, '_timesteps_total': None, '_time_total': 79.15324401855469, '_episodes_total': 8264}


#### Step 3. Play and render the game

Now we want to play the trained policy doing inference in the environment it was trained on.

<div class="alert alert-block alert-success">
✔ During inference, call the RLlib API method <b>compute_single_action()</b>: <br>

👍 Uses the trained <i>policy</i> (RL word for trained model) to calculate actions for the entire number of time steps in 1 <i>rollout</i> (RLlib word for episode during inference). 
</div>

⬇️ Below we play the game 100 times using the PPO already-trained policy.
<br>

In [23]:
#############
## Create the env to do inference on
#############
env = gym.make(env_name)
obs = env.reset()

# Use the restored algorithm from checkpoint and run it in inference mode
episode_reward = 0.0
done = False
num_episodes = 0
num_steps = 0

while num_episodes < 100:
    # Compute an action (`a`).
    a = algo.compute_single_action(observation=obs)
    # Send the computed action `a` to the env.
    obs, reward, done, _ = env.step(a)
    episode_reward += reward
    num_steps += 1
    
    # Is the episode `done`? -> Reset.
    if done:
        obs = env.reset()
        num_episodes += 1

# calculate mean_reward
print()
print("**************")
mean_reward = episode_reward / num_episodes
print(f"PPO mean_reward: {mean_reward:.2f} out of success: {env_spec.reward_threshold} after {num_episodes} episodes or {num_steps} time steps")
print(f"PPO won {episode_reward} times over {num_episodes} plays")
print("**************")
        
# Close the env
env.close()


**************
PPO mean_reward: 0.54 out of success: 0.7 after 100 episodes or 3284 time steps
PPO won 54.0 times over 100 plays
**************


<b>How does our inferenced policy compare to the Random baseline? <br></b>
- PPO wins ~54 times over 100 plays.  This is much higher than baseline!
> Baseline won ~53.0 times over 3000 plays (episodes) <br>


⬇️ Below we render the game using the PPO policy, so we can visually inspect the environment.

In [24]:
# The following three lines are for rendering purposes only.
# They allow us to render the env frame-by-frame in-place
# (w/o creating a huge output which we would then have to scroll through).
out = Output()
display.display(out)
with out:

    #############
    ## Create the env to do inference on
    #############
    env = gym.make(env_name)
    obs = env.reset()

    #############
    ## Use the restored policy and run it in inference mode
    ## Run compute_single_action() in inference episodes loop
    ## You will see an ASCII rendering in-place for about 10 seconds
    #############
    episode_reward = 0.0
    done = False
    num_episodes = 0

    while num_episodes < 5:
        # Compute an action (`a`).
        a = algo.compute_single_action(observation=obs)
        # Send the computed action `a` to the env.
        obs, reward, done, _ = env.step(a)
        episode_reward += reward

        # Is the episode `done`? -> Reset.
        if done:
            obs = env.reset()
            num_episodes += 1

        # Render the env (in place).
        time.sleep(0.3)
        out.clear_output(wait=True)
        print(f"episode: {num_episodes}")
        print(f"obs: {obs}, reward: {episode_reward}, done: {done}")
        env.render()
            
env.close()

Output()

### Summary

In this notebook, we have learned:
* What a gym Environment is, and how the gym.Env API is used define sequential decision making problems using python code
* How RLlib looks like on the surface (where to find its algorithms and top-level APIs)
* How to train a RLlib algorithm using `.train()` and a built-in gym.Env ("frozen lake")
* Where to find checkpoint files, logs, tensorboard files, etc..
* How to play and render some episodes from a gym.Env using a trained RLlib algorithm.

### Exercise 1

#### How would you choose another algorithm to train Frozen Lake?

Hint:  Look at the [RLlib algorithm doc page](https://docs.ray.io/en/master/rllib/rllib-algorithms.html).
How would you change the choice of RLlib algorithm from <b>PPO to DQN</b>?

In [25]:
# config is an object instead of a dictionary since Ray version >= 1.13
from ray.rllib.algorithms.dqn import DQNConfig

# Default DQN config values
# uncomment below to see the long list of specifically PPO default config values
# print(pretty_print(DQNConfig().to_dict()))

In [26]:
# Create a DQNConfig object
dqn_config = DQNConfig()

# Setup our config object to use our environment
dqn_config.environment(env="FrozenLake-v1")

# Decide if you want torch or tensorflow DL framework.  Default is "tf"
dqn_config.framework(framework="torch")

# +1 for head node, num parallel workers or actors for rollouts
dqn_config.rollouts(num_rollout_workers=1)

# Set the log level to DEBUG, INFO, WARN, or ERROR 
dqn_config.debugging(seed=415, log_level="ERROR")

# Setup evaluation
# Explicitly set "explore"=False to override default
dqn_config.evaluation(evaluation_interval=10, 
                evaluation_duration=20, 
                evaluation_config = {"explore" : False})

# Setup sampling rollout workers
dqn_config.rollouts(num_rollout_workers=1, 
                num_envs_per_worker=1)


<ray.rllib.algorithms.dqn.dqn.DQNConfig at 0x2911826a0>

In [27]:
###############
# EXAMPLE USING RLLIB API .train() IN A LOOP
# To train for N number of episodes, you put .train() into a loop, 
# similar to the way we ran the Gym env.step() in a loop.
###############

# if ray.is_initialized():
#     ray.shutdown()

start_time = time.time()

# Use the config object's `build()` method for generating
# an RLlib Algorithm instance that we can then train.
dqn_algo = dqn_config.build()
print(f"Algorithm type: {type(dqn_algo)}")

# train the Algorithm instance for 30 iterations
num_iterations = 30
rewards = []
checkpoint_dir = "results/DQN/"

for i in range(num_iterations):
    # Call its `train()` method
    result = dqn_algo.train()
    
    # Extract reward from results.
    rewards.append(result["episode_reward_mean"])
    
    # print something every 10 episodes
    if ((i % 10 == 0) or (i == num_iterations-1)):
        print(f"Iteration={i}, Mean Reward={result['episode_reward_mean']:.2f}",end="")
        try:
            print(f"+/-{np.std(rewards):.2f}")
        except:
            print()
        # save checkpoint file
        checkpoint_file = dqn_algo.save(checkpoint_dir)
        print(f"Checkpoints saved at {checkpoint_file}")
        # evaluate the policy
        eval_result = dqn_algo.evaluate()

# convert num_iterations to num_episodes
num_episodes = len(result["hist_stats"]["episode_lengths"]) * num_iterations
# convert num_iterations to num_timesteps
num_timesteps = sum(result["hist_stats"]["episode_lengths"] * num_iterations)
# calculate number of wins
num_wins = np.sum(result["hist_stats"]["episode_reward"])

# train time
print(f"DQN won {num_wins} times over {num_episodes} episodes ({num_timesteps} timesteps)")
print(f"Approx {num_wins/num_episodes:.2f} wins per episode")
print(f"Training took {time.time() - start_time:.2f} seconds")

# # To stop the Algorithm and release its blocked resources, use:
# dqn_algo.stop()
# print()

Algorithm type: <class 'ray.rllib.algorithms.dqn.dqn.DQN'>
Iteration=0, Mean Reward=0.00+/-0.00
Checkpoints saved at results/DQN/checkpoint_000001
Iteration=10, Mean Reward=0.16+/-0.05
Checkpoints saved at results/DQN/checkpoint_000011
Iteration=20, Mean Reward=0.42+/-0.16
Checkpoints saved at results/DQN/checkpoint_000021
Iteration=29, Mean Reward=0.48+/-0.19
Checkpoints saved at results/DQN/checkpoint_000030
DQN won 48.0 times over 3000 episodes (115800 timesteps)
Approx 0.02 wins per episode
Training took 99.39 seconds


Compare the DQN Training results to Random Baseline.   
- DQN Mean Reward=~0.48+/-0.19.  This is much higher than baseline!
> Baseline Mean Reward=~0.02+/-0.13 (out of success=0.7) <br>

<div class="alert alert-block alert-success">
    ✔ <b>DQN mean reward is approx 24x higher than the random baseline! <br>
</div>

<br>
⬇️ Below we play the game using the DQN trained policy 100 times, similar to what we did with the PPO trained policy..
<br>

In [28]:
# Enter the last checkpoint manually
checkpoint = "results/DQN/checkpoint_000030/checkpoint-30"
print(f"\n{checkpoint}")

# create an empty Algorithm
algo = dqn_config.build()

# restore the agent from the checkpoint
algo.restore(checkpoint)

#############
## Create the env to do inference on
#############
env = gym.make(env_name)
obs = env.reset()

# Use the restored model and run it in inference mode
episode_reward = 0.0
done = False
num_episodes = 0
num_steps = 0

while num_episodes < 100:
    # Compute an action (`a`).
    a = algo.compute_single_action(observation=obs)
    # Send the computed action `a` to the env.
    obs, reward, done, _ = env.step(a)
    episode_reward += reward
    num_steps += 1
    
    # Is the episode `done`? -> Reset.
    if done:
        obs = env.reset()
        num_episodes += 1

# calculate mean_reward
print()
print("**************")
mean_reward = episode_reward / num_episodes
print(f"DQN mean_reward: {mean_reward:.2f} out of success: {env_spec.reward_threshold} after {num_episodes} episodes or {num_steps} time steps")
print(f"DQN won {episode_reward} times over {num_episodes} plays")
print("**************")
        
# Close the env
env.close()


results/DQN/checkpoint_000030/checkpoint-30


2022-08-15 15:34:28,036	INFO trainable.py:668 -- Restored on 127.0.0.1 from checkpoint: results/DQN/checkpoint_000030
2022-08-15 15:34:28,039	INFO trainable.py:677 -- Current state after restoring: {'_iteration': 30, '_timesteps_total': None, '_time_total': 95.39320707321167, '_episodes_total': 1889}



**************
DQN mean_reward: 0.48 out of success: 0.7 after 100 episodes or 3849 time steps
DQN won 48.0 times over 100 plays
**************


<b>How does our inferenced policy compare to the Random baseline? <br></b>
- DQN wins ~48 times over 100 plays.  This is much higher than baseline!
> Baseline won ~53.0 times over 3000 plays (episodes) <br>

<br>
⬇️ Below we render the game using the DQN policy, so we can visually inspect the environment, as humans.
<br>
<br>

In [29]:
# The following three lines are for rendering purposes only.
# They allow us to render the env frame-by-frame in-place
# (w/o creating a huge output which we would then have to scroll through).
out = Output()
display.display(out)
with out:

    #############
    ## Create the env to do inference on
    #############
    env = gym.make(env_name)
    obs = env.reset()

    #############
    ## Use the restored model and run it in inference mode
    ## Run compute_single_action() in inference episodes loop
    ## You will see a an ASCII rendering in-place for about 10 seconds
    #############
    episode_reward = 0.0
    done = False
    num_episodes = 0

    while num_episodes < 5:
        # Compute an action (`a`).
        a = algo.compute_single_action(observation=obs)
        # Send the computed action `a` to the env.
        obs, reward, done, _ = env.step(a)
        episode_reward += reward

        # Is the episode `done`? -> Reset.
        if done:
            obs = env.reset()
            num_episodes += 1

        # Render the env (in place).
        time.sleep(0.3)
        out.clear_output(wait=True)
        print(f"episode: {num_episodes}")
        print(f"obs: {obs}, reward: {episode_reward}, done: {done}")
        env.render()
            
env.close()

Output()

### References

1. [OpenAI Gym Environments](https://www.gymlibrary.ml/)
2. [Ray doc page](https://docs.ray.io/en/latest/)
3. [Rllib github](https://github.com/ray-project/ray/tree/master/rllib)
4. [RLlib Algorithms doc page](https://docs.ray.io/en/master/rllib/rllib-algorithms.html)

📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb)<br>

➡ [Next notebook](./ex_02_create_multiagent_rllib_env.ipynb) <br>

In [30]:
# Shut down Ray if you are done
import ray
if ray.is_initialized():
    ray.shutdown()