© 2019-2022, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademyLogo.png)

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/christy/AnyscaleDemos/blob/main/rllib_demos/recsys_conference/optional_01_intro_gym_and_rllib.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/christy/AnyscaleDemos/blob/main/rllib_demos/recsys_conference/optional_01_intro_gym_and_rllib.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

#### Google Colab 

1. Look at top of notebook and click "Copy to Drive"
2. Run cell below for required pip installs
3. Adjust all the RLlib config() statements so total number of workers < 2 (Ray Tune) or <= 2 (RLlib algo.train() )


In [1]:
# Run this cell only for Google Colab
# !pip install ray tensorflow_probability tensorboardX gym==0.21 lz4

# Optional - Introduction to the OpenAI Gym Environment and RLlib Algorithm top-level APIs

### Learning objectives
In this this tutorial, you will learn:
 * [What is Reinforcement Learning (RL)?](#intro_rl)
 * [Overview of RL terminology](#intro_rl)
 * [Introduction to OpenAI Gym environments](#intro_gym)
 * [High-level OpenAI Gym API calls](#intro_gym_api)
 * [Overview of RLlib](#intro_rllib)
 * [Train a policy using an algorithm from RLlib](#intro_rllib_api)
 * [Evaluate a RLlib policy](#eval_rllib)
 * [Reload RLlib policy from checkpoint and run inference](#reload_rllib)
 
 [Link to slides](https://github.com/anyscale/academy/blob/main/ray-rllib/acm_recsys_tutorial_2022/slides/rllib_acm_recsys_2022_slides.pdf)
 

## What is Reinforcement Learning (RL)? <a class="anchor" id="intro_rl"></a> 

In the simplest definition, RL is a general framework where

> **Agents** learn how to perform actions in an **environment** so as to maximize the cumulative sum of **rewards**.
<br>
<br>

<img src="./images/env_key_concept1.png" width="50%" />

The agent and environment continuously interact in a feedback loop. 
- At each time step, from **State s(t)**, the agent takes an **action a(t)**.
- The environment gives back a Reward **r(t+1)** and the next State **s(t+1)**.
- Behind both the agent and environment is an algorithm.   
<ul>
    <ul>
        <li>The algorithm uses data from past States, Actions, Rewards to train a policy, written <b>π</b>(s(t)).</li>
        <li>The policy gives back the <b>Action a</b>(t).</li>
    </ul>
    </ul>

<br>
<b>The math concept behind RL is a Markov Decision Process</b><br>

The sequence of States, Actions, Rewards <i>[S(0), A(0), R(1), S(1), A(1), ...]</i>, is a sequence of random variables.  
> A sequence of random variables is a <b>stochastic process</b>.  

The environment is fully known if we can predict the probability of a particular State at time t, given all the previous states and actions.  <i>Pr(S<sub>t+1</sub>=s<sub>t+1</sub> | A<sub>t</sub>=a<sub>t</sub>, S<sub>t</sub>=s<sub>t</sub>, A<sub>t-1</sub>=a<sub>t-1</sub>, ...S<sub>0</sub>=s<sub>0</sub>)</i>
> <i>A stochastic process is a <b>Markov Decision Process</b> if the values at time t depend only on the values at time t-1</i>.  That is: <br>
> > <i>Pr(S<sub>t+1</sub>=s<sub>t+1</sub> | A<sub>t</sub>=a<sub>t</sub>, S<sub>t</sub>=s<sub>t</sub>)</i>

Markov Decision Processes (MDPs) can be solved computationally!  See the <a href="https://www.anyscale.com/blog/reinforcement-learning-with-deep-q-networks">Anyscale tutorial blog</a> for an explanation of the <i>Bellman Equation</i>, which is a discretization of a Markov Decision Process, shown below.

First, rewrite the MDP, expected reward, in terms of a Q function, where $\gamma$ is a discount factor for future rewards.

> $Q_π(s_t,a_t) = E_π \left[\sum_{j=0}^\tau \gamma^jr_{t+j+1}|S_t=s, A_t=a \right]$

<img src="./images/bellman_equation.png" width="50%" />

> The ML problem behind RL is to minimize the mean squared error (MSE) difference of the 2 sides of the Bellman equation, just like in regression problems! 

Computationally, minimization problems can be solved practically using stochastic gradient descent (SGD) or Weighted Alternating Least Squares (WALS).
<br>
<br>

<b>Back to environments and agents...</b> <br>

The **environment** is the agent's world, it is a simulation of the problem to be solved. The simulator might be of a:
<ul>
    <li>real, physical machine such as a gas turbine or autonomous vehicle</li>
    <li>real, abstract system such as user behavior on a website or the stock market</li>
    <li>virtual sytem on a computer such as a board game or a video game</li>
    </ul>
    
The **agent** represents what is triggering the actions.  For example it could be:
<ul>
    <li>a software system that is triggering actions for machines</li>
    <li>a type of user or investor</li>
    <li>a game player or game system that is competing against real players </li>
    </ul> 
    
<br>
<b>Comparison of RL to supervised learning</b> <br>
<ul>
    <li><u><i>Data</i></u>.  In supervised learning, you start with a labeled dataset.  In contrast, the <b>data in RL is not given up front; the environment acts as a data generator</b>.  One can also do RL on a pre-collected dataset (called offline RL), we will touch on offline RL later. </li> <br>
    <li><u><i>Training</i></u>.  In supervised learning, a ML algorithm is trained on ALL the labeled training data AT ONCE.  <b>RL trains over a sequence of feedback loops.</b>  The RL algorithm optimizes the sum of individual rewards over repeated lifetimes (episodes) of sequential decisions: action -> feedback -> improved action -> repeat. </li><br>
    <li><u><i>Evaluation</i></u>.  In supervised learning, a ML algorithm is evaluated on ALL the hold-out validation data AT ONCE. <b>RL REPEATEDLY evaluates a policy at different time steps</b>, typically whenever you save a checkpoint file. Evaluation at particular points in time in RL is similar in concept to "backtesting" in time series forecasting. RL evaluations are specific to a time step. </li>
    </ul>

<br>
<b>In conclusion: why bother with an Agent, Environment, and RL?</b>  <br>

Supervised learning can be too shortsighted or overlook important, changing user intents or business conditions.  

<div class="alert alert-block alert-success">    
<b>💡 RL has become the de-facto ML approach for sequential decision-making processes, especially when there are multiple goals and long-term possibly delayed rewards. <br>
    💡 RL can also work when there is no existing model to rely on or you want to improve over an existing decision-making strategy. </b> 
</div> 

<br> 

## Overview of RL terminology <a class="anchor" id="intro_rl"></a>

An RL environment consists of: 

1. all possible actions (**action space**)
2. a complete description of the environment, nothing hidden (**state space**)
3. an observation by the agent of certain parts of the state (**observation space**)
4. **reward**, which is the only feedback the agent receives after each action.

The model that tries to maximize the expected sum over all future rewards is called a **policy**. The policy is a function mapping the environment's observations to an action to take, usually written **π** (s(t)) -> a(t).  <i>In deep reinforcement learning, this function is a neural network</i>.

<b>Policy vs Model? </b>
In traditional supervised learning, model means a trained algorithm, or a learned function.

> <i>In RL, a model is roughly equivalent to a policy, but policy is more specific</i> because it is trained in a specific environment.  For deployment, we use the word "model" because more people understand the ML meaning of a trained model.

Below is a high-level image of how the Agent and Environment work together to train a Policy in a RL simulation feedback loop in RLlib.

<img src="./images/env_key_concept2.png" width="98%" />

The **RL simulation feedback loop** repeatedly collects data, for one (single-agent case) or multiple (multi-agent case) policies, trains the policies on these collected data, and makes sure the policies' weights are kept in synch. 

During simulation loops, the environment collects observations, taken actions, receives rewards and so-called **done** flags, indicating the boundaries of different episodes the agents play through in the simulation.

Each simulation iteration is called a <b>time step</b>.  The simulation iterations of action -> reward -> next state -> train -> repeat, until the end state, is called an **episode**, or in RLlib, a **rollout**.  At the end of the episode, when the <i>done</i> flag is True, we call RLlib method .reset(), which sets the <i>done</i> flag to False again.
> 👉 Each episode consists of one or many time steps.

<b>Per episode</b> (or between **done** flag == True), the RL simulation feedback loop repeats up to some specified end state (termination state or timesteps). Examples of termination are:
<ul>
    <li>the end of a maze (termination state)</li>  
    <li>the player died in a game (termination state)</li>
    <li>after 60 videos watched in a recommender system (timesteps).</li>
    </ul>
    
<b>Why train for many episodes?</b>  When you are doing machine learning, you do not just do something once and report the result.  You do it many times, to make sure you did not just get "lucky" one time.  RL is similar.  By training for many episodes, you collect more data, which provides more variance, which is hopefully more realistic.  
> 👉 Each training iteration consists of one or many episodes.

<div class="alert alert-block alert-success">
<b>💡 In RL, the policy is trained by repeating trials, or episodes (or rollouts), then reporting the calculated reward typically as an average of all achieved rewards per episode.  <br>
   💡 The cumulative sum of all mean episode rewards is called the Return.</b> 
</div>
    
<br>

## Introduction to OpenAI Gym example: frozen lake <a class="anchor" id="intro_gym"></a>

[OpenAI Gym](https://gym.openai.com/) is a well-known reference library of RL environments. 

#### 1. import gym

Below is how you would import gym and view all available environments.

In [2]:
# import libraries
import gym
print(f"gym: {gym.__version__}")

# List all available gym environments
all_env  =  list(gym.envs.registry.all())
print(f'Num Gym Environments: {len(all_env)}')

# You could loop through and list all environments if you wanted
# [print(e) for e in all_env]
envs_starting_with_f = [e for e in all_env if str(e).startswith("EnvSpec(Frozen")]
envs_starting_with_f

gym: 0.21.0
Num Gym Environments: 1055


[EnvSpec(FrozenLake-v1), EnvSpec(FrozenLake8x8-v1)]

#### 2. Instatiate your Gym object

The way you instantiate a Gym environment is with the **make()** function.

The .make() function takes arguments:
- **name of the Gym environment**, type: str, Required.
- **runtime parameter values**, Optional.

For the required string argument, you need to know the Gym name.  You can find the Gym name in the Gym documentation for environments, either:
<ol>
    <li>The doc page in <a href="https://www.gymlibrary.dev/environments/toy_text/frozen_lake/">Gym's website</a></li>
    <li>The environment's <a href="https://github.com/openai/gym/blob/master/gym/envs/toy_text/frozen_lake.py">source code </a></li>
    <li>
        <a href="https://www.gymlibrary.ml/environments/classic_control/cart_pole/#description">Research paper (if one exists)</a> referenced in the environment page </li>
    </ol>
    
Below is an example of how to create a basic Gym environment, [frozen lake](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/).  We can see below that the termination condition of an episode will be <b>TimeLimit</b> (the environment automatically ends an episode and sets done=True after this many timesteps).


In [3]:
env_name = "FrozenLake-v1"

# Instantiate gym env object with a runtime parameter value (is_slippery).
# is_slippery=True specifies the environment is stochastic
# is_slippery=False is the same as "deterministic=True"
env = gym.make(
    env_name,
    is_slippery=False,  # whether the environment behaves deterministically or not
)

# inspect the gym spec for the environment
print(f"env: {env}")
env_spec = env.spec
print(f"env_spec: {env_spec}")

# Note: "TimeLimit" means termination condition for an episode will be time steps

env: <TimeLimit<FrozenLakeEnv<FrozenLake-v1>>>
env_spec: EnvSpec(FrozenLake-v1)


#### 3. Inspect the environment action and observations spaces

Gym Environments can be deterministic or stochastic.

<ul>
    <li>
        <b>Deterministic</b> if the current state + selected action determines the next state of the environment.  <i>Chess is an example of a deterministic environment</i>, since all possible states/action combinations can be described as a discrete set of rules with states bounded by the pieces and size of the board.</li>
    <li>
        <b>Stochastic</b> if the policy output action is a probability distribution over a set of possible actions at time step t. In this case, the agent needs to compute its action from the policy in two steps. i) sample actions from the policy according to the probability distribution, ii) compute log likelihoods of the actions. <i>Random visitors to a website is an example of a stochastic environment</i>. </li>
    </ul>

<b>Gym actions.</b> The action_space describes the numerical structure of the legitimate actions that can be applied to the environment. 

For example, if we have 4 possible discrete actions, we could encode them as:
<ul>
    <li>0: LEFT</li>
    <li>1: DOWN</li>
    <li>2: RIGHT</li>
    <li>3: UP</li>
</ul>

<b>Gym observations.</b>  The observation_space defines the structure as well as the legitimate values for the observation of a state of the environment.  

For example, if we have a 4x4 grid, we could encode them as {0,1,2,3, 4, … ,15} for grid positions ((0,0), (0,1), (0,2), (0,3), …. (3,3)).

From the Gym [documentation](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/) about the frozen lake environment, we see: <br>

|Frozen Lake      | Gym space   |
|---------------- | ----------- |
|Action Space     | Discrete(4) |
|Observation Space| Discrete(16)|
 
<b><a href="https://github.com/openai/gym/tree/master/gym/spaces">Gym spaces</a></b> are gym data types.  The main types are `Discrete` for discrete numbers and `Box` for continuous numbers.  

Gym Space `Discrete` elements are Python type `int`, and Gym Space `Box` are Python type `float32`.

Below is an example how to inspect the environment action and observations spaces.

In [4]:
# check if it is a gym instance
if isinstance(env, gym.Env):
    print("This is a gym environment.")
    print()

    # print gym Spaces
    if isinstance(env.action_space, gym.spaces.Space):
        print(f"gym action space: {env.action_space}")
    if isinstance(env.observation_space, gym.spaces.Space):
        print(f"gym observation space: {env.observation_space}") 
        
# Note: the action space is discrete with 4 possible actions.
# Note: the observation space is 4x4 and thus runs from 0 to 15.
# Note: if we chose 8x8, the observation space would change to Discrete(64).

This is a gym environment.

gym action space: Discrete(4)
gym observation space: Discrete(16)


#### 4. Inspect gym environment default & runtime parameters

Gym environments contain 2 sets of parameters that are set after the environment object is instantiated.
<ul>
    <li><b>Default parameters</b> are fixed in the Gym environment code itself.</li>
    <li><b>Runtime parameters</b> are passed into the make() function as **kwargs.</li>
    </ul>

Below is an example of how to inspect the environment parameters.  Notice we can tell from the parameters that our frozen lake environment is: <br>
1) <i>Deterministic</i>, and <br>
2) Episode terminates with time step condition <i>max_episode_steps</i> = 100.

In [5]:
# inspect env.spec parameters
 
# View default env spec params that are hard-coded in Gym code itself
# Default parameters are fixed
print("Default spec params...")
print(f"id: {env_spec.id}")
# rewards above this value considered "success"
print(f"reward_threshold: {env_spec.reward_threshold}")
# env is deterministic or stochastic
print(f"nondeterministic: {env_spec.nondeterministic}")
# number of time steps per episode
print(f"max_episode_steps: {env_spec.max_episode_steps}")
# must reset before step or render
print(f"order_enforce: {env_spec.order_enforce}") 

# View runtime **kwargs .spec params.  These params set after env instantiated.
# print(f"type(env_spec._kwargs): {type(env_spec._kwargs)}") #dict
print()
print("Runtime spec params...")
# Note: gym > v21 use just .kwargs instead of ._kwargs
[print(f"{k}: {v}") for k,v in env_spec._kwargs.items()]
print()

# Note:  We can tell that our frozen lake environment is: 
# 1) Success criteria is rewards >= 0.7
# 2) Deterministic
# 3) Episode terminates when number time_steps = 100


Default spec params...
id: FrozenLake-v1
reward_threshold: 0.7
nondeterministic: False
max_episode_steps: 100
order_enforce: True

Runtime spec params...
map_name: 4x4
is_slippery: False



## High-level OpenAI Gym API calls <a class="anchor" id="intro_gym_api"></a>

The most basic Gym API methods are: <br>

- <b>env.reset()</b>
>Reset the environment to an initial state.  Returns the initial observation.  <b>You should call this method every time at the start of a new episode.</b>

- <b>env.step(action)</b> <br>
> Using an action as input, applies that action to the environment and <b><i>returns the 4-tuple (next-observation, reward, done, info)</i></b>.

- <b>action_space.sample()</b> <br>
> Get a random action from the environment.  Used typically to loop through environment, calculating an environment "Random Poicy baseline".

- <b>env.render()</b>  <br>
> Visually inspect the environment. This is for human/debugging purposes; it is not seen by the agent/algorithm.  Note you cannot inspect an environment before it has been initialized with env.reset().
    
<div class="alert alert-block alert-success">
💡 <b>To play an episode, call reset() first!  <br>
💡 After that, continue to call step() until the environment automatically returns done=True.</b> 
</div>

<br>

In [6]:
# Print the starting observation.  
# Recall possible observations are between 4x4 grid.
print(env.reset())
env.render()

0

[41mS[0mFFF
FHFH
FFFH
HFFG


In [7]:
# Take an action
# Recall the possible actions are: 0: LEFT, 1: DOWN, 2: RIGHT, 3: UP

new_obs, reward, done, _ = env.step(2) #Right
print(f"obs: {new_obs}, reward: {reward}, done: {done}")
env.render()
new_obs, reward, done, _ = env.step(1) #Down
print(f"obs: {new_obs}, reward: {reward}, done: {done}")
env.render()

obs: 1, reward: 0.0, done: False
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
obs: 5, reward: 0.0, done: True
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG


We can also try to run an action in the frozen lake environment which is outside the defined number range.

In [8]:
# Comment this cell if you want whole notebook to run without errors

# Try to take an invalid action

#env.step(4) # invalid

# should see KeyError below

To test out your environment, typically you will loop through a few episodes to make sure it works.  

In [9]:
from ipywidgets import Output
from IPython import display
import time

# The following three lines are for rendering purposes only.
# They allow us to render the env frame-by-frame in-place
# (w/o creating a huge output which we would then have to scroll through).
out = Output()
display.display(out)
with out:

    # Putting the Gym simple API methods together.
    # Here is a pattern for running a bunch of episodes.
    num_episodes = 5 # Number of episodes you want to run the agent
    total_reward = 0.0  # Initialize reward to 0

    # Loop through episodes
    for ep in range(num_episodes):

        # Reset the environment at the start of each episode
        obs = env.reset()
        done = False

        # Loop through time steps per episode
        while True:
            # take random action, but you can also do something more intelligent 
            action = env.action_space.sample()

            # apply the action
            new_obs, reward, done, info = env.step(action)
            total_reward += reward

            # If the epsiode is up, then start another one
            if done:
                break

            # Render the env (in place).
            time.sleep(0.3)
            out.clear_output(wait=True)
            print(f"episode: {ep}")
            print(f"obs: {new_obs}, reward: {total_reward}, done: {done}")
            env.render()

episode: 4
obs: 4, reward: 0.0, done: False
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG


## Overview of RLlib <a class="anchor" id="intro_rllib"></a>

<img width="7%" src="images/rllib-logo.png"> is currently the most comprehensive open-source Python Reinforcement Learning framework. **RLlib** is <b>distributed by default</b> since it is built on top of **[Ray](https://docs.ray.io/en/latest/)**, an easy-to-use, open-source, distributed computing framework for Python that can handle complex, heterogeneous applications. Ray and RLlib run on compute clusters on any cloud without vendor lock.  RLlib Resources:
<ol>
    <li>The doc page on <a href="https://docs.ray.io/en/master/rllib/index.html">ray.io website</a></li>
    <li><a href="https://github.com/ray-project/ray/tree/master/rllib">RLlib source code</a></li>
    </ol>

RLlib includes <b>25+</b> available [algorithms](https://docs.ray.io/en/master/rllib/rllib-algorithms.html), converted to both <img width="3%" src="./images/tensorflow-logo.png">_TensorFlow_ and <img width="3%" src="./images/pytorch-logo.png">_PyTorch_, covering different sub-categories of RL: _model-free_, _offline RL_, _model-based_, and _gradient-free_. Almost any RLlib algorithm can learn in a <b>multi-agent</b> setting. Many algorithms support <b>RNNs</b> and <b>LSTMs</b>.

On a very high level, RLlib is organized by **environments**, **algorithms**, **examples**, **tuned_examples**, and **models**.  

    ray
    |- rllib
    |  |- env 
    |  |- algorithms
    |  |  |- alpha_zero 
    |  |  |- appo 
    |  |  |- ppo 
    |  |  |- ... 
    |  |- examples 
    |  |- tuned_examples
    |  |- models

Within **_env_** you will find [classes](https://docs.ray.io/en/latest/rllib/package_ref/env.html) that allow RLlib to handle e.g. the multi-agent cases (which gym does NOT cover).  RLlib automatically supports any **OpenAI Gym environment** (which supports most user cases). RLlib also handle external environments that have strict performance or hosting requirements. <i>(In the next notebook, we will use the **RLlib MultiAgentEnv** base class to create a **multi agent** environment).</i>

Within **_examples_** you will find some examples of common custom rllib use cases.  

Within **_tuned\_examples_**, you will find, sorted by algorithm, suggested hyperparameter value choices within .yaml files. Ray RLlib team ran simulations/benchmarks to find suggested hyperparameter value choices.  These files are used for daily testing, and weekly hard-task testing to make sure they all run at speed, for both TF and Torch. Helps give you a leg-up with initial parameter choices!

Within **_models_**, you will find building blocks for NNs, default models that RLlib will use (for either <img width="3%" src="./images/tensorflow-logo.png">_TensorFlow_ or <img width="3%" src="./images/pytorch-logo.png">_PyTorch_). For example, here are building blocks for DNN, CNN, RNN, and LSTM. 

In this tutorial, we will mainly focus on the **_algorithms_** package, where we will find RLlib algos to train policies on environments.


## Train a policy using an algorithm from RLlib <a class="anchor" id="intro_rllib_api"></a>

Once you have an environment, next you need to decide which RL algorithm to use.  There are many factors to consider when selecting which algorithm to use on your environment.  Following are some high-level best practices.

1. <b>Choose an algorithm compatible with the action space.</b>  Do you have discrete actions (example: LEFT, RIGHT, …) or continuous actions (example: drive at a certain speed)? 
> To check high-level if an algorithm will work, look at the <a href="https://docs.ray.io/en/master/rllib/rllib-algorithms.html">RLlib algorithms doc page</a>.  Algorithms are listed according to whether or not they support Discrete actions, or Continuous actions, or both.

2. <b>Choose a stable algorithm.</b>  Look at the cumulative rewards per time step, they should rise steadily.  You do not want an algorithm where reward jumps up and down a lot.

3. <b>Choose the most sample-efficient algorithm that works for your environment</b>.  Look at the cumulative rewards per time step, they should rise quickly. <i>PPO is extremely sample-efficient.  SAC is much less sample-efficient.</i>


#### Step 1.  Import ray

In [10]:
# import commonly-used libraries
import os
import time
import numpy as np
print(f'Number of CPUs in this system: {os.cpu_count()}')
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)
print(f"numpy: {np.__version__}")
print(f"pandas: {pd.__version__}")

# import ray
import ray
from ray import tune
from ray.tune import CLIReporter
from ray.tune.logger import pretty_print
print(f"ray: {ray.__version__}")

Number of CPUs in this system: 16
numpy: 1.23.3
pandas: 1.4.4
ray: 3.0.0.dev0


#### Step 2. Check environment for errors   

Before you start training, it is a good idea to check the environment for errors.  RLlib provides a convenient [Environment pre-check function](https://github.com/ray-project/ray/blob/master/rllib/utils/pre_checks/env.py) for this.  It checks that the environment is compatible with OpenAI Gym and RLlib (and outputs a warning if necessary).

Below, we check our Frozen Lake environment for errors.

In [11]:
from ray.rllib.utils.pre_checks.env import check_env

# How to check you do not have any environment errors
print("checking environment ...")
try:
    check_env(env)
    print("All checks passed. No errors found.")
except:
    print("failed")



checking environment ...
All checks passed. No errors found.


#### Step 3. Calculate an environment baseline

Let's run through the environment, acting randomly, without rendering, and record the mean reward.  The purpose of this is to obtain a baseline before training a RLlib algorithm.

<div class="alert alert-block alert-success">
💡 If you are doing benchmarks, this random policy is often called a <b>"baseline".</b>
</div>

In [12]:
# Putting the Gym simple API methods together.
# Here is a pattern for running a bunch of episodes.
num_episodes = 3000 # Number of episodes you want to run the agent
num_timesteps = 0
# Collect all episode rewards here
episode_rewards = []

# Loop through episodes
for ep in range(num_episodes):

    # Reset the environment at the start of each episode
    obs = env.reset()
    done = False
    episode_reward = 0.0
    
    # Loop through time steps per episode
    while True:
        # take random action, but you can also do something more intelligent 
        action = env.action_space.sample()

        # apply the action
        new_obs, reward, done, info = env.step(action)
        episode_reward += reward

        # If the epsiode is up, then start another one
        num_timesteps += 1
        if done:
            episode_rewards.append(episode_reward)
            break

# calculate mean_reward
env_mean_random_reward = np.mean(episode_rewards)
env_sd_reward = np.std(episode_rewards)
# calculate number of wins
total_reward = np.sum(episode_rewards)
    
print()
print("**************")
print(f"Baseline Mean Reward={env_mean_random_reward:.2f}+/-{env_sd_reward:.2f}", end="")
print(f" (out of success={env_spec.reward_threshold})")
print(f"Baseline won {total_reward} times over {num_episodes} episodes ({num_timesteps} timesteps)")
print(f"Approx {total_reward/num_episodes:.2f} wins per episode")
print("**************")


**************
Baseline Mean Reward=0.01+/-0.11 (out of success=0.7)
Baseline won 36.0 times over 3000 episodes (22985 timesteps)
Approx 0.01 wins per episode
**************


#### Step 4.  Select an algorithm and find that algorithm's config class  

Here is how to find an <b>RLlib algorithm's config class</b>.
<ol>
    <li>Open RLlib docs <a href="https://docs.ray.io/en/master/rllib/rllib-algorithms.html">and navigate to the Algorithms page.</a></li>
    <li>Scroll down and click url of algo you want to use, e.g. <i><b>DQN</b></i></li>
    <li>On the <a href="https://docs.ray.io/en/master/rllib/rllib-algorithms.html#dqn">algo docs page </a>, click on the <i><b>Implementation</b></i> link.  This will open the <a href="https://github.com/ray-project/ray/blob/master/rllib/algorithms/dqn/dqn.py">algo code file on github</a>.</li>
    <li>Scroll down to the <i>config class definition</i>.</li>
    <li>Typically the docstring example will show: </li>
    <ol>
        <li>Example code using RLlib API, and </li>
        <li>Example code using Ray Tune API.</li>
    </ol>
    </ol>

In [13]:
# config is an object instead of a dictionary since Ray version >= 1.13
from ray.rllib.algorithms.dqn import DQNConfig

# Default DQN config values
# uncomment below to see the long list of specifically PPO default config values
# print(pretty_print(DQNConfig().to_dict()))

#### Step 5. Choose your training config settings and instantiate a config object with those settings

As of Ray 1.13, RLlib configs been converted from primitive Python dictionaries into Objects. This makes them harder to print, but easier to set/pass.

**Note about RLlib training parameter values precedence**
<ol>
    <li><i><b>Highest</b> precedence</i>: <b>user's config settings at time of training</b>.  These override all other config settings.</li>
    <li><i><b>Lower</b> precedence</i>: <b>specific RLlib algorithm (e.g. DQN) config</b>:  
        <ol>
            <li>Open the <a href="https://github.com/ray-project/ray/blob/master/rllib/algorithms/dqn/dqn.py">algo code file on github</a>.  </li>
            <li>Scroll down to the config class <b>__init()__</b> method.</li>
            <ol>
            <li><i>Algorithm default hyperparameter values are here</i>.</li>
            </ol>
        </ol>
    <li><i><b>Lowest</b></i> precedence: RLlib <b><a href"https://github.com/ray-project/ray/blob/master/rllib/algorithms/algorithm_config.py">generic algorithm config</a></b> settings.</li>
    </ol>

In [14]:
# RLlib generic (for all algorithms) config values

from ray.rllib.algorithms.algorithm_config import AlgorithmConfig
config = AlgorithmConfig()

# # uncomment below to see the long list of default RLlib AlgorithmConfig values
# print(f"RLlib's general default training config values:")
# print(pretty_print(config.to_dict()))

**Total number of Ray workers(or actors) =**
> number of rollout_workers (for streaming the data between learning steps) <br>
> \+ number of evaluation workers (for evaluation between learning steps)<br>
> \+ num_workers <br>
> \+ 1 for head node.<br>

<div class="alert alert-block alert-success">
    💡 <b>Ray Tune</b>: Total number Ray workers must be <b>(max) ONE LESS THAN</b> total number of available cores or processors. <br>
    💡 <b>RLlib algo.train()</b>: Total number Ray workers must be <b>(max) EQUAL TO</b> Total number of available cores or processors.
</div>

**Note about eval configuration, scaling, and fault tolerance**
    
    
All the above-mentioned scaling Ray num workers are set in the <b>Evaluation config</b>, except rollout_workers, which are set in a separate Rollout config.    

    
<ul>
    <li><b><i>evaluation_interval</i></b> = number training <b>iterations</b> between evaluations</li>
    <li><b><i>evaluation_duration</i></b> = number of evaluation <b>iterations</b> used for evaluation</li>
    <li><b><i>evaluation_num_workers</i></b> = number extra parallel Ray workers just for evaluation</li>
    <ul>
        <li>These are in addition to rollout_workers.</li>
        <li>Total number iterations used for eval = <i><b>evaluation_duration * evaluation_num_workers</b></i></li>
        <li>These show up in the Ray Dashboard as extra "RolloutWorker"s.</li>
    </ul>
    <li><i><b>evaluation_config/num_workers</i></b> = number of extra parallel Ray workers for <b>training</b>.</li>
        <ul>
        <li>Only 1 possible for DQN (see documentation).  Any number you put here will be translated to 1 during runtime.</li>
    </ul>
    <li><i><b>evaluation_parallel_to_training</i></b> = whether or not to use the extra parallel Ray workers for evaluation.</li>
    <ul>
        <li>False by default.</li>
    </ul>
    <li><i><b>rollouts.num_rollout_workers</i></b> = number of extra parallel Ray workers used just for <b>streaming data</b> per gradient update.</li>
    <ul>
        <li>This parameter is set in <i>.rollouts</i> not in <i>.evaluation</i>.  </li>
    </ul>
</ul>


Unless you overrode them, view other training parameters using <b><i>DQNConfig().to_dict():</b></i>
    <ul>
    <li><b><i>train_batch_size</i></b> = number of data samples from replay buffer every gradient update.</li>
    <ul>
        <li>Default batch_size for DQN = 32</li>
    </ul>
    <li><b><i>training iteration</i></b> = number of time steps per gradient update.</li>
    <ul>
        <li> Default timesteps per iteration for DQN = 1000</li>
    </ul>
        <li>Default learning rate <i><b>lr</i></b> = 5.0e-04</li>
        <li><i><b>target_network_update_freq</i></b>: 500 means the frozen-in-time network gets updated only once every 500 time steps </li>
    </ul>

<b>View network architecture</b> <br>
According to the paper
> The input to the neural
network consists is an 84 × 84 × 4 image produced by φ. The first hidden layer convolves 16 8 × 8
filters with stride 4 with the input image and applies a rectifier nonlinearity [10, 18]. The second
hidden layer convolves 32 4 × 4 filters with stride 2, again followed by a rectifier nonlinearity. The
final hidden layer is fully-connected and consists of 256 rectifier units. The output layer is a fullyconnected linear layer with a single output for each valid action.

Use dqn_algo.get_policy().model</li>
> Below we can see the TensorFlow model architecture is 2 hidden layers with 256 Dense cells each

<img src="./images/tf_dqn_algo_dnn_architecture.png" width="80%" />

In [16]:
# Create a DQNConfig object
dqn_config = DQNConfig()

# Only for Colab. Specify 1 gpu
# dqn_config.num_gpus=1

# Setup our config object to use our environment
dqn_config.environment(env="FrozenLake-v1")

# Decide if you want torch or tensorflow DL framework.  Default is "tf"
dqn_config.framework(framework="torch")

# Set the log level to DEBUG, INFO, WARN, or ERROR 
dqn_config.debugging(seed=415, log_level="ERROR")

# Setup evaluation
dqn_config.evaluation(
    
    # Minimum number of training iterations between evaluations.
    # Evaluations are blocking operations (if evaluation_parallel_to_training=False) 
    # set `evaluation_interval` larger for faster runtime.
    evaluation_interval=15, 

    # Minimum number of evaluation iterations.
    # If using multiple evaluation workers, we will run at least 
    # this many episodes * num_evalworkers total.
    evaluation_duration=5,      

    # Number of parallel evaluation workers. 
    # Zero by default, which means evaluation will run on the training resources. 
    # If you increase this, it will increase total Ray resource usage
    # since evaluation workers are created separately from rollout workers 
    # Note: these show up on Ray Dashboard as extra "RolloutWorker"s
    evaluation_num_workers=1,  #0 for Colab

    # Use the parallel evaluation workers in parallel with training workers
    evaluation_parallel_to_training=True,  #False for Colab
    
    evaluation_config = dict(
        # Explicitly set "explore"=False to override default True
        # Best practice value is False unless environment is stochastic
        explore=False,
        
        # Number of parallel Training workers
        # Override the num_workers from the training config 
        # Note: DQN only allows 1 Trainer worker, see documentation
        num_workers=1,  #any number here will be reset = 1 for DQN
    ),
)

# # Override default training parameters
# dqn_config.training(target_network_update_freq=5000, 
#                     model=dict("fcnet_hiddens" : [32, 32])
#                    )

# Setup sampling rollout workers for streaming the data 
dqn_config.rollouts(
    num_rollout_workers=3,  #1 for Colab
    
    # for small environments this can be >1 based on size of your processor
    num_envs_per_worker=4,)

print(f"Config type: {type(dqn_config)}")

# Use the config object's `build()` method for instantiating
# an RLlib Algorithm instance that we can then train.
# Note if using Tune, don't need algo object, but this is still a good debugging step.
dqn_algo = dqn_config.build()
print(f"Algorithm type: {type(dqn_algo)}")

print()
print("DQN MODEL ARCHITECTURE:")
# print(result['config']['model'])
# # tf print keras model summary
# print(dqn_algo.get_policy().model.base_model.summary())
# # torch
# from torchinfo import summary
# summary(dqn_algo.get_policy().model)


Config type: <class 'ray.rllib.algorithms.dqn.dqn.DQNConfig'>




Algorithm type: <class 'ray.rllib.algorithms.dqn.dqn.DQN'>

DQN MODEL ARCHITECTURE:


#### Step 6. Train an algorithm using the environment and algorithm config objects

**Two ways to train RLlib policies***
<ol>
    <li><a href="https://docs.ray.io/en/master/rllib/package_ref/index.html">RLlib API.</a> The main methods are:</li>
    <ul><b>
        <li>train()</li>
        <li>save()</li>
        <li>evaluate()</li>
        <li>restore()</li>
        <li>compute_single_action()</li></b>
    </ul>
    <li><a href="https://docs.ray.io/en/master/tune/api_docs/overview.html">Ray Tune API.</a>  The main methods are:</li>
        <ul>
            <li><b>run()</b></li>
    </ul>
    </ol>
    
*3rd way is RLlib CLI from command line using .yml file, but the .yml file is undocumented: <i>rllib train -f [myfile_name].yml</i><br>

<b>RLlib API train()</b> will train for 1 <i>iteration</i> only.  Good for debugging since every single output will be shown for the single iteration.  

<b>Ray Tune API run()</b> is usually more convenient since with 1 function call you get experiment management: hyperparameter tuning, save checkpoints, evaluate, and training up to a stopping criteria.

✔Both methods will run the RLlib [environment pre-check function](https://github.com/ray-project/ray/blob/master/rllib/utils/pre_checks/env.py) you saw earlier in this notebook (Step 2. Check environment).

<b>RLlib API restore()</b> will reload a checkpointed RLlib model for Serving and Offline learning, even if the model was trained using Tune.  Tune API methods will not work for this.

<b>RLlib API compute_single_action()</b> will use the trained <i>`policy`</i> (RL word for trained model) and use that for inference on an environment.   

<div class="alert alert-block alert-success">
In summary: <br>
    💡 <b>Train</b> a RLlib algorithm with Ray Tune method <b>`.run()`</b>  <br>
    👉  <b>Develop</b> or debug a RLlib algorithm with RLlib method <b>`.train()`</b> <br>
    👉  <b>Restore</b> a RLlib policy with RLlib  method <b>`.restore()`</b> <br>
    👉  <b>Run inference</b> on an environment using a trained policy with RLlib method <b>`.compute_single_action()`</b>
</div>

💡 <b>Right-click on the cell below and choose "Enable Scrolling for Outputs"!</b>  This will make it easier to view, since model training output can be very long!

In [17]:
# # SINGLE .TRAIN() OUTPUT
# # Check configs before submitting a long-running Tune job.

# # Perform single `.train() iteration` call
# # Result is a Python dict object
# result = dqn_algo.train()

# # Erase config dict from result (for better overview).
# del result["config"]
# # Print out training iteration results.
# print(pretty_print(result))

In [18]:
# Before setting up the Tune job hyperparam sweep,
# Check current parameter settings

print(f"Algo class for DQN: {dqn_config.algo_class}")
print(f"Learning rate: {dqn_config.lr}")
print(f"Train batch size: {dqn_config.train_batch_size}")
print(f"Gamma: {dqn_config.gamma}")
print(f"Target network update freq: {dqn_config.target_network_update_freq}")
print(f"Eval_num_workers: {dqn_config.evaluation_num_workers}")
print(f"Evaluation_parallel_to_training: {dqn_config.evaluation_parallel_to_training}")
print(f"Num_rollout_workers: {dqn_config.num_workers}")
print(f"Num_envs_per_worker: {dqn_config.num_envs_per_worker}")
print(f"Training num_workers: {dqn_config.to_dict()['evaluation_config']['num_workers']}")
print(f"Model grayscale: {dqn_config.to_dict()['model']['grayscale']}")
print(f"Model zero_mean: {dqn_config.to_dict()['model']['zero_mean']}")


Algo class for DQN: <class 'ray.rllib.algorithms.dqn.dqn.DQN'>
Learning rate: 0.0005
Train batch size: 32
Gamma: 0.99
Target network update freq: 500
Eval_num_workers: 1
Evaluation_parallel_to_training: True
Num_rollout_workers: 3
Num_envs_per_worker: 4
Training num_workers: 1
Model grayscale: False
Model zero_mean: True


In [19]:
# # Now let's change our existing config object and add a simple grid-search

# grid search over training params
dqn_config.training(
    lr=tune.grid_search([0.00005, 0.0002]),
    # train_batch_size=tune.grid_search([32, 100]),
)
print(f"Default lr is: {dqn_config.lr}")

# # grid search over eval params
# dqn_config.evaluation(
#     evaluation_num_workers=tune.grid_search([0,1]),
# )
# print(f"Default eval_num_workers for DQN is: {dqn_config.evaluation_num_workers}")

# # grid search over rollouts params
# dqn_config.rollouts(
#     num_rollout_workers=tune.grid_search([1,3,4])
#     num_envs_per_worker=tune.grid_search([1,4]),
# )
# print(f"Default num_envs_per_worker is: {dqn_config.num_envs_per_worker}")

Default lr is: {'grid_search': [5e-05, 0.0002]}


In [None]:
# ##############
# # EXAMPLE USING RAY TUNE API .run() 1 UNTIL STOPPING CONDITION
# ##############
# To start fresh, restart Ray in case it is already running
if ray.is_initialized():
    ray.shutdown()

# Stopping criteria whichever occurs first: average (sum) reward, or ...
stop_criteria = dict(
        # stop after n seconds
        time_total_s=35,
        # stop if reached n sampling timesteps
        # timesteps_total=9000,  
        # stop after n training iterations (calls to `Algorithm.train()`)
        # training_iteration=30,
        # stop if average (sum of) rewards in an episode is n or more
        # episode_reward_mean=0.2,  # 0.2 out of max 0.7 
)
    
# # Use a custom "reporter" that adds the individual policies' rewards to the output.
# reporter = CLIReporter()    
# reporter.add_metric_column("sampler_results/policy_reward_mean/policy1", "agent1 return")
# reporter.add_metric_column("sampler_results/policy_reward_mean/policy2", "agent2 return")
    
# # not working yet?
# from ray.train.rl import RLTrainer
# experiment_results = RLTrainer(

experiment_results = \
tune.run(
    # Alternatively, just put the string "DQN" here.
    # All of RLlib's algos are pre-registered with Tune: e.g. "PPO", "DQN", "SAC", "IMPALA", etc..
    dqn_config.algo_class,

    # training config params (translated into a python dict!)
    config=dqn_config.to_dict(),
    
    # Stopping criteria whichever occurs first: average reward over training episodes, or ...
    stop=stop_criteria,
    
    # Customize the progress reporter
    # progress_reporter=reporter,

    #redirect logs to relative path instead of default ~/ray_results/
    local_dir = "my_Tune_logs",
         
    # Every how many train() calls do we create a checkpoint?
    # checkpoint_freq=9,  # (iterations // (#desired_checkpts+1)) - 1
    # Always save last checkpoint (no matter the frequency).
    checkpoint_at_end=True,

    ###############
    # Note about Ray Tune verbosity.
    # Screen verbosity in Ray Tune is defined as verbose = 0, 1, 2, or 3, where:
    # 0 = silent
    # 1 = only status updates, no logging messages
    # 2 = summary, status and brief trial results, includes logging messages
    # 3 = summary, status and detailed trial results, includes logging messages
    # Defaults to 3.
    ###############
    verbose=2,
                   
    # Define what we are comparing for, when we search for the
    # "best" checkpoint at the end.
    metric="episode_reward_mean",
    mode="max",
)

print("Training completed.")


<br>

⬆️ Above, you will see a lot of Tune output.  Look for a summary table like this:

<img src="./images/DQN_tune_lr_summary.png" width="100%" />   
<br>

## Evaluate a RLlib Policy <a class="anchor" id="eval_rllib"></a>

Traditional Supervised ML splits data into train/valid/test, and runs evaluate on the entire valid dataset after the model has been trained.  RL, on the other hand, runs evaluation typically every time a checkpoint is saved.  

RLlib policies can be evaluated by:
<ul>
    <li>Examining <b>Ray Tune experiment results</b>  </li>
    <li>Calling RLlib Algorithm API <b>.evaluate()</b> typically every time <b>.save()</b> is called.</li>
    <li>Visualize real-time training progress in <b>TensorBoard</b></li>
    </ul>

<b>If using Ray Tune .run()</b>

> The `Tune.run()` method returns an object which can be read into a pandas dataframe.


<b>If using RLlib .train()</b>

> The `train()` method returns a dictionary. 

⬇️ Below we will examine the Ray Tune experiment results.
<br>

In [21]:
# Read off overall stats
stats = experiment_results.stats()
secs = stats["timestamp"] - stats["start_time"]
print(f'{secs:7.2f} seconds, {secs/60.0:7.2f} minutes')

# Read trainer results in a pandas dataframe
df = experiment_results.results_df

# I'm not sure what these mean?
# print(f"Number of sample_timesteps: {df.iloc[0,:].num_agent_steps_sampled}")
# print(f"Number of num_agent_steps_trained: {df.iloc[0,:].num_agent_steps_trained}")
print(f"df.shape: {df.shape}")  # One row per experiment
# pick off col numbers this way
# temp = df.columns.tolist()
# temp
temp_columns = ["experiment_tag", "config/lr", "config/gamma", "episode_reward_mean",
                "episode_len_mean", "timesteps_total", "training_iteration", 
                "done", "time_total_s", 
                "timers/training_iteration_time_ms", "timers/load_time_ms",
                "timers/load_throughput", "timers/learn_time_ms", "timers/synch_weights_time_ms",
                "config/num_workers", "config/evaluation_num_workers",
                "config/evaluation_config/num_envs_per_worker", "config/evaluation_config/evaluation_config/num_workers"]
temp = df.loc[:, temp_columns].head()
temp.rename(columns={'config/evaluation_config/evaluation_config/num_workers':'num_train_workers'}, inplace=True)
temp.rename(columns={'config/evaluation_config/num_envs_per_worker':'num_envs_per_eval_worker'}, inplace=True)
temp.rename(columns={'config/evaluation_num_workers':'evaluation_num_workers'}, inplace=True)
temp.rename(columns={'config/num_workers':'num_rollout_workers'}, inplace=True)
from IPython.display import display
display(temp)

print()
print("TRAIN SETTINGS")
print(f"learning rate: {df.iloc[0,:]['config/lr']}")
print(f"batch_size: {df.iloc[0,:]['config/train_batch_size']}")
print(f"eval_interval: {df.iloc[0,:]['config/evaluation_interval']}")
print(f"Timesteps since last target update: {df.iloc[0,:]['info/last_target_update_ts']}")
print(f"Num target updates: {df.iloc[0,:]['info/num_target_updates']}")

print()
print("TIMINGS")
print(f"Total time (sec) 1st trial: {df.iloc[0,:]['time_total_s']}")
print(f"Total time (sec) 2nd trial: {df.iloc[1,:]['time_total_s']}")


  63.65 seconds,    1.06 minutes
df.shape: (2, 450)


Unnamed: 0_level_0,experiment_tag,config/lr,config/gamma,episode_reward_mean,episode_len_mean,timesteps_total,training_iteration,done,time_total_s,timers/training_iteration_time_ms,timers/load_time_ms,timers/load_throughput,timers/learn_time_ms,timers/synch_weights_time_ms,num_rollout_workers,evaluation_num_workers,num_envs_per_eval_worker,num_train_workers
trial_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
11c3c_00000,0_lr=0.0001,5e-05,0.99,0.54,37.86,36288,36,True,35.260832,47.632,0.198,161358.173,8.353,1.695,3,1,4,1
11c3c_00001,1_lr=0.0002,0.0002,0.99,0.25,25.84,36288,36,True,35.127245,42.595,0.188,170262.245,7.704,1.605,3,1,4,1



TRAIN SETTINGS
learning rate: 5e-05
batch_size: 32
eval_interval: 15
Timesteps since last target update: 35856
Num target updates: 67

TIMINGS
Total time (sec) 1st trial: 35.26083159446716
Total time (sec) 2nd trial: 35.12724542617798


<br>

**Iterate based on Tune results**

From above, only running for <1 minute, we can see that the best configuration is
- learning rate = 0.00005

Next, let's train longer with these settings

In [22]:
# To start fresh, restart Ray in case it is already running
if ray.is_initialized():
    ray.shutdown()

In [23]:
# Change config settings
# Create a DQNConfig object
dqn_config = DQNConfig()\
    .environment(env="FrozenLake-v1")\
    .framework(framework="torch")\
    .debugging(seed=415, log_level="ERROR")\
    .evaluation(
        evaluation_interval=15, 
        evaluation_duration=5,      
        evaluation_num_workers=4,  #1 for Colab
        evaluation_parallel_to_training=True,
        evaluation_config = dict(
            explore=False,
            num_workers=1,  #any number here will be reset = 1 for DQN
        ),)\
    .rollouts(
        num_rollout_workers=1, 
        num_envs_per_worker=4,)\
    .training(
        lr=0.00005,)

print(f"Config type: {type(dqn_config)}")

# Use the config object's `build()` method for instantiating
# an RLlib Algorithm instance that we can then train.
dqn_algo = dqn_config.build()
print(f"Algorithm type: {type(dqn_algo)}")


2022-09-16 09:25:52,883	INFO worker.py:1221 -- Using address localhost:9031 set in the environment variable RAY_ADDRESS
2022-09-16 09:25:52,884	INFO worker.py:1331 -- Connecting to existing Ray cluster at address: 10.0.117.196:9031...
2022-09-16 09:25:52,888	INFO worker.py:1508 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://console.anyscale.com/api/v2/sessions/ses_S99ZQzT4pS1G37w2HU5fADdM/services?redirect_to=dashboard [39m[22m


Config type: <class 'ray.rllib.algorithms.dqn.dqn.DQNConfig'>




Algorithm type: <class 'ray.rllib.algorithms.dqn.dqn.DQN'>


In [24]:
###############
# EXAMPLE USING RLLIB API .train() IN A LOOP
# To train for N number of episodes, you put .train() into a loop, 
# similar to the way we ran the Gym env.step() in a loop.
###############

start_time = time.time()

# train the Algorithm instance for 20 iterations
num_iterations = 20
dqn_rewards  = []
checkpoint_dir = "/saved_runs/dqn/"

for i in range(num_iterations):
    # Call its `train()` method
    result = dqn_algo.train()
    
    # Extract reward from results.
    dqn_rewards.append(result["episode_reward_mean"])
    
    # checkpoint and evaluate every 15 iterations
    if ((i % 14 == 0) or (i == num_iterations-1)):
        print(f"Iteration={i}, Mean Reward={result['episode_reward_mean']:.2f}",end="")
        try:
            print(f"+/-{np.std(dqn_rewards ):.2f}")
        except:
            print()
        # save checkpoint file
        checkpoint_file = dqn_algo.save(checkpoint_dir)
        print(f"Checkpoints saved at {checkpoint_file}")
        # evaluate the policy
        eval_result = dqn_algo.evaluate()

# convert num_iterations to num_episodes
num_episodes = len(result["hist_stats"]["episode_lengths"]) * num_iterations
# convert num_iterations to num_timesteps
num_timesteps = sum(result["hist_stats"]["episode_lengths"] * num_iterations)
# calculate number of wins
num_wins = np.sum(result["hist_stats"]["episode_reward"])

# train time
secs = time.time() - start_time
print(f"DQN won {num_wins} times over {num_episodes} episodes ({num_timesteps} timesteps)")
print(f"Approx {num_wins/num_episodes:.2f} wins per episode")
print(f"Training took {secs:.2f} seconds, {secs/60.0:.2f} minutes")

Iteration=0, Mean Reward=0.03+/-0.00
Checkpoints saved at results_notebook/online_rl/dqn/checkpoint_000001
Iteration=14, Mean Reward=0.40+/-0.13
Checkpoints saved at results_notebook/online_rl/dqn/checkpoint_000015
Iteration=19, Mean Reward=0.49+/-0.19
Checkpoints saved at results_notebook/online_rl/dqn/checkpoint_000020
DQN won 49.0 times over 2000 episodes (77180 timesteps)
Approx 0.02 wins per episode
Training took 45.83 seconds, 0.76 minutes



<b>Compare the DQN Training results to Random Baseline <br></b>
- DQN Mean Reward=~0.67+/-0.26.  This is much higher than baseline!
> Baseline Mean Reward=~0.02+/-0.13 (out of success=0.7) <br>

<div class="alert alert-block alert-success">
    ✔ <b>DQN mean reward is approx 30x higher than the random baseline! <br>
</div>

<br>

In [25]:
# To stop the Algorithm (and Env) and release its blocked resources, use:
dqn_algo.stop()

#### Visualize the training progress in TensorBoard

<b>Ray Tune</b> automatically creates logs for your trained RLlib models that can be visualized in TensorBoard.  Ray Tune logs are stored in the specified redirect `local_dir`; or if none specified then the logs are stored in `~/ray_results/`.

<b>RLlib Algorithm .train() requires an explicit .save() step</b> in order to create logs.  The default format for .save() is Ray Tune .json logs compatible with TensorBoard.  Unlike Ray Tune, using .save(), it is only possible to store logs in `~/ray_results/`.  You cannot change the location of the TensorBoard logs.

To visualize the performance of your RL policy:

<ol>
    <li>Open a terminal</li>
    <li><i><b>cd</b></i> into the correct log directory.</li>
    <li><i><b>ls</b></i></li>
    <li>You should see files such as: <i>result.json, params.json, ... </i></li>
    <li>To be able to compare all your experiments, cd one dir level up.
    <li><i><b>cd ..</b></i>  
    <li><i><b>tensorboard --logdir . </b></i></li>
    <li>Look at the url in the message, and open it in a browser</li>
        </ol>
        
Note Step 7 above: if running RLlib on a cluster, use <a href="https://blog.tensorflow.org/2019/12/introducing-tensorboarddev-new-way-to.html">tensorboard.dev</a> instead.  Navigate to the directory on the head node where `ray_results/` directory is located.  From there, run 
`tensorboard dev upload --logdir .`

#### Tensorboard

TensorBoard will give you many pages of charts.  Most of the charts will be showing Train/Eval <b>sample efficiency</b>, <i>the number of training steps it took to achieve a certain level of performance</i>.

<div class="alert alert-block alert-success">
    <b>A few charts you will want to inspect:</b>
<ol>
    <li>View training mean episode reward with </li>
        <ul> 
            <li><b>x-axis step</b>.  This shows the whole learning curve.</li>
            <li><b>x-axis relative</b>. Look to the far right of the chart, here you can quickly pick off which model got the highest mean episode reward. 
            <li>Use the tensorboard menu on left-hand side to toggle between x-axis views.</li>
        </ul>
        <li>If comparing different training runs, check that <b>rank order of mean episode rewards per model matches between training and evaluation</b>.</li>
        <ul>
            <li>Training charts usually on page 1.</li>
            <li>Evaluation charts usually on page 2.</li>
        </ul>
    <li>View <b>training entropy</b>.  It should be generally decreasing.  
        <ul>
            <li>Entropy charts usually on page 3.</li>
            <li>If you see entropy decreasing but then at some point flatten or even increase, it means you should stop training earlier.</li>
        </ul>
    <li>Toggle the ✔ checkbox <i>`Ignore outliers in chart scaling`</i>, in case you have 1 model way outperforming other policies.  The checkbox is located on the top-left menu.</li>
    <li>💡 When viewing final model outputs per chart, make sure you are hovering far enought to the right to see a filled-color-dot at the end of each line chart.  This means you are viewing the final, overall metrics for that chart. </li>
</ol>
</div>
    
   
<b>TensorBoard Screenshots:</b> <br>  
<img src="./images/frozen_lake_v1_tensorboard.png" width="80%" />    



## Reload RLlib policy from checkpoint and run inference <a class="anchor" id="reload_rllib"></a>

We want to reload the desired RLlib model from checkpoint file and then run the policy in inference mode on the environment it was trained on.  

You will need:
<ul>
    <li>Your <b>algorithm's config class</b></li>
    <li>Name of the <b>environment</b> you used to train the policy.</li>
    <li>Path to the desired <b>checkpoint</b> file you want to use to restore the policy.</li>
    </ul>

#### Step 1. Find the best model checkpoint file

In [26]:
# EXAMPLE GETTING CHECKPOINT FROM RLLIB TRAIN

# Enter the last checkpoint manually
# checkpoint = "./results/DQN/checkpoint_000020/checkpoint-20"
checkpoint = "./saved_runs/dqn/checkpoint_000020/checkpoint-20"
print(f"\n{checkpoint}")


./results_notebook/online_rl/dqn/checkpoint_000020/checkpoint-20


In [27]:
# # EXAMPLE GETTING CHECKPOINT FROM RAY TUNE

# # Using the returned `experiment_results` object,
# # we can extract from it the best checkpoint according to some criterium, e.g. `episode_reward_mean`.

# # Return the trial that performed best here.
# best_trial = experiment_results.get_best_trial()
# print("Best trial: ", best_trial)

# # View all the training parameters
# # Could also view params.json inside ~/ray_results/ directory
# best_config = experiment_results.get_best_config(metric="episode_reward_mean"
#                                                  , mode="max")

# # View which checkpoint file
# # We would expect this to be either the very last checkpoint or one close to it:
# best_checkpoint = experiment_results.get_best_checkpoint(trial=best_trial, metric="episode_reward_mean", mode="max")
# print(f"Best checkpoint from training: {best_checkpoint}")


#### Step 2. Re-initialize an already-trained algorithm object from the checkpoint file


In [28]:
# Create new Algorithm and restore its state from the last checkpoint.

# create an empty Algorithm
algo = dqn_config.build()

# restore the agent from the checkpoint
algo.restore(checkpoint)

2022-09-16 09:26:57,353	INFO trainable.py:691 -- Restored on 10.0.117.196 from checkpoint: results_notebook/online_rl/dqn/checkpoint_000020
2022-09-16 09:26:57,354	INFO trainable.py:700 -- Current state after restoring: {'_iteration': 20, '_timesteps_total': None, '_time_total': 45.18734288215637, '_episodes_total': 1037}


#### Step 3. Play and render the game

Now we want to play the trained policy doing inference in the environment it was trained on.

<div class="alert alert-block alert-success">
✔ During inference, call the RLlib API method <b>compute_single_action()</b>: <br>

👍 Uses the trained <i>policy</i> (RL word for trained model) to calculate actions for the entire number of time steps in 1 <i>rollout</i> (RLlib word for episode during inference). 
</div>

⬇️ Below we play the game 100 times using the DQN already-trained policy.
<br>

In [29]:
#############
## Create the env to do inference on
#############
env = gym.make(env_name)
obs = env.reset()

# Use the restored algorithm from checkpoint and run it in inference mode
episode_reward = 0.0
done = False
num_episodes = 0
num_steps = 0

while num_episodes < 100:
    # Compute an action (`a`).
    a = algo.compute_single_action(observation=obs, explore=False)
    # Send the computed action `a` to the env.
    obs, reward, done, _ = env.step(a)
    episode_reward += reward
    num_steps += 1
    
    # Is the episode `done`? -> Reset.
    if done:
        obs = env.reset()
        num_episodes += 1

# calculate mean_reward
print()
print("**************")
mean_reward = episode_reward / num_episodes
print(f"DQN mean_reward: {mean_reward:.2f} out of success: {env_spec.reward_threshold} after {num_episodes} episodes or {num_steps} time steps")
print(f"DQN won {episode_reward} times over {num_episodes} plays (episodes)")
print("**************")



**************
DQN mean_reward: 0.62 out of success: 0.7 after 100 episodes or 4241 time steps
DQN won 62.0 times over 100 plays (episodes)
**************


<b>How does our inferenced policy compare to the Random baseline? <br></b>
- DQN wins ~70 times over 100 plays.  This is much higher than baseline!
> Baseline won ~53.0 times over 3000 plays (episodes) <br>


⬇️ Below we render the game using the DQN policy, so we can visually inspect the environment.

In [30]:
from ipywidgets import Output
from IPython import display
import time

# The following three lines are for rendering purposes only.
# They allow us to render the env frame-by-frame in-place
# (w/o creating a huge output which we would then have to scroll through).
out = Output()
display.display(out)
with out:

    #############
    ## Create the env to do inference on
    #############
    env = gym.make(env_name)
    obs = env.reset()

    #############
    ## Use the restored policy and run it in inference mode
    ## Run compute_single_action() in inference episodes loop
    ## You will see an ASCII rendering in-place for about 10 seconds
    #############
    episode_reward = 0.0
    done = False
    num_episodes = 0

    while num_episodes < 5:
        # Compute an action (`a`).
        a = algo.compute_single_action(observation=obs, explore=False)
        # Send the computed action `a` to the env.
        obs, reward, done, _ = env.step(a)
        episode_reward += reward

        # Is the episode `done`? -> Reset.
        if done:
            obs = env.reset()
            num_episodes += 1

        # Render the env (in place).
        time.sleep(0.3)
        out.clear_output(wait=True)
        print(f"episode: {num_episodes}")
        print(f"obs: {obs}, reward: {episode_reward}, done: {done}")
        env.render()


episode: 5
obs: 0, reward: 4.0, done: True

[41mS[0mFFF
FHFH
FFFH
HFFG


In [31]:
# To stop the Algorithm (and Env) and release its blocked resources, use:            
algo.stop()

### Summary

In this notebook, we have learned:
* What a gym Environment is, and how the gym.Env API is used define sequential decision making problems using python code
* How RLlib looks like on the surface (where to find its algorithms and top-level APIs)
* How to train a RLlib algorithm using `.train()` and a built-in gym.Env ("frozen lake")
* Where to find checkpoint files, logs, tensorboard files, etc..
* How to play and render some episodes from a gym.Env using a trained RLlib algorithm.

### References

1. [Reinforcement Learning: an introduction, by Sutton and Barto, book free download](http://incompleteideas.net/book/the-book-2nd.html)
2. [Anyscale tutorial blog explanation of Deep Q-Learning (DQN)](https://www.anyscale.com/blog/reinforcement-learning-with-deep-q-networks)
3. [OpenAI Gym Environments](https://www.gymlibrary.dev/)
4. [Ray doc page](https://docs.ray.io/en/master/)
5. [Rllib github](https://github.com/ray-project/ray/tree/master/rllib)
6. [RLlib Algorithms doc page](https://docs.ray.io/en/master/rllib/rllib-algorithms.html)

In [32]:
# Shut down Ray if you are done
import ray
if ray.is_initialized():
    ray.shutdown()

## Thank you!

<a href="https://docs.google.com/forms/d/1pxsMIPMxTTd2HH6710UOApx_smPDPPO0fpVWYKzvOgI/edit">Survey</a> - Let us know how useful you have found this tutorial.

**We would love to connect with you!**

**Twitter** - @anyscalecompute | @raydistributed <br>
<b><a href="https://github.com/ray-project/ray">Github</a></b> - 😜 give us a star!<br>
<b><a href="https://www.ray.io/community">Slack</a></b> - [+invitation link](https://docs.google.com/forms/d/e/1FAIpQLSfAcoiLCHOguOm8e7Jnn-JJdZaCxPGjgVCvFijHB5PLaQLeig/viewform)<br>
<b><a href="https://discuss.ray.io/">Discuss</a></b> - searchable questions <br>