# Exercise 02. Introduction to the OpenAI Gym Environment and RLlib Algorithm top-level APIs

© 2019-2022, Anyscale. All Rights Reserved

### Learning objectives
In this this tutorial, we will learn about:
 * [What is an Environment in RL](#intro_env)
 * [Overview of RL terminology](#intro_rl)
 * [What is OpenAI Gym](#intro_gym)
 * [High-level OpenAI Gym API calls](#intro_gym_api)
 * [Introduction to RLlib](#intro_rllib)
 * [How to train a RLlib model on a Gym environment](#intro_rllib_api)
 

## What is an environment in RL? <a class="anchor" id="intro_env"></a>

Solving a problem in RL begins with an **environment**. In the simplest definition of RL:

> An **agent** interacts with an **environment** and receives a reward.

An environment in RL is the agent's world, it is a simulation of the problem to be solved. 

<img src="images/env_key_concept1.png" width="50%">

The environment simulator might be a:
<ul>
    <li>real, physical situation such as a gas turbine</li>
    <li>virtual sytem on a computer such as a board game or video game</li>
    </ul>
Why bother with an Agent and Environment?  

> RL is useful when you have sequential decisions that need to be optimized over time. 

Traditional supervised learning views the world as more of a one-shot training, not as action -> fedback -> improved action -> repeat.
<br> 

## Overview of RL terminology <a class="anchor" id="intro_rl"></a>

An RL environment consists of: 

1. all possible actions (**action space**)
2. a complete omniscient description of the environment, nothing hidden (**state space**)
3. an observation by the agent of certain parts of the state (**observation space**)
4. **reward**, which is the only feedback the agent receives per action.

The model that tries to maximize the expected sum over all future rewards is called a **policy**. The policy is a function mapping the environment's observations to an action to take, usually written **π** (s(t)) -> a(t).

Below is a high-level image of how the Agent and Environment work together in a RL simulation feedback loop in RLlib.

<img src="images/env_key_concept2.png" width="98%">

The **RL simulation feedback loop** repeatedly collects data, for one (single-agent case) or multiple (multi-agent case) policies, trains the policies on these collected data, and makes sure the policies' weights are kept in synch. Thereby, the collected environment data contains observations, taken actions, received rewards and so-called **done** flags, indicating the boundaries of different episodes the agents play through in the simulation.

The simulation iterations of action -> reward -> next state -> train -> repeat, until the end state, is called an **episode**, or in RLlib, a **rollout**

Per episode, the RL simulation feedback loop repeats up to some specified end state (termination state or timesteps). Examples of termination could be:
<ul>
    <li>the end of a maze (termination state)</li>  
    <li>the player died in a game (termination state)</li>
    <li>after 60 videos watched in a recommender system (timesteps).</li>
    </ul>

## OpenAI Gym example: frozen lake <a class="anchor" id="intro_gym"></a>

[OpenAI Gym](https://gym.openai.com/) is a well-known reference library of RL environments. 

#### 1. import gym

Below is how you would import gym and view all available environments.

In [1]:
# import libraries
import gym
print(f"gym: {gym.__version__}")

# List all available gym environments
all_env  =  list(gym.envs.registry.all())
print(f'Num Gym Environments: {len(all_env)}')

# # You could loop through and list all environments if you wanted
# [print(e) for e in all_env]

gym: 0.21.0
Num Gym Environments: 103


#### 2. Instatiate your Gym object

The way you instantiate a Gym environment is with the **make()** function.

The .make() function takes arguments:
- **name of the Gym environment**, type: str, Required.
- **runtime parameter values**, Optional.

For the required string argument, you need to know the Gym name.  You can find the Gym name in the Gym documentation for environments, either:
<ol>
    <li>The doc page in <a href="https://www.gymlibrary.ml/environments/toy_text/frozen_lake/">Gym's website</a></li>
    <li>The environment's <a href="https://github.com/openai/gym/blob/master/gym/envs/toy_text/frozen_lake.py">source code </a></li>
    <li>
        <a href="https://www.gymlibrary.ml/environments/classic_control/cart_pole/#description">Research paper (if one exists)</a> referenced in the environment page </li>
    </ol>
    
Below is an example of how to create a basic Gym environment, [frozen lake](https://www.gymlibrary.ml/environments/toy_text/frozen_lake/).  We can see below that the termination condition of an episode will be time steps.

In [2]:
env_name = "FrozenLake-v1"
env_runtime_param_value = False

# Instantiate gym env object with a runtime parameter value
env = gym.make(
        env_name, 
        is_slippery=env_runtime_param_value)

# inspect the gym spec for the environment
print(f"env: {env}")
env_spec = env.spec
print(f"env_spec: {env_spec}")

# Note: "TimeLimit" means termination condition for an episode will be time steps

env: <TimeLimit<FrozenLakeEnv<FrozenLake-v1>>>
env_spec: EnvSpec(FrozenLake-v1)


#### 3. Inspect the environment action and observations spaces

Gym Environments can be deterministic or stochastic.

<ul>
    <li>
        <b>Deterministic</b> if the current state + selected action determines the next state of the environment.  Chess is a deterministic environment, since all possible states/action combinations can be described as a discrete set of rules with states bounded by the pieces and size of the board.</li>
    <li>
        <b>Stochastic</b> if the policy output action is a probability distribution over a set of possible actions at time step t. In this case the agent needs to compute its action from the policy in two steps. i) sample actions from the policy according to the probability distribution, ii) compute log likelihoods of the actions. Stochastic environments are random in nature.  Random visitors to a website is an example of a stochastic environment. </li>
    </ul>

<b>Gym actions.</b> The action_space describes the numerical structure of the legitimate actions that can be applied to the environment. 

For example, if we have 4 possible discrete actions, we could encode them as:
<ul>
    <li>0: LEFT</li>
    <li>1: DOWN</li>
    <li>2: RIGHT</li>
    <li>3: UP</li>
</ul>

<b>Gym observations.</b>  The observation_space defines the structure as well as the legitimate values for the observation of the state of the environment.  

For example, if we have a 4x4 grid, we could encode them as {0,1,2,3, 4, … ,16} for grid positions ((0,0), (0,1), (0,2), (0,3), …. (3,3)).


From the Gym [documentation](https://www.gymlibrary.ml/environments/toy_text/frozen_lake/) about the frozen lake environment, we see: <br>

|Frozen Lake      | Gym space   |
|---------------- | ----------- |
|Action Space     | Discrete(4) |
|Observation Space| Discrete(16)|


 
<b><a href="https://github.com/openai/gym/tree/master/gym/spaces">Gym spaces</a></b> are gym data types.  The main types are `Discrete` for discrete numbers and `Box` for continuous numbers.  

Gym Space `Discrete` elements are Python type `int`, and Gym Space `Box` are Python type `float32`.

Below is an example how to inspect the environment action and observations spaces.


In [3]:
# check if it is a gym instance
if isinstance(env, gym.Env):
    print("This is a gym environment.")
    print()

    # print gym Spaces
    if isinstance(env.action_space, gym.spaces.Space):
        print(f"gym action space: {env.action_space}")
    if isinstance(env.observation_space, gym.spaces.Space):
        print(f"gym observation space: {env.observation_space}") 

This is a gym environment.

gym action space: Discrete(4)
gym observation space: Discrete(16)


#### 4. Inspect gym environment parameters

Gym environments contain 2 sets of configuration parameters that are set after the environment object is instantiated.
<ul>
    <li><b>Runtime parameters</b> are passed into the make() function as **kwargs.</li>
    <li><b>Default parameters</b> are fixed in the Gym environment code.</li>
    </ul>

Below is an example of how to inspect the environment parameters.  

Notice we can tell from the parameters that our frozen lake environment is: 
1) Deterministic, and 
2) Episode terminates with time step condition max_episode_steps = 100.

In [4]:
# inspect env.spec parameters
 
# View runtime **kwargs .spec params.  These params set after env instantiated.
# print(f"type(env_spec._kwargs): {type(env_spec._kwargs)}") #dict
print("Runtime spec params...")
[print(f"{k}: {v}") for k,v in env_spec._kwargs.items()]
print()
 
# View default env spec params
# Default parameters are fixed
print("Default spec params...")
print(f"id: {env_spec.id}")
print(f"entry_point: {env_spec.entry_point}")
print(f"reward_threshold: {env_spec.reward_threshold}")
print(f"nondeterministic: {env_spec.nondeterministic}")
print(f"max_episode_steps: {env_spec.max_episode_steps}")
print(f"order_enforce: {env_spec.order_enforce}")

# We can tell that our frozen lake environment is: 
# 1) Deterministic, and 
# 2) Episode terminates with condition max_episode_steps = 100


Runtime spec params...
map_name: 4x4
is_slippery: False

Default spec params...
id: FrozenLake-v1
entry_point: gym.envs.toy_text:FrozenLakeEnv
reward_threshold: 0.7
nondeterministic: False
max_episode_steps: 100
order_enforce: True


#### 5. Perform some basic Gym API calls <a class="anchor" id="intro_gym_api"></a>

The most basic Gym API methods are:
<ul>
    <li><b>env.reset()</b> <br>Reset the environment to an initial state, this is how you initialize an environment so you can run a simulation on it.  You should call this method every time to initiate a new episode.</li>
    <li><b>env.render()</b>  <br>Visually inspect the environment anytime. Note you cannot inspect an environment before it has been initialized with env.reset().</li>
    <li><b>env.step(action)</b> <br>Take an action from the possible action space values.  It accepts an action, computes the state of the environment after applying that action and returns the 4-tuple (observation, reward, done, info).</li>
    <li><b>env.close()</b> <br>Close an environment.</li>
    </ul>

In [5]:
# Print the starting observation.  Recall possible observations are between 0-16.
print(env.reset())
env.render()

0

[41mS[0mFFF
FHFH
FFFH
HFFG


In [6]:
# Take an action
# Recall the possible actions are: 0: LEFT, 1: DOWN, 2: RIGHT, 3: UP

new_obs, reward, done, _ = env.step(2) #Right
print(f"obs: {new_obs}, reward: {reward}, done: {done}")
env.render()
new_obs, reward, done, _ = env.step(1) #Down
print(f"obs: {new_obs}, reward: {reward}, done: {done}")
env.render()

obs: 1, reward: 0.0, done: False
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
obs: 5, reward: 0.0, done: True
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG


We can also try to run an action in the frozen lake environment which is outside the defined number range.

In [7]:
# # Comment this cell if you want whole notebook to run without errors

# # Try to take an invalid action

# env.step(4) # invalid

# # should see KeyError below

In [8]:
# Putting the simple API methods together.
# Here is a pattern for running a bunch of episodes.
 
num_episodes = 1000 # Number of episodes you want to run the agent
render_freq = 200  # Render every X number of episodes 
total_reward = 0  # Initialize reward to 0

# Loop through episodes
for ep in range(num_episodes):

    # Reset the environment at the start of each episode
    obs = env.reset()
    
    # Loop through time steps per episode
    while True:
        # take random action, but you can also do something more intelligent 
        action = env.action_space.sample()

        # apply the action
        new_obs, reward, done, info = env.step(action)
        total_reward += reward

        # If the epsiode is up, then start another one
        if done:
            break
            
    # Render the env only every render_freq episodes
    if ep % render_freq == 0:
        print(f"episode: {ep}")
        print(f"obs: {new_obs}, reward: {total_reward}, done: {done}")
        env.render()

# Close the env
env.close()

episode: 0
obs: 5, reward: 0.0, done: True
  (Right)
SFFF
F[41mH[0mFH
FFFH
HFFG
episode: 200
obs: 5, reward: 1.0, done: True
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
episode: 400
obs: 5, reward: 3.0, done: True
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
episode: 600
obs: 5, reward: 9.0, done: True
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
episode: 800
obs: 5, reward: 11.0, done: True
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG


## Overview of RLlib <a class="anchor" id="intro_rllib"></a>

<img width="7%" src="images/rllib-logo.png"> is the most comprehensive open-source Reinforcement Learning framework. **[RLlib](https://github.com/ray-project/ray/tree/master/rllib)** is built on top of **[Ray](https://docs.ray.io/en/latest/)**, an easy-to-use, open-source, distributed computing framework for Python that can handle complex, heterogeneous applications. Ray and RLlib run on compute clusters on any cloud without vendor lock.

RLlib includes 25+ available [algorithms](https://docs.ray.io/en/master/rllib/rllib-algorithms.html), converted to both <img width="3%" src="images/tensorflow-logo.png">_TensorFlow_ and <img width="3%" src="images/pytorch-logo.png">_PyTorch_, covering different sub-categories of RL: _model-based_, _model-free_, and _Offline RL_. Almost any RLlib algorithm can learn in a multi-agent setting. Many algorithms support RNNs and LSTMs.

**Environments in RLlib**

To take advantage of Ray distributed parallel processing and vectorization, we can implement any environment as inheriting from [rllib.env.BaseEnv](https://github.com/ray-project/ray/blob/master/rllib/env/base_env.py).  See the [API documentation](https://docs.ray.io/en/latest/rllib/package_ref/env.html).  

By default, all OpenAI Gym environments are automatically implemented as RLlib BaseEnv. RLlib environments use a Gym wrapper, which means **Gym APIs can be used in RLlib**.



#### 1.  Import ray

In [9]:
# import libraries
import ray
from ray import tune
from ray.tune.logger import pretty_print
print(f"ray: {ray.__version__}")

ray: 3.0.0.dev0


#### 2. Check environment for errors

Before you start training, it is a good idea to check the environment for errors.  We can use a convenient [RLlib function](https://github.com/ray-project/ray/blob/master/rllib/utils/pre_checks/env.py#L22) for this.  

In [10]:
from ray.rllib.utils.pre_checks.env import check_env

# How to check you do not have any environment errors
print("checking environment ...")
try:
    ray.rllib.utils.pre_checks.env.check_env(env)
    print("All checks passed. No errors found.")
except:
    print("failed")

checking environment ...
All checks passed. No errors found.


## Train a Gym environment using an algorithm from RLlib <a class="anchor" id="intro_rllib_api"></a>

Roughly, RLlib is organized by **environments**, **algorithms**, **examples**, and **tuned_examples**.  

    ray
    |- rllib
    |  |- env 
    |  |- algorithms 
    |  |  |- alpha_zero 
    |  |  |- appo 
    |  |  |- ppo 
    |  |  |- ... 
    |  |- examples 
    |  |- tuned_examples

Within **_examples_** you will find faq code patterns.  

Within **_tuned_examples_**, you will find, sorted by algorithm, suggested hyperparameter value choices within .yaml files. Ray RLlib team ran simulations/benchmarks to find suggested hyperparameter value choices.  These files used for daily testing, and weekly hard-task testing to make sure they all run at speed, for both TF and Torch.  Helps you with leg-up with parameter choices!

In this tutorial, we will mainly focus on **_algorithms_**, where we will find RLlib algorithms to train RLlib models on environments.

#### 3.  Select an algorithm and instantiate that algorithm's config object  

To find it:
<ol>
    <li><a href="https://docs.ray.io/en/master/rllib/rllib-algorithms.html">Open RLlib docs</a></li>
    <li>Scroll down and click url of algo you're searching for, e.g. <i><b>PPO</b></i></li>
    <li>On the <a href=""https://docs.ray.io/en/master/rllib/rllib-algorithms.html#ppo>algo docs page </a>, click on the link <i><b>Implementation</b></i></li>
    <li>Search github file for the word <i><b>trainer</b></i></li>
    <li>Typically the docstring example will show: </li>
    <ol>
        <li>Example code implementing RLlib API, then </li>
        <li>Example code implementing Ray Tune API.</li>
    </ol>
    </ol>

In [32]:
# NEW WAY TO TRAIN SINCE Ray version >= 1.13
# config is an object instead of a dictionary

from ray.rllib.algorithms.ppo import PPOConfig

# Define algorithm config values
env_name = "CartPole-v1"
evaluation_interval = 2   #100, num training episodes to run between eval steps
evaluation_duration = 20  #100, num eval episodes to run for the eval step
num_workers = 4          # +1 for head node, num parallel workers or actors for rollouts
num_gpus = 0             # num gpus to use in the cluster
num_envs_per_worker = 4  #4, num vectorization of environments to run at same time

# Define trainer runtime config values
checkpoint_freq = evaluation_interval # freq save checkpoints >= evaulation_interval
relative_checkpoint_dir = "my_PPO_logs" # redirect logs instead of ~/ray_results/

# uncomment below to see the long list of default algorithm config values
# print(pretty_print(PPOConfig().to_dict()))

# Create a new training config
# override certain default algorithm config values
config_train = (
    PPOConfig()
    .framework(framework='torch')
    .environment(env=env_name, disable_env_checking=False)
    .rollouts(num_rollout_workers=num_workers, num_envs_per_worker=num_envs_per_worker)
    .resources(num_gpus=num_gpus, )
#     .training(gamma=0.9, lr=0.01, kl_coeff=0.3)
    .evaluation(evaluation_interval=evaluation_interval, 
                evaluation_duration=evaluation_duration)
)

print(type(config_train))

<class 'ray.rllib.algorithms.ppo.ppo.PPOConfig'>


#### 4. Instantiate a Trainer from the config object

**Three ways to train RLlib models**
<ol>
    <li><a href="https://docs.ray.io/en/master/rllib/package_ref/index.html">RLlib API.</a> The main methods we will use in this tutorial are:</li>
    <ul>
        <li>evaluate()</li>
        <li>save()</li>
        <li>restore()</li>
    </ul>
    <li><a href="https://docs.ray.io/en/master/tune/api_docs/overview.html">Ray Tune API.</a>  The main methods we will use in this tutorial are:</li>
        <ul>
        <li>run()</li>
    </ul>
    <li>RLlib CLI from command line: <i>rllib train -f [myfile_name].yml</i></li>
    </ol>
    
We will cover Options 1-2 below.

<b>Example Option 1: train RLlib using RLlib API .train() method</b>

The code below shows how to instantiate the trainer and train the algorithm on the environment for 1 single episode.  

To train for N number of episodes, you would put _.train()_ into a loop, similar to the way we ran env.step() in a loop.

In [33]:
# To start fresh, restart Ray in case it is already running
if ray.is_initialized():
    ray.shutdown()

In [38]:
# Here is how you build a RLlib trainer and train it using RLlib API

# Use .build() similar to how gym environments are passed to the gym .make() method.
trainer = config_train.build(env=env_name)

print(type(trainer))

# run the trainer for 1 episode
# trainer.train()

# Below, you will see a lot of output.  In fact you will see the output 100 times.



<class 'ray.rllib.algorithms.ppo.ppo.PPO'>


<b>Example Option 2: train RLlib using Ray Tune API .run() method</b>

From the above cell, you can see how to train a RLlib algorithm 1 episode at a time.  But it is more practical to train RLlib algorithms using Ray Tune.

Many more options are available for training using Ray Tune.  For example, in the code below, we specify a stopping criteria, instead of having to specify an exact number of episodes.

In [16]:
# To start fresh, restart Ray in case it is already running
if ray.is_initialized():
    ray.shutdown()

In [34]:
# Here is how you instantiate a Ray Tune trainer for RLlib and train a RLlib model

trainer = tune.run("PPO", 
                    
        # Stopping criteria: average reward over training episodes
        stop={"episode_reward_mean": 400},  #better is 400 out of max 500
                    
         # training config params
         config = config_train.to_dict(),
        
         # # OLD WAY TO PASS training config params
         # config = config_train,
                    
         #redirect logs instead of default ~/ray_results/
         local_dir = relative_checkpoint_dir, #relative path
         
         # set frequency saving checkpoints >= evaulation_interval
         checkpoint_freq = checkpoint_freq,
         
         # Reduce logging messages
         verbose = 1,
        )


[2m[36m(PPO pid=54685)[0m 2022-06-25 16:45:38,554	INFO ppo.py:378 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
[2m[36m(PPO pid=54685)[0m 2022-06-25 16:45:38,554	INFO algorithm.py:332 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2022-06-25 16:46:39,441	INFO tune.py:737 -- Total run time: 63.88 seconds (63.34 seconds for the tuning loop).


In [None]:
# Shut down Ray if you are done
if ray.is_initialized():
    ray.shutdown()

#### 4. Understand the results of training

How long did it take?

In [35]:
stats = trainer.stats()
secs = stats["timestamp"] - stats["start_time"]
print(f'{secs:7.2f} seconds, {secs/60.0:7.2f} minutes')

  60.38 seconds,    1.01 minutes


In [36]:
# Get Tune training results in a pandas dataframe
df = trainer.results_df
print(df.shape)  #Only 1 trial
df.columns

(1, 407)


Index(['episode_reward_max', 'episode_reward_min', 'episode_reward_mean',
       'episode_len_mean', 'episodes_this_iter', 'num_healthy_workers',
       'num_agent_steps_sampled', 'num_agent_steps_trained',
       'num_env_steps_sampled', 'num_env_steps_trained',
       ...
       'info/learner/default_policy/learner_stats/total_loss',
       'info/learner/default_policy/learner_stats/policy_loss',
       'info/learner/default_policy/learner_stats/vf_loss',
       'info/learner/default_policy/learner_stats/vf_explained_var',
       'info/learner/default_policy/learner_stats/kl',
       'info/learner/default_policy/learner_stats/entropy',
       'info/learner/default_policy/learner_stats/entropy_coeff',
       'config/evaluation_config/tf_session_args/gpu_options/allow_growth',
       'config/evaluation_config/tf_session_args/device_count/CPU',
       'config/evaluation_config/multiagent/policies/default_policy'],
      dtype='object', length=407)

For how many episodes did training run?

In [37]:
# Get number of episodes for the 1st trial
df.iloc[0,:].episodes_this_iter  

#Answer is 14 episodes

14

What were the best parameter values?  


In [43]:
# Watch out, the following cell output is long because there are many parameters!

# trainer.get_best_config(metric="episode_reward_mean", mode="mean")

Where is the best model checkpoint file?

In [44]:
# Get best checkpoint path
logdir = trainer.get_best_logdir(metric="evaluation_reward_mean", mode="max")
logdir

'/Users/christy/Documents/github_ray_summit_2022/ray-summit-2022-training/ray-rllib/my_PPO_logs/PPO/PPO_CartPole-v1_a4705_00000_0_2022-06-25_15-46-25'

#### 5. Visualize the training progress in TensorBoard

RLlib automatically creates logs for your trained RLlib models that can be visualized in TensorBoard.  To visualize the performance of your RL model:

<ol>
    <li>Open a terminal</li>
    <li><i><b>cd</b></i> into the logdir path from the above cell's output.</li>
    <li><i><b>ls</b></i></li>
    <li>You should see files that look like: checkpoint_NNNNNN</li>
    <li>To be able to compare all your experiments, cd one dir level up.
    <li><i><b>cd ..</b></i>  
    <li><i><b>tensorboard --logdir . </b></i></li>
    <li>Look at the url in the message, and open it in a browser</li>
        </ol>

#### Screenshot of Tensorboard

TensorBoard will give you many pages of charts.  Below displaying just Train/Eval mean and min rewards.

<b>Train Performance:</b> <br>

<img src="images/ppo_cartpole_training_rewards.png" width="80%">

<b>Eval Performance:</b> <br>
<img src="images/ppo_cartpole_evaluation_rewards.png" width="80%">

### Summary


### Exercises

1. Look at the Gym [documentation](https://www.gymlibrary.ml/environments/toy_text/frozen_lake/) for the frozen lake environment.  Can you figure out how to change the environment to be **stochastic?**  Hint: change the kwarg `is_slippery`.
2. Look at the runtime parameters for the Gym frozen lake environment.  How would you change the number of time steps per episode to be 200 instead of 100?
3. How would you change the choice of RLlib algorithm from PPO to DQN?

### References

1. 