<a href="https://colab.research.google.com/github/albertometelli/rl-phd-2022/blob/main/01_getting_started.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started

This notebook is inspired to the Stable Baselines3 tutorial available at [https://github.com/araffin/rl-tutorial-jnrr19](https://github.com/araffin/rl-tutorial-jnrr19).


## Introduction

In this notebook, we will learn how to use **Open AI Gym** environments and the basics of **stable baselines3**: how to instance an RL algorithm, train and evaluate it.

### Links

Open AI Gym Github: [https://github.com/openai/gym](https://github.com/openai/gym)

Open AI Gym Documentation: [https://www.gymlibrary.ml](https://www.gymlibrary.ml)

Stable Baselines 3 Github:[https://github.com/DLR-RM/stable-baselines3](https://github.com/DLR-RM/stable-baselines3)

Stable Baseline 3 Documentation: [https://stable-baselines3.readthedocs.io/en/master/](https://stable-baselines3.readthedocs.io/en/master/)

## Install Dependencies and Stable Baselines3 Using Pip

In [None]:
!apt-get install ffmpeg freeglut3-dev xvfb  # For visualization
!pip install stable-baselines3[extra]

In [None]:
import stable_baselines3
stable_baselines3.__version__

## Video Recording

In Google Colab it is not possible to render the Gym environments, so we need to record a video and then reproduce it. Here are the helper functions. 

In [None]:
# Set up fake display; otherwise rendering will fail
import os
import base64
from pathlib import Path
from IPython import display as ipythondisplay
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv

os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

def show_videos(video_path='', prefix=''):
    """
    Taken from https://github.com/eleurent/highway-env

    :param video_path: (str) Path to the folder containing videos
    :param prefix: (str) Filter the video, showing only the only starting with this prefix
    """
    html = []
    for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
        video_b64 = base64.b64encode(mp4.read_bytes())
        html.append('''<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                    </video>'''.format(mp4, video_b64.decode('ascii')))
    ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))


def record_video(env_id, model, video_length=500, prefix='', video_folder='videos/'):
    """
    :param env_id: (str)
    :param model: (RL model)
    :param video_length: (int)
    :param prefix: (str)
    :param video_folder: (str)
    """
    eval_env = DummyVecEnv([lambda: gym.make('CartPole-v1')])
    # Start the video at step=0 and record 500 steps
    eval_env = VecVideoRecorder(eval_env, video_folder=video_folder,
                              record_video_trigger=lambda step: step == 0, video_length=video_length,
                              name_prefix=prefix)

    obs = eval_env.reset()
    for _ in range(video_length):
        action, _ = model.predict(obs[0])
        obs, _, _, _ = eval_env.step([action])

    # Close the video recorder
    eval_env.close()

    
def render(env_id, policy, video_length=500, prefix='', video_folder='videos/'):
    record_video(env_id, policy, video_length, prefix, video_folder)
    show_videos(video_folder, prefix)

## Plotting

A helper function to plot the learning curves.

In [None]:
import matplotlib.pyplot as plt


def plot_results(results):
    plt.figure()
    
    for k in results.keys():
        data = np.load(results[k] + '/evaluations.npz')
        ts = data['timesteps']
        res = data['results']
        _mean, _std = res.mean(axis=1), res.std(axis=1)

        plt.plot(ts, _mean, label=k)
        plt.fill_between(ts, _mean-_std, _mean+_std, alpha=.2)
        
    plt.xlabel('Timesteps')
    plt.ylabel('Average return')
    plt.legend(loc='lower right')
    
    plt.show()

## Initializing Environments

Initializing environments in Gym and is done as follows. We can find a list of available environment [here](https://gym.openai.com/envs/#classic_control).

In [None]:
import gym
env = gym.make('CartPole-v1')

"A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. "

Cartpole Environment Decription: [https://gym.openai.com/envs/CartPole-v1/](https://gym.openai.com/envs/CartPole-v1/)

Cartpole Source Code: [https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)

![Cartpole](https://cdn-images-1.medium.com/max/1143/1*h4WTQNVIsvMXJTCpXm_TAw.gif)

## Interacting with the Environment

We run an instance of `CartPole-v1` environment for 50 timesteps, showing the information returned by the environment.

In [None]:
state = env.reset() # resets the environment in the initial state
print("Initial state: ", state)

for _ in range(30): 
    action = env.action_space.sample() # sample a random action
    
    state, reward, done, _ = env.step(action)  # execute the action in the environment
    print("State:", state,
          "Action:", action,
          "Reward:", reward,
          "Done:", done)
    
env.close()

A Gym environment provides to this user mainly four methods:

* `reset()`: resets the environment to its initial state $S_0 \sim d_0$ and returns the observation corresponding to the initial state.


* `step(action)`: takes an action $A_t$ as an input and executes the action in current state $S_t$ of the environment. This method returns a tuple of four values:

    * `observation` (object): an environment-specific object representation of your observation of the environment after the action is executed. It corresponds to the observation of the next state $S_{t+1} \sim p(\cdot|S_t,A_t)$
    
    * `reward` (float): immediate reward $R_{t+1} = r(S_t,A_t)$ obtained by executing action $A_t$ in state $S_t$
    
    * `done`(boolean): whether the reached next state $S_{t+1}$ is a terminal state.
    
    * `info` (dict): additional information useful for debugging and environment-specific.
    
    
*  `render(method='human')`: allows visualizing the agent in action. Note that graphical interface does not work on google colab, so we cannot use it directly (we will need a workaround).


*  `seed()`: sets the seed for this environment’s random number generator.

## Observation and Action Spaces

*  `observation_space`: this attribute provides the format of valid observations $\mathcal{S}$. It is of datatype `Space` provided by Gym. For example, if the observation space is of type `Box` and the shape of the object is `(4,)`, this denotes a valid observation will be an array of 4 numbers.

*  `action_space`: this attribute provides the format of valid actions $\mathcal{A}$. It is of datatype `Space` provided by Gym. For example, if the action space is of type `Discrete` and gives the value `Discrete(2)`, this means there are two valid discrete actions: 0 and 1.

In [None]:
print(env.observation_space)

print(env.action_space)

print(env.observation_space.high)

print(env.observation_space.low)

`Spaces` types available in Gym:

*  `Box`: an $n$-dimensional compact space (i.e., a compact subset of $\mathbb{R}^n$). The bounds of the space are contained in the `high` and `low` attributes.


*  `Discrete`: a discrete space made of $n$ elements, where $\{0,1,\dots,n-1\}$ are the possible values.


Other `Spaces` types can be used: `Dict`, `Tuple`, `MultiBinary`, `MultiDiscrete`.

In [None]:
import numpy as np
from gym.spaces import Box, Discrete

observation_space = Box(low=-1.0, high=2.0, shape=(3,), dtype=np.float32)
print(observation_space.sample())

observation_space = Discrete(4)
print(observation_space.sample())

## Details on the Cartpole Environment 

From [https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart's velocity.

### Action Space
The action space is `action` in $\{0,1\}$, where `action` is used to push the cart with a fixed amount of force:

 | Num | Action                 |
    |-----|------------------------|
    | 0   | Push cart to the left  |
    | 1   | Push cart to the right |
    
Note: The amount the velocity is reduced or increased is not fixed as it depends on the angle the pole is pointing. This is because the center of gravity of the pole increases the amount of energy needed to move the cart underneath it.
    
### Observation Space
The observation is a `ndarray` with shape `(4,)` where the elements correspond to the following:

   | Num | Observation           | Min                  | Max                |
    |-----|-----------------------|----------------------|--------------------|
    | 0   | Cart Position         | -4.8*                | 4.8*                |
    | 1   | Cart Velocity         | -Inf                 | Inf                |
    | 2   | Pole Angle            | ~ -0.418 rad (-24°)**| ~ 0.418 rad (24°)** |
    | 3   | Pole Angular Velocity | -Inf                 | Inf                |

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart's velocity.

**Note:** above denotes the ranges of possible observations for each element, but in two cases this range exceeds the range of possible values in an un-terminated episode:
- `*`: the cart x-position can be observed between `(-4.8, 4.8)`, but an episode terminates if the cart leaves the `(-2.4, 2.4)` range.
- `**`: Similarly, the pole angle can be observed between  `(-.418, .418)` radians or precisely **±24°**, but an episode is  terminated if the pole angle is outside the `(-.2095, .2095)` range or precisely **±12°**
    
### Rewards
Reward is 1 for every step taken, including the termination step.

### Starting State
All observations are assigned a uniform random value between (-0.05, 0.05)

### Episode Termination
The episode terminates of one of the following occurs:
1. Pole Angle is more than ±12°
2. Cart Position is more than ±2.4 (center of the cart reaches the edge of the display)
3. Episode length is greater than 500


## Evaluation of some Simple Policies

We now evaluate some policies on the cartpole.

* **Uniform Policy**: uniformly random policy

$$
\pi(a|s) = \mathrm{Uni}(\{0,1\})
$$

* **Reactive Policy**: simple deterministic policy that selects the action based on the pole angle

$$
\pi(s) = \begin{cases}
                0 & \text{if Pole Angle } \le 0 \\
                1 & \text{otherwise}
            \end{cases}
$$

In [None]:
class UniformPolicy:
    
    def predict(self, obs):
        return np.random.randint(0, 2), obs  # return the observation to comply with stable-baselines3


class ReactivePolicy:
    
    def predict(self, obs):
        if obs[2] <= 0:
            return 0, obs
        else:
            return 1, obs

Let us create a function to evaluate the agent's performance.

In [None]:
def evaluate(env, policy, gamma=1., num_episodes=100):
    """
    Evaluate a RL agent
    :param env: (Env object) the Gym environment
    :param policy: (BasePolicy object) the policy in stable_baselines3
    :param gamma: (float) the discount factor
    :param num_episodes: (int) number of episodes to evaluate it
    :return: (float) Mean reward for the last num_episodes
    """
    discounter = 1.
    all_episode_rewards = []
    for i in range(num_episodes): # iterate over the episodes
        episode_rewards = []
        done = False
        obs = env.reset()
        while not done: # iterate over the steps until termination
            action, _ = policy.predict(obs)
            obs, reward, done, info = env.step(action)
            episode_rewards.append(reward * discounter) # compute discounted reward
            discounter *= gamma

        all_episode_rewards.append(sum(episode_rewards))

    mean_episode_reward = np.mean(all_episode_rewards)
    std_episode_reward = np.std(all_episode_rewards) / np.sqrt(num_episodes - 1)
    print("Mean reward:", mean_episode_reward, 
          "Std reward:", std_episode_reward,
          "Num episodes:", num_episodes)

    return mean_episode_reward, std_episode_reward

Let us test the uniform policy.

In [None]:
uniform_policy = UniformPolicy()

uniform_policy_mean, uniform_policy_std = evaluate(env, uniform_policy)

render('CartPole-v1', uniform_policy, prefix='cartpole-uniform_policy')

Let us test the reactive policy.

In [None]:
reactive_policy = ReactivePolicy()

reactive_policy_mean, reactive_policy_std = evaluate(env, reactive_policy)

render('CartPole-v1', reactive_policy, prefix='cartpole-reactive_policy')

## PPO Training

We now use stable-baselines3 to train some simple algorithms. We start by using [Proximal Policy Optimization](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html).

We select the MlpPolicy, that is an alias of [ActorCriticPolicy](https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/policies.py), because the state of the CartPole environment is a feature vector (not images for instance). The type of action to use (discrete/continuous) will be automatically deduced from the environment action space.

We consider two network architectures:

* Linear policy
* Two hidden layers of 32 neurons each policy

In [None]:
from stable_baselines3 import PPO


# Instantiate the algorithm with 32x32 NN approximator for both actor and critic
ppo_mlp = PPO("MlpPolicy", env, verbose=1, 
                learning_rate=0.01,
                policy_kwargs=dict(net_arch = [dict(pi=[32, 32], vf=[32, 32])]))

print(ppo_mlp.policy)

# Instantiate the algorithm with linear approximator for both actor and critic
ppo_linear = PPO("MlpPolicy", env, verbose=1, 
                   learning_rate=0.01,
                   policy_kwargs=dict(net_arch = [dict(pi=[], vf=[])]))

print(ppo_linear.policy)

Let us now train the algorithms. In order to keep track of the performance during learning, we use an [EvalCallback](https://stable-baselines.readthedocs.io/en/master/guide/callbacks.html)

In [None]:
# Separate evaluation env
eval_env = gym.make('CartPole-v1')

# Train the agent for 50000 steps
ppo_mlp.learn(total_timesteps=50000, eval_freq=2048, eval_env=eval_env,
              eval_log_path='./logs/cartpole/ppo_mlp', log_interval=4)

ppo_linear.learn(total_timesteps=50000, eval_freq=2048, eval_env=eval_env, 
              eval_log_path='./logs/cartpole/ppo_linear', log_interval=4)

Let us plot the learning curves.

In [None]:
results = {'PPO-MLP': './logs/cartpole/ppo_mlp',
           'PPO-LINEAR': './logs/cartpole/ppo_linear',}
        
plot_results(results)

In [None]:
# Evaluate the trained models
ppo_mlp_mean, ppo_mlp_std = evaluate(env, ppo_mlp)
render('CartPole-v1', ppo_mlp, prefix='ppo_mlp')

ppo_linear_mean, ppo_linear_std = evaluate(env, ppo_linear)
render('CartPole-v1', ppo_linear, prefix='ppo_linear')

Let us have a look at the weights learned by PPO with the linear policy. Since actions are discrete, the policy model is **softmax**:

$$
\pi_{\boldsymbol{\theta}}(a|\mathbf{s}) \propto \exp \left( \mathbf{s}^T \boldsymbol{\theta}(a) + b(a) \right)
$$

In [None]:
print(ppo_linear.policy.action_net.weight)
print(ppo_linear.policy.action_net.bias)

## DQN Training

Let us now try [DQN](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html) with an MlpPolicy as well.

In [None]:
from stable_baselines3 import DQN
from torch import nn


# Instantiate the algorithm with 32x32 NN approximator for both actor and critic
dqn_mlp = DQN("MlpPolicy", env, verbose=1, 
                learning_starts=3000,
                policy_kwargs=dict(net_arch = [32, 32], activation_fn=nn.Tanh))

print(dqn_mlp.policy)

In [None]:
# Train the agent for 50000 steps
dqn_mlp.learn(total_timesteps=50000, eval_freq=2048, eval_env=eval_env, 
              eval_log_path='./logs/cartpole/dqn_mlp', log_interval=100)

In [None]:
# Evaluate the trained models
dqn_mlp_mean, dqn_mlp_std = evaluate(env, dqn_mlp)
render('CartPole-v1', dqn_mlp, prefix='dqn_mlp')

Let us now plot the final results.

In [None]:
#Plot the training curves

results['DQN'] = './logs/cartpole/dqn_mlp'
plot_results(results)

In [None]:
#Plot the results
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

algs = ['Random', 'Reactive', 'PPO MLP', 'PPO Linear', 'DQN']
means = [uniform_policy_mean, reactive_policy_mean, ppo_mlp_mean, ppo_linear_mean, dqn_mlp_mean]
errors = [uniform_policy_std, reactive_policy_std, ppo_mlp_std, ppo_linear_std, dqn_mlp_std]

ax.bar(algs, means, yerr=errors, align='center', alpha=0.5, ecolor='black', capsize=10)
plt.show()