<a href="https://colab.research.google.com/github/araffin/rl-tutorial-jnrr19/blob/master/stable_baselines_getting_started.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stable Baselines, a Fork of OpenAI Baselines - Getting Started

Github Repo: [https://github.com/hill-a/stable-baselines](https://github.com/hill-a/stable-baselines)

Medium article: [https://medium.com/@araffin/stable-baselines-a-fork-of-openai-baselines-df87c4b2fc82](https://medium.com/@araffin/stable-baselines-a-fork-of-openai-baselines-df87c4b2fc82)

[RL Baselines Zoo](https://github.com/araffin/rl-baselines-zoo) is a collection of pre-trained Reinforcement Learning agents using Stable-Baselines.

It also provides basic scripts for training, evaluating agents, tuning hyperparameters and recording videos.

Documentation is available online: [https://stable-baselines.readthedocs.io/](https://stable-baselines.readthedocs.io/)


## Install Dependencies and Stable Baselines Using Pip

List of full dependencies can be found in the [README](https://github.com/hill-a/stable-baselines).

```

sudo apt-get update && sudo apt-get install cmake libopenmpi-dev zlib1g-dev
```


```

pip install stable-baselines[mpi]
```

In [0]:
!pip install stable-baselines[mpi]==2.8.0

## Import policy, RL agent, ...

In [0]:
import gym
import numpy as np

from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO2

## Create the Gym env and instantiate the agent

For this example, we will use CartPole environment, a classic control problem.

"A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. "

Cartpole environment: [https://gym.openai.com/envs/CartPole-v1/](https://gym.openai.com/envs/CartPole-v1/)

![Cartpole](https://cdn-images-1.medium.com/max/1143/1*h4WTQNVIsvMXJTCpXm_TAw.gif)

Note: vectorized environments allow to easily multiprocess training. In this example, we are using only one process, hence the DummyVecEnv.

We chose the MlpPolicy because input of CartPole is a feature vector, not images.

The type of action to use (discrete/continuous) will be automatically deduced from the environment action space


Here we are using the [Proximal Policy Optimization](https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html) algorithm (PPO2 is the version optimized for GPU), which is an actor-crtic methods: it uses a value function to improve the policy gradient descent (by reducing the variance).

It combines ideas from [A2C](https://stable-baselines.readthedocs.io/en/master/modules/a2c.html) (having multiple workers and using an entropy bonus for exploration) and [TRPO](https://stable-baselines.readthedocs.io/en/master/modules/trpo.html) (it uses a trust region to improve stability and avoid catastrophic drop in performance).

PPO is an on-policy algorithm, which means that the trajectories used to update the networks must be collected using the latest policy.
It is usually less sample efficient than off-policy alorithms like [DQN](https://stable-baselines.readthedocs.io/en/master/modules/dqn.html), [SAC](https://stable-baselines.readthedocs.io/en/master/modules/sac.html) or [TD3](https://stable-baselines.readthedocs.io/en/master/modules/td3.html), but is much faster regarding wall-clock time.


In [0]:
env = gym.make('CartPole-v1')
# vectorized environments allow to easily multiprocess training
# we demonstrate its usefulness in the next examples
env = DummyVecEnv([lambda: env])  # The algorithms require a vectorized environment to run

model = PPO2(MlpPolicy, env, verbose=0)

We create a helper function to evaluate the agent:

In [0]:
def evaluate(model, num_steps=1000):
  """
  Evaluate a RL agent
  :param model: (BaseRLModel object) the RL Agent
  :param num_steps: (int) number of timesteps to evaluate it
  :return: (float) Mean reward for the last 100 episodes
  """
  episode_rewards = [0.0]
  obs = env.reset()
  for i in range(num_steps):
      # _states are only useful when using LSTM policies
      action, _states = model.predict(obs)
      # here, action, rewards and dones are arrays
      # because we are using vectorized env
      obs, rewards, dones, info = env.step(action)
      
      # Stats
      episode_rewards[-1] += rewards[0]
      if dones[0]:
          obs = env.reset()
          episode_rewards.append(0.0)
  # Compute mean reward for the last 100 episodes
  mean_100ep_reward = round(np.mean(episode_rewards[-100:]), 1)
  print("Mean reward:", mean_100ep_reward, "Num episodes:", len(episode_rewards))
  
  return mean_100ep_reward

Let's evaluate the un-trained agent, this should be a random agent.

In [5]:
# Random Agent, before training
mean_reward_before_train = evaluate(model, num_steps=10000)

Mean reward: 21.2 Num episodes: 436


## Train the agent and evaluate it

In [6]:
# Train the agent for 10000 steps
model.learn(total_timesteps=10000)

<stable_baselines.ppo2.ppo2.PPO2 at 0x7f5a7b7e47f0>

In [7]:
# Evaluate the trained agent
mean_reward = evaluate(model, num_steps=10000)

Mean reward: 250.0 Num episodes: 40


Apparently the training went well, the mean reward increased a lot ! 

## Bonus: Train a RL Model in One Line

In [0]:
model = PPO2('MlpPolicy', "CartPole-v1", verbose=1).learn(1000)

## Train a DQN agent

In the previous example, we have used PPO, which one of the many algorithms provided by stable-baselines.

In the next example, we are going train a [Deep Q-Network agent (DQN)](https://stable-baselines.readthedocs.io/en/master/modules/dqn.html), and try to see possible improvements provided by its extensions (Double-DQN, Dueling-DQN, Prioritized Experience Replay).

The essential point of this section is to show you how simple it is to tweak hyperparameters.

The main advantage of stable-baselines is that it provides a common interface to use the algorithms, so the code will be quite similar.


DQN paper: https://arxiv.org/abs/1312.5602

Dueling DQN: https://arxiv.org/abs/1511.06581

Double-Q Learning: https://arxiv.org/abs/1509.06461

Prioritized Experience Replay: https://arxiv.org/abs/1511.05952

### Vanilla DQN: DQN without extensions

In [0]:
# Same as before we instantiate the agent along with the environment
from stable_baselines import DQN

# Deactivate all the DQN extensions to have the original version
# In practice, it is recommend to have them activated
kwargs = {'double_q': False, 'prioritized_replay': False, 'policy_kwargs': dict(dueling=False)}

# Note that the MlpPolicy of DQN is different from the one of PPO
# but stable-baselines handles that automatically if you pass a string
dqn_model = DQN('MlpPolicy', 'CartPole-v1', verbose=1, **kwargs)

In [10]:
# Random Agent, before training
mean_reward_before_train = evaluate(dqn_model, num_steps=10000)

Mean reward: 12.0 Num episodes: 807


In [0]:
# Train the agent for 20000 steps
dqn_model.learn(total_timesteps=20000, log_interval=10)

In [20]:
# Evaluate the trained agent
mean_reward = evaluate(dqn_model, num_steps=10000)

Mean reward: 166.7 Num episodes: 60


### DQN + Prioritized Replay

In [13]:
# Activate only the prioritized replay
kwargs = {'double_q': False, 'prioritized_replay': True, 'policy_kwargs': dict(dueling=False)}

dqn_per_model = DQN('MlpPolicy', 'CartPole-v1', verbose=1, **kwargs)

Creating environment from the given name, wrapped in a DummyVecEnv.


In [0]:
dqn_per_model.learn(total_timesteps=20000, log_interval=10)

In [22]:
# Evaluate the trained agent
mean_reward = evaluate(dqn_per_model, num_steps=10000)

Mean reward: 129.9 Num episodes: 77


### DQN + Prioritized Experience Replay + Double Q-Learning + Dueling

In [16]:
# Activate all extensions
kwargs = {'double_q': True, 'prioritized_replay': True, 'policy_kwargs': dict(dueling=True)}

dqn_full_model = DQN('MlpPolicy', 'CartPole-v1', verbose=1, **kwargs)

Creating environment from the given name, wrapped in a DummyVecEnv.


In [0]:
dqn_full_model.learn(total_timesteps=20000, log_interval=10)

In [24]:
mean_reward = evaluate(dqn_per_model, num_steps=10000)

Mean reward: 131.6 Num episodes: 76


In this particular example, the extensions does not seem to give any improvement compared to the simple DQN version.
They are several reasons for that:

1. `CartPole-v1` is a pretty simple environment
2. We trained DQN for very few timesteps, not enough to see any difference
3. The default hyperparameters for DQN are tuned for atari games, where the number of training timesteps is much bigger (10^6) and input observations are images