<a href="https://colab.research.google.com/github/robertmoni/modelbasedrl/blob/master/Copy_of_stable_baselines_getting_started.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stable Baselines, a Fork of OpenAI Baselines - Getting Started

Github Repo: [https://github.com/hill-a/stable-baselines](https://github.com/hill-a/stable-baselines)

Medium article: [https://medium.com/@araffin/stable-baselines-a-fork-of-openai-baselines-df87c4b2fc82](https://medium.com/@araffin/stable-baselines-a-fork-of-openai-baselines-df87c4b2fc82)

## Install Dependencies and Stable Baselines Using Pip

List of full dependencies can be found in the [README](https://github.com/hill-a/stable-baselines).

```

sudo apt-get update && sudo apt-get install cmake libopenmpi-dev zlib1g-dev
```


```

pip install stable-baselines
```

In [1]:
!pip install stable-baselines==2.6.0

Collecting stable-baselines==2.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/ce/1e/f99bb18f6a24a88fdd528561c204ed82d644f83dbf7da4950900d6920dc5/stable_baselines-2.6.0-py3-none-any.whl (250kB)
[K     |████████████████████████████████| 256kB 4.8MB/s 
Installing collected packages: stable-baselines
  Found existing installation: stable-baselines 2.2.1
    Uninstalling stable-baselines-2.2.1:
      Successfully uninstalled stable-baselines-2.2.1
Successfully installed stable-baselines-2.6.0


## Import policy, RL agent, ...

In [0]:
import gym
import numpy as np

from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.ppo2 import PPO2

## Create the Gym env and instantiate the agent

For this example, we will use CartPole environment, a classic control problem.

"A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. "

Cartpole environment: [https://gym.openai.com/envs/CartPole-v1/](https://gym.openai.com/envs/CartPole-v1/)

![Cartpole](https://cdn-images-1.medium.com/max/1143/1*h4WTQNVIsvMXJTCpXm_TAw.gif)

Note: vectorized environments allow to easily multiprocess training. In this example, we are using only one process, hence the DummyVecEnv.

We chose the MlpPolicy because input of CartPole is a feature vector, not images.

The type of action to use (discrete/continuous) will be automatically deduced from the environment action space



In [3]:
env = gym.make('CartPole-v1')
# vectorized environments allow to easily multiprocess training
# we demonstrate its usefulness in the next examples
env = DummyVecEnv([lambda: env])  # The algorithms require a vectorized environment to run

model = PPO2(MlpPolicy, env, verbose=0)

W0722 21:55:38.644183 140512076314496 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/stable_baselines/common/tf_util.py:98: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0722 21:55:38.645871 140512076314496 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/stable_baselines/common/tf_util.py:107: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

W0722 21:55:38.674685 140512076314496 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/stable_baselines/common/policies.py:114: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

W0722 21:55:38.676453 140512076314496 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/stable_baselines/common/input.py:25: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0722 21:55:38.725634 140512076314496 deprecation.py:323] From /u

We create a helper function to evaluate the agent:

In [0]:
def evaluate(model, num_steps=1000):
  """
  Evaluate a RL agent
  :param model: (BaseRLModel object) the RL Agent
  :param num_steps: (int) number of timesteps to evaluate it
  :return: (float) Mean reward for the last 100 episodes
  """
  episode_rewards = [0.0]
  obs = env.reset()
  for i in range(num_steps):
      # _states are only useful when using LSTM policies
      action, _states = model.predict(obs)
      # here, action, rewards and dones are arrays
      # because we are using vectorized env
      obs, rewards, dones, info = env.step(action)
      
      # Stats
      episode_rewards[-1] += rewards[0]
      if dones[0]:
          obs = env.reset()
          episode_rewards.append(0.0)
  # Compute mean reward for the last 100 episodes
  mean_100ep_reward = round(np.mean(episode_rewards[-100:]), 1)
  print("Mean reward:", mean_100ep_reward, "Num episodes:", len(episode_rewards))
  
  return mean_100ep_reward

Let's evaluate the un-trained agent, this should be a random agent.

In [0]:
# Random Agent, before training
mean_reward_before_train = evaluate(model, num_steps=10000)

Mean reward: 22.8 Num episodes: 453


## Train the agent and evaluate it

In [5]:
# Train the agent for 10000 steps
model.learn(total_timesteps=10000)

<stable_baselines.ppo2.ppo2.PPO2 at 0x7fcb353730b8>

In [6]:
# Evaluate the trained agent
mean_reward = evaluate(model, num_steps=10000)

Mean reward: 166.7 Num episodes: 60


Apparently the training went well, the mean reward increased a lot ! 

## Bonus: Train a RL Model in One Line

In [7]:
model = PPO2('MlpPolicy', "CartPole-v1", verbose=1).learn(1000)

Creating environment from the given name, wrapped in a DummyVecEnv.
--------------------------------------
| approxkl           | 4.4630156e-07 |
| clipfrac           | 0.0           |
| explained_variance | 0.0192        |
| fps                | 400           |
| n_updates          | 1             |
| policy_entropy     | 0.69314665    |
| policy_loss        | 2.0517851e-05 |
| serial_timesteps   | 128           |
| time_elapsed       | 3.1e-06       |
| total_timesteps    | 128           |
| value_loss         | 37.039703     |
--------------------------------------
---------------------------------------
| approxkl           | 3.612032e-05   |
| clipfrac           | 0.0            |
| explained_variance | -0.011         |
| fps                | 1094           |
| n_updates          | 2              |
| policy_entropy     | 0.6931189      |
| policy_loss        | -0.00081435114 |
| serial_timesteps   | 256            |
| time_elapsed       | 0.321          |
| total_timesteps    | 25