The framework we gonna use in this Notebook is stable-baseline which is a set of improved implementations of Reinforcement Learning (RL) algorithms based on OpenAI Baselines.

# Import Dependencies

In [1]:
import os

import gym  #Inbuild environment from Open AI
from stable_baselines3 import PPO #RL Algorithm
from stable_baselines3.common.vec_env import DummyVecEnv #This allows to train multiple agent at same time.
from stable_baselines3.common.evaluation import evaluate_policy #Measure the performace of model.

# Load Environment

OpenAI gym provides us with many simulated environment to train our model on and in this notebook we gonna use the simulated environment from OpenAI called cartpole.

How does the environment looks like?

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces in the left and right direction on the cart.

In [2]:
environment_name = "CartPole-v0"

In [3]:
env = gym.make(environment_name)

In [4]:
episodes = 5
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0 
    
    while not done:
        env.render()
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score+=reward
    print('Episode:{} Score:{}'.format(episode, score))
env.close()


Episode:1 Score:15.0
Episode:2 Score:26.0
Episode:3 Score:15.0
Episode:4 Score:33.0
Episode:5 Score:24.0


* Consider an episode as something like when an agent plays an entire game.

* Here, we learn a loop of 5 episodes and reset the environment to original observation after every.

* `env.render()` function is used to graphically represent the environment.

* We then generate a random action from all the possible actions which is present in the sample space. 

* We then apply that action to the environment which was randomly generated, which will return four values. 

* The next set of observation, reward, the boolean value which says if the action is done or not and the info.

* If the action is done, the while loop condition will be false and we will be moved on to next iteration.

# Understanding Environment

## Action

The evironment have two possible actions. 

0 - push cart to left
1 - push cart to right

In [5]:
env.action_space

Discrete(2)

In [6]:
env.action_space.sample()

0

## Observation

The observation is a Box tyoe `ndarray` with shape `(4,)` with the values corresponding to the following information:

cart position, cart velocity, pole angle, pole angular velocity

In [7]:
env.observation_space

Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)

In [8]:
env.observation_space.sample()

array([-3.0420842e+00,  1.2434593e+38, -3.4050566e-01,  2.0792163e+38],
      dtype=float32)

# Train the Model


In [9]:
log_path = os.path.join('Training')

In [10]:
env = DummyVecEnv([lambda: env]) #Wrap the non-vectorised environment
model = PPO('MlpPolicy', env, verbose = 1, tensorboard_log=log_path)

Using cpu device


In [11]:
model.learn(total_timesteps=20000)

Logging to Training/PPO_1
-----------------------------
| time/              |      |
|    fps             | 3897 |
|    iterations      | 1    |
|    time_elapsed    | 0    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 2499        |
|    iterations           | 2           |
|    time_elapsed         | 1           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008801184 |
|    clip_fraction        | 0.129       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.685      |
|    explained_variance   | -0.00753    |
|    learning_rate        | 0.0003      |
|    loss                 | 8.43        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0219     |
|    value_loss           | 58.6        |
-----------------------------------------
--------

<stable_baselines3.ppo.ppo.PPO at 0x7f3a3c71f6a0>

# Saving the Model


In [12]:
PPO_path = os.path.join('Training', 'Saved Models', 'PPO_model')

In [13]:
model.save(PPO_path)



# Evaluate the Model

In [14]:
evaluate_policy(model, env, n_eval_episodes=10, render=True)



(200.0, 0.0)

* This returns average reward and the standard deviation.

* Reward for cartpole is calculated as 1 point for every step as pole remains upright.

* The pole is way more stable than how it was when we passed the random actions. It will take 10 different episodes and return the reward based on it.

In [15]:
env.close()

# Testing the model

Unlike previous time when we generate the random action. This time we will make model predict the action and visualise how it performs.

In [16]:
obs = env.reset()
model.predict(obs)

(array([1]), None)

The model predict the action when we pass the observations into it. 

Let's visualise it with multiple episodes and print the reward.

In [17]:
episodes = 5
for episode in range(1, episodes+1):
    obs = env.reset() 
    done = False
    score = 0 
    
    while not done:
        env.render()
        action, _states = model.predict(obs) #Using model here
        obs, reward, done, info = env.step(action)
        score+=reward
    print('Episode:{} Score:{}'.format(episode, score))
env.close()

Episode:1 Score:[200.]
Episode:2 Score:[200.]
Episode:3 Score:[200.]
Episode:4 Score:[200.]
Episode:5 Score:[196.]


The model is predicting the steps and acing it because it balance the pole way better and getting 200 rewards every single time.

# Viewing Logs in Tensorboard

The training log which is saved on the log_path location can be hosted through this code and can be seen graphically which all the metrics in local host.

In [18]:
training_log_path = os.path.join(log_path, 'PPO_1')

In [19]:
!tensorboard --logdir={training_log_path}

TensorFlow installation not found - running with reduced feature set.

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.10.0 at http://localhost:6006/ (Press CTRL+C to quit)
^C


# Adding a callback to the training Stage


We can apply callback at some stages of the training procedure when we gets the reward we desire. We will stop the training at certain threshold of reward and save the best model.


In [20]:
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold

Path to save the best model.

In [21]:
save_path = os.path.join('Training', 'Saved Models')
log_path = os.path.join('Training', 'Logs')

Create the vectorised environment.

In [22]:
env = gym.make(environment_name)
env = DummyVecEnv([lambda: env])

The model will check for reward after every 10000 run that if it pass the 200 rewards, if it does then it will stop the training and save the best model on save_path

In [23]:
stop_callback = StopTrainingOnRewardThreshold(reward_threshold=200, verbose=1)
eval_callback = EvalCallback(env, 
                             callback_on_new_best=stop_callback, 
                             eval_freq=10000, 
                             best_model_save_path=save_path, 
                             verbose=1)

In [24]:
model = PPO('MlpPolicy', env, verbose = 1, tensorboard_log=log_path)

Using cpu device


Apply the callback to learn.

In [25]:
model.learn(total_timesteps=20000, callback=eval_callback)

Logging to Training/Logs/PPO_1
-----------------------------
| time/              |      |
|    fps             | 4058 |
|    iterations      | 1    |
|    time_elapsed    | 0    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 2590        |
|    iterations           | 2           |
|    time_elapsed         | 1           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008667963 |
|    clip_fraction        | 0.0817      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.687      |
|    explained_variance   | 0.00854     |
|    learning_rate        | 0.0003      |
|    loss                 | 5.78        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0116     |
|    value_loss           | 44.7        |
-----------------------------------------
---

<stable_baselines3.ppo.ppo.PPO at 0x7f3a3be33250>

# Changing Policies

This is an example of using different architecture for the different Neural network used in PPO.

In [26]:
net_arch=[dict(pi=[128, 128, 128, 128], vf=[128, 128, 128, 128])]

Attaching new neural network policy to our model.

In [27]:
model = PPO('MlpPolicy', env, verbose = 1, policy_kwargs={'net_arch': net_arch})

Using cpu device


In [28]:
model.learn(total_timesteps=20000, callback=eval_callback)

-----------------------------
| time/              |      |
|    fps             | 3104 |
|    iterations      | 1    |
|    time_elapsed    | 0    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 1962        |
|    iterations           | 2           |
|    time_elapsed         | 2           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.014669506 |
|    clip_fraction        | 0.197       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.682      |
|    explained_variance   | 0.00113     |
|    learning_rate        | 0.0003      |
|    loss                 | 2.34        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0218     |
|    value_loss           | 20.6        |
-----------------------------------------
----------------------------------

<stable_baselines3.ppo.ppo.PPO at 0x7f39fef2c400>

# Using different Algorithm

In [29]:
from stable_baselines3 import DQN

In [30]:
model = DQN('MlpPolicy', env, verbose = 1, tensorboard_log=log_path)

Using cpu device


In [31]:
model.learn(total_timesteps=20000, callback=eval_callback)

Logging to Training/Logs/DQN_1
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.966    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 13517    |
|    time_elapsed     | 0        |
|    total_timesteps  | 71       |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.923    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 14642    |
|    time_elapsed     | 0        |
|    total_timesteps  | 162      |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.892    |
| time/               |          |
|    episodes         | 12       |
|    fps              | 15439    |
|    time_elapsed     | 0        |
|    total_timesteps  | 227      |
----------------------------------
------------------------

<stable_baselines3.dqn.dqn.DQN at 0x7f3a3bf87be0>

In [32]:
dqn_path = os.path.join('Training', 'Saved Models', 'DQN_model')

In [33]:
model.save(dqn_path)