<a href="https://colab.research.google.com/github/georgie-talukdar/masters/blob/main/Deep_Reinforcement_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practical 10 Deep Reinforcement Learning

In this practical you'll get a chance to play around with the AI Gym which allows you to use a test environment to train a (Deep) Reinfocement Learning system. We're going to work with the MountainCar environment. In this environment there is a car which needs to get to the top of a hill, but it's engine isn't strong enough to make it up the hill. So it needs to reverse up the opposite hill in order to build up enough momentum to make it to the top of the hill.

## Installs & Imports

First some installs - we need the top line of installs in order to be able to visualise what the agent is doing with recorded videos, and the second line is to install Stable Baselines, a popular Python RL Library which claims to be an improvement on Open AI's library Baselines (Note: Stable Baselines was written in TensorFlow, Stable Baselines 3 is the PyTorch version of the library)

In [1]:
!apt-get install ffmpeg freeglut3-dev xvfb  # For visualization
!pip install stable-baselines3[extra]
!pip install pyglet\<2.0.0

Reading package lists... Done
Building dependency tree       
Reading state information... Done
ffmpeg is already the newest version (7:3.4.11-0ubuntu0.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  freeglut3 freeglut3-dev xvfb
0 upgraded, 3 newly installed, 0 to remove and 7 not upgraded.
Need to get 982 kB of archives.
After this operation, 3,350 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 freeglut3 amd64 2.8.1-3 [73.6 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 freeglut3-dev amd64 2.8.1-3 [124 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 xvfb amd64 2:1.19.6-1ubuntu4.12 [785 kB]
Fetched 982 kB in 2s (464 kB/s)
Selecting previously unselected package freeglut3:amd64.
(Reading database ... 124015 files and directories currently installed.)
Pr

We are also going to use Gym, a toolkit provided by AI research lab Open AI for developing and comparing reinforcement learning algorithms. 

Just like how in the supervised learning practicals we needed to read in a dataset, and Keras provided some toy datasets to get started with, in reinforcement learning we need an environment for the agent to explore, and Gym provides some off-the-shelf toy environments to experiment with! 

Our Stable Baselines install has installed Gym as a requirement, so we only need to import. We will also go ahead and import NumPy.

In [2]:
import gym
import numpy as np

The next thing we need to import is our model of choice, in this case DQN (Deep Q Network), the deep RL algorithm covered in the lecture. Rather than write our own implementation, we can import the implementation that Stable Baselines provides.

In [3]:
from stable_baselines3 import DQN

We also need to import the policy i.e. the type of network to use for our policy (our network taking in states and spitting out actions) and our value function(s) (our network taking in states and spitting out values). 

If we were using images as the input to our network we would select 'CnnPolicy', for example.

Here we are going to use 'MlpPolicy', a policy object that implements actor critic using an MLP (multilayer peceptron). The default architecture is 2 layers of 64 neurons, but we can modify the architecture with 'policy_kwargs' when we build our model shortly.

In [4]:
from stable_baselines3.dqn.policies import MlpPolicy

## Create our Environment

For this practical we will use Gym's Mountain Car, an environment from their classic control collection.

This is a link to the environment's webpage where you can see what the environment looks like: https://gymnasium.farama.org/environments/classic_control/mountain_car/

Their description of the environment is as follows:
"A car is on a one-dimensional track, positioned between two "mountains". The goal is to drive up the mountain on the right; however, the car's engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum."

The environment is stored in Gym's registry, so to create an instance of the environment we simply need to call the function `gym.make` and pass the environment name as a string.

In [5]:
env = gym.make("MountainCar-v0")

Every gym environment has an observation space - something that describes the type and form of observations that the agent expects - and an action space - something that describes the range of actions available to the agent.

Let's inspect the observation and action space for Mountain Car.

In [6]:
print(env.observation_space)
print(env.observation_space.low)
print(env.observation_space.high)
print(env.action_space)

Box([-1.2  -0.07], [0.6  0.07], (2,), float32)
[-1.2  -0.07]
[0.6  0.07]
Discrete(3)


The 'Spaces' section of the Gym documentation https://gymnasium.farama.org/api/spaces/ will give you more information on what this output means.

How this information translates for our Mountain Car environment is that the car's observation is a vector of two values - its's own position and velocity. The car has three actions to select from - push left, push right and no push.

## RL Model

Although we don't need to code up our own implementation of DQN, we do need to tell Stable Baselines which hyperparameters to pass to the model's constructor.

If we were to construct the model without passing any arguments, the constructor would use the default values set for the model.

Good results in RL are generally dependent on finding appropriate hyperparameters. Recent algorithms (PPO, SAC, TD3) typically require less hyperparameter tuning than older algorithms like DQN, however even with these, don’t expect the default ones to work on any environment.

The hyperparameters we are going to pass to our model constructor have been decided based on hyperparatemer tuning with the library Optuna. The repository RL Baselines Zoo provided by Stable Baselines can tell you the tuned hyperparameters for some of their algorithm implementations https://github.com/DLR-RM/rl-baselines3-zoo/tree/master/hyperparams

In [7]:
model = DQN(MlpPolicy, env, learning_rate=0.004,
            batch_size=128, buffer_size=10000, 
            learning_starts=1000, gamma=0.98, 
            target_update_interval=600, train_freq=16, 
            gradient_steps=8, exploration_fraction=0.2,
            exploration_final_eps=0.07, 
            policy_kwargs=dict(net_arch=[256, 256]), verbose=1)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


Before we train our agent, let's evaluate how good it is at completing the task untrained !

Below is a function that runs the environment for n episodes, selecting an action using the model's `predict` function, implementing that action decision in the environment with the environment's `step` function, and logging the reward issued by the environment. A record is kept of the cumulative reward for each episode and the function outputs the average of these values.

For context, the lowest value an agent can achieve in any given episode of Mountain Car is -200 and the environment is considered 'solved' if the agent achieves -110 or above.

In [8]:
def evaluate(model, num_episodes=100):
    """
    Evaluate a RL agent
    :param model: (BaseRLModel object) the RL Agent
    :param num_episodes: (int) number of episodes to evaluate it
    :return: (float) Mean reward for the last num_episodes
    """
    # This function will only work for a single Environment
    env = model.get_env()
    all_episode_rewards = []
    for i in range(num_episodes):
        episode_rewards = []
        done = False
        obs = env.reset()
        while not done:
            # _states are only useful when using LSTM policies
            action, _states = model.predict(obs)
            # here, action, rewards and dones are arrays
            # because we are using vectorized env
            obs, reward, done, info = env.step(action)
            episode_rewards.append(reward)

        all_episode_rewards.append(sum(episode_rewards))

    mean_episode_reward = np.mean(all_episode_rewards)
    print("Mean reward:", mean_episode_reward, "Num episodes:", num_episodes)

    return mean_episode_reward

In [9]:
# Random Agent, before training
mean_reward_before_train = evaluate(model, num_episodes=100)

Mean reward: -200.0 Num episodes: 100


As you can see, our agent is not very good yet !

We actually don't need to write our own evaluation function, Stabel Baselines provides one called 'evaluate_policy'

In [10]:
from stable_baselines3.common.evaluation import evaluate_policy

In [11]:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)
print('Mean reward = ', mean_reward)
print('Std dev reward = ', std_reward)



Mean reward =  -200.0
Std dev reward =  0.0


Without knowing what the lower bound for episodic cumulative reward is for Mountain Car, or without knowing the threshold for solving the environment, the reward values can be pretty unhelpful. You can go to the environment's documentation to find this information out.

What can be even more helpful is to watch the agent in action !

The first thing is to set Google Colab up with a fake display, otherwise rendering will fail.

In [12]:
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

Next, a function for recording the agent and saving the video to file, and a function for fetching the saved video and showing it in our Colab notebook.

In [13]:
import base64
from pathlib import Path

from IPython import display as ipythondisplay

def show_videos(video_path='', prefix=''):
  """
  Taken from https://github.com/eleurent/highway-env

  :param video_path: (str) Path to the folder containing videos
  :param prefix: (str) Filter the video, showing only the only starting with this prefix
  """
  html = []
  for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
      video_b64 = base64.b64encode(mp4.read_bytes())
      html.append('''<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

In [14]:
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv

def record_video(env_id, model, video_length=500, prefix='', video_folder='videos/'):
  """
  :param env_id: (str)
  :param model: (RL model)
  :param video_length: (int)
  :param prefix: (str)
  :param video_folder: (str)
  """
  eval_env = DummyVecEnv([lambda: gym.make(env_id)])
  # Start the video at step=0 and record 500 steps
  eval_env = VecVideoRecorder(eval_env, video_folder=video_folder,
                              record_video_trigger=lambda step: step == 0, video_length=video_length,
                              name_prefix=prefix)

  obs = eval_env.reset()
  for _ in range(video_length):
    action, _ = model.predict(obs)
    obs, _, _, _ = eval_env.step(action)

  # Close the video recorder
  eval_env.close()

Let's record and watch a video of our untrained agent ...

In [15]:
record_video('MountainCar-v0', model, video_length=500, prefix='dqn-mountaincar-untrained')

Saving video to /content/videos/dqn-mountaincar-untrained-step-0-to-step-500.mp4


In [16]:
show_videos('videos', prefix='dqn-mountaincar-untrained')

The agent (mountain car) will probably just roll around a bit at the bottom of the valley. 

## Train Agent

Let's now train our agent! The number of steps to train an agent for is, again, a hyperparameter. Stable Baselines 3 suggest 120,000 timesteps for Mountain Car and DQN so let's go with that.

The argument 'verbose' has been set to 1 so that we see some training stats as the model trains. You should see the 'ep_re_mean' start to improve after around 25,000 timesteps. It may take a little while to finish - it took over 5 mins to run whilst writing this practical.

In [21]:
model.learn(total_timesteps=150000)

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 200      |
|    ep_rew_mean      | -200     |
|    exploration_rate | 0.975    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 9562     |
|    time_elapsed     | 0        |
|    total_timesteps  | 800      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 200      |
|    ep_rew_mean      | -200     |
|    exploration_rate | 0.95     |
| time/               |          |
|    episodes         | 8        |
|    fps              | 1639     |
|    time_elapsed     | 0        |
|    total_timesteps  | 1600     |
| train/              |          |
|    learning_rate    | 0.004    |
|    loss             | 0.115    |
|    n_updates        | 59800    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean    

<stable_baselines3.dqn.dqn.DQN at 0x7feb51356cd0>

Now our agent is trained, let's re-run the evaluation.

In [23]:
# Evaluate the trained agent
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)
print('Mean reward = ', mean_reward)
print('Std dev reward = ', std_reward)

Mean reward =  -131.49
Std dev reward =  34.44023664262486


Since the data that the model is training on is collected by the agent as it interacts with the environment, every RL trianing run will be slightly different and therefore the mean episode reward when you evaluate the agent might be slightly different, but you should get around -120 after 120,000 timesteps.

The threshold for solving is -110, so you could if you want train the agent for longer. However, if we watch a video of the agent, you should see that the car does indeed reach the flag at the top of the mountain.

In [24]:
record_video('MountainCar-v0', model, video_length=500, prefix='dqn-mountaincar-trained')

Saving video to /content/videos/dqn-mountaincar-trained-step-0-to-step-500.mp4


In [25]:
show_videos('videos', prefix='dqn-mountaincar-trained')

# Exercises

1. Try changing the number of timesteps. What happens with a partly trained agent?
2. For the DQN model try changing the hyper-parameters. The model is quite senstive to these parameters so don't be surprised if the model fails to train enough to solve the problem.
3. PPO is a policy based Deep Reinforcement Learning approach. Try changing the code to use the PPO model. Note this requires you to change not only the model but the MlpPolicy too.
4. There are a number of other AI gym environments that you can play around with. Such as 'CartPole-v1' and 'Acrobat-v1'. More details can be found at https://gymnasium.farama.org/environments/classic_control/. Try to modify the notebook to use one of the other environments.