# COMP47590 Advanced Machine Learning Final

#### Student Name: Winnie Imafidon
#### Student Number: 2220023

## Assignment 2: Going the Distance
Uses the PPO actor-critic method to train a neural network to control a simple robot in the RacingCar environment from OpenAI gym (https://gym.openai.com/envs/RacingCar-v0/). 

![Racing](racing_car.gif)

The **action** space can be continuous or discreet. If **continuous** there are 3 actions :

- 0: steering, -1 is full left, +1 is full right
- 1: gas
- 2: breaking

If **discrete** there are 5 actions:
- 0: do nothing
- 1: steer left
- 2: steer right
- 3: gas
- 4: brake

For this assignment we should use the continuous action space. 

**Reward** of -0.1 is awarded every frame and +1000/N for every track tile visited, where N is the total number of tiles in track. For example, if you have finished in 732 frames, your reward is 1000 - 0.1*732 = 926.8 points.

And the default **observation** is a single image frame (96 * 96).

### Initialisation

If using Google colab you need to install packages - comment out lines below.

In [None]:
#!apt install swig cmake ffmpeg
#!apt-get install -y xvfb x11-utils
#!pip install stable-baselines3[extra] pyglet box2d box2d-kengz
#!pip install pyvirtualdisplay PyOpenGL PyOpenGL-accelerate

For Google colab comment out this cell to make a virtual rendering canvas so render calls work (we still won't see display!)

In [None]:
#import pyvirtualdisplay
#
#_display = pyvirtualdisplay.Display(visible=False,  # use False with Xvfb
#                                    size=(1400, 900))
#_ = _display.start()

Import required packages. 

In [1]:
import torch 
import gymnasium as gym
import stable_baselines3 as sb3

import pandas as pd # For data frames and data frame manipulation
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)
import numpy as np # For general  numeric operations

import matplotlib.pyplot as plt
%matplotlib inline 

### Create and Explore the Environment

Create the **CarRacing-v2** environment. Add wrappers to resize the images and convert to greyscale.

In [2]:
env = gym.make('CarRacing-v2', 
               render_mode = 'human')
#  useful for reducing the input dimensionality for a machine learning model, which can speed up training.
env = gym.wrappers.resize_observation.ResizeObservation(env, 64)
# This wrapper converts RGB images to grayscale, which simplifies the 
# learning task by reducing the color information that the model needs to process. 
env = gym.wrappers.gray_scale_observation.GrayScaleObservation(env, keep_dim = True)
# This wrapper limits the number of steps in each episode to 1500, after which the episode will automatically end.
env = gym.wrappers.TimeLimit(env, 
                                max_episode_steps = 1500)

Explore the environment - view the action space and observation space.

In [3]:
env.action_space

Box([-1.  0.  0.], 1.0, (3,), float32)

In [4]:
env.observation_space

Box(0, 255, (64, 64, 1), uint8)

In [5]:
# highlighting that this is a continuous action space 
action = env.action_space.sample()
print(action)

[0.9162945  0.2870367  0.03831504]


Play an episode of the environment using random actions

In [6]:

episodes = 1
for episode in range(1, episodes+1):
    obs , _ = env.reset()
    score = 0
    done = False
    truncate = False
    while not done:    
        action = env.action_space.sample()
        obs, reward, done, truncate, info = env.step(action)
        score += reward         
        env.render()
        
    print("Episode:{} Score:{}".format(episode,score))
env.close()

Episode:1 Score:-497.64814814817026


In [7]:
print(reward)

-100


In [8]:
print(action)

[-0.512909    0.24843697  0.548837  ]


### Single Image Agent
Create an agent that controls the car using a single image frame as the state input. We recommend a PPO agent with the following hyper-parameters (although you can experiment):
- learning_rate = 3e-5
- n_steps = 512
- ent_coef = 0.001
- batch_size = 128
- gae_lambda =  0.9
- n_epochs = 20
- use_sde = True
- sde_sample_freq = 4
- clip_range = 0.4
- policy_kwargs = {'log_std_init': -2, 'ortho_init':False},

We also recommend enabling **tensorboard** monitoring of the training process.

In [9]:
network_args = {'log_std_init': -2, 'ortho_init':False} 

In [10]:
env_train = gym.make('CarRacing-v2')
env_train = gym.wrappers.resize_observation.ResizeObservation(env_train, 64)
env_train = gym.wrappers.gray_scale_observation.GrayScaleObservation(env_train, keep_dim = True)

In [11]:

tb_log = './log_tb_carracing_ppo_final1/'
ppo_agent_model = sb3.PPO('CnnPolicy', 
                env_train, 
                learning_rate = 0.00003,
                n_steps = 512,
                batch_size = 128,
                ent_coef = 0.001,
                gae_lambda = 0.9,
                n_epochs = 20,
                use_sde = True,
                sde_sample_freq = 4,
                clip_range = 0.4,
                verbose=1,
                policy_kwargs = network_args,
                tensorboard_log=tb_log
              )


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Wrapping the env in a VecTransposeImage.


Examine the actor and critic network architectures.

In [12]:
print(ppo_agent_model.policy)

ActorCriticCnnPolicy(
  (features_extractor): NatureCNN(
    (cnn): Sequential(
      (0): Conv2d(1, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU()
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU()
      (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
      (5): ReLU()
      (6): Flatten(start_dim=1, end_dim=-1)
    )
    (linear): Sequential(
      (0): Linear(in_features=1024, out_features=512, bias=True)
      (1): ReLU()
    )
  )
  (pi_features_extractor): NatureCNN(
    (cnn): Sequential(
      (0): Conv2d(1, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU()
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU()
      (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
      (5): ReLU()
      (6): Flatten(start_dim=1, end_dim=-1)
    )
    (linear): Sequential(
      (0): Linear(in_features=1024, out_features=512, bias=True)
      (1): ReLU()
    )
  )
  (vf_features_extractor): NatureCNN(
    (cnn): 

Create an evaluation callback that is called every at regular intervals and renders the episode.

In [13]:
# ensuring the environments are the same for training
eval_env = ppo_agent_model.get_env()


In [14]:
# another way to male envionmnt 
# eval_env = gym.make('CarRacing-v2', render_mode = 'human') # We use a separate evaluation env in case any wrappers have been used
# eval_env = gym.wrappers.resize_observation.ResizeObservation(eval_env, 64)
# # This wrapper converts RGB images to grayscale, which simplifies the 
# # learning task by reducing the color information that the model needs to process. 
# eval_env = gym.wrappers.gray_scale_observation.GrayScaleObservation(eval_env, keep_dim = True)

In [15]:

eval_callback = sb3.common.callbacks.EvalCallback(eval_env, 
                                                  log_path= './logs_carracing_final1/', 
                                                  eval_freq=5000,
                                                  render=False)

Train the model for a large number of timesteps (500,000 timesteps will probably work well).

In [16]:
eval_env.observation_space

Box(0, 255, (1, 64, 64), uint8)

In [17]:
eval_env.action_space

Box([-1.  0.  0.], 1.0, (3,), float32)

In [18]:
# # Add code here
ppo_agent_model.learn(total_timesteps=500000,
            callback=eval_callback,
            tb_log_name="ppo_carracing_final"
           )

Connect to the tensorboard log using **TensorBoard** from the command line to view training progress: 

`tensorboard --logdir ./logs_carracing_ppo_final1/`

Then open TensorBoard in a browser, typically located at:

`http://localhost:6006/`

Save the trained agent.

In [19]:
%load_ext tensorboard


In [20]:
# please see ppo_carracing_final_1
%tensorboard --logdir ./log_tb_carracing_ppo_final1/


In [21]:
ppo_agent_model.save("./carracingPPO_agent")

For memory management delete old agent and environment (assumes variable names - change if required).

In [22]:
del ppo_agent_model
del env
del eval_env

### Create Image Stack Agent

Create the CarRacing-v0 environment using wrappers to resize the images to 64 x 64 and change to greyscale. Also add a wrapper to create a stack of 4 frames. 

In [23]:
# Add code here
# Add code here
env = gym.make('CarRacing-v2', render_mode = 'human')

#  useful for reducing the input dimensionality for a machine learning model, which can speed up training.
env = gym.wrappers.resize_observation.ResizeObservation(env, 64)
# This wrapper converts RGB images to grayscale, which simplifies the 
# learning task by reducing the color information that the model needs to process. 
env = gym.wrappers.gray_scale_observation.GrayScaleObservation(env, keep_dim = True)
env  = sb3.common.monitor.Monitor(env)
env= sb3.common.vec_env.DummyVecEnv([lambda: env])
env = sb3.common.vec_env.VecFrameStack(env, 
                                       n_stack=4)
env = gym.wrappers.TimeLimit(env, 
                                max_episode_steps = 1500)

In [24]:
env.action_space

Box([-1.  0.  0.], 1.0, (3,), float32)

In [25]:
env.observation_space

Box(0, 255, (64, 64, 4), uint8)

Create an agent that controls the car using a stack of input image frames as the state input. We recommend a PPO agent with the following hyper-parameters (although you can experiment):
- learning_rate = 3e-5
- n_steps = 512
- ent_coef = 0.001
- batch_size = 128
- gae_lambda =  0.9
- n_epochs = 20
- use_sde = True
- sde_sample_freq = 4
- clip_range = 0.4
- policy_kwargs = {'log_std_init': -2, 'ortho_init':False},

We also recommend enabling **tensorboard** monitoring of the training process.

In [26]:
env_train_stack = gym.make('CarRacing-v2')
env_train_stack = gym.wrappers.resize_observation.ResizeObservation(env_train_stack, 64)
env_train_stack = gym.wrappers.gray_scale_observation.GrayScaleObservation(env_train_stack, keep_dim = True)
env_train_stack = sb3.common.monitor.Monitor(env_train_stack)
env_train_stack = sb3.common.vec_env.DummyVecEnv([lambda: env_train_stack])
env_train_stack = sb3.common.vec_env.VecFrameStack(env_train_stack, 
                                       n_stack=4)

In [27]:
# Add code here
tb_log = './log_tb_carracing_task2_final1/'
ppo_agent_model = sb3.PPO('CnnPolicy', 
                env_train_stack, 
                learning_rate = 0.00003,
                n_steps = 512,
                ent_coef = 0.001,
                batch_size = 128,
                gae_lambda = 0.9,
                n_epochs = 20,
                use_sde = True,
                sde_sample_freq = 4,
                clip_range = 0.4,
                verbose=1,
                policy_kwargs = network_args,
                tensorboard_log=tb_log
              )

Using cpu device
Wrapping the env in a VecTransposeImage.


Examine the actor and critic network architectures.

In [28]:
# Add code here
print(ppo_agent_model.policy)

ActorCriticCnnPolicy(
  (features_extractor): NatureCNN(
    (cnn): Sequential(
      (0): Conv2d(4, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU()
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU()
      (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
      (5): ReLU()
      (6): Flatten(start_dim=1, end_dim=-1)
    )
    (linear): Sequential(
      (0): Linear(in_features=1024, out_features=512, bias=True)
      (1): ReLU()
    )
  )
  (pi_features_extractor): NatureCNN(
    (cnn): Sequential(
      (0): Conv2d(4, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU()
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU()
      (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
      (5): ReLU()
      (6): Flatten(start_dim=1, end_dim=-1)
    )
    (linear): Sequential(
      (0): Linear(in_features=1024, out_features=512, bias=True)
      (1): ReLU()
    )
  )
  (vf_features_extractor): NatureCNN(
    (cnn): 

In [29]:
eval_env = ppo_agent_model.get_env()


In [30]:
# # Add code here
#Another way tomkae eval env 
# eval_env = gym.make('CarRacing-v2', render_mode = 'human') # We use a separate evaluation env in case any wrappers have been used
# eval_env = gym.wrappers.resize_observation.ResizeObservation(eval_env, 64)
# eval_env = gym.wrappers.gray_scale_observation.GrayScaleObservation(eval_env, keep_dim = True)
# eval_env = sb3.common.monitor.Monitor(eval_env)
# eval_env = sb3.common.vec_env.DummyVecEnv([lambda: eval_env])
# eval_env = sb3.common.vec_env.VecFrameStack(eval_env, 
#                                        n_stack=4)


Create an evaluation callback that is called every at regular intervals and renders the episode.

In [31]:
# Add code here
eval_callback = sb3.common.callbacks.EvalCallback(eval_env, 
                                                  log_path= './logs_carracing_task2/_final1', 
                                                  eval_freq= 5000,
                                                  render=False)

Train the model for a large number of timesteps (500,000 timesteps will probably work well).

In [32]:
# Add code here
ppo_agent_model.learn(total_timesteps=500000,
            callback=eval_callback,
            tb_log_name="ppo_carracing_task2_final1"
           )

Logging to ./log_tb_carracing_task2_final1/ppo_carracing_task2_final1_4
----------------------------
| time/              |     |
|    fps             | 150 |
|    iterations      | 1   |
|    time_elapsed    | 3   |
|    total_timesteps | 512 |
----------------------------
--------------------------------------
| rollout/                |          |
|    ep_len_mean          | 1e+03    |
|    ep_rew_mean          | -59.9    |
| time/                   |          |
|    fps                  | 77       |
|    iterations           | 2        |
|    time_elapsed         | 13       |
|    total_timesteps      | 1024     |
| train/                  |          |
|    approx_kl            | 2.923398 |
|    clip_fraction        | 0.724    |
|    clip_range           | 0.4      |
|    entropy_loss         | 1.55     |
|    explained_variance   | 0.000187 |
|    learning_rate        | 3e-05    |
|    loss                 | 0.364    |
|    n_updates            | 20       |
|    policy_gradient_lo

<stable_baselines3.ppo.ppo.PPO at 0x314afced0>

Connect to the tensorboard log using **TensorBoard** from the command line to view training progress: 

`tensorboard --logdir ./log_tb_carracing_task2_final1/`

Then open TensorBoard in a browser, typically located at:

`http://localhost:6006/`

In [33]:
%load_ext tensorboard


The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [34]:
%tensorboard --logdir ./log_tb_carracing_task2_final1/


Save the trained agent.

In [35]:
ppo_agent_model.save("./carracingPPO_agent_stack")

For memory management delete old agent and environment (assumes variable names - change if required).

In [36]:
del ppo_agent_model
del env
del eval_env

### Evaluation

#### Load the single image saved agent

In [100]:
ppo_agent_model = sb3.ppo.PPO.load("./carracingPPO_agent")

Setup the single image environment for evaluation.

In [98]:
eval_env = gym.make('CarRacing-v2', render_mode= "human") # We use a separate evaluation env in case any wrappers have been used
eval_env = gym.wrappers.resize_observation.ResizeObservation(eval_env, 64)
# This wrapper converts RGB images to grayscale, which simplifies the 
# learning task by reducing the color information that the model needs to process. 
eval_env = gym.wrappers.gray_scale_observation.GrayScaleObservation(eval_env, keep_dim = True)


Evaluate the agent in the environment for 30 episodes, rendering the process. 

In [88]:
ppo_agent_model.set_env(eval_env)

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Wrapping the env in a VecTransposeImage.


In [57]:
# Add code here
mean_reward, std_reward = sb3.common.evaluation.evaluate_policy(ppo_agent_model, 
                                                                eval_env, 
                                                                n_eval_episodes=30,
                                                               render = True)
print("Mean Reward: {} +/- {}".format(mean_reward, std_reward))

Mean Reward: -82.97379410862922 +/- 1.3005900032994788


For memory management delete the single image agent (assumes variable names - change if required).

In [101]:
del ppo_agent_model
del eval_env

#### Load the image stack agent

In [77]:
# Add code here 
ppo_agent_model = sb3.ppo.PPO.load("./carracingPPO_agent_stack")

Set up the image stack environment

In [82]:
# Add code here
eval_env = gym.make('CarRacing-v2' ,render_mode= "human")
eval_env = gym.wrappers.resize_observation.ResizeObservation(eval_env, 64)
eval_env = gym.wrappers.gray_scale_observation.GrayScaleObservation(eval_env, keep_dim = True)
eval_env = sb3.common.monitor.Monitor(eval_env)
eval_env = sb3.common.vec_env.DummyVecEnv([lambda: eval_env])
eval_env = sb3.common.vec_env.VecFrameStack(eval_env, 
                                       n_stack=4)


In [83]:
ppo_agent_model.set_env(eval_env)

Wrapping the env in a VecTransposeImage.


Evaluate the agent in the environment for 30 episodes, rendering the process. 

In [84]:
# Add code here
# testing model 
episodes = 10
scores_array = []
timestep_arr = []
for episode in range(1, episodes+1):
    obs = eval_env.reset()  #state = env.reset()
    done = False
    score = 0
    timestep = 0
    
    while not done:
        action , _ = ppo_agent_model.predict(obs) 
        obs, reward, done, info = eval_env.step(action) 
        score += reward
        timestep += 1
        eval_env.render()
    scores_array.append(score)
    timestep_arr.append(timestep)
    print("Episode:{} Score:{}".format(episode,score))
eval_env.close()

Episode:1 Score:[-70.32597]
Episode:2 Score:[-68.84695]
Episode:3 Score:[-67.845276]
Episode:4 Score:[-65.03464]
Episode:5 Score:[-39.799084]
Episode:6 Score:[-32.584015]
Episode:7 Score:[-43.52133]
Episode:8 Score:[-44.61514]
Episode:9 Score:[-39.79908]
Episode:10 Score:[-42.307438]


In [81]:
# Add code here
mean_reward, std_reward = sb3.common.evaluation.evaluate_policy(ppo_agent_model, 
                                                                eval_env, 
                                                                n_eval_episodes=30,
                                                               render = True)
print("Mean Reward: {} +/- {}".format(mean_reward, std_reward))

Mean Reward: -30.90001066666667 +/- 21.29935653787396


In [85]:
del ppo_agent_model
del eval_env

### Training Better Model for Longer 

In [62]:
ppo_agent_model = sb3.ppo.PPO.load("./carracingPPO_agent_stack")

In [68]:
# Add code here
eval_env = gym.make('CarRacing-v2' ,render_mode= "human")
eval_env = gym.wrappers.resize_observation.ResizeObservation(eval_env, 64)
eval_env = gym.wrappers.gray_scale_observation.GrayScaleObservation(eval_env, keep_dim = True)
eval_env = sb3.common.monitor.Monitor(eval_env)
eval_env = sb3.common.vec_env.DummyVecEnv([lambda: eval_env])
eval_env = sb3.common.vec_env.VecFrameStack(eval_env, 
                                       n_stack=4)


In [69]:
ppo_agent_model.set_env(eval_env)

Wrapping the env in a VecTransposeImage.


In [74]:
ppo_agent_model.learn(total_timesteps = 500000, 
            reset_num_timesteps = False)

Logging to ./log_tb_carracing_task2_final1/PPO_0
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1e+03    |
|    ep_rew_mean     | -6.45    |
| time/              |          |
|    fps             | 45       |
|    iterations      | 1        |
|    time_elapsed    | 11       |
|    total_timesteps | 2001408  |
---------------------------------
----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | -6.45      |
| time/                   |            |
|    fps                  | 36         |
|    iterations           | 2          |
|    time_elapsed         | 28         |
|    total_timesteps      | 2001920    |
| train/                  |            |
|    approx_kl            | 0.03934592 |
|    clip_fraction        | 0.0619     |
|    clip_range           | 0.4        |
|    entropy_loss         | -20.6      |
|    explained_variance   | 0.803    

<stable_baselines3.ppo.ppo.PPO at 0x1745e9e90>

In [75]:
ppo_agent_model.save("./carracingPPO_agent_stack")

### Reflection

Reflect on which  agent performs better at the task, and the training process involved (max 200 words).

Initially, my model, trained over 500,000 timesteps, showed poor performance with both single and stacked image agents yielding low rewards. 

##### Results on 500000 timesteps 
1. My evaluation for single image agent [Mean Reward: -92.95956816666668 +/- 0.7243045979982494].
2. My evaluation for stack image agent [Mean Reward: -83.51077016666666 +/- 1.5817448422319385].

##### Which agent performed better (1000000 timesteps)?
Upon increasing the timesteps to 1,000,000, the stacked image agent slightly outperformed the single image agent, demonstrating better mean rewards. My stacked image agent performed better.

1. My evaluation for single image agent [Mean Reward: -82.97379410862922 +/- 1.3005900032994788]
2. My evaluation for stack image agent[ Mean Reward: -80.95164756666667 +/- 1.6859319366412475]

Subsequent training of the stacked image agent for 2,500,000 timesteps significantly improved its performance, achieving a mean reward of -30.90 with a higher standard deviation, indicating varying performance across episodes. The extensive training required considerable computational resources and time, often needing overnight runs, which slowed down the ability to make and test adjustments rapidly.  [Mean Reward: -30.90001066666667 +/- 21.29935653787396]

##### Considerations for the future  
1. balancing exploration and exploitation remains a crucial challenge in RL.
2. Techniques such as pixel normalization and reward standardization could be implemented to manage the broad range of reward values and reduce the dimensionality of the observation space, potentially enhancing model performance.
3. Longer training times. More training results in better performance.
4. Better computation power. CPU constraints.

##### References
1. https://agents.inf.ed.ac.uk/blog/reinforcement-learning-implementation-tricks/#:~:text=Frame%20stacking,obtained%20from%20a%20single%20image.
2. https://towardsdatascience.com/reinforcement-learning-101-e24b50e1d292
3. https://www.fool.com/terms/r/reinforcement-learning/#:~:text=Of%20course%2C%20there%20are%20downsides,of%20the%20positive%20value%20description.