[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/WUR-AI/Advanced-Machine-Learning-Course/blob/master/Advanced_machine_learning_RL_exercise.ipynb)


# Advanced machine learning - Reinforcement Learning Exercise

### Training a DRL agent

Welcome to the exercises of reinforcement learning! In this exercise we will train two popular deep reinforcement learning agents that you have learned through your courses. This is the time to put that knowledge to practice!

In the notebook, you will see a couple of ToDos. Try your best to work through them, and don't hesitate to ask for help!

#### Import and install required libraries

In [None]:
!pip install -q swig
!pip install -q gymnasium[box2d]
!pip install -q pyglet

# install stable baselines that house the RL algorithms of DQN and PPO
!pip install -q "stable_baselines3[extra]"

# download a trained DQN
!git clone https://github.com/WUR-AI/Advanced-Machine-Learning-Course.git

The cell below imports important libraries that will be used to train our RL agent. There are additional packages that will be used to visualize the RL agent in action. Since, google colab doesn't natively support visualizing the agent when calling render_mode="human"

In [110]:
import base64
import io
from IPython.display import HTML
import glob
import numpy as np
import time
import matplotlib.pyplot as plt
import random
import typing
import uuid
from stable_baselines3 import DQN, PPO
import gymnasium as gym
from gymnasium.wrappers import RecordVideo

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

#Set the seed for reproducibility
np.random.seed(0)
random.seed(0)


def visualize(model, env):
    video_folder = f"videos/{uuid.uuid4()}"
    env = RecordVideo(env, video_folder=video_folder, episode_trigger=lambda e: True)
    obs, _ = env.reset(seed=10)
    terminated, truncated = False, False

    while not (terminated or truncated):
        if isinstance(model, DQN) or isinstance(model, PPO):
            action, _state = model.predict(obs, deterministic=True)
            action = int(action)  # In case it's a NumPy array
        elif model == 'random':
            action = env.action_space.sample()
        else:
            raise ValueError(f"Model {model} is not supported")

        obs, reward, terminated, truncated, _ = env.step(action)

    env.close()

    mp4list = glob.glob(f'{video_folder}/*.mp4')
    if len(mp4list) > 0:
        mp4 = mp4list[0]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        return HTML(data=f'''
            <video width="640" height="480" controls>
                <source src="data:video/mp4;base64,{encoded.decode('ascii')}" type="video/mp4" />
            </video>''')
    else:
        return "No video found."




## Gymnasium and the rocket landing problem

In the following exercise we will train an agent to land a rocket on the moon. We will utilize the Gymnasium environment of the Lunar Lander. It is a problem of optimizing the thrusters of the rocket to land nicely on the pad, pulled by the moon's gravity. There are 3 thrusters available to use; the left, right and middle engine. The agent is rewarded on every timestep based on different factors: how far it is from the landing pad, the speed it's approaching the pad, the tilt angle of the rocket. It is also given a negative reward each time the engine is fired, discouraging it from using the thrusters too much.

More information of the rocket landing environment is available [here](https://gymnasium.farama.org/environments/box2d/lunar_lander/).

### Create the environment

Creating an environment with the gymnasium package is relatively easy:

In [18]:
env_name = "LunarLander-v3"  #pre-made moon landing environment from gymnasium

env = gym.make(env_name, render_mode="rgb_array")

#Set the seed
env.action_space.seed(42)

42

### Check the properties of the environment

It's always important to be familiar with an environment of an RL problem. Here, we look into the action space and observation space. Check [here](https://gymnasium.farama.org/introduction/basic_usage/#action-and-observation-spaces) for a description of the different spaces in the gymnasium.

In [6]:
print(f'The action space is {env.action_space}')
print(f'The observation space is of {type(env.observation_space)} with shape {env.observation_space.shape} and contains {env.observation_space.dtype}')

The action space is Discrete(4)
The observation space is of <class 'gymnasium.spaces.box.Box'> with shape (8,) and contains float32


As we can see above, the action space is discrete, with a continuous observation space.
The action space consists of 4 discrete actions:


*   0: do nothing
*   1: fire left engine
*   2: fire main engine
*   3: fire right engine

The observation space consists of an 8-dimensional vector, consisting of 6 continuous values and 2 booleans.

To make the observation space a bit clearer, it's nice to put them into bins by extracting their max and min values.


In [7]:
low, high = env.observation_space.low, env.observation_space.high
print(f'The lower values of the observation space is:\n{low}\n\n and the upper values are \n{high}')

The lower values of the observation space is:
[ -2.5        -2.5       -10.        -10.         -6.2831855 -10.
  -0.         -0.       ]

 and the upper values are 
[ 2.5        2.5       10.        10.         6.2831855 10.
  1.         1.       ]


Here, we can see the value ranges of the observation space. The x and y coordinate ranges are $[-2.5, 2.5]$ (the landing pad is at $(0,0)$), the linear velocity ranges (in x and y) are $[-10, 10]$, the angle is in $[-2\pi, 2\pi]$ and the angular velocity is in $[-10, 10]$. The last two are booleans that represent the contact of the legs with the ground when landing.

We can reset the environment to the start of an episode with this line of code:

In [8]:
env.reset()

(array([-3.4141540e-05,  1.4010243e+00, -3.4739270e-03, -4.3981442e-01,
         4.6357098e-05,  7.8687706e-04,  0.0000000e+00,  0.0000000e+00],
       dtype=float32),
 {})

Then we can see the state of the environment at the start of an episode. Next, we can sample some of the actions of the agent with the following line of code. Try running it a few times to see what the agent does.

In [13]:
env.action_space.sample()

np.int64(1)

##### TODO 1:


*   When sampling the environment actions, what does it mean when it shows the number 3?


In [None]:
#write the TODO here

## Seeing a random agent in action

To get even more familiarized with the environment, we will see our agent on screen. With the function visualize() (defined in the 2nd cell) we can see our agent in action. We will call it with 'random'; meaning it will do random actions. Click on the "start button" in the video to see your agent in action.

In [None]:
visualize('random', env)

You will most likely have seen the agent fail miserably to land the rocket, or just fly out of screen - never to be seen again. Hence, we need an agent with some intelligence to land the rocket. Here, we will move on to train the agent with two fundamental RL algorithms; DQN and PPO.

## To conclude:

The important functions for the environment are as follows:
- **env.reset():**
    Resets the environment and obtain initial starting observation
- **env.step(action):**
    Applies an action to it. It outputes next state, reward, terminate, truncate, and info

# Training a Q-learning agent

Let's go into the meat of the problem: training an agent with a deep Q learning method.
In this exercise, we will use the stable_baselines3 implementation of the DQN algorithm.

The theory behind the DQN algorithm you have learned in class. In essence, the idea behind Q-learning is that if we had a function
$Q^*: State \times Action \rightarrow \mathbb{R}$, that could tell
us what our return would be, if we were to take an action in a given
state, then we could easily construct a policy that maximizes our
rewards:

\begin{align}\pi^*(s) = \arg\!\max_a \ Q^*(s, a)\end{align}

For our training update rule, we'll use a fact that every $Q$
function for some policy obeys the Bellman equation:

\begin{align}Q^{\pi}(s, a) = r + \gamma Q^{\pi}(s', \pi(s'))\end{align}

OK, fun equations, right? To move forward, we will try to train a DQN agent with 10.000 steps.

In [None]:
from stable_baselines3 import DQN, PPO
from stable_baselines3.common.evaluation import evaluate_policy

# Train a basic DQN agent without changing its hyperparameters
model = DQN("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10_000)

After training, we can try evaluating (testing) the policy with the line of code below. Testing in RL means we will plop a learned agent in its environment and let it run while we record the rewards it obtains. We will test with 10 episodes:

In [None]:
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)
print(f"The mean reward is {mean_reward} and the standard deviation of the reward is {std_reward}")

How is the average reward? Nevertheless, after training the agent for around 100 episodes, we can try and see it in action:

In [None]:
visualize(model, env)

How did the agent do? Most likely not so good. We can try and tweak the training hyper-parameters. We can try tweaking the number of steps, the exploration rate and the policy networks of the RL agent.


Note: the number of steps here mean each time a step is taken (env.step(action)), to distinguish it from number of episodes.

#### TODO 2:
Our trained DQN consists of two neural networks: a Q network and a target Q network. Given that we have an observation space of size 8, and an action space of size 4, what would be the number of inputs in the first layer of the networks? You can check your answer by printing the networks as shown below

In [None]:
print(model.policy)

Once you've understood the inputs and outputs, let's move on to the next task!

* Fill in below three hyperparameters: (1) number of steps of training, (2) the fraction of the whole training that the agent will be in "explore" mode - i.e. doing random actions, and (3) the final random action probabililty, that you deem would let the agent find a good policy.
Justify your choices!

In [76]:
#hyperparameters to tweak

n_steps =
hyperparams_dqn = {'exploration_fraction': ,
'exploration_final_eps':
}

In [77]:
#Code for checking the agent's performance
import os
from stable_baselines3.common import results_plotter
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.results_plotter import load_results, ts2xy, plot_results
from stable_baselines3.common.noise import NormalActionNoise
from stable_baselines3.common.callbacks import BaseCallback


# function take from https://stable-baselines3.readthedocs.io/en/master/guide/examples.html
class SaveOnBestTrainingRewardCallback(BaseCallback):
    """
    Callback for saving a model (the check is done every ``check_freq`` steps)
    based on the training reward (in practice, we recommend using ``EvalCallback``).

    :param check_freq:
    :param log_dir: Path to the folder where the model will be saved.
      It must contains the file created by the ``Monitor`` wrapper.
    :param verbose: Verbosity level: 0 for no output, 1 for info messages, 2 for debug messages
    """

    def __init__(self, check_freq: int, log_dir: str, verbose: int = 1):
        super(SaveOnBestTrainingRewardCallback, self).__init__(verbose)
        self.check_freq = check_freq
        self.log_dir = log_dir
        self.save_path = os.path.join(log_dir, "best_model")
        self.best_mean_reward = -np.inf

    def _init_callback(self) -> None:
        # Create folder if needed
        if self.save_path is not None:
            os.makedirs(self.save_path, exist_ok=True)

    def _on_step(self) -> bool:
        if self.n_calls % self.check_freq == 0:

            # Retrieve training reward
            x, y = ts2xy(load_results(self.log_dir), "timesteps")
            if len(x) > 0:
                # Mean training reward over the last 100 episodes
                mean_reward = np.mean(y[-100:])
                if self.verbose >= 1:
                    print(f"Num timesteps: {self.num_timesteps}")
                    print(
                        f"Best mean reward: {self.best_mean_reward:.2f} - Last mean reward per episode: {mean_reward:.2f}")

                # New best model, you could save the agent here
                if mean_reward > self.best_mean_reward:
                    self.best_mean_reward = mean_reward
                    # Example for saving best model
                    if self.verbose >= 1:
                        print(f"Saving new best model to {self.save_path}")
                    self.model.save(self.save_path)

        return True


# Create log dir
log_dir = "tmp/"
os.makedirs(log_dir, exist_ok=True)

# Create the callback: check every 1000 steps
callback = SaveOnBestTrainingRewardCallback(check_freq=1000, log_dir=log_dir)


In [None]:
# Train the agent with your tweaked hyperparameters
# Add logging to check the agent's performance during training

# Make the evironment
env_name = "LunarLander-v3"
env = gym.make(env_name, render_mode="rgb_array")
env = Monitor(env, log_dir)

#seed for reproducability
seed = 5

# Train a the DQN agent
model = DQN("MlpPolicy", env, seed=seed, verbose=1, **hyperparams_dqn)
model.learn(total_timesteps=n_steps, callback=callback)

In [None]:
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)
print(f"The mean reward is {mean_reward} and the standard deviation of the reward is {std_reward}")

In [None]:
reward_eps, eps = evaluate_policy(model, model.get_env(), n_eval_episodes=10, return_episode_rewards=1)
print(f"The reward per episode is {reward_eps} and the length of each episode is {eps}")

In [None]:
plot_results([log_dir], n_steps, results_plotter.X_TIMESTEPS, "DQN LunarLander")
plt.show()

#### TO DO 3:

* how did your DQN agent do?
* Did you think the hyperparameters you chose were good enough? Do you think there are [additional paramaters](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html#stable_baselines3.dqn.DQN) that are worthwhile to tweak? Describe what you think.
* What do you think is the most important parameter for the environment of the moon lander?

You can continue to train and tweak the agent on your own.
In the following section, you can load a trained agent with optimized parameters.

In [None]:
# Make the evironment
env_name = "LunarLander-v3"
env = gym.make(env_name, render_mode="rgb_array")
model = DQN.load("/content/Advanced-Machine-Learning-Course/dqn-LunarLander.zip")

In [None]:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)
print(f"The mean reward is {mean_reward} and the standard deviation of the reward is {std_reward}")

Visualize the loaded model

In [None]:
visualize(model, env)

We can see the learned DQN agent is able to land the rocket on the moon properly.

## Training a Policy Gradient agent

## About PPO

You might have had a difficult time training the DQN agent due to its sensitivity to hyperparameters. This is partly because DQN estimates action values (Q-values), and in some environments, it tends to overestimate the expected rewards for certain actions. This overestimation can lead to unstable training and divergence from the optimal policy.

**Proximal Policy Optimization (PPO)** addresses this issue by **clipping** the policy update. Instead of allowing large, potentially destabilizing updates to the policy, PPO restricts how much the new policy is allowed to deviate from the old one during each update step.

This is done using the following clipped objective function:

$$
L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t \right) \right]
$$

Where:

- $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the probability ratio between the new and old policies.  
- $\hat{A}_t$ is the estimated advantage at time $t$.  
- $\epsilon$ is a small hyperparameter (commonly 0.1 or 0.2) that determines how much the new policy can deviate from the old one.

This clipping mechanism helps ensure **more stable and conservative updates**, avoiding the kind of over-optimistic updates that can occur in DQN.

---

Now that we've covered the theory, let's train the agent using the `stable-baselines3` implementation of PPO. We'll start with 10,000 training steps.


In [None]:
from stable_baselines3 import PPO

# Make the evironment
env_name = "LunarLander-v3"
env = gym.make(env_name, render_mode="rgb_array")

seed = 0

# Train a the DQN agent
model = PPO("MlpPolicy", env, gamma=0.9, seed=seed, verbose=1)
model.learn(total_timesteps=10_000)

In [None]:
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)
print(f"The mean reward is {mean_reward} and the standard deviation of the reward is {std_reward}")

Training 10.000 steps might not be enough for the PPO agent.

#### TO DO 4:
* Tweak the [hyperparameters](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#stable_baselines3.ppo.PPO) of the PPO agent. Which parameters do you think are the most suitable for training the moon lander?
* Train a PPO agent, subsequently evaluate the agent and explain the changes you made.

In [None]:
#fill in values of the hyperparameters below

n_steps =
hyperparams_ppo = {'batch_size':,
'n_steps': ,
'gamma':,
'clip_range':
}

In [None]:
# Create log dir
log_dir = "tmp/"
os.makedirs(log_dir, exist_ok=True)

# Create the callback: check every 1000 steps
callback = SaveOnBestTrainingRewardCallback(check_freq=1000, log_dir=log_dir)

# Make the evironment
env_name = "LunarLander-v3"
env = gym.make(env_name, render_mode="rgb_array")
env = Monitor(env, log_dir)

seed = 5

# Train a the PPO agent
model = PPO("MlpPolicy", env, seed=seed, verbose=1, **hyperparams_ppo)
model.learn(total_timesteps=n_steps, callback=callback)

In [None]:
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)
print(f"The mean reward is {mean_reward} and the standard deviation of the reward is {std_reward}")

In [None]:
plot_results([log_dir], n_steps, results_plotter.X_TIMESTEPS, "PPO LunarLander")
plt.show()

In [None]:
visualize(model, env)

Your best model will be saved in the files section of google colab. This file you can download and run again; or even continue with the training.

Here you have learned to train an RL agent to land a rocket on the moon with two popular RL algorithms. There are some differences between the two algorithms.

#### TO DO 5:
*   What are the main differences between Q-learning and Gradient Policy algorithm?
* What does it mean that DQN learns off-policy and PPO learns on-policy?
* What are your thoughts about when to use either DQN or PPO?