<a href="https://colab.research.google.com/github/eduardofae/RL/blob/main/AT-09/09%20DQN%20lunar%20lander.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DQN on Lunar Lander

**With content from [Neuromatch Academy](https://colab.research.google.com/github/NeuromatchAcademy/course-content-dl/blob/main/projects/ReinforcementLearning/lunar_lander.ipynb)**


## Assignment:  experimenting with DQN Tuning on Lunar Lander

You will see in the video and plots below that the initial performance of the DQN agent with the default hyperparameters is probably poor.

Your task is to find a better combination of hyperparameters for the DQN model to improve both the learning speed and stability.

**Instructions:**

1.  **Hyperparameter Exploration:** Experiment with different combinations of hyperparameters (you can wrap the `DQN` model definition in the code cell above to facilitate experimentaton). All hyperparams can be varied, but pay close attention to:
    *   **Network Architecture (`net_arch`):** Try varying the number of hidden layers and the number of neurons in each layer.
    *   **Learning Rate (`learning_rate`)**
    *   **Batch Size (`batch_size`) and Buffer Size (`buffer_size`):** Experiment with different sizes for the training batch and the replay buffer.
    *   **Exploration Parameters (`exploration_initial_eps`, `exploration_fraction`, `exploration_final_eps`):** Adjust how the agent explores the environment initially and how quickly it transitions to exploiting learned knowledge.
    *   Other parameters like `gamma` and `train_freq` can also be explored.

2.  **Evaluation of Trials:** For each hyperparameter combination you try, train the model for **10,000 timesteps** (by setting `total_timesteps=10000` in the `model.learn()` call). Observe the "ep_rew_mean" in the training logs to get a sense of the learning speed and stability. You can also run the video and plotting cells after each trial to visualize the performance and the reward curve.

3.  **Select Best Configuration:** Based on your trials, identify the hyperparameter combination that shows the best balance of learning speed (reward increasing quickly) and stability (minimal fluctuations in reward).

4.  **Train with Best Configuration:** Once you have found your best configuration, run 5 repetitions of training for **10,000 timesteps** and testing a model with these hyperparameters. Plot 'shaded' graphs of loss and reward over the course of training (see an example below and a guide in Item 2 of the guide [here](https://rll.berkeley.edu/deeprlcoursesp17/docs/plotting_handout.pdf). Notice the fluctuations on performance or their absence.<img src='https://learn2learn.net/assets/img/examples/cheetah_fwdbwd_rewards.png' height="300"/>

**Submission:**

Answer the questions in the moodle quiz, send the required plots and the downloaded .ipynb (or python project, if you did offline).

## Understanding TensorBoard



This notebook uses TensorBoard for visualization. To generate the plots for submission, you might need to export the data of the tensorboard plots, or create code to parse and generate your own plots. If you're familiar with the use of TensorBoard, skip this cell. Otherwise, read on.

TensorBoard is a visualization tool provided with TensorFlow. It allows you to visualize your model's graph, plot quantitative metrics about the execution of your graph, and show additional data like images that pass through the graph.

In the context of training reinforcement learning agents with Stable-Baselines3, TensorBoard is primarily used to visualize various training metrics that are automatically logged, such as:

*   **Episode Rewards:** Shows how the agent's performance (measured by the total reward collected per episode) changes over time.
*   **Episode Length:** Indicates the duration of each episode in terms of timesteps.
*   **Loss:** Shows the value of the loss function during training, which indicates how well the model is predicting the optimal actions.
*   **Learning Rate:** Tracks the learning rate schedule if one is used.

**How to use the controls:**

When you launch TensorBoard, you will see a web interface. Key controls and features include:

*   **Scalars:** This is where you'll find the plots for metrics like episode reward, episode length, and loss.
*   **Runs:** If you train your model multiple times with different hyperparameters or seeds, each training run will appear here. You can select or deselect runs to compare their performance on the same plot.
*   **Smoothing:** On each scalar plot, there is usually a "Smoothing" slider. This slider controls how much the raw data is smoothed.
    *   **Raw Values (Smoothing at 0):** Shows the exact value of the metric at each logged step. This can appear very noisy, especially in the early stages of training.
    *   **Smoothed Values (Smoothing > 0):** Applies a moving average or other smoothing technique to the data. This helps to see the overall trend of the metric, making it easier to assess learning speed and stability. You can adjust the slider to see varying degrees of smoothing.
*   **Zoom and Pan:** You can usually zoom in on specific areas of the plots and pan around to examine details.
*   **Download Data:** You can often download the raw data for a plot in CSV or JSON format for further analysis.

By examining the plots in TensorBoard, particularly the episode rewards and loss with varying levels of smoothing, you can gain insights into your agent's learning process, identify whether it is improving, and assess the stability of the training.

---
# Setup

Installs and import packages, then defines a helper function to play videos.

In [None]:
# @title Install required packages
!pip install swig --quiet # SWIG is a development tool that connects programs written in C and C++ with a variety of high-level programming languages. It's used by Box2D.
!pip install gymnasium[box2d] --quiet # Gymnasium is a fork of OpenAI Gym, providing environments for reinforcement learning. Box2D is a 2D physics engine used in the Lunar Lander environment.
!pip install 'stable-baselines3[extra]' --quiet # Stable-Baselines3 is a set of reliable implementations of reinforcement learning algorithms. The '[extra]' includes additional dependencies like rendering.
!pip install pyvirtualdisplay --quiet # pyvirtualdisplay is used to create a virtual display, which is necessary for rendering the environment in a Colab environment.
!pip install tensorboard --quiet    # allows monitoring the training

In [None]:
# @title Imports
import io
import os
import sys
import torch
import base64

import numpy as np
import matplotlib.pyplot as plt

import gymnasium as gym

import stable_baselines3
from stable_baselines3 import DQN
from stable_baselines3.common.results_plotter import ts2xy, load_results # utility functions for plotting results: ts2xy converts timesteps and episode rewards into x and y coordinates, and load_results loads the training logs.
from stable_baselines3.common.callbacks import EvalCallback # EvalCallback, which is used to evaluate the agent's performance periodically during training and log the results.

In [None]:
# @title Play Video function
from IPython.display import HTML
from base64 import b64encode
from pyvirtualdisplay import Display

# create the directory to store the video(s)
os.makedirs("./video", exist_ok=True)

display = Display(visible=False, size=(1400, 900))
_ = display.start()

"""
Utility functions to enable video recording of gym environment
and displaying it.
To enable video, just do "env = wrap_env(env)""
"""
def render_mp4(videopath: str) -> str:
  """
  Gets a string containing a b4-encoded version of the MP4 video
  at the specified path.
  """
  mp4 = open(videopath, 'rb').read()
  base64_encoded_mp4 = b64encode(mp4).decode()
  return f'<video width=400 controls><source src="data:video/mp4;' \
         f'base64,{base64_encoded_mp4}" type="video/mp4"></video>'

In [None]:
# @title Record Video function
def record_and_display_video(env_name, model, video_name, num_episodes=1):
    """
    Records a video of the agent performing in the environment and displays it.

    Args:
        env_name (str): The name of the environment.
        model (stable_baselines3.DQN): The trained model.
        video_name (str): The name to use for the video file.
        num_episodes (int): The number of episodes to record (default is 1).
    """
    # create the directory to store the video(s)
    os.makedirs("./video", exist_ok=True)

    # Use a virtual display for rendering
    display = Display(visible=False, size=(1400, 900))
    _ = display.start()

    env = gym.make(env_name, render_mode="rgb_array")
    env = gym.wrappers.RecordVideo(
        env,
        video_folder="video",
        name_prefix=f"{env_name}_{video_name}",
        episode_trigger=lambda episode_id: episode_id < num_episodes
    )

    observation, _ = env.reset()
    total_reward = 0
    done = False
    episode_count = 0

    while not done:
        action, states = model.predict(observation, deterministic=True)
        observation, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        total_reward += reward
        if done:
            episode_count += 1
            if episode_count < num_episodes:
                observation, _ = env.reset()
                done = False


    env.close()
    display.stop() # Stop the virtual display

    print(f"\nTotal reward: {total_reward}")

    # show video
    html = render_mp4(f"video/{env_name}_{video_name}-episode-0.mp4")
    return HTML(html)

* * *
# Introduction



In a standard RL setting, an agent learns optimal behavior from an environment through a feedback mechanism to maximize a given objective. Many algorithms have been proposed in the RL literature that an agent can apply to learn the optimal behavior. One such popular algorithm is the Deep Q-Network (DQN). This algorithm makes use of deep neural networks to compute optimal actions. In this project, your goal is to understand the effect of the number of neural network layers on the algorithm's performance. The performance of the algorithm can be evaluated through two metrics - Speed and Stability.

**Speed:** How fast the algorithm reaches the maximum possible reward.

**Stability** In some applications (especially when online learning is involved), along with speed, stability of the algorithm, i.e., minimal fluctuations in performance, is equally important.

In this project, you do not have to write the DQN code from scratch. You only have to tune the hyperparameters (neural network size, learning rate, etc), observe the performance, and analyze.

The chosen RL task is Lunar Lander. This task consists of the lander and a landing pad marked by two flags. The episode starts with the lander moving downwards due to gravity. The objective is to land safely using different engines available on the lander with zero speed on the landing pad as quickly and fuel efficient as possible. Reward for moving from the top of the screen and landing on landing pad with zero speed is between 100 to 140 points. Each leg ground contact yields a reward of 10 points. Firing main engine leads to a reward of -0.3 points in each frame. Firing the side engine leads to a reward of -0.03 points in each frame. An additional reward of -100 or +100 points is received if the lander crashes or comes to rest respectively which also leads to end of the episode.

The input state of the Lunar Lander consists of following components:

  1. Horizontal Position
  2. Vertical Position
  3. Horizontal Velocity
  4. Vertical Velocity
  5. Angle
  6. Angular Velocity
  7. Left Leg Contact
  8. Right Leg Contact

The actions of the agents are:
  1. Do Nothing
  2. Fire Main Engine
  3. Fire Left Engine
  4. Fire Right Engine


<img src="https://raw.githubusercontent.com/NeuromatchAcademy/course-content-dl/main/projects/static/lunar_lander.png">

# TensorBoard

It will start empty because no data has been logged. As the training goes, click the refresh button on the tensorboard to load new data.

In [None]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

In [None]:
# Specify the log directory and launch TensorBoard
log_dir = "/tmp/gym/" # Make sure this matches the log_dir used in your training code
os.makedirs(log_dir, exist_ok=True)
%tensorboard --logdir {log_dir}

* * *
# DQN Implementation from Stable Baselines 3



We will now use the DQN algorithm using Stable-Baselines3, a set of consolidated implementations of reinforcement learning algorithms in PyTorch. It provides a straightforward way to train and evaluate various RL agents, including DQN.

The model accepts various hyperparameters, and you'll be playing with some.

In [None]:

# Create environment
env_name = 'LunarLander-v3'
env = gym.make(env_name)

# Wrap the environment with a Monitor to log training progress
# so we don't need to manually record statistics
env = stable_baselines3.common.monitor.Monitor(env, log_dir)

# neural network hyperparameters
# net_arch is a list of number of neurons per hidden layer, e.g. [16,20] means
# two hidden layers with 16 and 20 neurons, respectively
policy_kwargs = dict(activation_fn=torch.nn.ReLU,
                     net_arch=[8,])

# instantiates the model using the defined hyperparameters
model = DQN("MlpPolicy",
    env,policy_kwargs = policy_kwargs,
    learning_rate=0.1 ,
    batch_size=1,  # number of samples taken in each gradient descent update
    buffer_size=1,  # size (number of experience tupleS) of the replay buffer.
    learning_starts=1,  # how many steps to interact with the environment without updates to the model
    gamma=0.99,  # discount factor
    target_update_interval=1,  # steps between updates of the target network (1= update every step)
    train_freq=(1,"step"),  # frequency of model updates (1,'step') meaNs train the network at every step
    exploration_initial_eps = 1,  # initial value of random action probability
    exploration_fraction = 1,  # fraction of entire training period over which epsilon is decreased
    exploration_final_eps=0.5,  # final value of random action probability
    seed = 1,  # seed for the pseudo random generators
    verbose=0, # Set verbose to 1 to observe training logs.
    tensorboard_log=log_dir # where to store training info for tensorboard
)

# You can also experiment with other RL algorithms like A2C, PPO, DDPG etc.
# Refer to  https://stable-baselines3.readthedocs.io/en/master/guide/examples.html
# for documentation. For example, if you would like to run DDPG, just replace "DQN" above with "DDPG".

In [None]:
# @title Showing the shape of observation and #actions
print('State shape: ', env.observation_space.shape)
print('Number of actions: ', env.action_space.n)

In [None]:
# @title Video of one episode of the untrained model
record_and_display_video(env_name, model, "untrained")

### Training DQN


In [None]:
# For evaluating the performance of the agent periodically and logging the results.
callback = EvalCallback(env, log_path=log_dir, deterministic=True)

#model.learn(total_timesteps=10_000, callback=callback, progress_bar=True)
model.learn(total_timesteps=10_000, log_interval=10, callback=callback, progress_bar=True)
# The performance of the training will be registered every 'log_interval' episodes.

The training takes time. We encourage you to analyze the output logs (set verbose to 1 to print the output logs). The main component of the logs that you should track is "ep_rew_mean" (mean of episode rewards). As the training proceeds, the value of "ep_rew_mean" should increase. The improvement need not be monotonic, but the trend should be upwards!

Along with training, we are also periodically evaluating the performance of the current model during the training.

Now, let us look at the visual performance of the trained lander.

**Note:** The performance varies across different seeds and runs. This code is not optimized to be stable across all runs and seeds. The idea is to find a robust hyperparameter configuration, but performance is expected to vary anyway.

In [None]:
record_and_display_video(env_name, model, "after-training")

### Performance over time

Let us analyze the model's performance (speed and stability). For this purpose, we plot the number of time steps on the x-axis and the episodic reward given by the trained model on the y-axis.

An episode is considered successful when it is finished with reward >= 200.

Notice that points are not evenly spaced on the graph. Moreover, this output is different from tensorboard because tensorboard sample frequency is smaller (this shows every episode).

**Warning**: just re-running the experiment and plotting again will accumulate results (i.e. just reexecuting the train and plot cells will result in 20000 timesteps in the x-axis). You must store each execution in a different place to see the results properly.

In [None]:
# Load training results from the log directory and convert them into x (timesteps) and y (episode rewards) coordinates for plotting.
x, y = ts2xy(load_results(log_dir), 'timesteps')
plt.plot(x, y)
plt.xlabel('Timesteps')
plt.ylabel('Episode Rewards')
plt.show()

Probably, both the video and plot showed a poor performance. From the above plot, we observe that, although the maximum reward is achieved quickly. Achieving an episodic reward of > 200 is good. We see that the agent has achieved it in less than 50000 timesteps (speed is good!). However, there are a lot of fluctuations in the performance (stability is not good!).

Your objective now is to modify the model hyperparameters and investigate the stability and speed of the chosen configuration.   


---
# Additional Project Ideas

## Extension to Atari Games

In the Lunar Lander task, the input to the algorithm is a vector of state information. Deep RL algorithms can also be applied when the input to the training is image frames, which is the case in the Atari games. For example, consider one of "DQN-friendly" Atari game: Pong. In this environment, the observation is an RGB image of the screen, which is an array of shape (210, 160, 3). To train the Pong game, you can start with the following sample code:

In [None]:
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import VecFrameStack

# Create Atari environment.
# If you are using Google Colab, you need to install the 'ale-py' package
# with 'pip install ale-py==0.7.5'
# and also install the pygame package: 'pip install pygame'
env = make_atari_env("PongNoFrameskip-v4", n_envs=1, seed=0)

# Frame stacking (optional): stack 4 frames to provide the agent with information about the direction of movement.
env = VecFrameStack(env, n_stack=4)

# Initialize the DQN model with a CNN policy.
# n_steps: The number of steps to run for each environment per update
model = DQN("CnnPolicy", env, verbose=1)

# Train the agent for 10000 timesteps
model.learn(total_timesteps=10000)

# Save the trained model
model.save("dqn_pong")

# Load the trained model
# model = DQN.load("dqn_pong")

# Enjoy trained agent
# obs = env.reset()
# while True:
#     action, _states = model.predict(obs, deterministic=True)
#     obs, rewards, dones, info = env.step(action)
#     env.render()

---
# References

1. [Stable Baselines Framework](https://stable-baselines3.readthedocs.io/en/master/guide/examples.html)
2. [Lunar Lander Environment](gymnasium.farama.org/environments/box2d/lunar_lander/)

