# Graded lab: Implement DQN for LunarLander

This notebook originates from the Deep RL Course on HuggingFace and has been modified.
You're not expected to understand the topic of PPO yet, so you can safely ignore that part.

![Cover](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/thumbnail.jpg)

In this notebook, you'll train your **DQN agent** - a Lunar Lander agent that will learn to **land correctly on the Moon 🌕**. Using [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) a Deep Reinforcement Learning library, share them with the community, and experiment with different configurations

⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️




In [3]:
%%html
<video controls autoplay><source src="https://huggingface.co/sb3/ppo-LunarLander-v2/resolve/main/replay.mp4" type="video/mp4"></video>

### The environment 🎮

- [LunarLander-v2](https://gymnasium.farama.org/environments/box2d/lunar_lander/)

### The library used 📚

- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/)

We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues).

## Install dependencies and create a virtual screen 🔽

The first step is to install the dependencies, we’ll install multiple ones.

- `gymnasium[box2d]`: Contains the LunarLander-v2 environment 🌛
- `stable-baselines3[extra]`: The deep reinforcement learning library.
- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face 🤗 Hub.

To make things easier, we created a script to install all these dependencies.

In [4]:
!apt install swig cmake

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
cmake is already the newest version (3.22.1-1ubuntu1.22.04.1).
Suggested packages:
  swig-doc swig-examples swig4.0-examples swig4.0-doc
The following NEW packages will be installed:
  swig swig4.0
0 upgraded, 2 newly installed, 0 to remove and 15 not upgraded.
Need to get 1,116 kB of archives.
After this operation, 5,542 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 swig4.0 amd64 4.0.2-1ubuntu1 [1,110 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 swig all 4.0.2-1ubuntu1 [5,632 B]
Fetched 1,116 kB in 1s (1,388 kB/s)
Selecting previously unselected package swig4.0.
(Reading database ... 120899 files and directories currently installed.)
Preparing to unpack .../swig4.0_4.0.2-1ubuntu1_amd64.deb ...
Unpacking swig4.0 (4.0.2-1ubuntu1) ...
Selecting previously unselected package swig.
Preparing to unpack .../swig_4.0.2-1ubu

In [5]:
!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt

Collecting stable-baselines3==2.0.0a5 (from -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt (line 1))
  Downloading stable_baselines3-2.0.0a5-py3-none-any.whl (177 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.5/177.5 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting swig (from -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt (line 2))
  Downloading swig-4.1.1.post1-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gymnasium[box2d] (from -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt (line 3))
  Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.9/953.9 k

During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).

Hence the following cell will install virtual screen libraries and create and run a virtual screen 🖥

In [6]:
!sudo apt-get update
!sudo apt-get install -y python3-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay

0% [Working]            Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
0% [Connecting to archive.ubuntu.com (91.189.91.81)] [1 InRelease 14.2 kB/110 k0% [Connecting to archive.ubuntu.com (91.189.91.81)] [Connected to cloud.r-proj                                                                               Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
0% [Connecting to archive.ubuntu.com (91.189.91.81)] [Waiting for headers] [Wai                                                                               Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
0% [Waiting for headers] [Waiting for headers] [3 InRelease 1,581 B/1,581 B 100                                                                               0% [Waiting for headers] [Waiting for headers]                                              Hit:4 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u

To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks to this trick, **we will be able to run our virtual screen.**

In [None]:
import os
os.kill(os.getpid(), 9)

In [1]:
# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

<pyvirtualdisplay.display.Display at 0x780340159720>

## Import the packages 📦

One additional library we import is huggingface_hub **to be able to upload and download trained models from the hub**.


The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.

You can see here all the Deep reinforcement Learning models available here👉 https://huggingface.co/models?pipeline_tag=reinforcement-learning&sort=downloads



In [2]:
import gymnasium as gym

from huggingface_sb3 import load_from_hub, package_to_hub
from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor


import math
import random
import matplotlib
import matplotlib.pyplot as plt
from collections import namedtuple, deque
from itertools import count

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np

%matplotlib inline

# set up matplotlib
is_ipython = 'inline' in matplotlib.get_backend()
if is_ipython:
    from IPython import display

plt.ion()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Create the LunarLander environment 🌛 and understand how it works

### [The environment 🎮](https://gymnasium.farama.org/environments/box2d/lunar_lander/)

In this first tutorial, we’re going to train our agent, a [Lunar Lander](https://gymnasium.farama.org/environments/box2d/lunar_lander/), **to land correctly on the moon**. To do that, the agent needs to learn **to adapt its speed and position (horizontal, vertical, and angular) to land correctly.**

---


💡 A good habit when you start to use an environment is to check its documentation

👉 https://gymnasium.farama.org/environments/box2d/lunar_lander/

---


In [3]:
# We create our environment with gym.make("<name_of_the_environment>")
env = gym.make("LunarLander-v2")
state = env.reset()
print(state)

print("_____OBSERVATION SPACE_____ \n")
print("Observation Space Shape", env.observation_space.shape)
print("Sample observation", env.observation_space.sample()) # Get a random observation

(array([-0.00258131,  1.4116296 , -0.26146847,  0.03152581,  0.00299783,
        0.05922648,  0.        ,  0.        ], dtype=float32), {})
_____OBSERVATION SPACE_____ 

Observation Space Shape (8,)
Sample observation [-47.227703    79.67672      2.363438     2.4818523    0.51080126
   4.576614     0.2436623    0.67654735]


We see with `Observation Space Shape (8,)` that the observation is a vector of size 8, where each value contains different information about the lander:
- Horizontal pad coordinate (x)
- Vertical pad coordinate (y)
- Horizontal speed (x)
- Vertical speed (y)
- Angle
- Angular speed
- If the left leg contact point has touched the land (boolean)
- If the right leg contact point has touched the land (boolean)


In [4]:
print("\n _____ACTION SPACE_____ \n")
print("Action Space Shape", env.action_space.n)
print("Action Space Sample", env.action_space.sample()) # Take a random action


 _____ACTION SPACE_____ 

Action Space Shape 4
Action Space Sample 3


The action space (the set of possible actions the agent can take) is discrete with 4 actions available 🎮:

- Action 0: Do nothing,
- Action 1: Fire left orientation engine,
- Action 2: Fire the main engine,
- Action 3: Fire right orientation engine.

Reward function (the function that will gives a reward at each timestep) 💰:

After every step a reward is granted. The total reward of an episode is the **sum of the rewards for all the steps within that episode**.

For each step, the reward:

- Is increased/decreased the closer/further the lander is to the landing pad.
-  Is increased/decreased the slower/faster the lander is moving.
- Is decreased the more the lander is tilted (angle not horizontal).
- Is increased by 10 points for each leg that is in contact with the ground.
- Is decreased by 0.03 points each frame a side engine is firing.
- Is decreased by 0.3 points each frame the main engine is firing.

The episode receive an **additional reward of -100 or +100 points for crashing or landing safely respectively.**

An episode is **considered a solution if it scores at least 200 points.**

#### Vectorized Environment

- We create a vectorized environment (a method for stacking multiple independent environments into a single environment) of 16 environments, this way, **we'll have more diverse experiences during the training.**

In [5]:
# Create the environment
envs = make_vec_env('LunarLander-v2', n_envs=16)
print(envs.action_space.sample()) #test random action on envs

3


## Create the Model 🤖

We have studied our environment and we understood the problem: **being able to land the Lunar Lander to the Landing Pad correctly by controlling left, right and main orientation engine**. Now let's build the algorithm we're going to use to solve this Problem 🚀.

To solve this problem, you're going to implement DQN from scratch.

In [52]:
Transition = namedtuple('Transition',
                        ('state', 'action', 'next_state', 'reward'))

class ReplayMemory(object):

    def __init__(self, capacity):
        self.memory = deque([], maxlen=capacity)

    def push(self, *args):
        """Save a transition"""
        self.memory.append(Transition(*args))

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)

In [96]:
class ReplayBuffer:
    """Fixed-size buffer to store experience tuples."""

    def __init__(self, action_size, buffer_size, batch_size, seed):
        """Initialize a ReplayBuffer object.

        Params
        ======
            action_size (int): dimension of each action
            buffer_size (int): maximum size of buffer
            batch_size (int): size of each training batch
            seed (int): random seed
        """
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
        self.seed = random.seed(seed)

    def add(self, state, action, reward, next_state, done):
        """Add a new experience to memory."""
        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)

    def sample(self):
      """Randomly sample a batch of experiences from memory."""
      experiences = random.sample(self.memory, k=self.batch_size)

      states = [torch.from_numpy(np.array(e.state)).float().to(device) for e in experiences if e is not None]
      actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to(device)
      rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
      next_states = [torch.from_numpy(np.array(e.next_state)).float().to(device) for e in experiences if e is not None]
      dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)

      return states, actions, rewards, next_states, dones


    def __len__(self):
        """Return the current size of internal memory."""
        return len(self.memory)

In [97]:
import torch.nn.init as init

In [111]:
# TODO: Define your DQN agent from scratch!

# Here He-Initialization is used for better gradient flow within the network
class DQN_Lunar_Lander(nn.Module):

    def __init__(self, n_observations, n_actions):
        super(DQN_Lunar_Lander, self).__init__()

        self.layerList = nn.ModuleList()
        # n_observations = 8 here with LunarLander-V2 env
        self.input_size = n_observations
        self.hidden_layer_size = 4
        self.num_hidden_layers = 1
        # n_actions = 4 here with LunarLander-V2 env
        self.output_size = n_actions

        # Input Layer
        self.layerList.append(nn.Linear(self.input_size, self.hidden_layer_size))
        self.layerList.append(nn.ReLU())

        # Hidden Layers
        for _ in range(self.num_hidden_layers):
          linear_layer = nn.Linear(self.hidden_layer_size, self.hidden_layer_size)
          init.kaiming_uniform_(linear_layer.weight, mode='fan_in', nonlinearity='relu')
          self.layerList.append(linear_layer)
          self.layerList.append(nn.ReLU())

        # Output Layer
        linear_output_layer = nn.Linear(self.hidden_layer_size, self.output_size)
        init.kaiming_uniform_(linear_output_layer.weight, mode='fan_in', nonlinearity='linear')
        self.layerList.append(linear_output_layer)
        # END TODO

    def forward(self, state):
      for i, layer in enumerate(self.layerList):
        state = layer(state)
        # print(f"Output after layer {i + 1}: {x.shape}")
        # print(f"x in forward: {x}")
      return state
# END TODO

## Train the DQN agent 🏃
- Let's train our agent for 1,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~20min, but you can use fewer timesteps if you just want to try it out.
- During the training, take a ☕ break you deserved it 🤗

In [112]:
# Hyperparameter definition
BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 64 # minibatch size for more generalized training
GAMMA = 0.99
EPS_START = 0.9
EPS_END = 0.05
EPS_DECAY = 1000
TAU = 1e-3 # for soft update of target parameters
LR = 5e-4
UPADATE_EVERY = 4

In [134]:
class Agent_Lunar_Lander():
  def __init__(self, state_size, action_size, seed) -> None:
    self.state_size = state_size
    self.action_size = action_size
    self.seed = random.seed(seed)

    # Q-Networks with the Adam optimizer, using L2 regularization (weight_decay) for further improvement to find generalized optimum
    # action-value network also called: Online Network
    self.action_value_net = DQN_Lunar_Lander(state_size, action_size).to(device)
    self.target_net = DQN_Lunar_Lander(state_size, action_size).to(device)
    self.optimizer = optim.AdamW(self.action_value_net.parameters(), lr=LR, weight_decay=1e-5, amsgrad=True)

    # Replay Buffer
    # self.replay_buffer = ReplayMemory(BUFFER_SIZE)
    self.replay_buffer = ReplayBuffer(action_size, BUFFER_SIZE, BATCH_SIZE, seed)
    self.t_step = 0

  def get_epsilon(self, steps):
    """Gets value for epsilon. It declines as we take more steps."""
    # Ensures that there's almost at least a min_epsilon chance of randomly exploring
    return EPS_START * max(EPS_END, min(1., 1. - math.log10((steps + 1) / EPS_DECAY)))

  def step(self, state, action, reward, next_state, done):
    self.replay_buffer.add(state, action, reward, next_state, done)

    if self.t_step == 0:
      if len(self.replay_buffer) > BATCH_SIZE:
        experiences = self.replay_buffer.sample()
        self.learn(experiences, GAMMA)

  def act(self, states):
    states = torch.cat(states, dim=0)  # Concatenate the states into a single tensor
    sample = random.random()
    actions = []
    for state in states:
      eps_threshold = self.get_epsilon(self.t_step)
      if sample > eps_threshold:
          with torch.no_grad():
              actions.append(self.action_value_net.forward(state).max(1)[1].item())
      else:
          actions.append(envs.action_space.sample())
      self.t_step += 1
    return actions


  def learn(self, experiences, gamma):
    states, actions, rewards, next_states, dones = experiences

    # implementation of the slide p. 17 / lecture 11
    q_targets_next = self.target_net(next_states).detach().max(1)[0].unsqueeze(1)
    target_q_values = rewards + gamma * q_targets_next * (1 - dones)
    q_expected = self.action_value_net(states).gather(1, actions)

    # because of DQN => MSE as loss-fucntion
    loss = F.mse_loss(q_expected, target_q_values)
    self.optimizer.zero_grads()
    loss.backward()
    self.optimizer.step()

    # update of the
    self.soft_update(self.action_value_net, self.target_net, TAU)


  def soft_update(self, local_model, target_model, tau):
    for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
        target_param.data.copy_(tau * local_param.data + (1.0-tau) * target_param.data)

In [135]:
# get the state_size for the DQNs
state_sample = envs.reset()
state_size = len(state_sample[0])
# get the action_size for the DQNs
action_size = envs.action_space.n

agent = Agent_Lunar_Lander(state_size, action_size, seed=42)

In [136]:
def dqn(envs, agent, n_episodes=2000, max_t=1000):
    scores = []                        # list containing scores from each episode
    scores_window = deque(maxlen=100)  # last 100 scores

    for i_episode in range(1, n_episodes+1):
        states = envs.reset()  # Reset all environments
        states = [torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0) for state in states]
        rewards_accumulator = np.zeros(envs.num_envs)

        for t in range(max_t):
            # print(states)
            actions = agent.act(states)
            next_state, reward, done, _ = envs.step(actions)

            for i in range(envs.num_envs):
                agent.step(states[i], actions[i], reward[i], next_state[i], done[i])
                rewards_accumulator[i] += reward[i]

            # states = [torch.tensor(next_state[i], dtype=torch.float32, device=device).unsqueeze(0) for i in range(envs.num_envs)]  # Update states
            # states = [torch.tensor(state, dtype=torch.float32).to(device).unsqueeze(0) for state in states]
            states = [torch.tensor(state, dtype=torch.float32).clone().detach().to(device).unsqueeze(0) for state in states]
            if np.any(done):
                break

        avg_score = np.mean(rewards_accumulator)
        scores_window.append(avg_score)
        scores.append(avg_score)

        print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)), end="")
        if i_episode % 100 == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))

        if np.mean(scores_window) >= 200.0:
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode-100, np.mean(scores_window)))
            torch.save(agent.qnetwork_local.state_dict(), 'checkpoint.pth')
            break

    return scores

In [137]:
scores = dqn(envs, agent)

  states = [torch.tensor(state, dtype=torch.float32).clone().detach().to(device).unsqueeze(0) for state in states]


RuntimeError: ignored

In [None]:
def dqn_training(n_episodes=1, max_t=1000, eps_start=1.0, eps_end=0.01, eps_decay=0.995):
  scores = []                        # list containing scores from each episode
  scores_window = deque(maxlen=100)  # last 100 scores
  eps = eps_start                    # initialize epsilon
  episode_durations = [[] for _ in range(envs.num_envs)]
  for i_episode in range(1, n_episodes+1):
      state = envs.reset()
      states = [torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0) for state in states]
      rewards_accumulator = np.zeros(envs.num_envs)
      # score = 0
      for t in range(max_t):
          actions = agent.act(states, eps)
          print("actions=", actions)
          next_states, rewards, dones, _ = env.step(actions)
          observations = [torch.tensor(observation, dtype=torch.float32, device=device).unsqueeze(0) for observation in next_states]
          rewards = [torch.tensor([reward], device=device) for reward in rewards]
          actions = [torch.tensor([[action]], device=device) for action in actions]
          for env_index in range(envs.num_envs):
            rewards_accumulator[env_index] += rewards[env_index].item()
            if dones[env_index]:
                next_state = None
                episode_durations[env_index].append(rewards_accumulator[env_index])
                # plot_durations()
                rewards_accumulator[env_index] = 0
            else:
                next_state = observations[env_index]

          agent.step(state, action, reward, next_state, done)
          state = next_state
          # score += reward
          rewards_accumulator[env_index] += rewards[env_index].item()
          if done:
              break
      scores_window.append(score)       # save most recent score
      scores.append(score)              # save most recent score
      eps = max(eps_end, eps_decay*eps) # decrease epsilon
      print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)), end="")
      if i_episode % 100 == 0:
          print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
      if np.mean(scores_window)>=200.0:
          print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode-100, np.mean(scores_window)))
          torch.save(agent.qnetwork_local.state_dict(), 'checkpoint.pth')
            break
    return scores

In [None]:


def select_actions(states):
    global steps_done
    sample = random.random()
    actions = []
    for state in states:
      eps_threshold = get_epsilon(steps_done)
      if sample > eps_threshold:
          with torch.no_grad():
              actions.append(policy_net.forward(state).max(1)[1].item())
      else:
          actions.append(envs.action_space.sample())
      steps_done += 1
    return actions


episode_durations = [[] for _ in range(envs.num_envs)]

def get_epsilon(steps):
  """Gets value for epsilon. It declines as we take more steps."""
  # Ensures that there's almost at least a min_epsilon chance of randomly exploring
  return EPS_START * max(EPS_END, min(1., 1. - math.log10((steps + 1) / EPS_DECAY)))


def plot_durations(show_result=False):

    # Plot durations for each environment
    for env_index in range(envs.num_envs):
        plt.figure(env_index + 1)  # Start a new figure for each environment
        durations_t = torch.tensor(episode_durations[env_index], dtype=torch.float)
        if show_result:
            plt.title(f'Environment {env_index + 1} - Result')
        else:
            plt.clf()
            plt.title(f'Environment {env_index + 1} - Training...')
        plt.xlabel('Episode')
        plt.ylabel('Duration')
        plt.plot(durations_t.numpy())
        # Take 100 episode averages and plot them too
        if len(durations_t) >= 100:
            means = durations_t.unfold(0, 100, 1).mean(1).view(-1)
            means = torch.cat((torch.zeros(99), means))
            plt.plot(means.numpy())

    plt.pause(0.001)  # pause a bit so that plots are updated
    if is_ipython:
        if not show_result:
            display.display(plt.gcf())
            display.clear_output(wait=True)
        else:
            display.display(plt.gcf())

In [None]:
def optimize_model():
    if len(memory) < BATCH_SIZE:
        return
    transitions = memory.sample(BATCH_SIZE)
    # Transpose the batch (see https://stackoverflow.com/a/19343/3343043 for
    # detailed explanation). This converts batch-array of Transitions
    # to Transition of batch-arrays.
    batch = Transition(*zip(*transitions))

    # Compute a mask of non-final states and concatenate the batch elements
    # (a final state would've been the one after which simulation ended)
    non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
                                          batch.next_state)), device=device, dtype=torch.bool)
    non_final_next_states = torch.cat([s for s in batch.next_state
                                                if s is not None])
    state_batch = torch.cat(batch.state)
    action_batch = torch.cat(batch.action)
    reward_batch = torch.cat(batch.reward)

    # Compute Q(s_t, a) - the model computes Q(s_t), then we select the
    # columns of actions taken. These are the actions which would've been taken
    # for each batch state according to policy_net
    state_action_values = policy_net(state_batch).gather(1, action_batch)

    # Compute V(s_{t+1}) for all next states.
    # Expected values of actions for non_final_next_states are computed based
    # on the "older" target_net; selecting their best reward with max(1).values
    # This is merged based on the mask, such that we'll have either the expected
    # state value or 0 in case the state was final.
    next_state_values = torch.zeros(BATCH_SIZE, device=device)
    with torch.no_grad():
        next_state_values[non_final_mask] = target_net(non_final_next_states).max(1).values
    # Compute the expected state-action values (Q values)
    # TODO: write the function to compute the expected Q values
    #   Q(s,a) = Q(s,a)*GAMMA +  r
    expected_state_action_values = (next_state_values * GAMMA) + reward_batch

    # Compute Huber loss
    criterion = nn.SmoothL1Loss()
    loss = criterion(state_action_values, expected_state_action_values.unsqueeze(1))

    # Optimize the model
    optimizer.zero_grad()
    loss.backward()
    # In-place gradient clipping
    torch.nn.utils.clip_grad_value_(policy_net.parameters(), 100)
    optimizer.step()

#### Solution

In [None]:
# Initialize the environment and get it's state. environments resets automatically when reaching a terminal state
states = envs.reset() # 2d array containing the 16 observations of the 16 environments
# print(states.shape)
# print(states)
states = [torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0) for state in states]
# print(states.shape)
# print(states)

rewards_accumulator = np.zeros(envs.num_envs)

# TODO: Train it for 1,000,000 timesteps
while steps_done < 1000000:
    actions = select_actions(states)
    # print("action", actions)
    observations, rewards, dones, _ = envs.step(actions)
    observations = [torch.tensor(observation, dtype=torch.float32, device=device).unsqueeze(0) for observation in observations]
    # print("rewards", rewards)
    rewards = [torch.tensor([reward], device=device) for reward in rewards]
    actions = [torch.tensor([[action]], device=device) for action in actions]
    # print("element of actions", actions[0])
    for env_index in range(envs.num_envs):
        rewards_accumulator[env_index] += rewards[env_index].item()
        if dones[env_index]:
            next_state = None
            episode_durations[env_index].append(rewards_accumulator[env_index])
            # plot_durations()
            rewards_accumulator[env_index] = 0
        else:
            next_state = observations[env_index]

        # Store the transition in memory
        memory.push(states[env_index], actions[env_index], next_state, rewards[env_index])

    # Move to the next state
    states = observations

    # Perform one step of the optimization (on the policy network)
    optimize_model()

    # Soft update of the target network's weights
    # θ′ ← τ θ + (1 −τ )θ′
    target_net_state_dict = target_net.state_dict()
    policy_net_state_dict = policy_net.state_dict()
    for key in policy_net_state_dict:
        target_net_state_dict[key] = policy_net_state_dict[key]*TAU + target_net_state_dict[key]*(1-TAU)
    target_net.load_state_dict(target_net_state_dict)

print('Complete')
plot_durations(show_result=True)
plt.ioff()
plt.show()

## Evaluate the agent 📈
- Remember to wrap the environment in a [Monitor](https://stable-baselines3.readthedocs.io/en/master/common/monitor.html).
- Now that our Lunar Lander agent is trained 🚀, we need to **check its performance**.


💡 When you evaluate your agent, you should not use your training environment but create an evaluation environment.

- In my case, I got a mean reward is `200.20 +/- 20.80` after training for 1 million steps, which means that our lunar lander agent is ready to land on the moon 🌛🥳.

In [None]:
import gym
import os
import time
import torch
from gym.wrappers import RecordVideo

# For Google Colab, to download files
from google.colab import files

# Create a directory to store video
video_folder = '/content/videos'
os.makedirs(video_folder, exist_ok=True)

# Wrap your environment
env = gym.make("LunarLander-v2")
env = RecordVideo(env, video_folder)

def evaluate_and_record(policy_net, env, n_eval_episodes=1):
    for i in range(n_eval_episodes):
        state = env.reset()
        while True:
            state_tensor = torch.tensor(state, device=device, dtype=torch.float32).unsqueeze(0)
            with torch.no_grad():
                action = policy_net(state_tensor).max(1)[1].view(1, 1).item()
            state, _, done, _ = env.step(action)
            if done:
                break

    # Close the environment and video recorder
    env.close()

# Evaluate and record
evaluate_and_record(policy_net, env)

# Download the video files
for filename in os.listdir(video_folder):
    if filename.endswith(".mp4"):
        files.download(os.path.join(video_folder, filename))
