<a href="https://colab.research.google.com/github/erodola/DLAI-s2-2023/blob/main/labs/04/4_Logistic_Regression_and_Optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning

# Tutorial 6: Q-Learning

In this tutorial, we will cover:

- Q-Learning
- Deep Q-Learning

Authors:

- Antonio Ricciardi, PhD student

Course:

- Lectures and notebooks at https://github.com/erodola/ML-s2-2024/

# Imports and utilities

In [None]:
!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit2/requirements-unit2.txt

In [None]:
!sudo apt-get update
!sudo apt-get install -y python3-opengl
!apt install ffmpeg xvfb
!pip3 install pyvirtualdisplay

Trick to run virtual screen

In [None]:
import os
os.kill(os.getpid(), 9)

In [None]:
# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

## Import the packages

- `random` is to generate random numbers (that will be useful for epsilon-greedy policy).
- `imageio` is to generate a replay video.

In [None]:
import numpy as np
import gymnasium as gym
import random
import imageio
import os
import tqdm
import pickle5 as pickle
from tqdm.notebook import tqdm
import base64
from IPython.display import HTML
# import gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from collections import deque
import matplotlib.pyplot as plt

In [None]:
def record_video(env, Qtable, out_directory, max_steps, fps=1):
  """
  Generate a replay video of the agent
  :param env
  :param Qtable: Qtable of our agent
  :param out_directory
  :param max_steps: maximum number of steps
  :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)
  """
  images = []
  terminated = False
  truncated = False
  state, info = env.reset(seed=random.randint(0, 500))
  img = env.render()
  images.append(img)
  score = 0
  step = 0
  while not terminated and not truncated and step < max_steps:
    # Take the action (index) that have the maximum expected future reward given that state
    action = np.argmax(Qtable[state][:])
    state, reward, terminated, truncated, info = env.step(action)  # We directly put next_state = state for recording logic
    img = env.render()
    images.append(img)
    score += reward
    step += 1
  imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)
  print("Final Score:", score)


We're now ready to code our Q-Learning algorithm 🔥

# Part 1: Frozen Lake ⛄ (non slippery version)

## Create and understand [FrozenLake environment ⛄]((https://gymnasium.farama.org/environments/toy_text/frozen_lake/))
---

A good habit when you start to use an environment is to check its documentation

https://gymnasium.farama.org/environments/toy_text/frozen_lake/

---

We're going to train our Q-Learning agent **to navigate from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoid holes (H)**.

We can have two sizes of environment:

- `map_name="4x4"`: a 4x4 grid version
- `map_name="8x8"`: a 8x8 grid version


The environment has two modes:

- `is_slippery=False`: The agent always moves **in the intended direction** due to the non-slippery nature of the frozen lake (deterministic).
- `is_slippery=True`: The agent **may not always move in the intended direction** due to the slippery nature of the frozen lake (stochastic).

For now let's keep it simple with the 4x4 map and non-slippery.
We add a parameter called `render_mode` that specifies how the environment should be visualised. In our case because we **want to record a video of the environment at the end, we need to set render_mode to rgb_array**.

As [explained in the documentation](https://gymnasium.farama.org/api/env/#gymnasium.Env.render) “rgb_array”: Return a single frame representing the current state of the environment. A frame is a np.ndarray with shape (x, y, 3) representing RGB values for an x-by-y pixel image.

In [None]:
env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False, render_mode="rgb_array")

In [None]:
# We create our environment with gym.make("<name_of_the_environment>")- `is_slippery=False`: The agent always moves in the intended direction due to the non-slippery nature of the frozen lake (deterministic).
print("_____OBSERVATION SPACE_____ \n")
print("Observation Space", env.observation_space)
print("Sample observation", env.observation_space.sample()) # Get a random observation

We see with `Observation Space Shape Discrete(16)` that the observation is an integer representing the **agent’s current position as current_row * ncols + current_col (where both the row and col start at 0)**.

For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map. **For example, the 4x4 map has 16 possible observations.**


For instance, this is what state = 0 looks like:

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/frozenlake.png" alt="FrozenLake">

In [None]:
print("\n _____ACTION SPACE_____ \n")
print("Action Space Shape", env.action_space.n)
print("Action Space Sample", env.action_space.sample()) # Take a random action

The action space (the set of possible actions the agent can take) is discrete with 4 actions available 🎮:
- 0: GO LEFT
- 1: GO DOWN
- 2: GO RIGHT
- 3: GO UP

Reward function 💰:
- Reach goal: +1
- Reach hole: 0
- Reach frozen: 0

# Let's start coding 🚀

In [None]:
state_space =
print("There are ", state_space, " possible states")

action_space =
print("There are ", action_space, " possible actions")

In [None]:
# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros. np.zeros needs a tuple (a,b)
def initialize_q_table(state_space, action_space):
  Qtable =
  return Qtable

In [None]:
Qtable_frozenlake = initialize_q_table(state_space, action_space)

#### Solution

In [None]:
state_space = env.observation_space.n
print("There are ", state_space, " possible states")

action_space = env.action_space.n
print("There are ", action_space, " possible actions")

In [None]:
# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros
def initialize_q_table(state_space, action_space):
  Qtable = np.zeros((state_space, action_space))
  return Qtable

In [None]:
Qtable_frozenlake = initialize_q_table(state_space, action_space)

## Define the Greedy Policy

Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.

- Epsilon-greedy policy (acting policy)
- Greedy-policy (updating policy)

The greedy policy will also be the final policy we'll have when the Q-learning agent completes training. The greedy policy is used to select an action using the Q-table.

In [None]:
# define the greedy policy
def greedy_policy(Qtable, state):
  # Exploitation: take the action with the highest state, action value
  action =

  return action

#### Solution

In [None]:
def greedy_policy(Qtable, state):
  # Exploitation: take the action with the highest state, action value
  action = np.argmax(Qtable[state][:])

  return action

## Define the epsilon-greedy policy 🤖

Epsilon-greedy is the training policy that handles the exploration/exploitation trade-off.

The idea with epsilon-greedy:

- With *probability 1 - ɛ* : **we do exploitation** (i.e. our agent selects the action with the highest state-action pair value).

- With *probability ɛ*: we do **exploration** (trying a random action).

As the training continues, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.**

In [None]:
def epsilon_greedy_policy(Qtable, state, epsilon):
  # Randomly generate a number between 0 and 1
  random_num =
  # if random_num > greater than epsilon --> exploitation
  if random_num > epsilon:
    # Take the action with the highest value given a state
    # np.argmax can be useful here
    action =
  # else --> exploration
  else:
    action = # Take a random action

  return action

#### Solution

In [None]:
def epsilon_greedy_policy(Qtable, state, epsilon):
  # Randomly generate a number between 0 and 1
  random_num = random.uniform(0,1)
  # if random_num > greater than epsilon --> exploitation
  if random_num > epsilon:
    # Take the action with the highest value given a state
    # np.argmax can be useful here
    action = greedy_policy(Qtable, state)
  # else --> exploration
  else:
    action = env.action_space.sample()

  return action

## Define the hyperparameters ⚙️

The exploration related hyperparamters are some of the most important ones.

- We need to make sure that our agent **explores enough of the state space** to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.
- If you decrease epsilon too fast (too high decay_rate), **you take the risk that your agent will be stuck**, since your agent didn't explore enough of the state space and hence can't solve the problem.

In [None]:
# Training parameters
n_training_episodes = 10000  # Total training episodes
learning_rate = 0.7          # Learning rate

# Evaluation parameters
n_eval_episodes = 100        # Total number of test episodes

# Environment parameters
env_id = "FrozenLake-v1"     # Name of the environment
max_steps = 99               # Max steps per episode
gamma = 0.95                 # Discounting rate
eval_seed = []               # The evaluation seed of the environment

# Exploration parameters
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.05            # Minimum exploration probability
decay_rate = 0.0005            # Exponential decay rate for exploration prob

### Test and show frozenlake episode with untrained agent

In [None]:
# Define the output directory and filename for the video
output_directory = "episode_video.mp4"

# Record the video for an episode
record_video(env, Qtable_frozenlake, output_directory, max_steps=99, fps=1)

# Display the video in Jupyter Notebook
video = open(output_directory, "rb").read()
video_tag = f'<video controls alt="Episode Video" src="data:video/mp4;base64,{base64.b64encode(video).decode()}" type="video/mp4">'
display(HTML(video_tag))

# Training Loop

In [None]:
def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
  for episode in tqdm(range(n_training_episodes)):
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
    # Reset the environment
    state, info = env.reset()
    step = 0
    terminated = False
    truncated = False

    # repeat
    for step in range(max_steps):
      # Choose the action At using epsilon greedy policy
      action =

      # Take action At and observe Rt+1 and St+1
      # Take the action (a) and observe the outcome state(s') and reward (r)
      new_state, reward, terminated, truncated, info =

      # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
      Qtable[state][action] =

      # If terminated or truncated finish the episode
      if terminated or truncated:
        break

      # Our next state is the new state
      state = new_state
  return Qtable

#### Solution

In [None]:
def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
  for episode in tqdm(range(n_training_episodes)):
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
    # Reset the environment
    state, info = env.reset()
    step = 0
    terminated = False
    truncated = False

    # repeat
    for step in range(max_steps):
      # Choose the action At using epsilon greedy policy
      action = epsilon_greedy_policy(Qtable, state, epsilon)

      # Take action At and observe Rt+1 and St+1
      # Take the action (a) and observe the outcome state(s') and reward (r)
      new_state, reward, terminated, truncated, info = env.step(action)

      # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
      Qtable[state][action] = Qtable[state][action] + learning_rate * (reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action])

      # If terminated or truncated finish the episode
      if terminated or truncated:
        break

      # Our next state is the new state
      state = new_state
  return Qtable

## Train the Q-Learning agent 🤖

In [None]:
Qtable_frozenlake = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_frozenlake)

## Let's check our Q-table!

In [None]:
Qtable_frozenlake

## Evaluation 📝

- We defined the evaluation method that we're going to use to test our Q-Learning agent.

In [None]:
def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
  """
  Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.
  :param env: The evaluation environment
  :param max_steps: Maximum number of steps per episode
  :param n_eval_episodes: Number of episode to evaluate the agent
  :param Q: The Q-table
  :param seed: The evaluation seed array (for taxi-v3)
  """
  episode_rewards = []
  for episode in tqdm(range(n_eval_episodes)):
    if seed:
      state, info = env.reset(seed=seed[episode])
    else:
      state, info = env.reset()
    step = 0
    truncated = False
    terminated = False
    total_rewards_ep = 0

    for step in range(max_steps):
      # Take the action (index) that have the maximum expected future reward given that state
      action = greedy_policy(Q, state)
      new_state, reward, terminated, truncated, info = env.step(action)
      total_rewards_ep += reward

      if terminated or truncated:
        break
      state = new_state
    episode_rewards.append(total_rewards_ep)
  mean_reward = np.mean(episode_rewards)
  std_reward = np.std(episode_rewards)

  return mean_reward, std_reward

## Evaluate our Q-Learning agent 📈

- Usually, you should have a mean reward of 1.0
- The **environment is relatively easy** since the state space is really small (16). What you can try to do is [to replace it with the slippery version](https://gymnasium.farama.org/environments/toy_text/frozen_lake/), which introduces stochasticity, making the environment more complex.

In [None]:
# Evaluate our Agent
mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_frozenlake, eval_seed)
print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")

### Show trained agent

In [None]:
# Define the output directory and filename for the video
output_directory = "episode_video.mp4"

# Record the video for an episode
record_video(env, Qtable_frozenlake, output_directory, max_steps=99, fps=1)

# Display the video in Jupyter Notebook
video = open(output_directory, "rb").read()
video_tag = f'<video controls alt="Episode Video" src="data:video/mp4;base64,{base64.b64encode(video).decode()}" type="video/mp4">'
display(HTML(video_tag))

# Part 2: Taxi-v3 🚖

## Create and understand [Taxi-v3 🚕](https://gymnasium.farama.org/environments/toy_text/taxi/)
---

💡 A good habit when you start to use an environment is to check its documentation

👉 https://gymnasium.farama.org/environments/toy_text/taxi/

---

In `Taxi-v3` 🚕, there are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue).

When the episode starts, **the taxi starts off at a random square** and the passenger is at a random location. The taxi drives to the passenger’s location, **picks up the passenger**, drives to the passenger’s destination (another one of the four specified locations), and then **drops off the passenger**. Once the passenger is dropped off, the episode ends.


<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi.png" alt="Taxi">


In [None]:
env = gym.make("Taxi-v3", render_mode="rgb_array")

There are 500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger (including the case when the passenger is in the taxi), and 4 destination locations.

In [None]:
state_space = env.observation_space.n
print("There are ", state_space, " possible states")

In [None]:
action_space = env.action_space.n
print("There are ", action_space, " possible actions")

The action space (the set of possible actions the agent can take) is discrete with **6 actions available 🎮**:

- 0: move south
- 1: move north
- 2: move east
- 3: move west
- 4: pickup passenger
- 5: drop off passenger

Reward function 💰:

- -1 per step unless other reward is triggered.
- +20 delivering passenger.
- -10 executing “pickup” and “drop-off” actions illegally.

In [None]:
# Create our Q table with state_size rows and action_size columns (500x6)
Qtable_taxi = initialize_q_table(state_space, action_space)
print(Qtable_taxi)
print("Q-table shape: ", Qtable_taxi .shape)

## Hyperparameters ⚙️

⚠ DO NOT MODIFY EVAL_SEED: the eval_seed array **allows us to evaluate your agent with the same taxi starting positions for every classmate**

In [None]:
# Training parameters
n_training_episodes = 25000   # Total training episodes
learning_rate = 0.7           # Learning rate

# Evaluation parameters
n_eval_episodes = 100        # Total number of test episodes

# DO NOT MODIFY EVAL_SEED
eval_seed = [16,54,165,177,191,191,120,80,149,178,48,38,6,125,174,73,50,172,100,148,146,6,25,40,68,148,49,167,9,97,164,176,61,7,54,55,
 161,131,184,51,170,12,120,113,95,126,51,98,36,135,54,82,45,95,89,59,95,124,9,113,58,85,51,134,121,169,105,21,30,11,50,65,12,43,82,145,152,97,106,55,31,85,38,
 112,102,168,123,97,21,83,158,26,80,63,5,81,32,11,28,148] # Evaluation seed, this ensures that all classmates agents are trained on the same taxi starting position
                                                          # Each seed has a specific starting state

# Environment parameters
env_id = "Taxi-v3"           # Name of the environment
max_steps = 99               # Max steps per episode
gamma = 0.95                 # Discounting rate

# Exploration parameters
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.05           # Minimum exploration probability
decay_rate = 0.005            # Exponential decay rate for exploration prob


## Train our Q-Learning agent 🤖

In [None]:
Qtable_taxi = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_taxi)
Qtable_taxi

### Evaluate Q-Learning agent on the Taxi-v3 environment 🚕

In [None]:
# Define the output directory and filename for the video
output_directory = "episode_video.mp4"

# Record the video for an episode
record_video(env, Qtable_taxi, output_directory, max_steps=200, fps=1)

# Display the video in Jupyter Notebook
video = open(output_directory, "rb").read()
video_tag = f'<video controls alt="Episode Video" src="data:video/mp4;base64,{base64.b64encode(video).decode()}" type="video/mp4">'
display(HTML(video_tag))


# Deep Q-Learning

In [None]:
# @title Play Video function
from IPython.display import HTML
from base64 import b64encode
from pyvirtualdisplay import Display

# create the directory to store the video(s)
os.makedirs("./video", exist_ok=True)

display = Display(visible=False, size=(1400, 900))
_ = display.start()

"""
Utility functions to enable video recording of gym environment
and displaying it.
To enable video, just do "env = wrap_env(env)""
"""
def render_mp4(videopath: str) -> str:
    """
    Gets a string containing a b4-encoded version of the MP4 video
    at the specified path.
    """
    mp4 = open(videopath, 'rb').read()
    base64_encoded_mp4 = b64encode(mp4).decode()
    return f'<video width=400 controls><source src="data:video/mp4;' \
        f'base64,{base64_encoded_mp4}" type="video/mp4"></video>'

# import gymnasium as gym
# from gym import spaces
#from gym.envs.box2d.lunar_lander import *
from gymnasium.wrappers.monitoring.video_recorder import VideoRecorder

def record_video_neuralnet(env, model, out_directory, max_steps=1000, fps=30):
    """
    Generate a replay video of the agent
    :param env
    :param model: Neural network model
    :param out_directory
    :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)
    """
    vid = VideoRecorder(env, path=f"video/vid.mp4")
    state = env.reset()[0]
    step = 0
    total_reward = 0
    terminated = False
    truncated = False
    while not terminated and not truncated and step < max_steps:
        frame = env.render()
        vid.capture_frame()
        state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
        action_probs = model(state_tensor)
        action = torch.argmax(action_probs).item()
        state, reward, terminated, truncated, info = env.step(action)
        total_reward += reward
        step += 1

    vid.close()
    env.close()
    print(f"\nTotal reward: {total_reward}")

    # show video
    html = render_mp4(f"video/vid.mp4")
    HTML(html)

# Part 3: CartPole 

## Create and understand [CartPole-v1 🛒](https://gymnasium.farama.org/environments/classic_control/cartpole/)
---

💡 A good habit when you start to use an environment is to check its documentation

👉 [CartPole-v1 Documentation](https://gymnasium.farama.org/environments/classic_control/cartpole/)

---

CartPole-v1 is a classic control environment in which the goal is to balance a pole on a cart. The environment is considered solved when the agent can balance the pole for 200 time steps.

The state space consists of four continuous variables:
- Cart position
- Cart velocity
- Pole angle
- Pole angular velocity

The action space consists of two discrete actions:
- Push the cart to the left
- Push the cart to the right

The reward function is as follows:
- +1 for each time step the pole remains upright

To solve this environment, we will use reinforcement learning algorithms Deep Q-learning.


In [None]:
# Create the CartPole environment
env = gym.make("CartPole-v1", render_mode="rgb_array")

## Neural Network Architecture

We will use a simple neural network with the following architecture:
- Input layer: corresponding to the state space
- Hidden layer: 128 units with ReLU activation
- Output layer: corresponding to the action space

In [None]:
class NeuralNetwork(nn.Module):
    def __init__(self, input_size, output_size):
        super(NeuralNetwork, self).__init__()
        self.fc1 = nn.Linear(input_size, 128)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(128, output_size)
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

In [None]:
# Create an instance of the neural network
model = NeuralNetwork(input_size=env.observation_space.shape[0], output_size=env.action_space.n)

# Print the model architecture
print(model)

### Test with random actions

In [None]:
# Reset the environment
state, _ = env.reset()

# Initialize the score
score = 0

# Perform actions in the environment until the episode is done
done = False
while not done:
    # Choose a random action
    action = env.action_space.sample()
    
    # Take the action in the environment
    next_state, reward, terminated, truncated, info = env.step(action)
    
    # Update the score
    score += reward
    
    # Update the current state
    state = next_state

    done = np.logical_or(terminated, truncated)

# Print the final score
print("Final Score:", score)


## Render Video

In [None]:
# Define the output directory and filename for the video
output_directory = "episode_video.mp4"

# Record the video for an episode
record_video_neuralnet(env, model, output_directory, max_steps=200, fps=30)
# show video
html = render_mp4(f"video/vid.mp4")
HTML(html)

## Training Loop

## Hyperparameters

- `num_episodes`: The total number of episodes to train the agent.
- `max_steps`: The maximum number of steps per episode.
- `batch_size`: The batch size for training the neural network.
- `gamma`: The discount factor for future rewards.
- `epsilon_start`: The initial exploration rate for epsilon-greedy policy.
- `epsilon_end`: The final exploration rate for epsilon-greedy policy.
- `epsilon_decay`: The decay rate for the exploration rate.


In [None]:
# Set hyperparameters
num_episodes = 350 # 1000
max_steps = 500
batch_size = 32
gamma = 0.99
epsilon_start = 1.0
epsilon_end = 0.01
epsilon_decay = 0.995

## Traininig the Deep Q-Learning agent 🤖

In [None]:
# Initialize the target and policy networks
target_net = NeuralNetwork(input_size=env.observation_space.shape[0], output_size=env.action_space.n)
policy_net = NeuralNetwork(input_size=env.observation_space.shape[0], output_size=env.action_space.n)

## Let's define the training loop 🔄

In [None]:
def train_deep_q():
    episode_rewards = []
    mean_rewards = []
    std_rewards = []
    epsilons = []

    # Create a replay buffer
    replay_buffer = deque(maxlen=10000)

    # Define the optimizer
    optimizer = optim.Adam(policy_net.parameters(), lr=0.001)

    # Define the loss function
    loss_fn = F.mse_loss

    # Initialize the epsilon value
    epsilon = epsilon_start
    
    # Training loop
    for episode in range(num_episodes):
        # Reset the environment and initialize the state
        state, _ = env.reset()
        total_reward = 0
        
        for step in range(max_steps):
            # Select an action using an epsilon-greedy policy
            if random.random() < epsilon:
                action = env.action_space.sample()
            else:
                with torch.no_grad():
                    q_values = policy_net(torch.tensor(state, dtype=torch.float32))
                    action = torch.argmax(q_values).item()
            
            # Take the selected action and observe the next state, reward, and done flag
            next_state, reward, terminated, truncated, info = env.step(action)
            total_reward += reward

            done = np.logical_or(terminated, truncated)
            
            # Store the experience in the replay buffer
            replay_buffer.append((state, action, reward, next_state, done))
            
            # Update the current state
            state = next_state
            
            # Sample a batch of experiences from the replay buffer
            if len(replay_buffer) >= batch_size:
                batch = random.sample(replay_buffer, batch_size)
                states, actions, rewards, next_states, dones = zip(*batch)
                
                # Convert the batch to tensors
                states = torch.tensor(states, dtype=torch.float32)
                actions = torch.tensor(actions, dtype=torch.long)
                rewards = torch.tensor(rewards, dtype=torch.float32)
                next_states = torch.tensor(next_states, dtype=torch.float32)
                dones = torch.tensor(dones, dtype=torch.float32)
                
                """ ---------- FILL HERE ---------- """
                # Calculate the target Q-values using the target network
                with torch.no_grad():
                    target_q_values = 
                
                # Calculate the predicted Q-values using the policy network
                q_values = 
                predicted_q_values = 
                
                # Calculate the loss between the target and predicted Q-values
                loss = 
                """ ---------- UNTIL HERE ---------- """
                
                # Update the weights of the policy network using gradient descent
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                
                # Soft update the target network weights with the policy network weights
                for target_param, policy_param in zip(target_net.parameters(), policy_net.parameters()):
                    target_param.data.copy_(0.01 * policy_param.data + 0.99 * target_param.data)
            
            if done:
                episode_rewards.append(total_reward)
                mean_reward = np.mean(episode_rewards[-10:])
                mean_rewards.append(mean_reward)

                std_reward = np.std(episode_rewards[-10:])
                std_rewards.append(std_reward)

                break
        
        # Decay the exploration rate
        epsilon = max(epsilon_end, epsilon_decay * epsilon)
        
        # Print the training statistics
        print(f"Episode: {episode+1}/{num_episodes}, Total Reward: {total_reward}")
        
    return episode_rewards, mean_rewards, std_rewards


#### Solution

In [None]:
def train_deep_q():
    episode_rewards = []
    mean_rewards = []
    std_rewards = []

    # Create a replay buffer
    replay_buffer = deque(maxlen=10000)

    # Define the optimizer
    optimizer = optim.Adam(policy_net.parameters(), lr=0.001)

    # Define the loss function
    loss_fn = F.mse_loss

    # Initialize the epsilon value
    epsilon = epsilon_start
    
    # Training loop
    for episode in range(num_episodes):
        # Reset the environment and initialize the state
        state, _ = env.reset()
        total_reward = 0
        
        for step in range(max_steps):
            # Select an action using an epsilon-greedy policy
            if random.random() < epsilon:
                action = env.action_space.sample()
            else:
                with torch.no_grad():
                    q_values = policy_net(torch.tensor(state, dtype=torch.float32))
                    action = torch.argmax(q_values).item()
            
            # Take the selected action and observe the next state, reward, and done flag
            next_state, reward, terminated, truncated, info = env.step(action)
            total_reward += reward

            done = np.logical_or(terminated, truncated)
            
            # Store the experience in the replay buffer
            replay_buffer.append((state, action, reward, next_state, done))
            
            # Update the current state
            state = next_state
            
            # Sample a batch of experiences from the replay buffer
            if len(replay_buffer) >= batch_size:
                batch = random.sample(replay_buffer, batch_size)
                states, actions, rewards, next_states, dones = zip(*batch)
                
                # Convert the batch to tensors
                states = torch.tensor(states, dtype=torch.float32)
                actions = torch.tensor(actions, dtype=torch.long)
                rewards = torch.tensor(rewards, dtype=torch.float32)
                next_states = torch.tensor(next_states, dtype=torch.float32)
                dones = torch.tensor(dones, dtype=torch.float32)
                
                # Calculate the target Q-values using the target network
                with torch.no_grad():
                    target_q_values = rewards + gamma * (1 - dones) * torch.max(target_net(next_states), dim=1).values
                
                # Calculate the predicted Q-values using the policy network
                q_values = policy_net(states)
                predicted_q_values = q_values.gather(1, actions.unsqueeze(1)).squeeze(1)
                
                # Calculate the loss between the target and predicted Q-values
                loss = loss_fn(predicted_q_values, target_q_values)
                
                # Update the weights of the policy network using gradient descent
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                
                # Soft update the target network weights with the policy network weights
                for target_param, policy_param in zip(target_net.parameters(), policy_net.parameters()):
                    target_param.data.copy_(0.01 * policy_param.data + 0.99 * target_param.data)
            
            if done:
                episode_rewards.append(total_reward)
                mean_reward = np.mean(episode_rewards[-10:])
                mean_rewards.append(mean_reward)

                std_reward = np.std(episode_rewards[-10:])
                std_rewards.append(std_reward)

                break
        
        # Decay the exploration rate
        epsilon = max(epsilon_end, epsilon_decay * epsilon)
        
        # Print the training statistics
        print(f"Episode: {episode+1}/{num_episodes}, Total Reward: {total_reward}")
    
    return episode_rewards, mean_rewards, std_rewards

## Train

In [None]:
episode_rewards, mean_rewards, std_rewards = train_deep_q()

## Plot the rewards

In [None]:
# Plot training statistics
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(range(num_episodes), episode_rewards)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Training Progress - Total Reward')

## Testing the Deep Q-Learning agent 📈

In [None]:
# Define the output directory and filename for the video
output_directory = "vid.mp4"

# Record the video for an episode
record_video_neuralnet(env, model, output_directory, max_steps=200, fps=30)
# show video
html = render_mp4(f"video/vid.mp4")
HTML(html)

# Lunar Lander
## Create and understand [LunarLander-v2 🚀](https://gymnasium.farama.org/environments/box2d/lunar_lander/)
The Lunar Lander is a popular environment in the OpenAI Gym library. It simulates the task of landing a spacecraft on the moon's surface. The goal is to control the spacecraft's thrusters to safely land it on a landing pad, while avoiding crashing or running out of fuel.

The Lunar Lander environment provides the following observations:
- `state`: A 1D array of 8 continuous values representing the state of the spacecraft. The values include the x and y coordinates, the horizontal and vertical velocities, the angle, the angular velocity, and the contact points with the ground.
- `reward`: A scalar value indicating the reward obtained in the current step. The goal is to maximize the reward by successfully landing the spacecraft.
- `done`: A boolean value indicating whether the episode is finished. The episode ends if the spacecraft crashes or successfully lands on the landing pad.
- `info`: Additional information about the environment.



In [None]:
!pip install swig
!pip install gymnasium[box2d]

In [None]:
env = gym.make("LunarLander-v2", render_mode="rgb_array")

# Set hyperparameters
num_episodes = 600
max_steps = 1000
batch_size = 32
gamma = 0.99
epsilon_start = 1.0
epsilon_end = 0.01
epsilon_decay = 0.995

## Initialize neural networks

In [None]:
# Initialize the target and policy networks
target_net = NeuralNetwork(input_size=env.observation_space.shape[0], output_size=env.action_space.n)
policy_net = NeuralNetwork(input_size=env.observation_space.shape[0], output_size=env.action_space.n)

## Render untrained agent

In [None]:
# Define the output directory and filename for the video
output_directory = "vid.mp4"

# Record the video for an episode
record_video_neuralnet(env, policy_net, output_directory, max_steps=200, fps=30)
# show video
html = render_mp4(f"video/vid.mp4")
HTML(html)

## Train the agent

In [None]:
episode_rewards, mean_rewards, std_rewards = train_deep_q()

In [None]:
# Plot training statistics
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(range(num_episodes), episode_rewards)

plt.tight_layout()
plt.show()

## Render the episode

In [None]:
# Define the output directory and filename for the video
output_directory = "vid.mp4"

# Record the video for an episode
record_video_neuralnet(env, policy_net, output_directory, max_steps=200, fps=30)
# show video
html = render_mp4(f"video/vid.mp4")
HTML(html)