# Reinforcement learning summer school at the VU - 2022 - Workbook

## Workshop tutorial, day 2: 

## Deep Reinforcement Learning Agent (Part 2)


### Author: Buelent Uendes 

In this notebook, we will use a function approximator to solve the mountain car game. As seen in the previous notebook, a simple agent that uses Q learning can learn to move the car in a way to move up the hill. Yet, for this to work, one had to discretize the state space. However, for large problems this approach is not feasible, given the fact that we then have a state,action pair matrix. To overcome this, we will use a Neural Network that will approximate the state, pair values. For this, we will use PyTorch. If you have not used PyTorch yet, do not worry, as most of the code will be provided for you. Also, you can always ask any TA for further help. Yet, if you want to have a more in-depth tutorial in PyTorch, you can use the following YouTube tutorial:

- https://www.youtube.com/watch?v=c36lUUr864M

Deep reinforcement learning got popular following the paper published in 2013 [Playing Atari with Deep Reinforcement Learning](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf). Following this paper, several additional techniques were introduced that aim to stabilize the learning process. In the following notebook, we will look at two new methods, memory replay and target networks and will try to solve the mountain car environment using a Deep Reinforcement Learning algorithm. 

**The code used in this notebook is based upon the implementation of a Deep Q agent as shown in this [tutorial.](https://www.youtube.com/watch?v=NP8pXZdU-5U)**

**Instructions:**

In the notebook, you will see a couple of ToDos with some instructions. Try your best to work through them and to complete the notebook. In case you run into problems, do not hesitate to ask any of the TAs for help! :) 

## Preliminaries 

### Import main libraries 

In [None]:
!pip3 install gym==0.26.2

In [None]:
import gym
import numpy as np
import time
import matplotlib.pyplot as plt
import torch 
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import random
from collections import deque
from mpl_toolkits import mplot3d
from matplotlib import cm
import pandas as pd
import seaborn as sns

### Seeting the seed for reproducibility 

In [None]:
# Set the seed for reproducibility
seed = 7
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)

## General notes 

We will introduce the concept of Deep Reinforcement Learning in **three** steps:

1) First introduce how to implement a simple deep neural network that is represents an essential building block of Deep Q learning

2) Introduce the topic of experience replay/replay buffer

3) Introduce the concept of a target network!

## Part 1: Deep neural network and its general characteristics in the context of reinforcement learning

### General characteristics of Deep Q learning

In the simplest approach, a Deep RL algorithm is:

- Episodic (the agent acts in the environment only for a specific number of timesteps)
- Online (we train the algorithm while the agent interacts with the environment)
- Model-free. We do not attempt to model the environment.

In the following, we will implement a deep neural network using the PyTorch library. 

In [None]:
class DQN(nn.Module):
    
    def __init__(self, env, learning_rate):
        
        '''
        Params:
        env = environment that the agent needs to play
        learning_rate = learning rate used in the update
        
        '''
        
        super(DQN,self).__init__()
        input_features = env.observation_space.shape[0]
        action_space = env.action_space.n
        
        '''
        ToDo: 
        Write the layers of your neural network! 
        Make sure that the input features and the output features are in line with the environment that 
        the class takes as an input feature
        '''
        #Solution:
        
        
        #Here we use ADAM, but you could also think of other algorithms such as RMSprob
        self.optimizer = optim.Adam(self.parameters(), lr = learning_rate)
        
    def forward(self, x):
        
        '''
        Params:
        x = observation
        '''
        
        '''
        ToDo: 
        Write the forward pass! You can use any activation function that you want (ReLU, tanh)...
        Important: We want to output a linear activation function as we need the q-values associated with each action
    
        '''
        
        #Solution:
        
        return x
    

That's it! This is the implementation of a deep neural network in PyTorch!

## Part 2: Experience replay

In a normal implementation of a deep neural network, one would train the algorithm using some sort of a gradient method. Yet, one of the key assumption is that the data is iid, i.e. independent identically distributed which does not hold in our reinforcement learning setting. The next state and its reward depends on the action our agent took the preceeding state which makes subsequent states and the data highly correlated. This can cause the DQN to be instable. To circumvent this, people use in practice a so-called experience replay technique. The main rationale behind this idea is to break the correlation between subsequent transitions by saving experiences in memory and sample randomly from the stored transitions when performing a Q-value update. This 'trick' is essential to make the method work!

In the following, we will create a experience replay class that will store the transitions of the deep Q agent. It is important to keep in mind that the replay buffer has a fixed capacity. If the data that we want to store in the replay buffer exceeds the buffer, we want to store only the most recent transitions in the buffer. 

In [None]:
class ExperienceReplay:
    
    def __init__(self, env, buffer_size, min_replay_size = 1000, seed = 123):
        
        '''
        Params:
        env = environment that the agent needs to play
        buffer_size = max number of transitions that the experience replay buffer can store
        min_replay_size = min number of (random) transitions that the replay buffer needs to have when initialized
        seed = seed for random number generator for reproducibility
        '''
        self.env = env
        self.min_replay_size = min_replay_size
        self.replay_buffer = deque(maxlen=buffer_size)
        self.reward_buffer = deque([-200.0], maxlen = 100)
        
        print('Please wait, the experience replay buffer will be filled with random transitions')
                
        obs, _ = self.env.reset(seed=seed)
        for _ in range(self.min_replay_size):
            action = env.action_space.sample()
            new_obs, rew, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            transition = (obs, action, rew, done, new_obs)
            self.replay_buffer.append(transition)
            obs = new_obs
    
            if done:
                obs, _ = env.reset(seed=seed)
        
        print('Initialization with random transitions is done!')
      
          
    def add_data(self, data): 
        '''
        Params:
        data = relevant data of a transition, i.e. action, new_obs, reward, done
        '''
        self.replay_buffer.append(data)
            
    def sample(self, batch_size):
        
        '''
        Params:
        batch_size = number of transitions that will be sampled
        
        Returns:
        tensor of observations, actions, rewards, done (boolean) and next observation 
        '''
        
        transitions = random.sample(self.replay_buffer, batch_size)
        observations = np.asarray([t[0] for t in transitions])
        '''
        ToDo:
        Do the same for the remaining variables and store these as np arrays!
        '''

        #PyTorch needs these arrays as tensors!
        observations_t = torch.as_tensor(observations, dtype = torch.float32)
        actions_t = torch.as_tensor(actions, dtype = torch.int64).unsqueeze(-1)
        rewards_t = torch.as_tensor(rewards, dtype = torch.float32).unsqueeze(-1)
        dones_t = torch.as_tensor(dones, dtype = torch.float32).unsqueeze(-1)
        new_observations_t = torch.as_tensor(new_observations, dtype = torch.float32)
        
        return observations_t, actions_t, rewards_t, dones_t, new_observations_t
    
    def add_reward(self, reward):
        
        '''
        Params:
        reward = reward that the agent earned during an episode of a game
        '''
        
        self.reward_buffer.append(reward)
        

## Write the code for the vanilla DQN agent 

In [None]:
class vanilla_DQNAgent:
    
    def __init__(self, env_name, device, epsilon_decay, 
                 epsilon_start, epsilon_end, discount_rate, lr, buffer_size, seed = 123):
        '''
        Params:
        env = environment that the agent needs to play
        device = set up to run CUDA operations
        epsilon_decay = Decay period until epsilon start -> epsilon end
        epsilon_start = starting value for the epsilon value
        epsilon_end = ending value for the epsilon value
        discount_rate = discount rate for future rewards
        lr = learning rate
        buffer_size = max number of transitions that the experience replay buffer can store
        seed = seed for random number generator for reproducibility
        '''
        self.env_name = env_name
        self.env = gym.make(self.env_name, render_mode = None)
        self.device = device
        self.epsilon_decay = epsilon_decay
        self.epsilon_start = epsilon_start
        self.epsilon_end = epsilon_end
        self.discount_rate = discount_rate
        self.learning_rate = lr
        self.buffer_size = buffer_size
        
        self.replay_memory = ExperienceReplay(self.env, self.buffer_size, seed = seed)
        self.online_network = DQN(self.env, self.learning_rate).to(self.device)
        
    def choose_action(self, step, observation, greedy = False):
        
        '''
        Params:
        step = the specific step number 
        observation = observation input
        greedy = boolean that
        
        Returns:
        action: action chosen (either random or greedy)
        epsilon: the epsilon value that was used 
        '''
        
        epsilon = np.interp(step, [0, self.epsilon_decay], [self.epsilon_start, self.epsilon_end])
    
        random_sample = random.random()
    
        if (random_sample <= epsilon) and not greedy:
            #Random action
            action = self.env.action_space.sample()
        
        else:
            #Greedy action
            obs_t = torch.as_tensor(observation, dtype = torch.float32)
            q_values = self.online_network(obs_t.unsqueeze(0))
        
            max_q_index = torch.argmax(q_values, dim = 1)[0]
            action = max_q_index.detach().item()
        
        return action, epsilon
    
    def learn(self, batch_size):
        
        '''
        Params:
        batch_size = number of transitions that will be sampled
        '''
        
        #Sample random transitions with size = batch size
        observations_t, actions_t, rewards_t, dones_t, new_observations_t = self.replay_memory.sample(batch_size)

        target_q_values = self.online_network(new_observations_t)
        max_target_q_values = target_q_values.max(dim=1, keepdim=True)[0]

        targets = rewards_t + self.discount_rate * (1-dones_t) * max_target_q_values

        #Compute loss
        q_values = self.online_network(observations_t)
        action_q_values = torch.gather(input=q_values, dim=1, index=actions_t)

        #Loss
        '''
        ToDo: 
        Implement here the loss function! You can choose the standard MSE loss or Huber loss. Call this variable loss!
        '''        

        
        '''
        ToDo: Write the gradient descent step, were you optimize the online network based on the loss!'
            Tip: You need 3 lines.
            1. Call the zero grad method on the self.network optimizer!
            2. Call the backward method on the loss
            3. Do an optimization step
        '''
        
        #Solution:
        #Gradient descent

        

## Write the training loop and perform the first run!

In a last step, we can write a training loop that will put all things together. We will run the training loop for a number of iteration and see how our first algorithm performs!

### Hyperparameters 

In [None]:
#Set the hyperparameters

#Discount rate
discount_rate = 0.99
#That is the sample that we consider to update our algorithm
batch_size = 32
#Maximum number of transitions that we store in the buffer
buffer_size = 50000
#Minimum number of random transitions stored in the replay buffer
min_replay_size = 1000
#Starting value of epsilon
epsilon_start = 1.0
#End value (lowest value) of epsilon
epsilon_end = 0.05
#Decay period until epsilon start -> epsilon end
epsilon_decay = 10000

max_episodes = 250000

#Learning_rate
lr = 5e-4

### Initialize all instances 

In [None]:
env_name = 'MountainCar-v0'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
vanilla_agent = vanilla_DQNAgent(env_name, device, epsilon_decay, epsilon_start, epsilon_end, discount_rate, lr, buffer_size)

### Write a training loop function

We will first write a training loop function and let it then run for the vanilla DQN agent!

In [None]:
def training_loop(env_name, agent, max_episodes, target_ = False, seed=42):
    
    '''
    Params:
    env = name of the environment that the agent needs to play
    agent= which agent is used to train
    max_episodes = maximum number of games played
    target = boolean variable indicating if a target network is used (this will be clear later)
    seed = seed for random number generator for reproducibility
    
    Returns:
    average_reward_list = a list of averaged rewards over 100 episodes of playing the game
    '''
    env = gym.make(env_name, render_mode = None)
    env.action_space.seed(seed)
    obs, _ = env.reset(seed=seed)
    average_reward_list = [-200]
    episode_reward = 0.0
    
    for step in range(max_episodes):
        
        action, epsilon = agent.choose_action(step, obs)
       
        new_obs, rew, terminated, truncated, _ = env.step(action)
        done = terminated or truncated        
        transition = (obs, action, rew, done, new_obs)
        agent.replay_memory.add_data(transition)
        obs = new_obs
    
        episode_reward += rew
    
        if done:
        
            obs, _ = env.reset(seed=seed)
            agent.replay_memory.add_reward(episode_reward)
            #Reinitilize the reward to 0.0 after the game is over
            episode_reward = 0.0

        #Learn

        agent.learn(batch_size)

        #Calculate after each 100 episodes an average that will be added to the list
                
        if (step+1) % 100 == 0:
            average_reward_list.append(np.mean(agent.replay_memory.reward_buffer))
        
        #Update target network, do not bother about it now!
        if target_:
            
            #Set the target_update_frequency
            target_update_frequency = 250
            if step % target_update_frequency == 0:
                dagent.update_target_network()
    
        #Print some output
        if (step+1) % 10000 == 0:
            print(20*'--')
            print('Step', step)
            print('Epsilon', epsilon)
            print('Avg Rew', np.mean(agent.replay_memory.reward_buffer))
            print()

    return average_reward_list

In [None]:
average_rewards_vanilla_dqn = training_loop(env_name, vanilla_agent, max_episodes)

**Comment**: As you can see, the vanilla deep Q network performs very poorly and does not learn to master the challenge. Play around with the number of iterations and epsilon decay to check if you can improve the algorithm!

## Part 3: Target network

A problem of the standard Q learning introduced above is the fact that we use the same Q value to choose an action and to evaluate it. To overcome this problem, double-Q learning was proposed in the following paper [Double Q-learning](https://papers.nips.cc/paper/2010/file/091d584fced301b442654dd8c23b3fc9-Paper.pdf).
In the case of DQN, we can make use of the same idea by training a second neural network,a so-called target network. Just as the name suggests, the target network will be used to compute the target of the update equation using this target network. This target network will only be updated after a pre-defined number of steps to ensure that the target will not move as the DQN network will learn (as it is the case in the standard simple DQN framework). This idea was put forward in the paper again by van Hasselt et al. (2016) [Deep reinforcement learning with double Q-learning](https://arxiv.org/pdf/1509.06461.pdf). 


Implementing a target network and changing the architecture to a double DQN is rather straightforward. All we need to do is to initialize besides an online network a so-called target network. After a specific number of steps, the parameter values of the target network are reinitialized with the online network after a pre-defined number of steps. For this, we will add a few lines to the vanilla DQN class and call it DDQN (for double deep Q learning).

In [None]:
class DDQNAgent:
    
    def __init__(self, env_name, device, epsilon_decay, 
                 epsilon_start, epsilon_end, discount_rate, lr, buffer_size, seed = 123):
        '''
        Params:
        env = name of the environment that the agent needs to play
        device = set up to run CUDA operations
        epsilon_decay = Decay period until epsilon start -> epsilon end
        epsilon_start = starting value for the epsilon value
        epsilon_end = ending value for the epsilon value
        discount_rate = discount rate for future rewards
        lr = learning rate
        buffer_size = max number of transitions that the experience replay buffer can store
        seed = seed for random number generator for reproducibility
        '''
        self.env_name = env_name
        self.env = gym.make(self.env_name, render_mode = None)
        self.device = device
        self.epsilon_decay = epsilon_decay
        self.epsilon_start = epsilon_start
        self.epsilon_end = epsilon_end
        self.discount_rate = discount_rate
        self.learning_rate = lr
        self.buffer_size = buffer_size
        
        self.replay_memory = ExperienceReplay(self.env, self.buffer_size, seed = seed)
        self.online_network = DQN(self.env, self.learning_rate).to(self.device)
        
        '''
        ToDo: 
        Add here a target network and set the parameter values to the ones of the online network! 
        You can do this in 2 steps:
            1) Initialize the target network and call it self.target_network
            2) Set the parameters to the ones of the online network
               Hint: Use the method 'load_state_dict'!
        '''
        #WRITE YOUR CODE HERE!
        
        #Initialize the target network 
        
        #Set the parameters to the ones of the online_network!
        
    def choose_action(self, step, observation, greedy = False):
        
        '''
        Params:
        step = the specific step number 
        observation = observation input
        greedy = boolean that
        
        Returns:
        action: action chosen (either random or greedy)
        epsilon: the epsilon value that was used 
        '''
        
        epsilon = np.interp(step, [0, self.epsilon_decay], [self.epsilon_start, self.epsilon_end])
    
        random_sample = random.random()
    
        if (random_sample <= epsilon) and not greedy:
            #Random action
            action = self.env.action_space.sample()
        
        else:
            #Greedy action
            obs_t = torch.as_tensor(observation, dtype = torch.float32)
            q_values = self.online_network(obs_t.unsqueeze(0))
        
            max_q_index = torch.argmax(q_values, dim = 1)[0]
            action = max_q_index.detach().item()
        
        return action, epsilon
    
    
    def return_q_value(self, observation):
        '''
        Params:
        observation = input value of the state the agent is in
        
        Returns:
        maximum q value 
        '''
        #We will need this function later for plotting the 3D graph
        
        obs_t = torch.as_tensor(observation, dtype = torch.float32)
        q_values = self.online_network(obs_t.unsqueeze(0))
        
        return torch.max(q_values).item()
        
    def learn(self, batch_size):
        
        '''
        Params:
        batch_size = number of transitions that will be sampled
        '''
        
        observations_t, actions_t, rewards_t, dones_t, new_observations_t = self.replay_memory.sample(batch_size)

        #Compute targets, note that we use the same neural network to do both! This will be changed later!

        target_q_values = self.target_network(new_observations_t)
        max_target_q_values = target_q_values.max(dim=1, keepdim=True)[0]

        targets = rewards_t + self.discount_rate * (1-dones_t) * max_target_q_values

        #Compute loss

        q_values = self.online_network(observations_t)

        action_q_values = torch.gather(input=q_values, dim=1, index=actions_t)

        #Loss, here we take the huber loss!

        loss = F.smooth_l1_loss(action_q_values, targets)
        
        #Uncomment the following code to use the MSE loss instead!
        #loss = F.mse_loss(action_q_values, targets)
        
        #Gradient descent to update the weights of the neural networ
        self.online_network.optimizer.zero_grad()
        loss.backward()
        self.online_network.optimizer.step()
        
    def update_target_network(self):
        
        '''
        ToDO: 
        Complete the method which updates the target network with the parameters of the online network
        Hint: use the load_state_dict method!
        '''
    
        #WRITE YOUR CODE HERE!
    

    def play_game(self, step=1, seed=123):
        """
        The following method will let the DQNAgent play the game after it has worked 
        through the number of episodes for training
        """
        
        '''
        Params:
        step = the number of the step within the epsilon decay that is used for the epsilon value of epsilon-greedy
        seed = seed for random number generator for reproducibility
        '''
        #Get the optimized strategy:
        done = False
        #Reinitialize the game 
        self.env = gym.make(self.env_name, render_mode='human')
        #Start the game
        state, _ = self.env.reset()
        while not done:
            #Pick the best action 
            action = self.choose_action(step, state, True)[0]
            next_state, rew, terminated, truncated, _ = self.env.step(action)
            done = terminated or truncated 
            state = next_state
            #Pause to make it easier to watch
            time.sleep(0.05)
        #Close the pop-up window
        self.env.close()
    

After we have created our DDQNAgent class, we can re-run the experiment from above and see if we can increase the performance! 

## Hyperparameters and initialization 

Since the hyperparameters are the same as before, we only need to set the new hyperparameter target_update_frequency.

In [None]:
dagent = DDQNAgent(env_name, device, epsilon_decay, epsilon_start, epsilon_end, discount_rate, lr, buffer_size)

### Main loop DDQN - double deep Q network

In [None]:
average_rewards_ddqn = training_loop(env_name, dagent, max_episodes, target_ = True) 

**Comments**:

As you can see, implementing a target network improved the performance of the deep reinforcement learning algorithm greatly! 

We can also plot the results of both algorithms to see the difference even more clearly.


**Comment**:

Here we can plot the results of the algorithms. 

In [None]:
plt.plot(1000*(np.arange(len(average_rewards_ddqn))+1),average_rewards_ddqn)
plt.plot(1000*(np.arange(len(average_rewards_vanilla_dqn))+1),average_rewards_vanilla_dqn)
# specifying horizontal line type
plt.axhline(y = -110, color = 'r', linestyle = '-')
plt.title('Average reward over the past 100 simulations')
plt.xlabel('Number of simulations')
plt.legend(['Double DQN', 'Vanilla DQN', 'Benchmark solving the game'])
plt.ylabel('Average reward')

As we can see, the Double DQN performs significantly better than the vanilla DQN. The horizontal redline is the benchmark, as one considers the mountain car environment to be solved when the average reward over 100 subsequent trials is -110 ([check this link for further info](https://github.com/openai/gym/wiki/Leaderboard#mountaincar-v0)).

You can play around the hyperparameter and see how the results change if, for example, you lower the discount rate or learning rate! Also, you can see if changing the neural network architecture, i.e. making it deeper, will lead to an increase in performance.

## Reap the rewards of the hard work - see the DDQN play the game! 

Now that we worked through two different deep reinforcement learning architectures, we can see the DQN solve the game. The code below with let the DQNAgent play the mountain car game. 

In [None]:
dagent.play_game()

## Visualize the result in a 3D plot 

We can visualize the result in a 3D plot, plotting the x-position as well as the velocity with the corresponding value function. To recall, the value of a particular state is, in case of a greedy policy, the corresponding maximum state action pair! The following function will plot the value function.

The following code will plot the value function that results of the DDQN algorithm in 3D.

In [None]:
low = env.observation_space.low
high =env.observation_space.high

bin_size = 20
bin_x = np.linspace(low[0], high[0], bin_size)
bin_velocity = np.linspace(low[1], high[1], bin_size)

X, Y = np.meshgrid(bin_x, bin_velocity)
Z = np.zeros((len(X), len(Y)))

for i in range(len(X)):
    for j in range(len(Y)):
        Z[i][j] = dagent.return_q_value([X[0][i], Y[j][0]])
fig = plt.figure(figsize =(10,10))
ax = plt.axes(projection='3d')

surf = ax.plot_surface(X, Y, Z, cmap=cm.coolwarm,
                       linewidth=0, antialiased=False)

#ax.contour3D(X, Y, Z, 50, cmap='magma')
ax.set_xlabel('x-position', fontsize = 18)
ax.set_ylabel('velocity', fontsize = 18)
ax.set_zlabel('value function', fontsize = 18)
fig.colorbar(surf, shrink=0.5, aspect=5)
ax.set_title('Visualization of the value function', fontsize = 18)
plt.show()

**Your turn!**

You can play around with the deep reinforcement learning architecture and see what impact for example the discount rate has. Also, you can modify the architecture of the neural network (by making it more deep and/or change the activation function). Just have fun!

## Extensions/Interesting notes: 

Following the sucess of the paper by Minh et al. (2013), research in deep reinforcement learning has progressed and a couple of extensions to the basic framework as well as tricks have been proposed, such as dueling deep Q learning (also called D3QN) and priotized experience replay. But before I give you some pointers on this, we will discuss the deadly triad in a bit more detail.


## 1) The deadly triad

Some of you might have heard of the term 'deadly triad' which refers to the instability a reinforcement learning algorithm faces, when an algorithm makes use of:

- function approximation
- bootstrapping
- off-policy evaluation

Our deep reinforcement learning algorithm makes use of all three concepts. Yet, it does **not** state that instability/divergence always occur when all three above-mentioned techniques are used. The deadly triad only states that it **can** occur. An interesting paper that addresses this issue empirically is the following paper by van Hasselt et al. (2018) [Deep Reinforcement Learning and the Deadly Triad](https://arxiv.org/pdf/1812.02648.pdf)

For doing this, they realize that if one bounds the rewards in the interval between $[-1,1]$, then one can show that the corresponding Q values are bounded given by the following equation:

$  \sum_{t'=t}^{T} \gamma^{t'-t} |r_{t'}| \le \sum_{t'=t}^{\infty} \gamma^{t'-t} |r_{t'}| \le \sum_{t'=t}^{\infty} \gamma^{t'-t} = \frac{1}{1-\gamma}  $

According to this, any Q-value is theoretically bounded by the above equation. In our case, by $100$ (given a discount rate of $0.99$). Hence, if the Q-value exceeds this bound, we say that soft divergence occurs.

In their work, they find via running several experiments interesting insights:

- If one does not correct for overestimating bias (by for example not using a target network), divergence can occur more frequently.

- Increasing the multistep return decreases the chance of divergence.

- The effect of the neural network size is not straightforward, as the best performing architectures in their experiment are large, but also tend to show some instabilities.

Hence, they suggest that one can prevent instabilities by reducing the overestimation bias and by bootstraping on a separate network (also using multi-step returns). 


## 2) The importance of the random seed - instability of the DeepRL 

As stressed [in this good post](https://spinningup.openai.com/en/latest/spinningup/spinningup.html#closing-thoughts), the performance of DeepRL algorithms is very sensitive to stochasticity and the particular choice of the hyperparameters chosen. For this reason, one should run any DeepRL algorithm on a number of different random seeds and carefully tune the hyperparameters. This aspect is also discussed in this [paper](https://arxiv.org/pdf/1708.04133.pdf).

In the following, we will show how setting different seeds affects the performance of the double deep Q network. The following code is inspired by the code [here](https://gymnasium.farama.org/tutorials/reinforce_invpend_gym_v26/)

In [None]:
rewards_for_different_seeds = []
for seed in range(5): # Here we go for 5 seeds only, as otherwise the code runs for too long!
    # Reset the pytorch and numpy seeds
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    
    # Initialize the agent:
    dagent = DDQNAgent(env_name, device, epsilon_decay, epsilon_start, 
                       epsilon_end, discount_rate, lr, buffer_size, seed)
    
    rewards_for_different_seeds.append(training_loop(env_name, dagent, max_episodes, target_ = True)) 


In [None]:
#Create a dataframe to plot using seaborn library
rewards = pd.DataFrame(rewards_for_different_seeds).melt()
rewards.rename(columns={"variable": "episodes", "value": "reward"}, inplace=True)
sns.set(style="darkgrid", context="talk", palette="rainbow")
sns.lineplot(x="episodes", y="reward", data=rewards).set(
    title="Performance of Double DQN for different seeds (MountainCar-v0)"
)
plt.show()

**Comment**:
As expected, the performance of the DDQN algrorithm is heavily affected by the different seeds! This illustrates the importance of running the experiment with different seeds!

## 3) Dynamic discount rate 

One of the hyperparameter that one needs to choose carefully is the discount rate $\gamma$. In their work, [Francois-Lavet et al. (2016)](https://arxiv.org/pdf/1512.02011.pdf) show that one can increase the performance and significantly decrease the number of learning steps required, by not only having a dynamic $\epsilon$ rate, but also by having a discount rate that increases over time. In particular, they suggest to:

$\gamma_{k+1} = 1 - 0.98(1- \gamma_{k}) $

The start value of gamma is set to $0.9$ and it increases up to a final value of $0.99$. You could try to implement this and check its effect on the result!


## 4) Priotized experience replay 

As we saw above, an important part of deep reinforcement learning is the experience replay buffer. This concept lets the agent remember and reuse old experiences from the past. Yet, those samples were sampled uniformly. Yet, one could imagine that some experiences are more fruitful for the agent to be replayed than others. Intuitively, one would like to replay experiences for which the agent can learn the most. To do this, one needs to change the sample procedure in a way that experiences for which the agent can learn more have a higher sampling probability. This is the key idea of the priotized experience replay buffer, as introduced by [Schaul et al. (2015)](https://arxiv.org/pdf/1511.05952.pdf). 
In their work, they propose to measure the expected learning significance by the magnitude of the temporal difference  (TD) error.

In particular, they propose the following. The priority of transition $p_{i}$ is based on the absolute magnitude of the TD error $\delta_{i}$ plus a small, positive constant $\epsilon$ which ensures that all samples have a non-zero probability to be resampled:

$p_{i} = |\delta_{i}| + \epsilon$

Then, the probability of sampling transition $i$ is defined as:

$P(i) = \frac {p_{i}^{\alpha}}{\sum_{k} p_{k}^{\alpha}}$ where $\alpha$ determines how much priotization is used (with $\alpha = 1$ representing the standard uniform case).

Given that we change the sampling procedure from a uniform one, to a different one in which transitions with a higher TD error have a higher chance of getting resampled, we need to correct for the bias that occurs as stochastic updates rely on the assumption that updates correspond to the same distribution as the expectation. This can be done via importance sampling, i.e. :

$w_{i} = (\frac {1}{N} \frac{1}{P(i)})^{\beta}$

where $N$ refers to the size of the replay buffer and $\beta$ to a hyperparameter (with 1 compensating fully for non-uniform probabilities). For stability reasons, they suggest to normalize the weights by dividing all weights by the corresponding largest weight in the buffer.

Having this, the only change on needs to do to for the training algorithm is to multiply the gradient in the update equation with $w_{i}$.

## 5) Dueling Deep Reinforcement learning - D3QN

Another extension to the standard DDQN framework is the idea of dueling deep reinforcement learning. This was introduced by [Wang et al. (2016).](https://arxiv.org/pdf/1511.06581.pdf).

The key idea is to note that the Q function can be decomposed into the value function $V(s)$ which denotes how valuable it is for an agent to be in a particular state $s$ and an advantage function $A(s,a)$ which determines how valuable it is to take a particular action $a$ in a state $s$:

$Q(s,a) = V(s) + A(s,a)$

Yet, given a particular Q function, one can decompose it into the components of $V(s)$ and $A(s,a)$ in many ways which  raises identifiability issues. To circumvent this issue, the authors suggest to force the highest Q value to be equal to the value function, i.e.

$Q(s,a) = V(s) + ( A(s,a) - max_{a' \in |A|}A(s,a'))$

Alternatively, one can do it also using the average over all actions:


$Q(s,a) = V(s) + A(s,a) - \frac{1}{||a||}\sum_{a'}A(s,a')$

To implement this in PyTorch, one only needs to seperate the layers into two different streams, where one layer calculates the value function and the other one the advantage function. Lastly, one needs to then aggregate them again (and subtract the average!).