# Deep Q-Learning

In this notebook, it will be discussed how to use neural networks to solve reinforcement learning problems that has expanded the range of domains we can takle - especially those with large and continuous state spaces. In general:
 * How NN can be used to represent value functions.
 * Adapting Monte Carlo and Temporal Difference (model free approaches) to work with this new representation.
 * Implement Deep Q-Learning algorithm
 * Harness the power of Deep Learning including convolutional and recurrent neural networks.

## Neural Nerworks as Value Functions

We can think of Neural Networks as universal function approximators.<br>
How can we use them to represent our value functions?<br>
A state-value function maps any state $s$ to a real number which indicates how valuable that state is according to the current policy $\pi$. If we use a neural network to approximate this function, then the input need to be fed in as a vector - this can be done using a feature transformation $x$. Then, the input can progress thorugh the network; a network that has been designed to output a single real number - this estimation would be our value.
![LearningNeuralNetsParameters](./images/LearningrNeuralNetParameteres.png)

The Neural Network have to learn the $W$ parameters.<br>
In the same way as in Deep Learning, if we have a target which we are trying to reach (like for instance $V_\pi$ provided somehow by some oracle) we could then use the squared difference between the neural net's estimation and the target value as our error or loss. Then we could backpropagate this error or loss through the network, adjusting weights along the way to minimize this difference (i.e. error or loss). Maybe the most popular method to adjust these weights is gradient descent which iteratively change weights a small step away from the direction of error. In order to apply gradient descent we need to calculate the derivative of the value function represented by the network with r espect to its weights. Thsi process can become very complex, especially for networks with deep architectures, but the implementations of these algorithms in libraries like TensorFlow , Theano and MXNet are pretty efficient when training the neural nets. What we need is to find a way to figure out the loss and that is where our knowledge of reinforcement learning comes to play.
![learningneuralnetworkparameters](./images/LearningrNeuralNetParameteres_2.png)

Again and as it has already done before, we also consider the action-value function $q$. The update rule looks very similar to that of the state value function $V$:<br>

<center>
    $\text{state value function:     } \Delta W = \alpha\big(v_\pi(s) - \hat{v}(s, W)\big)\nabla_w\hat{v}(s, W) $<br>
    $\text{action value function:     } \Delta W = \alpha\big(q_\pi(s, a) - \hat{q}(s, a, W)\big)\nabla_w\hat{q}(s, a, W) $
</center>

The main problem here, is that reinforcement learning is not supervised learning. There is no way to know before hand what the target we are chasing is. For this reason, we need to use a more realistic target - one that is based on our interactions with the environment.<br>
In order to find suitable targets to use in place of the true value functions in the last equations, we will discuss (and apply) some startegies.

## Monte Carlo Learning

Monte Carlo Learning is the first startegy to be discussed.<br>
Remember the incremental step that is used in classical MC Learning to update value functions. Here $G_t$ is the return that is the cumulative discounted reward received following time $t$ - That is a suitable target to attain. If we take our neural network update rule and substitute the unknown true value function with this return then we get a concrete update rule for state value functions represented by neural networks or other functions approximators.

The update rule of our neural net looks currently like this:

$$\Delta W = \alpha\big(V_\pi(S_t) - \hat{V}(S_t, W)\big) \nabla_w\hat{V}(S_t, W)$$

To get our concrete update rule we substitute the term $"V_\pi(S_t)"$ for the cumulative expected return:

$$\Delta W = \alpha\big(G_t - \hat{V}(S_t, W)\big) \nabla_w\hat{V}(S_t, W)$$

![MonteCarloLearning](./images/MonteCarloLearning.png)

As expected, we can do the same for action value functions:

![MonteCarloLearningActionValue](./images/MonteCarloLearning_actionVal.png)

We will have to focus on the control problem and adapt the classic MC algorithm to work with a function approximator. This normally includes an evaluation step where we try to estimate the value of each state action pair under the current policy. We do this by interacting with the environment to generate an episode using the policy $\Pi$, and then for each time step $t$ in the episode, we update the parameter vector $W$ using the state action pair $q(S_t, A)$ and the return $G_t$ computed from the remainder of the episode. This is followed by an improvement step whre we extract an epsilon greedy policy based on these Q values. At the beginning, we need to initialize our parameters $W$, let's say we do that reandomly and start with a policy $\pi$ defined in the same epsilon greedy manner. Then we can repeat these two steps over and over untill the weights converge resulting in the optimal value function and hence the corresponding policy.<br>
**Note**: this is the "every visit" version of Monte Carlo. For the first visit version, you only perform the weight update when you see the state action pair for the first time in an episode.

![MCalgorithmWithFunctionApprox](./images/MonteCarloWithFunctApprox.png)

MC is guaranteed to converge on a local optimum in general. In case of a linear function approximation, it will converge on the global optimum.

## Temporal Difference Learning

The second strategy to be discussed is the Temporal Difference Learning. This is a Temporal Difference technique with function approximation.<br>
Let's start by comparing the incremental steps of Monte Carlo and Temporal Difference. In Monte Carlo we use the actual return obtained through the episode and in Temporal Difference we use an estimated return:
<center>
    $\begin{align}
    \color{black}{\text{Monte Carlo:     }} & \color{blue}{V(S_t)\longleftarrow V(S_t) + \alpha\big(}\mathbin{\color{red}{G_t}}\color{blue}{ - V(S_t)\big)} \\
    &\color{red}{G_t} = \text{ actual return} = R_{t+1} + \gamma R_{t+2} + \gamma^2R_{t+3} + \cdots \\
     \color{black}{\text{Temporal Difference:     }} & \color{blue}{V(S_t)\longleftarrow V(S_t) + \alpha\big(}\mathbin{\color{red}{R_{t+1} + \gamma V(S_{t+1})}}\color{blue}{ - V(S_t)\big)} \\
     &\color{red}{R_{t+1} + \gamma V(S_{t+1})} = \text{ estimated return (TD target)} \\
     & \text{Here we present the simplest case of TD - TD(0). In this case we use the next reward} \\
     & \text{plus the discounted value of the next state}
    \end{align}$
</center>

Similar to what we have discussed in the [last section (Monte Carlo Learning)](#Monte-Carlo-Learning), we can use this TD target as our unknown true value function, passing therefore from this equation:<br>
$$\Delta W = \alpha\big(V_\pi(S_t) - \hat{V}(S_t, W)\big) \nabla_w\hat{V}(S_t, W)$$
to this equation:
$$\Delta W = \alpha\big(\color{red}{R_{t+1} + \gamma\hat{V}(S_{t+1}, W)}- \hat{V}(S_t, W)\big) \nabla_w\hat{V}(S_t, W)$$

Note that the value function had to be adapted so that it uses the function approximator $\hat{V}$.

<center>
    $\begin{align}
    \Delta W = \alpha\big(&\underline{R_{t+1} + \gamma\hat{V}(S_{t+1}, W)- \hat{V}(S_t, W)}\big) \nabla_w\hat{V}(S_t, W) \\
    & \text{this entire difference (the underlined part) is called }\underline{\text{TD Error}}\text{ and it's denoted by }\delta_t.
    \end{align}$
</center>

As we have done before, we can extend this same idea to action value functions:

$$\Delta W = \alpha\big(R_{t+1} + \gamma\hat{q}(S_{t+1}, A_{t+1}, W)- \hat{q}(S_t, A_t, W)\big) \nabla_w\hat{q}(S_t, A_t, W)$$

With this new upate rule, we can build now the algorithm. For this, the TD(0) target will be used and the focus will be on the control problem. Remember that this is escentially the SARSA algorithm.
 1. The parameters $W$ must be initialized arbitrarily and the policy $\pi$ is implicitly defined as an $\epsilon\text{-greedy}$ choise over our approximate action value function $\hat{q}$.
 2. The agent begin its interaction with the environment, obtaining an initial state $S$.
 3. The agent will perform the following steps until it reaches a terminal state:
  * It chooses at each time step an action $A$ to perform and immediately oberves the reward $R$ and the next state $S'$
  * It chooses another action $A'$ (this time from state $S'$), according to the $\epsilon\text{-greedy}$ policy $\pi$. This steps give to the agent what it needs for the SARSA update.
  * It (the agent) plugs these obtained information into the gradient descent update rule and adjust the weights $W$ accordingly.
  * It then rolls over $S'$ to be the new $S$, and $A'$ to be the new $A$ and start again (from point number 3).
![TD(0)withFunctApproxEpisodic](./images/TDControlFunctApproxEpisodic.png)

This algorithm - or the way it is formulated - is useful for episodic tasks, where each episode is guaranteed to terminate.<br>
If whished, it can be adapted for continuing tasks. To do this, we remove the distinct boundary between episodes and treat the sequence of interactions as one long unending episode.
![TDControlFunctApproxContTasks](./images/TDControlFunctApproxContTasks.png)

SARSA is an ongoing policy algorithm. This means that it updates the same policy that it uses to carry out actions. It modifies the same policy it follows. Sounds strage, but usually it works well and converges pretty quickly because it is using the most updated policy to pick its actions. There are however some drawbacks; the policy being learned and the one being followed are intimately tied to each other. Why is this a drawback? The answer to this question comes later on this notebook.
If we want an agent that follows one policy (for instance, one that is more exploratory) while learning a better or more optimal policy, then such an agent needs to use an off-policy algorithm.

## Q-Learning

Q-Learning is an off policy variant of TD Learning. Let's addapt Q-Learning to make it work with function approximation.

 1. Initialize the parameters $W$ arbitrarily and define an $\epsilon\text{-greedy}$ policy $\pi$ based on the Q-values.
 2. Over multiple episodes the agent uses the $\epsilon\text{-greedy}$ policy to repeatedly take an action and after performing each action it observes the given reward $R$ and the new state $S'$. Up to this point, is everything very similar to TD. The main difference comes in the update step. In Q-Learning, the update step does not pick the next action from the same $\epsilon\text{-greedy}$ policy, but instead the agent chooses an action greedely. This is, the agent picks the action which would maximize the expected value going forward. It is important to **pay attention here**; the agent is actually not performing this action. This action is used for performing the update. In fact the agent doesn't need to take this action, it can simply use the maxQ value for the enxt state.
This is why Q-Learning is an off-policy method.

**<u>It uses one policy to take actions (that'd be the $\underline{\epsilon\text{-greedy}}$ policy $\underline{\pi}$) and another policy to perform the updates (that'd be a greedy policy).</u>**
![Q-LearningFunctApprox](./images/Q-LearningFunctApprox.png)

Again, we can adapt this algorithm for continuing tasks by trating the long undending sequence as if it were one long unending episode:


![Q-LearningFunctApproxContTasks](./images/Q-LearningFunctApproxContTasks.png)

In both cases it is needed additional criteria to figure out when the agent has fully learned the task or to detect when the agent is failing.
A comparison between SARSA and Q-Learning is offered in the following table:

|SARSA |Q-Learning |
|------|------|
| On-policy | Off-policy |
| Good online performance | Bad online performance |
| Q-values affected by exploration | Q-values unaffected by exploration|

Both approaches have their pros and cons. Depending on the characteristics of the environment and your preferences reagarding online performance vs. more accurate learning, one algorithm or the other will be chosen.

Off-policy methods like Q-Learning have gotten too much attention mainly because they decouple the action taken by the agent in the environment from its learning process. This allows to build different variations of our learning algorithm. For instance, it can be used a more exploratory policy while acting and yet learn the optimal value function. As shown by the comparative table above, online performance will be bad, but at some point the agent can stop exploring and follow the optimal policy for better results. Another possibility is for example that a human can demostrate the actions to take (a human takes the actions), and the agent can learn from observing the effects of those actions. Another possibility is that the agent could learn offline or in batches since an update policy need not be performed at every step. This is critical for reliably training neural networks for reinforcement learning.

**Note:** One drawback of both SARSA and Q-Learning , since they are TD approaches, is that they may not converge on the global optimum when using non-linear function approximation.

| Off-Policy Advantages |
|-------------------------|
| More exploration when learning |
| Learning from demonstration |
| Supports offline or batch learning |

## Deep Q Network

At the core of the a Deep Q Network agent is a deep neural network that acts as a function approximator. For example, images from an atari game can be passed to the agent one screen at a time and it produces a vector of action values, with the max value indicating the action to take. As a reinforcement signal, it is fed back the change in game score at each time step.

![DeepQNetwork](./images/DeepQNetwork.png)

Atari games are displayed at a resolution of 210 by 160 pixels with a 128 possible colors for each pixel. This is a very complex environment, since even when technically speaking we have a discrete state space, it is very large to process as is. In order to reduce this complexity, the people in Deep Mind decided to perform some minimal processing to the images, converting the frames to gray scale and scale them down to a square 84 by 84 pixel block (square images allowed them to use more optimized neural network operations on GPUs). Then, in order to give the agent access to a sequence of frames, they stacked four of such frames together, resulting in a final state space size of 84 by 84 by 4.

![InputStateSpaceAtari](./images/InputStateSpaceAtari.png)

Traditinal reinforcement learnint algorithms produce only one Q value at a time. Unlike this, the outpu side of Deep Q Network is designed to produce <u>a Q value for every possible action</u> in a single forward pass. Without this, it would be neccessary to run the network individually for every action,  but instead doing this, we could simply use the output vector to take an action, either stochastically or by choosing the one with the maximum value.
![actionVectorOutput](./images/actionVectorOutput.png)

This innovative input and output transformations support a powerfull yet simple neural network architecture under the hood. The screen images are first processed by convolutional layers. This allows the system to exploit spatial relationships, and can sploit spatial rule space. Also since four frames are stacked and provided as input, these convolutional layers also extract some temporal properties across those frames.<br>
The original DQN agent used three such convolutional layers with RLU activation, regularized linear units. They were followed by one fully-connected hidden layer with RLU activation, and one fully connected linear output layer that produced the vector of action values.

![OriginalDQNNeuralNetworkArchitecture](./images/OriginalDQNNeuralNetworkArchitecture.png)

Training this network requires a lot of data and even then it is not guaranteed to converge on the optimal value function. In fact there are situations where the network weights can oscillate or diverge due to the high correlation between actions and states. This can result in a very unstable and ineffective policy. In order to overcome these challenges, the researches came up with several techniques that slightly modifies the base Q.Learning algorithm. Two of these techniques and maybe the most importatn contributions of their work, are:
 * Experience Replay
 * Fixed Q Targets
 
 It might be interesting to take a look to:
  * Mnih et al., 2015. [Human-level control through deep reinforcement learning](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf).

## Experience Replay

Experience Replay was originally proposed to make more efficient use of oberved experiences. Let's consider the basic online Q-Learning algorithm where we interact with the environment and at each time step we obtain a state-action-reward-nextState tuple (i.e. $<S_t, A_t, R_{t+1}, S_{t+1}>$). Learn from it and then discard it, moving on to t he next tuple in the following time step. And taking a closer look to this, it might seem to be a wasteful machanism. Wouldn't it be better if we could store the experienced tuples to be used at a later point to possible learn more from them? Besides this, <u>some states are pretty rare to come by and some actions can be pretty costly</u>, so it would be better to recall such experiences. This is what a replay buffer allows us to do!

![ReplayBuffer](./images/ReplayBuffer.png)

The idea of how this works is, during the agent's interaction with the environment it stores each experienced tuple in the buffer and then it samples a small batch of tuples from it in order to learn. As a result, the agent will be able to learn from individual tuples multiple times, recall rare occurrences and in general make better use of the experience. Another critical problem that experience replay can help with is that if you think of the experiences being obtained, you realize that every action $A_t$ affects the next state $S_t$ in some way. That means that a sequence of experienced tuples can be highly correlated. A naive Q-Learning approach that learns from each of these experiences in sequential order runs the risk of getting swayed by the effects of this correlation.
Such correlation can be broken by using a Replay Buffer (or an "Experience Replay" approach). The agent could sample from the Replay Buffer randomly (it doesn't have to be in the same sequence as the tuples were stored). This has two advantages: it breaks the correlation and ultimately prevents action values from oscillating or diverging catastrophically.

So, by following this approach, we are basically building a database of samples and then the agent is learning a mapping from these samples. <u>Experience replay helps to reduce the reinforcement learning problem (the value learning portion of it) to a supervised learning scenario</u>.

It could then be applied other models learning techniques and best practices developed in the supervised learning literature through reinforcement learning. An improving to this idea is to prioritize the experience tuples that are rare or those which are more important.

## Fixed Q Targets

There is another correlation that Q-Learning is susceptible to (appart from the correlation between consecutive experience tuples, which is solved with the Experience Replay approach).

Q-Learning is a form of Temporal Difference. Remember that our TD Target is $\Big[R + \gamma\big(\text{max}_a\hat{q}(S', a, W)\big)\Big]$ and that the goal was -or is- to reduce the difference between this target and the currently predicted Q-value. That difference is the "TD Error".

$$\Delta W = \alpha\big(\underline{\color{red}{R + \gamma\text{max}_a\hat{q}(S', a, W)} - \color{blue}{\hat{q}(S, A, W)}}\big)\nabla_w\hat{q}(S, A, W)$$

Where the term in $\color{red}{\text{red}}$ is -again- the *TD Target*, and the term in $\color{blue}{\text{blue}}$ is the *current value*.<br>
And where the difference of these terms is the **TD Error**:
$$\Big[R + \gamma\big(\text{max}_a\hat{q}(S', a, W)\big) - \hat{q}(S, A, W)\Big]$$

In Q-Learning, the TD Target is supposed to be a replacement for the true value function $q_\pi(S, A) \rightarrow$ which is unknown to us.<br>
Originally, $q_\pi$ was used to define a squared error loss. We differentiated that with respect to $W$ to get our gradient descent update rule:

![OriginalDevGradDescUpdateRule](./images/OriginalDevGradDescUpdateRule.png)

We can see that $q_\pi$ doesn't depend on our function approximation or on its parameters, thus resulting in a simple derivative --> our update rule.<br>
However the TD Target does depend on these parameters. This means that it is mathematically incorrect to simply replace the true value function $q_\pi$ with a target like this.
Despite this, it works pretty well because the updates result in a small change to the parameters and such small changes are in general in the right direction. Notice that if we set $\alpha = 1$ and leap towards the target, then the step would be "overshooted" landing in the wrong place. Also, this is less of a concern when we use a lookup table or a dictionary since Q-values are stored separately for each state action pair.
Nevertheless, this incorrectnes can affect learning significantly when function approximation is used - the Q-values are intrinsically tied together through the function parameters.<br>
By the way, this is basically the answer the the last unanswered question on the [Temporal Difference Learning](#Temporal-Difference-Learning) section.

So, this means that basically we have a moving target. And this is the correlation that is mentioned at the begining of this section. This correlation is between the target and the parameters the agent changes. The agent is chasing a moving target.

To solve this problem we need to decouple the target from the actions to produce a more stable learning environment. We do that by fixing the function parameters used to generate our targets. Such fixed parameters are indicated by $\color{red}{w^-}$ and they are just a copy of $W$ that do not change during the learning step. A way of learning with this approach is to copy $W$ into $\color{red}{w^-}$, use thsi copy to generate targets while changing $W$ for a certain number of learning steps, and then we update $\color{red}{w^-}$ with the lates $W$ and we repeat again.

By doing this, we decouple the target from the parameters making the learning algorithm more stable and less likely to diverge or fall into ascillations.
![DecoupledTargets](./images/DecoupledTargets.png)

## Deep Q-Learning Algorithm

Deep Q-Learning algorithm has two proceses that are interrelated:
 1. Sampling the environment by performing actions and sotre away the observed experienced tuples (building a replay buffer)
 2. Selecting a bath from this buffer (randomly) and train or learn from such batch - of course we use the gradient descent update step for learning
 
Since these processes do not depend on each other, it could be performed a series of samplings steps before one learning step or multiple learning steps using different random batches. The rest of the algorithm is designed to support these steps.

 * In the begining an empty replay memory is initialized (since memory is finite, a good approach here would be to use a circular buffer
 * The parameters or weights $W$ must be also initialized
  - For this there are some best practices like for instance, sampling the weights randomly from a normal distribution with variance equal to $2 * \text{number of inputs to each neuron}$. These initialization methods are t ypically available in modern deep learning libraries.
  * There will be needed a second set of parameters $w^-$ that must be initialized to $W$.
  * If needed, perform a preprocessing of the states before feeding them to the network. This preprocessing is denoted by $\phi$.
  * In practices it is needed to have sufficient number of tuples in memory before the agent is able to run the learning steps. It is important to **note** that the memory is not cleared after each episode becuase we want the agent to be able to recall and build batches of experiences from across episodes.
  
It is recomended to read the DQN paper before trying to implement the algorithm, since you may need to choose which techniques you want to apply and adapt them for different types of environments.

Mnih et al., 2015. [Human-level control through deep reinforcement learning](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf). (DQN paper)<br>
He et al., 2015. [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/pdf/1502.01852.pdf). (weight initialization)

![DeepQLearningAlgo](./images/DeepQLearningAlgo.png)

## DQN Improvements

Three of the most prominent  improvements that can be done to DQN:
 * Double DQNs
 * Prioritized Replay
 * Dueling Networks

 ### 1st. enhancement

Q-Learning **is prone to overestimation of action values**.<br>
Let's look again at the update rule of the Q-Learning algorithm with function approximation and let's also focus on the TD Target:

<center>
$\Delta W = \alpha\big(\underline{R + \gamma max_a\hat{q}(S', a, W)} - \hat{q}(S, A, W)\big)\nabla_w\hat{q}(S, A, W) \\
    \text{Here the underlined part of the update rule is the TD Target.}$
</center>

In this update rule, the max operation is necessary to find the best possible value we could get from the next state. Let's rewrite the target and expand the max operator to try to understand this better:
$$R + \gamma\hat{q}\Big[S', \text{argmax}_a\hat{q}(S', a, W), W\Big]$$

This equation is expressing more efficiently that we want to obtain the Q-value for the state $S'$ and the action that result in the maximum Q-value among all possible action that can be taken while in that same state.

Writin the equation in this way allows us to see that it could be possible that the argmax operation makes a mistake - especially during the early learning steps. Thsi is because the Q-values are still evolving and therefore it is very likely that the agent has not yet gathered enough information to really "know" what the best action is. Think about it; <u>the accuracy of our Q-values depends a lot on what actions have been tried</u> and what neighboring states have been explored.
This results in an overestimation of Q-values because it is always picked the maximum among a set of "noisy" numbers (they are still not even close the true Q-values and therefore it might be best not to trust these values).
To make this estimation more robust we could implement Double Q-Learning (works well in practice). This is also known as Double DQNs.

Double Q-Learning selects the best action using one set of parameters while it evaluates using a different set of parameteres (i.e. $W'$):

$$R+ \gamma\hat{q}\big(S', \text{argmax}_a\hat{q}(S', a, W), W'\big)$$

This is like having two function approximators:
 * if $W$ picks an action that is not the best according to $W'$, then the Q-valule returned is not that high.
 
In the long run this prevents the algorithm from propagating incidental higher rewards that may have been obtained by chance and don't reflect long term returns. But the question for now is: where do we get this second set of parameters from?<br>
The original formulation of Double Q-Learning maintains two value functions and randomly choose one of them to update at each step using the other only for evaluating actions.<br>
But <u>when using DQNs with fixed Q targets</u> we already have an alternate set of parameteres. (remember we denote this fixed Q targets as $w^-$). Well, since $w^-$ is kept frozen for a while it is therefore different enough from $W$ and then it can be used (or reused) for this purpose.

<center>
    $\begin{align}
& R+ \gamma\hat{q}\big(S', \text{argmax}_a\hat{q}(S', a, \color{red}{W}), \color{blue}{W'}\big) \\
& \color{blue}{W\text{ selects best action}}\\
& \color{red}{W' = w^-\text{ evaluates that action}}
\end{align}$
</center>

This modification keeps Q-values in check and it prevents that such Q-values explode in the early stages of learning or fluctuate later on. The resulting polilcies have also been shown to perform significantly better than Vanilla DQNs.

### 2nd enhancement

This second issue is related to experience replay. Remember the idea behind the Experience Replay? The agent interacts with the environment to collect experience tuples that it saves in a circular buffer (because memory is not infinite) and later it samples randomly some experiences to form a batch to learn from it. As already discussed, this helps breaking the correlation that might exist between consecutive experiences and stabilizes the learning algorithm.

But some of this saved experiences might be more important than others for the learning process. Such important experiences <u>might occur infrequently</u> and therefore when sampling randomly from the buffer to form a batch, <u>these experiences have a very low probability of getting selected</u> and since the buffers are circular <u>older important experiences may get lost</u> (erased).

It seems to be a good idea, to prioritize the experiences.

**Prioritize experience replay** assign priorities to the tuples. With what criteria? One approach is to use the TD error delta. The bigger the error, the more we expect to learn from that tuple. So we take the magnitude of this error as a measurement of priority, and store it along with each corresponding tuple in the replay buffer. So when creating batches, we use this value to compute a sampling probability and then we select any tuple $i$ with a probability equal to its <font color=red>priority value $P_i$</font>, normalized by the <font color=blue>sum of all priority values in the replay buffer</font>.

![PrioritizeExperienceReplay](./images/PrioritizeExperienceReplay.png)

When a tuple is picked, we can update its priority with a newly computed TD error using the latest Q-values. This seems to work fairly well and has been shown to reduce the number of batch updates needed to learn a value function.<br>
Still, there are a couple of things that can be improved:
 1. If the TD error is zero, the priority value of the tuple and in consequence its probability of being picked will also be zero. TD Error that equals to zero or that have  a very low value do not mean that the agent have nothing to learn from them (it could be the case that the TD Error is very close to zero because the agent is at a very early stage of its learning process and therefore its estimation was closed to zero due to the limited samples it has visited so far. To prevent such tuples from never being selected, it can be added a small constant "$e$" to every priority value.
 2. If the agent uses greedily this priority values, it may lead to a small subset of experiences being replayed over and over resulting in a sort of overfitting to that subset. How to avoid it? Reintroduce some element of uniform random sampling. This adds another hyper-parameter "$a$" which is used to redefine the sampling probability as priority $P_i$ to the power of "$a$" (i.e. $P_i^a$). And by varying this parameter $a$, it can be controlled how much to use priority versus randomness. When $a = 0$ corresponds to uniform randomness and when $a = 1$ only priorities will be used. When we use prioritized experience replay, we have to make adjustments to our update rule. The original Q-Learning update is derived from an expectation over all experiences. When using a stochastic update rule, the way we sample these experiences must match the underlying distribution they came from. This is obiously preserved when we sample experienced tuples uniformly from the replay buffer. But this assumption is violated when we use a non-uniform sampling, for example, using priorities. The Q-values we learn will be biased according to these priority values which we only wanted to use for sampling. How to correct for this bias? We introduce an important sampling weight which equals to $\frac{1}{N}$, where $N$ is the size of this replay buffer times $\frac{1}{P(i)}$ (one over the sampling probability $P(i)$). Another hyper-parameter is added at this point: $b$. We raise each important sampling weight to $b$ to control how much these weights affect learning. These weights are more important towards the end of learning when the Q-values begin to converge so $b$ can be increased from a low value to one over time.
 
 ![PrioritizeExperienceReplay2](./images/PrioritizeExperienceReplay2.png)

### 3rd enhancement

The third enahncement is Dualing Networks.
**Dualing Networks**'s idea is to use two streams. One stream estimates the state value functions and the other stream estimates the advantage for each action. These two streams might share some layers in the beginning such as convlutional layers and then they branch off with their own fully connected layers. The final Q-values are obtained by combining the state and advantage values.
The intuition behind this, is that the values of most states don't vary a lot across actions. Therefore it might make sense to try and directly estimate such values. But it is still needed to capture the difference actions make in each state - and this is where the advantage function comes into play. Q-Learning must be modified in order for it to adapt this architecture (Wang et al., 2015. [Dueling Network Architectures for Deep Reinforcement Learning](https://arxiv.org/abs/1511.06581).)

![DuelingNetworks](./images/DuelingNetworks.png)