# Value-based RL DNN approaches

### DQN - Deep Q-Network

The DQN algorithm was one of the first RL algorithms where the RL framework was combined with DNNs and the result was satisfying. The properties of the algorithm:

* relatively easy to implement
* simple but powerful
* difficult convergence
* sensitive to the hyper-parameters


DQN is based on Q-learning but the $Q$-function is approximated by a neural network: $Q_\theta(s, a)$. Then the update rule for the $\theta$ parameters is given as:

$$\theta_{t+1} = \theta_t + \alpha \cdot \left( r_t + \max_{a'}Q_\theta(s', a') - Q_\theta(s, a) \right)\cdot \nabla_\theta Q_\theta(s, a)|_{\theta = \theta_t}$$

**How can we represent the state?** The problem is when something is moving on the image (e.g.: a ball) then a static frame is not able to represent it at a time point. The convolutional networks are memory less therefore they cannot store information across the consecutive frames. Remember, RL assumes an MDP (Markov decision process), which requires that the state contains all the information about the environment. 

A good approximation of this is to use a bunch of consecutive frames. Therefore a moving object appears in different places on the consecutive frames.

**Preprocessing steps:**

The frames arriving from the simulator need to be preprocessed before feeding  them into the network.

<img src="http://drive.google.com/uc?export=view&id=1v6xXmKxSbElHF8RDgwmjP4eou2DxMHDj" width=75%>

The preprocessing of the raw input frames consists of the following steps, as the above image illustrates:

* grayscale the image
* cropping (only the interesting part of the image will remain)
* downsampling (or resizing) the image for $84\times 84$
* stacking four frames together to form the state

**Network architecture:**

* Conv2D(kernel\_num=32, kernel\_size=(8, 8), padding='valid', input_shape=(84, 84, 4), strides=(4, 4))
* Activation('relu')
* Conv2D(kernel\_num=64, kernel\_size=(4, 4), padding='valid', strides=(2, 2))
* Activation('relu')
* Conv2D(kernel\_num=64, kernel\_size=(3, 3), padding='valid', strides=(1, 1))
* Activation('relu')
* Flatten()
* Dense(units=512, activation='relu')
* Dense(num\_actions)

DQN has two major tricks to avoid the instability caused by the neural network approximator:

1. experience replay
2. iterative update

**Experience replay:** Use past experience for the same state/ action (not just the current one) - may have different rewards and different new states they result in
<img src="http://drive.google.com/uc?export=view&id=1rqQQyPxhDTSFScMwmXDLGAE6eUpAWoSu" width=75%>

There are two main reasons why experience replay can help to converge faster:

1. If all of the samples are taken consecutively before feeding it into the network then the data samples will be correlated. This correlation makes the learning slower and harms the generalization. The replay buffer gathers the experiences in a buffer and the training batches are sampled according to a uniform distribution.
2. There are valuable states (experiences) which should be used more times because it affects the policy strongly. However, may be the state is visited rarely because it is hard to reach it. Because the replay buffer stores a long history of experiences, the rare experiences can be reused several times.

In form of the memory buffer, we have direct control over the memory requirements of the algorithm, thus, we can __trade off stability for memory consumption__.

**Iterative update:**

One of the reasons behind the instability of Q-learning combined with a deep neural network is the fast change (high variance) of the one-step return. The one-step return depends on the network itself and the network weight is updated frequently. The network has no time to adapt and follow up changes.

Iterative update or (delayed update) uses two networks for representing the $Q$-function. The architecture is the same but the weights are different. The weights are synchronized after a given number of steps.

The goal of the first network is to calculate the return and it is not updated until synchronization. The second network is responsible for selecting the next step and it is always updated according to the update rule.

The update rule changes to the following one:

$$\theta_{t+1} = \theta_t + \alpha \cdot \left( r_t + \max_{a'}\hat{Q}_{\theta^-}(s', a') - Q_\theta(s, a) \right)\cdot \nabla_\theta Q_\theta(s, a)|_{\theta = \theta_t}$$

The next slide shows the pseudo code of the DQN algorithm. The experience replay is the $D$ buffer in the code. The algorithm stores and samples experiences from the buffer. The iterative update is implemented with $\hat{Q}$ and $Q$.

<img src="http://drive.google.com/uc?export=view&id=1EaDj9o9-ACuMsf9PtmMw4p3CMtknTVAg" width=55%>

[Video playing Atari](https://www.youtube.com/watch?v=V1eYniJ0Rnk)

### Disadvantages of value-based RL

As we have already seen, value-based RL derives the policy from a value-function:

With Q-function:

$$\pi(s) = \arg \max_a{ Q(s, a) }$$

When V-function is given:

$$\pi(s) = \arg\max_a \left( T(s, a, s') \cdot \left[ r(s, a) + \gamma V(s') \right]\right)$$

Due to the $\arg \max$ function the resulting policy is deterministic. However, in the first session, we have seen an example that stochastic policies have an advantage over deterministic ones. Let's recall the example:

<img src="http://drive.google.com/uc?export=view&id=1b-EDUk5cFVpqtOvZ0o4begzKCYwO0dMS" width=65%>

Here the example shows that cells with the question marks are the same for the agent. This is because here the state is defined by walls the agent can see around itself. Therefore the problem cannot be solved with a deterministic policy as efficiently as a stochastic one can. 

One main disadvantage of value-based RL is that it cannot calculate a stochastic policy directly. Furthermore changing the value-function causes unpredictable changes in the derived policy.

Summarized - value based RL:
* naturally gives a **deterministic policy**
* **small changes** in the value function can cause severe changes in the policy

# Policy gradients

- Basic idea: adjust the policy in the direction of the gradient (that makes it better)

To overcome the limitations of value-based RL, the policy-based methods parametrize directly the policy $\pi(s, a, \theta)$. 
The question is how to optimize the parameters of the policy.
As throughout in RL and machine learning, the optimization’s steps are done in the direction of the gradient of a loss function (or objective function). The question is how we can define a good objective function?

## One example of Policy Gradient methods: REINFORCE

<img src="http://drive.google.com/uc?export=view&id=1Il8lf_096OKqwVpZ6ohUIdYsjr-dtGHk" width=75%>

Here the return is calculated with Monte Carlo, what we seen in the first section.

# Taxonomy of RL

<img src="http://drive.google.com/uc?export=view&id=1Gz0WBOtTxYrZ91uAidFE_Tuw0RBSfl9O" width=55%>

<img src="http://drive.google.com/uc?export=view&id=11JF9TQok7WgDkqtOklY63RGJ6DhjvY_p" width=55%>


<img src="http://drive.google.com/uc?export=view&id=1OQXkGuNHx5dCThkWRhyVsiP_SjLhEN8v" width=55%>


"Figure 1: Growth of published reinforcement learning papers. Shown are the number of RL-related publications (y-axis) per year (x-axis) scraped from Google Scholar searches." [source](https://www.researchgate.net/publication/319953023_Deep_Reinforcement_Learning_That_Matters/figures?lo=1)