# Deep Q-Networks
> Notes on the famous Deep Q-Networks paper from Deepmind.

- toc: true
- branch: 2020-04-03-deep-q-networks
- badges: true
- image: images/q-network-architecture.jpg
- comments: true
- author: David R. Pugh
- categories: [pytorch, deep-reinforcement-learning, deep-q-networks]

In [4]:
!conda install --yes --channel pytorch pytorch=1.4

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/pughdr/Research/stochastic-expatriate-descent/env

  added / updated specs:
    - pytorch=1.4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py37_1         156 KB
    intel-openmp-2020.0        |              166         896 KB
    mkl-2020.0                 |              166        93.5 MB
    ninja-1.9.0                |   py37h04f5b5a_0          90 KB
    openssl-1.1.1f             |       h1de35cc_0         2.2 MB
    pytorch-1.4.0              |          py3.7_0        34.5 MB  pytorch
    ------------------------------------------------------------
                                           Total:       131.3 MB

The following NEW packages will be INSTALLED:

  intel-openmp       pkgs/main/osx-64::intel-op

In [8]:
import torch
from torch import nn
from torch.nn import functional as F

https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf

Use deep convolutional neural network to approximate that optimal action-value function

$$ Q^*(s, a) = \max_{\pi} \mathbb{E}\Bigg[\sum_{s=0}^{\infty} \gamma^s r_{t+s} | s_t=s, a_t=a, \pi \Bigg] $$

which is the maximum sum of rewards $r_t$ discounted by $\gamma$ at each time-step $t$ achievable by a behaviour policy $\pi = P(a|s)$, after making an observation of the state $s$ and taking an action $a$. 


Reinforcement learning is known to be unstable or even to diverge when a non-linear function approximator such as a neural network is used to represent the $Q$ function. Why?

1. Correlations present in the sequence of observations of the state $s$. In reinforcement learning applications the sequence state observations is a time-series which will almost surely be auto-correlated. But surely this would also be true of any application of deep neural networks to model time series data. 
2. Small updates to $Q$ may significantly change the policy, $\pi$ and therefore change the data distribution.
3. Correlations between the action-values, $Q$, and the target values $r + \gamma \max_{a'} Q(s', a')$

In the paper they address these issue by...

* A biologically inspired mechanism they refer to as *experience replay* that randomizes over the data which removes correlations in the sequence of observations of the state $s$ and smoothes over changes in the data distribution (issues 1 and 2 above).
* Using an iterative update rule that adjusts the action-values, $Q$, towards target values that are only periodically updated, thereby reducing correlations with the target (issue 3 above).

## Approximating the state-action function, $Q(s,a)$

They parameterize an approximate value function $Q(s,a; \theta_i)$ using the deep convolutional neural network shown in the figure below. Note that $\theta_i$ are the parameters or weights of the $Q$-network at iteration $i$.

![](images/q-network-architecture.jpg)

In [13]:
def create_deep_q_network(state_size: int, action_size: int) -> nn.Module:
    deep_q_network = nn.Sequential(
        nn.Conv2d(in_channels=state_size, out_channels=32, kernel_size=8, stride=4),
        nn.ReLU(),
        nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2),
        nn.ReLU(),
        nn.Conv2d(in_channels=64, out_channels=64, kernel_size=2, stride=1),
        nn.ReLU(),
        nn.Linear(in_features=64, out_features=512),
        nn.ReLU(),
        nn.Linear(in_features=512, out_features=action_size)
    )
    return deep_q_network

### Experience Replay

To perform *experience replay* the authors store the agent's experiences $e_t$ as represented by the tuple

$$ e_t = (s_t, a_t, r_t, s_{t+1}) $$

consisting of the observed state in period $t$, the reward received in period $t$, the action taken in period $t$, and the resulting state in period $t+1$. The dataset of agent experiences at period $t$ consists of the set of past experiences.

$$ D_t = \{e1, e2, ..., e_t \} $$

Depending on the task it may note be feasible for the agent to store the entire history of past experiences.

During learning Q-learning updates are computed based on samples (or minibatches) of experience $(s,a,r,s')$, drawn uniformly at random from the pool of stored samples $D_t$.


## The Loss Function

The $Q$-learning update at iteration $i$ uses the following loss function

$$ \mathcal{L_i}(\theta_i) = \mathbb{E}_{(s, a, r, s') \sim U(D)} \Bigg[\bigg(r + \gamma \max_{a'} Q\big(s', a'; \theta_i^{-}\big) - Q\big(s, a; \theta_i\big)\bigg)^2\Bigg] $$

where $\gamma$ is the discount factor determining the agent’s horizon, $\theta_i$ are the parameters of the $Q$-network at iteration $i$ and $\theta_i^{-}$ are the $Q$-network parameters used to compute the target at iteration $i$. The target network parameters $\theta_i^{-}$ are only updated with the Q$-network parameters $\theta_i$ every $C$ steps and are held fixed between individual updates. 

In [14]:
expected_q_network = create_deep_q_network(state_size=4, action_size=4)
target_q_network = create_deep_q_network(state_size=4, action_size=4)

expected_q_values = ???
target_q_values = ???

loss_fn = F.mse_loss(expected_q_values, target_q_values)

AttributeError: 'Sequential' object has no attribute 'size'