# Q-network

- Q-learning becomes intractable when number of states is large
- Train a Network that learns to map $state \to (action, reward)$
- However, Q-network diverges due to
  - Correlations between samples (too similar)
  - Non-stationary targets (gradient descent affects $Y$)

# DQN

- Use deep neural networks
- Experience replay ([NIPS'13](https://arxiv.org/pdf/1312.5602))
  - Store samples in memory and random sample
- Separate action/target networks ([Nature'15](http://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf))
  - Fix target network and copy action network for N steps

### Equations

\begin{align*}
Y = r_t + \gamma\underset{a'}\max{\hat{Q}_\theta(s_{t+1}, a';\bar\theta)} \\
\hat{Y} = \hat{Q}(S_t,a_t;\theta) \\
\underset\theta\min\sum_{t=0}^T[\hat{Y}-Y]^2
\end{align*}

### Algorithm

1. Initialize replay memory $D$, action network $Q(\theta)$, target network $\hat{Q}(\bar\theta)$
2. Do Forever:
  - Select $a_t = \text{arg}\underset{a}\max Q(s_t,a;\theta)$ with probability $1-\epsilon$
  - Excute action $a_t$ and observe reward $r_t$ and next state $s_{t+1}$
  - Store transition $(s_t, a_t, r_t, s_{t+1}) \to D$
  - Sample random mini-batch from $D$
  - Set $y_t = r_t$ if terminates else $r_t + \gamma\underset{a'}\max{\hat{Q}_\theta(s_{t+1}, a';\bar\theta)}$
  - Perform gradient descent step on $(y_t-Q(s_t, a_t;\theta))^2$
  - Copy $\hat{Q}\gets Q$ every $C$ steps

In [1]:
from __future__ import division

from collections import deque
import copy

import gym
import numpy as np
import torch
from torch.autograd import Variable
import torch.nn as nn

In [2]:
env = gym.make('CartPole-v0')
env = gym.wrappers.Monitor(env, directory='gym-results/', force=True)

[2017-06-12 18:24:26,768] Making new env: CartPole-v0
[2017-06-12 18:24:26,793] Clearing 4 monitor files from previous run (because force=True was provided)


In [3]:
# Params
INPUT_SIZE = env.observation_space.shape[0]
OUTPUT_SIZE = env.action_space.n
HIDDEN_SIZE = 16
GAMMA = 0.99
REPLAYS = 50000
BATCH_SIZE = 64
UPDATE_BY = 5
EPISODES = 5000

In [4]:
# Initialize
D = deque(maxlen=REPLAYS)    # replay buffer
Q = nn.Sequential(
    nn.Linear(INPUT_SIZE, HIDDEN_SIZE),
    nn.ReLU(),
    nn.Linear(HIDDEN_SIZE, OUTPUT_SIZE),
)
Q_bar = copy.deepcopy(Q)    # target network

In [5]:
s = env.reset()

[2017-06-12 18:24:33,801] Starting new video recorder writing to /Users/naver/practice/rl-study/gym-results/openaigym.video.0.32175.video000000.mp4


In [18]:
Q(Variable(torch.FloatTensor(s)).view(1,-1))

Variable containing:
-0.3304  0.1774
[torch.FloatTensor of size 1x2]

In [19]:
env.step(1)

(array([-0.01672859,  0.14554791,  0.03870915, -0.29024425]), 1.0, False, {})

In [None]:
for episode in range(EPISODES):
    epsilon = 1 / ((episode / 10) + 1)
    done = False
    state = env.reset()
    while not done:
        if np.random.rand() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q(Variable(torch.FloatTensor(state)).view(1,-1)))
        next_state, reward, done, _ = env.step(action)
        D.append((state, action, reward, next_state, done))
        