# Q-network

- Q-learning becomes intractable when number of states is large
- Train a Network that learns to map $state \to (action, reward)$
- However, Q-network diverges due to
  - Correlations between samples (too similar)
  - Non-stationary targets (gradient descent affects $Y$)

# DQN

- Use deep neural networks
- Experience replay ([NIPS'13](https://arxiv.org/pdf/1312.5602))
  - Store samples in memory and random sample
- Separate action/target networks ([Nature'15](http://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf))
  - Fix target network and copy action network for N steps

### Equations

\begin{align*}
Y = r_t + \gamma\underset{a'}\max{\hat{Q}_\theta(s_{t+1}, a';\bar\theta)} \\
\hat{Y} = \hat{Q}(S_t,a_t;\theta) \\
\underset\theta\min\sum_{t=0}^T[\hat{Y}-Y]^2
\end{align*}

### Algorithm

1. Initialize replay memory $D$, action network $Q(\theta)$, target network $\hat{Q}(\bar\theta)$
2. Do Forever:
  - Select $a_t = \text{arg}\underset{a}\max Q(s_t,a;\theta)$ with probability $1-\epsilon$
  - Excute action $a_t$ and observe reward $r_t$ and next state $s_{t+1}$
  - Store transition $(s_t, a_t, r_t, s_{t+1}) \to D$
  - Sample random mini-batch from $D$
  - Set $y_t = r_t$ if terminates else $r_t + \gamma\underset{a'}\max{\hat{Q}_\theta(s_{t+1}, a';\bar\theta)}$
  - Perform gradient descent step on $(y_t-Q(s_t, a_t;\theta))^2$
  - Copy $\hat{Q}\gets Q$ every $C$ steps

In [2]:
from __future__ import division

from collections import deque
import copy
from operator import itemgetter
import random

import gym
import numpy as np
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.optim as optim

In [3]:
# Setup gym
env = gym.make('CartPole-v0')
# env = gym.wrappers.Monitor(env, directory='gym-results/', force=True)

[2017-06-13 18:41:57,433] Making new env: CartPole-v0


In [4]:
# Params
INPUT_SIZE = env.observation_space.shape[0]
OUTPUT_SIZE = env.action_space.n
HIDDEN_SIZE = 16
GAMMA = 0.99
REPLAYS = 10000
BATCH_SIZE = 32
UPDATE_BY = 5
EPISODES = 500
LR = 1e-3

In [5]:
# DQN
D = deque(maxlen=REPLAYS)    # replay buffer
Q = nn.Sequential(
    nn.Linear(INPUT_SIZE, HIDDEN_SIZE),
    nn.ReLU(),
    nn.Linear(HIDDEN_SIZE, OUTPUT_SIZE),
)
Q_bar = copy.deepcopy(Q)    # target network
optimizer = optim.Adam(Q.parameters(), lr=LR)
criterion = nn.MSELoss()

In [None]:
# Train
for episode in range(EPISODES):
    epsilon = 1 / ((episode / 10) + 1)
    done = False
    state = env.reset()
    step = 1
    while not done:
        if np.random.rand() < epsilon:    # exploit & explore
            action = env.action_space.sample()
        else:
            action = np.argmax(Q(Variable(torch.FloatTensor(state)).unsqueeze(0)))
        next_state, reward, done, _ = env.step(action)
        if done:
            reward = -1.0    # penalty
        D.append((state, action, reward, next_state, done))
        if len(D) > BATCH_SIZE:    # SGD
            batch = random.sample(D, BATCH_SIZE)
            states = Variable(torch.FloatTensor(map(itemgetter(0), batch)))
            actions = Variable(torch.LongTensor(map(itemgetter(1), batch)))
            rewards = Variable(torch.FloatTensor(map(itemgetter(2), batch)))
            next_states = Variable(torch.FloatTensor(map(itemgetter(3), batch)))
            terminates = Variable(torch.FloatTensor(map(itemgetter(4), batch)))    # true=1
            optimizer.zero_grad()
            y = rewards + GAMMA * Q_bar(next_states).detach().max(dim=1)[0] * (1-terminates)    # detach not to backprop
            y_h = Q(states).gather(1, actions.unsqueeze(1))
            loss = criterion(y_h, y)
            loss.backward()
            optimizer.step()    
        if step % UPDATE_BY == 0:
            Q_bar = copy.deepcopy(Q)
        step += 1
        state = next_state
    print episode, step, loss.data[0]

In [None]:
# Test
state = env.reset()
rewards = 0
while True:
    env.render()
    action = np.argmax(Q(Variable(torch.FloatTensor(state)).unsqueeze(0)))
    state, reward, done, _ = env.step(action)
    rewards += reward
    if done:
        print rewards
        break

In [None]:
env.render(close=True)