# Q-network

- Q-learning becomes intractable when number of states is large
- Train a Network that learns to map $state \to (action, reward)$
- However, Q-network diverges due to
  - Correlations between samples (too similar)
  - Non-stationary targets (gradient descent affects $Y$)

# DQN

- Use deep neural networks
- Experience replay ([NIPS'13](https://arxiv.org/pdf/1312.5602))
  - Store samples in memory and random sample
- Separate action/target networks ([Nature'15](http://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf))
  - Fix target network and copy action network for N steps

### Equations

\begin{align*}
Y = r_t + \gamma\underset{a'}\max{\hat{Q}_\theta(s_{t+1}, a';\bar\theta)} \\
\hat{Y} = \hat{Q}(S_t,a_t;\theta) \\
\underset\theta\min\sum_{t=0}^T[\hat{Y}-Y]^2
\end{align*}

### Algorithm

1. Initialize replay memory $D$, action network $Q(\theta)$, target network $\hat{Q}(\bar\theta)$
2. Do Forever:
  - Select $a_t = \text{arg}\underset{a}\max Q(s_t,a;\theta)$ with probability $1-\epsilon$
  - Excute action $a_t$ and observe reward $r_t$ and next state $s_{t+1}$
  - Store transition $(s_t, a_t, r_t, s_{t+1}) \to D$
  - Sample random mini-batch from $D$
  - Set $y_t = r_t$ if terminates else $r_t + \gamma\underset{a'}\max{\hat{Q}_\theta(s_{t+1}, a';\bar\theta)}$
  - Perform gradient descent step on $(y_t-Q(s_t, a_t;\theta))^2$
  - Copy $\hat{Q}\gets Q$ every $C$ steps

In [1]:
from __future__ import division

from collections import deque
import copy
from operator import itemgetter
import random

import gym
import numpy as np
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.optim as optim

In [2]:
# Setup gym
env = gym.make('CartPole-v0')

[2017-06-14 15:45:15,624] Making new env: CartPole-v0


In [3]:
# Params
INPUT_SIZE = env.observation_space.shape[0]
OUTPUT_SIZE = env.action_space.n
HIDDEN_SIZE = 128
GAMMA = 0.99
REPLAYS = 10000
BATCH_SIZE = 64
UPDATE_BY = 5
EPISODES = 100
LR = 1e-3

In [4]:
# DQN
D = deque(maxlen=REPLAYS)    # replay buffer
Q = nn.Sequential(
    nn.Linear(INPUT_SIZE, HIDDEN_SIZE),
    nn.ReLU(),
    nn.Linear(HIDDEN_SIZE, OUTPUT_SIZE),
)
Q_bar = copy.deepcopy(Q)    # target network
optimizer = optim.Adam(Q.parameters(), lr=LR)
criterion = nn.MSELoss()

In [5]:
# Train
steps = []
for episode in range(EPISODES):
    epsilon = 1 / ((episode / 10) + 1)
    done = False
    state = env.reset()
    step = 0
    while not done:
        if np.random.rand() < epsilon:    # exploit & explore
            action = env.action_space.sample()
        else:
            _state = Variable(torch.FloatTensor(state)).unsqueeze(0)
            action = np.argmax(Q(_state).data.numpy())
        next_state, reward, done, _ = env.step(action)
        if done:
            reward = -1.0    # penalty
        D.append((state, action, reward, next_state, done))
        if len(D) > BATCH_SIZE:    # SGD
            batch = random.sample(D, BATCH_SIZE)
            states = Variable(torch.FloatTensor(map(itemgetter(0), batch)))
            actions = Variable(torch.LongTensor(map(itemgetter(1), batch)))
            rewards = Variable(torch.FloatTensor(map(itemgetter(2), batch)))
            next_states = Variable(torch.FloatTensor(map(itemgetter(3), batch)))
            terminates = Variable(torch.FloatTensor(map(itemgetter(4), batch)))    # true=1
            optimizer.zero_grad()
            y = rewards + GAMMA * Q_bar(next_states).detach().max(dim=1)[0] * (1-terminates)    # detach not to backprop
            y_h = Q(states).gather(1, actions.unsqueeze(1))
            loss = criterion(y_h, y)
            loss.backward()
            optimizer.step()    
        if step % UPDATE_BY == 0:
            Q_bar = copy.deepcopy(Q)
        step += 1
        state = next_state
    steps.append(step)
print steps    # max=200

[12, 17, 120, 17, 17, 70, 139, 43, 45, 144, 30, 51, 117, 133, 173, 200, 139, 200, 186, 200, 111, 162, 200, 74, 200, 161, 200, 166, 191, 200, 200, 197, 200, 197, 200, 200, 200, 198, 11, 186, 196, 200, 200, 200, 200, 181, 182, 200, 200, 178, 200, 200, 200, 179, 186, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 191, 200, 200, 200, 200, 200, 189, 196, 200, 200, 195, 188, 200, 200, 200, 200, 200, 198, 200, 200, 200, 181, 191, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200]


In [6]:
# Test
env = gym.wrappers.Monitor(env, directory='gym-results/', force=True)
rewards = []
for episode in range(100):
    state = env.reset()
    step = 0
    while True:
        env.render()
        _state = Variable(torch.FloatTensor(state)).unsqueeze(0)
        action = np.argmax(Q(_state).data.numpy())
        state, reward, done, _ = env.step(action)
        step += reward
        if done:
            rewards.append(step)
            break
print np.mean(rewards)
env.render(close=True)
env.close()

[2017-06-14 15:47:38,230] Clearing 4 monitor files from previous run (because force=True was provided)
[2017-06-14 15:47:38,233] Starting new video recorder writing to gym-results/openaigym.video.0.12296.video000000.mp4
[2017-06-14 15:47:45,661] Starting new video recorder writing to gym-results/openaigym.video.0.12296.video000001.mp4
[2017-06-14 15:48:12,391] Starting new video recorder writing to gym-results/openaigym.video.0.12296.video000008.mp4
[2017-06-14 15:49:19,305] Starting new video recorder writing to gym-results/openaigym.video.0.12296.video000027.mp4
[2017-06-14 15:51:26,287] Starting new video recorder writing to gym-results/openaigym.video.0.12296.video000064.mp4
[2017-06-14 15:53:30,049] Finished writing results. You can upload them to the scoreboard via gym.upload('gym-results')


200.0


In [7]:
gym.upload('gym-results/', api_key='open_ai_gym_key')

[2017-06-14 15:53:43,066] [CartPole-v0] Uploading 100 episodes of training data
[2017-06-14 15:53:45,227] [CartPole-v0] Uploading videos of 5 training episodes (53574 bytes)
[2017-06-14 15:53:46,013] [CartPole-v0] Creating evaluation object from gym-results/ with learning curve and training video
[2017-06-14 15:53:46,626] 
****************************************************
You successfully uploaded your evaluation on CartPole-v0 to
OpenAI Gym! You can find it at:

    https://gym.openai.com/evaluations/eval_ta7gMWo3R4aZdMBEmZjeqA

****************************************************
