# Sequential DQN

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
import gym
import numpy as np

In [2]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.10.0
Default GPU Device: 


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

>**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [3]:
import gym

# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [4]:
state = env.reset()
for _ in range(10):
    # env.render()
    action = env.action_space.sample()
    next_state, reward, done, info = env.step(action) # take a random action
    print('state, action, next_state, reward, done, info:', state, action, next_state, reward, done, info)
    state = next_state
    if done:
        state = env.reset()

state, action, next_state, reward, done, info: [ 0.00848692 -0.01232851 -0.03539592  0.04655268] 0 [ 0.00824035 -0.2069255  -0.03446486  0.3278611 ] 1.0 False {}
state, action, next_state, reward, done, info: [ 0.00824035 -0.2069255  -0.03446486  0.3278611 ] 1 [ 0.00410184 -0.01133027 -0.02790764  0.02451182] 1.0 False {}
state, action, next_state, reward, done, info: [ 0.00410184 -0.01133027 -0.02790764  0.02451182] 1 [ 0.00387524  0.18418056 -0.02741741 -0.276844  ] 1.0 False {}
state, action, next_state, reward, done, info: [ 0.00387524  0.18418056 -0.02741741 -0.276844  ] 0 [ 0.00755885 -0.01053973 -0.03295429  0.00706695] 1.0 False {}
state, action, next_state, reward, done, info: [ 0.00755885 -0.01053973 -0.03295429  0.00706695] 1 [ 0.00734806  0.18503896 -0.03281295 -0.2958286 ] 1.0 False {}
state, action, next_state, reward, done, info: [ 0.00734806  0.18503896 -0.03281295 -0.2958286 ] 1 [ 0.01104883  0.38061295 -0.03872952 -0.59867696] 1.0 False {}
state, action, next_state, r

To shut the window showing the simulation, use `env.close()`.

If you ran the simulation above, we can look at the rewards:

In [5]:
# print(rewards[-20:])
# print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
# print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
# print(np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
# print(np.max(np.array(rewards)), np.min(np.array(rewards)))
# print(np.max(np.array(states)), np.min(np.array(states)))

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [6]:
def model_input(state_size, lstm_size, batch_size=1):
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    # RNN
    gru = tf.nn.rnn_cell.GRUCell(lstm_size)
    cell = tf.nn.rnn_cell.MultiRNNCell([gru], state_is_tuple=False)
    initial_state = cell.zero_state(batch_size, tf.float32)
    return states, actions, targetQs, cell, initial_state

In [7]:
# RNN generator or sequence generator
def generator(states, initial_state, cell, lstm_size, num_classes, reuse=False): 
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        inputs = tf.layers.dense(inputs=states, units=lstm_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        # dynamic means adapt to the batch_size
        inputs_rnn = tf.reshape(inputs, [1, -1, lstm_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cell, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state.shape)
        outputs = tf.reshape(outputs_rnn, [-1, lstm_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=num_classes)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [8]:
def model_loss(action_size, hidden_size, states, cell, initial_state, actions, targetQs):
    actions_logits, final_state = generator(states=states, cell=cell, initial_state=initial_state, 
                                            lstm_size=hidden_size, num_classes=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    Qs = tf.reduce_max(actions_logits*actions_labels, axis=1)
    loss = tf.reduce_mean(tf.square(Qs - targetQs))
    return actions_logits, final_state, loss

In [9]:
def model_opt(loss, learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]

    # # Optimize
    # with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
    # #opt = tf.train.AdamOptimizer(learning_rate).minimize(loss, var_list=g_vars)

    #grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(loss, g_vars), clip_norm=5) # usually around 1-5
    grads = tf.gradients(loss, g_vars)
    opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(grads, g_vars))

    return opt

In [10]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, cell, self.initial_state = model_input(
            state_size=state_size, lstm_size=hidden_size)
        
        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.final_state, self.loss = model_loss(
            action_size=action_size, hidden_size=hidden_size, 
            states=self.states, actions=self.actions, 
            targetQs=self.targetQs, cell=cell, initial_state=self.initial_state)

        # Update the model: backward pass and backprop
        self.opt = model_opt(loss=self.loss, learning_rate=learning_rate)

## Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on. 

Here, we'll create a `Memory` object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a `Memory` object. If you're unfamiliar with `deque`, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.

In [11]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
        self.states = deque(maxlen=max_size)
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx], [self.states[ii] for ii in idx]

## Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon).  That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an **$\epsilon$-greedy policy**.


At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called _exploitation_. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

## Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in _episodes_. One *episode* is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

* Initialize the memory $D$
* Initialize the action-value network $Q$ with random weights
* **For** episode = 1, $M$ **do**
  * **For** $t$, $T$ **do**
     * With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
     * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
     * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
     * Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
     * Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
     * Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
  * **endfor**
* **endfor**

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [12]:
# print('state:', np.array(states).shape[1], 
#       'action size: {}'.format((np.max(np.array(actions)) - np.min(np.array(actions)))+1))

In [14]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
action_size = 2
state_size = 4
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity - 1000 DQN
batch_size = 128               # experience mini-batch size - 20 DQN
gamma = 0.99                   # future reward discount

In [15]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=batch_size)

(?, 4) (?, 64)
(1, ?, 64) (1, 64)
(1, ?, 64) (1, 64)
(?, 64)
(?, 2)


## Populate the memory (exprience memory)

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.

In [16]:
state = env.reset()
for _ in range(batch_size):
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)
    memory.buffer.append([state, action, next_state, reward, float(done)])
    state = next_state
    if done is True:
        state = env.reset()

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [None]:
saver = tf.train.Saver()
episode_rewards_list, rewards_list, loss_list = [], [], []

# TF session for training
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        loss_batch = []
        state = env.reset()
        initial_state = sess.run(model.initial_state)

        # Training steps/batches
        while True:
            action_logits, final_state = sess.run([model.actions_logits, model.final_state],
                                                  feed_dict = {model.states: state.reshape([1, -1]), 
                                                               model.initial_state: initial_state})
            # Explore (Env) or Exploit (Model): NO
            total_step += 1
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            # if explore_p > np.random.rand():
            #     action = env.action_space.sample()
            # else:
            action = np.argmax(action_logits)
            next_state, reward, done, _ = env.step(action)
            memory.buffer.append([state, action, next_state, reward, float(done)])
            memory.states.append([initial_state, final_state])
            total_reward += reward
            initial_state = final_state
            state = next_state

            # Training
            #batch, rnn_states = memory.sample(batch_size)
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            rnn_states = memory.states
            initial_states = np.array([each[0] for each in rnn_states])
            final_states = np.array([each[1] for each in rnn_states])
            next_actions_logits = sess.run(model.actions_logits, 
                                           feed_dict = {model.states: next_states, 
                                                        model.initial_state: final_states[0].reshape([1, -1])})
            nextQs = np.max(next_actions_logits, axis=1) * (1-dones)
            targetQs = rewards + (gamma * nextQs)
            loss, _ = sess.run([model.loss, model.opt], feed_dict = {model.states: states, 
                                                                     model.actions: actions,
                                                                     model.targetQs: targetQs,
                                                        model.initial_state: initial_states[0].reshape([1, -1])})
            loss_batch.append(loss)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'loss:{:.4f}'.format(np.mean(loss_batch)),
              'exploreP:{:.4f}'.format(explore_p))
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        loss_list.append([ep, np.mean(loss_batch)])
        # Break episode/epoch loop
        if np.mean(episode_reward) >= 500:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:12.0000 R:12.0000 loss:0.8702 exploreP:0.9988
Episode:1 meanR:32.0000 R:52.0000 loss:0.9405 exploreP:0.9937
Episode:2 meanR:45.6667 R:73.0000 loss:1.3626 exploreP:0.9865
Episode:3 meanR:42.7500 R:34.0000 loss:1.7622 exploreP:0.9832
Episode:4 meanR:40.2000 R:30.0000 loss:1.7501 exploreP:0.9803
Episode:5 meanR:40.0000 R:39.0000 loss:1.7121 exploreP:0.9765
Episode:6 meanR:38.1429 R:27.0000 loss:1.8985 exploreP:0.9739
Episode:7 meanR:37.3750 R:32.0000 loss:2.1310 exploreP:0.9708
Episode:8 meanR:36.5556 R:30.0000 loss:2.4070 exploreP:0.9680
Episode:9 meanR:35.2000 R:23.0000 loss:2.6545 exploreP:0.9658
Episode:10 meanR:34.3636 R:26.0000 loss:2.9866 exploreP:0.9633
Episode:11 meanR:33.0000 R:18.0000 loss:3.2750 exploreP:0.9616
Episode:12 meanR:31.8462 R:18.0000 loss:3.5348 exploreP:0.9599
Episode:13 meanR:30.9286 R:19.0000 loss:3.9222 exploreP:0.9580
Episode:14 meanR:30.1333 R:19.0000 loss:4.1273 exploreP:0.9562
Episode:15 meanR:29.3750 R:18.0000 loss:4.3936 exploreP:0.9545
Ep

Episode:129 meanR:19.6600 R:17.0000 loss:7.0923 exploreP:0.7686
Episode:130 meanR:19.4000 R:14.0000 loss:6.9766 exploreP:0.7676
Episode:131 meanR:19.2600 R:16.0000 loss:7.1571 exploreP:0.7664
Episode:132 meanR:19.1300 R:18.0000 loss:7.1388 exploreP:0.7650
Episode:133 meanR:18.9800 R:22.0000 loss:6.8797 exploreP:0.7633
Episode:134 meanR:18.9000 R:19.0000 loss:6.4964 exploreP:0.7619
Episode:135 meanR:18.7900 R:25.0000 loss:6.1994 exploreP:0.7600
Episode:136 meanR:18.6800 R:23.0000 loss:6.5970 exploreP:0.7583
Episode:137 meanR:18.6000 R:22.0000 loss:6.8233 exploreP:0.7567
Episode:138 meanR:18.5200 R:21.0000 loss:7.2645 exploreP:0.7551
Episode:139 meanR:18.5100 R:24.0000 loss:7.6009 exploreP:0.7533
Episode:140 meanR:18.4800 R:19.0000 loss:7.9517 exploreP:0.7519
Episode:141 meanR:18.4300 R:20.0000 loss:8.3710 exploreP:0.7504
Episode:142 meanR:18.3900 R:28.0000 loss:7.8867 exploreP:0.7483
Episode:143 meanR:18.3900 R:30.0000 loss:7.1859 exploreP:0.7461
Episode:144 meanR:18.3400 R:41.0000 loss

Episode:257 meanR:36.2400 R:20.0000 loss:18.1862 exploreP:0.5058
Episode:258 meanR:36.3700 R:43.0000 loss:21.7344 exploreP:0.5036
Episode:259 meanR:37.1200 R:107.0000 loss:39.5012 exploreP:0.4984
Episode:260 meanR:38.0200 R:121.0000 loss:7.4086 exploreP:0.4925
Episode:261 meanR:39.3800 R:164.0000 loss:5.2127 exploreP:0.4847
Episode:262 meanR:41.1000 R:206.0000 loss:4.4917 exploreP:0.4750
Episode:263 meanR:41.7500 R:104.0000 loss:6.6695 exploreP:0.4702
Episode:264 meanR:42.2900 R:91.0000 loss:5.8990 exploreP:0.4660
Episode:265 meanR:42.5000 R:52.0000 loss:4.2510 exploreP:0.4636
Episode:266 meanR:42.3300 R:28.0000 loss:9.8185 exploreP:0.4624
Episode:267 meanR:41.9400 R:18.0000 loss:16.9045 exploreP:0.4616
Episode:268 meanR:41.8500 R:12.0000 loss:23.1944 exploreP:0.4610
Episode:269 meanR:42.7200 R:109.0000 loss:14.8802 exploreP:0.4561
Episode:270 meanR:43.4400 R:87.0000 loss:4.4321 exploreP:0.4523
Episode:271 meanR:44.5500 R:130.0000 loss:3.4473 exploreP:0.4466
Episode:272 meanR:45.1900 R

Episode:382 meanR:231.9000 R:500.0000 loss:5.1639 exploreP:0.0490
Episode:383 meanR:234.6800 R:396.0000 loss:19.9895 exploreP:0.0475
Episode:384 meanR:239.1600 R:500.0000 loss:6.4433 exploreP:0.0457
Episode:385 meanR:244.0700 R:500.0000 loss:16.0802 exploreP:0.0440
Episode:386 meanR:246.6000 R:263.0000 loss:30.0700 exploreP:0.0431
Episode:387 meanR:245.0700 R:29.0000 loss:63.2757 exploreP:0.0430
Episode:388 meanR:244.1200 R:113.0000 loss:72.0559 exploreP:0.0426
Episode:389 meanR:247.8500 R:500.0000 loss:8.8695 exploreP:0.0410
Episode:390 meanR:252.5700 R:500.0000 loss:15.3101 exploreP:0.0395
Episode:391 meanR:255.9900 R:368.0000 loss:23.0226 exploreP:0.0384
Episode:392 meanR:260.7600 R:500.0000 loss:7.4707 exploreP:0.0371
Episode:393 meanR:262.8200 R:500.0000 loss:15.5268 exploreP:0.0357
Episode:394 meanR:262.5500 R:158.0000 loss:47.0453 exploreP:0.0353
Episode:395 meanR:265.6400 R:500.0000 loss:4.0197 exploreP:0.0341
Episode:396 meanR:266.4600 R:109.0000 loss:61.5701 exploreP:0.0338
E

Episode:505 meanR:455.4100 R:473.0000 loss:19.6434 exploreP:0.0102
Episode:506 meanR:455.4100 R:500.0000 loss:3.4267 exploreP:0.0102
Episode:507 meanR:455.4100 R:500.0000 loss:17.6960 exploreP:0.0102
Episode:508 meanR:455.4100 R:500.0000 loss:18.9431 exploreP:0.0102
Episode:509 meanR:455.4100 R:500.0000 loss:17.6513 exploreP:0.0102
Episode:510 meanR:454.3900 R:27.0000 loss:72.0261 exploreP:0.0102
Episode:511 meanR:449.5600 R:17.0000 loss:140.2437 exploreP:0.0102
Episode:512 meanR:444.6900 R:13.0000 loss:206.4319 exploreP:0.0102
Episode:513 meanR:441.6700 R:198.0000 loss:64.6669 exploreP:0.0102
Episode:514 meanR:441.6700 R:500.0000 loss:2.3294 exploreP:0.0101
Episode:515 meanR:443.8900 R:500.0000 loss:6.1226 exploreP:0.0101
Episode:516 meanR:443.8900 R:500.0000 loss:15.1049 exploreP:0.0101
Episode:517 meanR:447.4500 R:500.0000 loss:16.4297 exploreP:0.0101
Episode:518 meanR:447.4500 R:500.0000 loss:16.5606 exploreP:0.0101
Episode:519 meanR:447.4500 R:500.0000 loss:16.6599 exploreP:0.0101

Episode:629 meanR:344.4300 R:500.0000 loss:17.2327 exploreP:0.0100
Episode:630 meanR:346.4400 R:500.0000 loss:16.0470 exploreP:0.0100
Episode:631 meanR:348.7800 R:500.0000 loss:16.0464 exploreP:0.0100
Episode:632 meanR:351.1200 R:500.0000 loss:20.2337 exploreP:0.0100
Episode:633 meanR:351.4800 R:500.0000 loss:15.5130 exploreP:0.0100
Episode:634 meanR:354.0900 R:500.0000 loss:1.1226 exploreP:0.0100
Episode:635 meanR:354.0900 R:500.0000 loss:17.3331 exploreP:0.0100
Episode:636 meanR:356.5600 R:500.0000 loss:13.6065 exploreP:0.0100
Episode:637 meanR:358.3000 R:500.0000 loss:14.3423 exploreP:0.0100
Episode:638 meanR:360.1900 R:500.0000 loss:14.9160 exploreP:0.0100
Episode:639 meanR:358.5700 R:13.0000 loss:68.6695 exploreP:0.0100
Episode:640 meanR:358.3100 R:255.0000 loss:40.7770 exploreP:0.0100
Episode:641 meanR:354.9800 R:106.0000 loss:0.8269 exploreP:0.0100
Episode:642 meanR:354.9600 R:107.0000 loss:26.4539 exploreP:0.0100
Episode:643 meanR:354.9200 R:146.0000 loss:24.1882 exploreP:0.010

Episode:752 meanR:318.4200 R:12.0000 loss:77.9524 exploreP:0.0100
Episode:753 meanR:318.4200 R:500.0000 loss:10.9067 exploreP:0.0100
Episode:754 meanR:318.4200 R:500.0000 loss:7.2594 exploreP:0.0100
Episode:755 meanR:316.3600 R:294.0000 loss:29.2192 exploreP:0.0100
Episode:756 meanR:315.8300 R:209.0000 loss:1.7057 exploreP:0.0100
Episode:757 meanR:315.4700 R:119.0000 loss:1.0198 exploreP:0.0100
Episode:758 meanR:315.0400 R:97.0000 loss:1.7643 exploreP:0.0100
Episode:759 meanR:314.7000 R:72.0000 loss:2.7305 exploreP:0.0100
Episode:760 meanR:314.4200 R:74.0000 loss:6.4882 exploreP:0.0100
Episode:761 meanR:313.5500 R:18.0000 loss:8.8507 exploreP:0.0100
Episode:762 meanR:312.6600 R:14.0000 loss:22.0991 exploreP:0.0100
Episode:763 meanR:311.8300 R:15.0000 loss:23.2943 exploreP:0.0100
Episode:764 meanR:311.5200 R:69.0000 loss:9.3641 exploreP:0.0100
Episode:765 meanR:315.4700 R:500.0000 loss:4.1231 exploreP:0.0100
Episode:766 meanR:319.4800 R:500.0000 loss:14.1218 exploreP:0.0100
Episode:767 

Episode:876 meanR:348.8000 R:500.0000 loss:16.8002 exploreP:0.0100
Episode:877 meanR:348.8000 R:500.0000 loss:11.5839 exploreP:0.0100
Episode:878 meanR:348.8000 R:500.0000 loss:14.0821 exploreP:0.0100
Episode:879 meanR:348.8000 R:500.0000 loss:18.4423 exploreP:0.0100
Episode:880 meanR:348.8000 R:500.0000 loss:16.4679 exploreP:0.0100
Episode:881 meanR:348.8000 R:500.0000 loss:15.1849 exploreP:0.0100
Episode:882 meanR:348.8000 R:500.0000 loss:14.9894 exploreP:0.0100
Episode:883 meanR:348.8000 R:500.0000 loss:13.4257 exploreP:0.0100
Episode:884 meanR:348.8000 R:500.0000 loss:14.1580 exploreP:0.0100
Episode:885 meanR:348.8000 R:500.0000 loss:20.3774 exploreP:0.0100
Episode:886 meanR:352.4000 R:500.0000 loss:18.0469 exploreP:0.0100
Episode:887 meanR:355.8200 R:500.0000 loss:13.2041 exploreP:0.0100
Episode:888 meanR:355.8200 R:500.0000 loss:20.7753 exploreP:0.0100
Episode:889 meanR:356.6400 R:500.0000 loss:17.1621 exploreP:0.0100
Episode:890 meanR:358.6200 R:500.0000 loss:16.5067 exploreP:0.

Episode:1000 meanR:388.2000 R:192.0000 loss:3.4871 exploreP:0.0100
Episode:1001 meanR:384.9000 R:170.0000 loss:2.0874 exploreP:0.0100
Episode:1002 meanR:381.7900 R:189.0000 loss:0.9853 exploreP:0.0100
Episode:1003 meanR:379.2100 R:242.0000 loss:0.5880 exploreP:0.0100
Episode:1004 meanR:376.6900 R:248.0000 loss:4.2010 exploreP:0.0100
Episode:1005 meanR:374.3100 R:262.0000 loss:6.2375 exploreP:0.0100
Episode:1006 meanR:372.1200 R:281.0000 loss:5.3078 exploreP:0.0100
Episode:1007 meanR:370.0200 R:290.0000 loss:0.7103 exploreP:0.0100
Episode:1008 meanR:369.1700 R:415.0000 loss:0.8168 exploreP:0.0100
Episode:1009 meanR:365.2800 R:111.0000 loss:7.5425 exploreP:0.0100
Episode:1010 meanR:365.2800 R:500.0000 loss:8.0406 exploreP:0.0100
Episode:1011 meanR:365.2800 R:500.0000 loss:16.6541 exploreP:0.0100
Episode:1012 meanR:365.2800 R:500.0000 loss:11.9853 exploreP:0.0100
Episode:1013 meanR:361.9100 R:97.0000 loss:50.9032 exploreP:0.0100
Episode:1014 meanR:361.9100 R:500.0000 loss:5.4606 exploreP:

Episode:1122 meanR:375.4500 R:500.0000 loss:19.1108 exploreP:0.0100
Episode:1123 meanR:377.5600 R:500.0000 loss:21.6712 exploreP:0.0100
Episode:1124 meanR:378.8500 R:500.0000 loss:21.1641 exploreP:0.0100
Episode:1125 meanR:381.5200 R:500.0000 loss:14.2696 exploreP:0.0100
Episode:1126 meanR:381.5200 R:500.0000 loss:19.9728 exploreP:0.0100
Episode:1127 meanR:381.5200 R:500.0000 loss:13.1542 exploreP:0.0100
Episode:1128 meanR:381.5200 R:500.0000 loss:16.9659 exploreP:0.0100
Episode:1129 meanR:381.5200 R:500.0000 loss:21.1708 exploreP:0.0100
Episode:1130 meanR:383.6900 R:500.0000 loss:19.1455 exploreP:0.0100
Episode:1131 meanR:385.6100 R:500.0000 loss:17.0294 exploreP:0.0100
Episode:1132 meanR:388.5000 R:500.0000 loss:15.7526 exploreP:0.0100
Episode:1133 meanR:392.2000 R:500.0000 loss:14.2289 exploreP:0.0100
Episode:1134 meanR:395.9700 R:500.0000 loss:17.7862 exploreP:0.0100
Episode:1135 meanR:395.9700 R:500.0000 loss:16.2161 exploreP:0.0100
Episode:1136 meanR:395.9700 R:500.0000 loss:16.2

Episode:1243 meanR:420.2900 R:130.0000 loss:7.9710 exploreP:0.0100
Episode:1244 meanR:416.4900 R:120.0000 loss:11.9472 exploreP:0.0100
Episode:1245 meanR:412.5600 R:107.0000 loss:10.1851 exploreP:0.0100
Episode:1246 meanR:408.3700 R:81.0000 loss:9.9882 exploreP:0.0100
Episode:1247 meanR:408.3700 R:500.0000 loss:15.7607 exploreP:0.0100
Episode:1248 meanR:408.3700 R:500.0000 loss:25.6031 exploreP:0.0100
Episode:1249 meanR:408.3700 R:500.0000 loss:16.3267 exploreP:0.0100
Episode:1250 meanR:408.3700 R:500.0000 loss:14.5843 exploreP:0.0100
Episode:1251 meanR:408.3700 R:500.0000 loss:14.9271 exploreP:0.0100
Episode:1252 meanR:408.3700 R:500.0000 loss:15.5738 exploreP:0.0100
Episode:1253 meanR:408.3700 R:500.0000 loss:15.4466 exploreP:0.0100
Episode:1254 meanR:408.3700 R:500.0000 loss:15.0850 exploreP:0.0100
Episode:1255 meanR:408.3700 R:500.0000 loss:14.6999 exploreP:0.0100
Episode:1256 meanR:408.3700 R:500.0000 loss:14.7835 exploreP:0.0100
Episode:1257 meanR:408.3700 R:500.0000 loss:14.2647

Episode:1364 meanR:497.8200 R:500.0000 loss:15.9479 exploreP:0.0100
Episode:1365 meanR:497.8200 R:500.0000 loss:16.1805 exploreP:0.0100
Episode:1366 meanR:497.8200 R:500.0000 loss:17.3289 exploreP:0.0100
Episode:1367 meanR:497.8200 R:500.0000 loss:17.3300 exploreP:0.0100
Episode:1368 meanR:497.8200 R:500.0000 loss:18.2305 exploreP:0.0100
Episode:1369 meanR:497.8200 R:500.0000 loss:16.0975 exploreP:0.0100
Episode:1370 meanR:497.8200 R:500.0000 loss:19.5856 exploreP:0.0100
Episode:1371 meanR:497.8200 R:500.0000 loss:17.9754 exploreP:0.0100
Episode:1372 meanR:497.8200 R:500.0000 loss:16.2737 exploreP:0.0100
Episode:1373 meanR:497.8200 R:500.0000 loss:13.8482 exploreP:0.0100
Episode:1374 meanR:497.8200 R:500.0000 loss:18.1417 exploreP:0.0100
Episode:1375 meanR:497.8200 R:500.0000 loss:16.2270 exploreP:0.0100
Episode:1376 meanR:497.8200 R:500.0000 loss:13.8687 exploreP:0.0100
Episode:1377 meanR:492.9400 R:12.0000 loss:72.0214 exploreP:0.0100
Episode:1378 meanR:488.0600 R:12.0000 loss:139.08

Episode:1486 meanR:361.5600 R:500.0000 loss:15.9374 exploreP:0.0100
Episode:1487 meanR:361.5600 R:500.0000 loss:16.3641 exploreP:0.0100
Episode:1488 meanR:361.5600 R:500.0000 loss:15.0065 exploreP:0.0100
Episode:1489 meanR:361.5600 R:500.0000 loss:8.4297 exploreP:0.0100
Episode:1490 meanR:361.5600 R:500.0000 loss:11.6695 exploreP:0.0100
Episode:1491 meanR:361.5600 R:500.0000 loss:17.0128 exploreP:0.0100
Episode:1492 meanR:361.5600 R:500.0000 loss:16.6813 exploreP:0.0100
Episode:1493 meanR:361.5600 R:500.0000 loss:8.4034 exploreP:0.0100
Episode:1494 meanR:361.5600 R:500.0000 loss:14.9061 exploreP:0.0100
Episode:1495 meanR:361.5600 R:500.0000 loss:21.5848 exploreP:0.0100
Episode:1496 meanR:361.5600 R:500.0000 loss:17.8625 exploreP:0.0100
Episode:1497 meanR:361.5600 R:500.0000 loss:17.0074 exploreP:0.0100
Episode:1498 meanR:361.5600 R:500.0000 loss:18.2579 exploreP:0.0100
Episode:1499 meanR:361.5600 R:500.0000 loss:9.0868 exploreP:0.0100
Episode:1500 meanR:361.5600 R:500.0000 loss:20.1733

Episode:1608 meanR:409.3400 R:500.0000 loss:22.9602 exploreP:0.0100
Episode:1609 meanR:409.3400 R:500.0000 loss:20.9245 exploreP:0.0100
Episode:1610 meanR:411.4400 R:386.0000 loss:24.7458 exploreP:0.0100
Episode:1611 meanR:415.2700 R:500.0000 loss:8.6189 exploreP:0.0100
Episode:1612 meanR:420.1700 R:500.0000 loss:16.7138 exploreP:0.0100
Episode:1613 meanR:417.7600 R:259.0000 loss:28.3371 exploreP:0.0100
Episode:1614 meanR:417.7600 R:500.0000 loss:4.3560 exploreP:0.0100
Episode:1615 meanR:414.9800 R:222.0000 loss:31.0271 exploreP:0.0100
Episode:1616 meanR:411.7500 R:177.0000 loss:16.5595 exploreP:0.0100
Episode:1617 meanR:408.2200 R:147.0000 loss:7.8434 exploreP:0.0100
Episode:1618 meanR:404.9000 R:168.0000 loss:3.2042 exploreP:0.0100
Episode:1619 meanR:401.6700 R:177.0000 loss:1.8849 exploreP:0.0100
Episode:1620 meanR:397.8300 R:116.0000 loss:0.6435 exploreP:0.0100
Episode:1621 meanR:393.8700 R:104.0000 loss:0.5292 exploreP:0.0100
Episode:1622 meanR:389.8000 R:93.0000 loss:0.5225 explo

Episode:1730 meanR:382.5500 R:288.0000 loss:11.4155 exploreP:0.0100
Episode:1731 meanR:386.0500 R:393.0000 loss:1.4831 exploreP:0.0100
Episode:1732 meanR:389.0800 R:315.0000 loss:11.8819 exploreP:0.0100
Episode:1733 meanR:393.9600 R:500.0000 loss:0.8449 exploreP:0.0100
Episode:1734 meanR:396.3100 R:245.0000 loss:28.5856 exploreP:0.0100
Episode:1735 meanR:396.4900 R:126.0000 loss:24.9552 exploreP:0.0100
Episode:1736 meanR:394.6900 R:320.0000 loss:15.0690 exploreP:0.0100
Episode:1737 meanR:395.4800 R:500.0000 loss:0.9205 exploreP:0.0100
Episode:1738 meanR:398.4200 R:500.0000 loss:15.2213 exploreP:0.0100
Episode:1739 meanR:401.0000 R:500.0000 loss:14.5436 exploreP:0.0100
Episode:1740 meanR:400.0900 R:244.0000 loss:32.9338 exploreP:0.0100
Episode:1741 meanR:402.7000 R:500.0000 loss:0.8919 exploreP:0.0100
Episode:1742 meanR:405.8600 R:500.0000 loss:7.5784 exploreP:0.0100
Episode:1743 meanR:405.8600 R:500.0000 loss:12.7133 exploreP:0.0100
Episode:1744 meanR:405.9000 R:469.0000 loss:15.9536 e

Episode:1852 meanR:409.2600 R:500.0000 loss:14.5520 exploreP:0.0100
Episode:1853 meanR:412.7400 R:500.0000 loss:15.5814 exploreP:0.0100
Episode:1854 meanR:414.6900 R:500.0000 loss:14.6429 exploreP:0.0100
Episode:1855 meanR:415.5600 R:500.0000 loss:15.6408 exploreP:0.0100
Episode:1856 meanR:416.7100 R:500.0000 loss:15.6198 exploreP:0.0100
Episode:1857 meanR:415.0000 R:157.0000 loss:48.6374 exploreP:0.0100
Episode:1858 meanR:415.0000 R:500.0000 loss:6.1222 exploreP:0.0100
Episode:1859 meanR:419.3200 R:500.0000 loss:18.2836 exploreP:0.0100
Episode:1860 meanR:419.3200 R:500.0000 loss:16.9061 exploreP:0.0100
Episode:1861 meanR:419.3200 R:500.0000 loss:15.8860 exploreP:0.0100
Episode:1862 meanR:419.3200 R:500.0000 loss:15.3622 exploreP:0.0100
Episode:1863 meanR:421.6800 R:500.0000 loss:17.4563 exploreP:0.0100
Episode:1864 meanR:424.0700 R:500.0000 loss:17.6931 exploreP:0.0100
Episode:1865 meanR:423.4400 R:281.0000 loss:28.6911 exploreP:0.0100
Episode:1866 meanR:420.0900 R:165.0000 loss:15.34

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

## Testing

Let's checkout how our trained agent plays the game.

In [184]:
import gym

# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model-seq.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    initial_state = sess.run(model.initial_state) # Qs or current batch or states[:-1]
    
    # Episode/epoch
    for _ in range(1):
        state = env.reset()
        total_reward = 0
        
        # Steps/batches
        while True:
            env.render()
            action_logits, initial_state = sess.run([model.actions_logits, model.final_state],
                                                    feed_dict = {model.states: state.reshape([1, -1]), 
                                                                 model.initial_state: initial_state})
            action = np.argmax(action_logits)
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                break
        # At the end of each episode
        print('total_reward:{}'.format(total_reward))

# Close the env
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt




total_reward:120.0


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.