# Sequential Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
import gym
import numpy as np

In [2]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

>**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [4]:
import gym

# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [5]:
env.reset()
for _ in range(10):
    # env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) # take a random action
    print('state, action, reward, done, info:', state, action, reward, done, info)
    if done:
        env.reset()

state, action, reward, done, info: [ 0.01317684 -0.23364403  0.01789839  0.31985257] 0 1.0 False {}
state, action, reward, done, info: [ 0.00850396 -0.03878149  0.02429544  0.03286743] 1 1.0 False {}
state, action, reward, done, info: [ 0.00772833  0.15598378  0.02495279 -0.25205211] 1 1.0 False {}
state, action, reward, done, info: [ 0.010848   -0.03948543  0.01991175  0.0483958 ] 0 1.0 False {}
state, action, reward, done, info: [ 0.0100583   0.15534543  0.02087966 -0.23793889] 1 1.0 False {}
state, action, reward, done, info: [ 0.0131652   0.35016296  0.01612088 -0.52396332] 1 1.0 False {}
state, action, reward, done, info: [ 0.02016846  0.54505437  0.00564162 -0.81152311] 1 1.0 False {}
state, action, reward, done, info: [ 0.03106955  0.74009858 -0.01058884 -1.10242615] 1 1.0 False {}
state, action, reward, done, info: [ 0.04587152  0.93535822 -0.03263737 -1.39841225] 1 1.0 False {}
state, action, reward, done, info: [ 0.06457869  1.13087042 -0.06060561 -1.70111805] 1 1.0 False {}


To shut the window showing the simulation, use `env.close()`.

If you ran the simulation above, we can look at the rewards:

In [6]:
# print(rewards[-20:])
# print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
# print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
# print(np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
# print(np.max(np.array(rewards)), np.min(np.array(rewards)))
# print(np.max(np.array(states)), np.min(np.array(states)))

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [16]:
def model_input(state_size):
    actions = tf.placeholder(tf.int32, [None], name='actions')
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    labelQs = tf.placeholder(tf.float32, [None], name='labelQs')
    return actions, states, targetQs, labelQs

In [17]:
# Generator: Generating/prediting the actions
def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        #predictions = tf.nn.softmax(logits)

        # return actions logits
        return logits

In [32]:
def model_loss(action_size, hidden_size, states, actions, targetQs, labelQs):
    actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    Qs = tf.reduce_max(actions_logits*actions_labels, axis=1)
    lossQtgt = tf.reduce_mean(tf.square(Qs - targetQs)) # next state, next action and nextQs
    lossQlbl = tf.reduce_mean(tf.square(Qs - labelQs)) # current state, action, and currentQs
    # lossQtgt_sigm = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=Qs, 
    #                                                                        labels=tf.nn.sigmoid(targetQs)))
    # lossQlbl_sigm = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=Qs,
    #                                                                        labels=tf.nn.sigmoid(labelQs)))
    loss = lossQtgt + lossQlbl #+ lossQtgt_sigm + lossQlbl_sigm
    return actions_logits, loss, lossQtgt, lossQlbl

In [33]:
def model_opt(loss, learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        opt = tf.train.AdamOptimizer(learning_rate).minimize(loss, var_list=g_vars)

    return opt

In [34]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.actions, self.states, self.targetQs, self.labelQs = model_input(state_size=state_size)
        
        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.loss, self.lossQtgt, self.lossQlbl = model_loss(
            action_size=action_size, hidden_size=hidden_size, 
            states=self.states, actions=self.actions, 
            targetQs=self.targetQs, labelQs=self.labelQs)

        # Update the model: backward pass and backprop
        self.opt = model_opt(loss=self.loss, learning_rate=learning_rate)

## Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on. 

Here, we'll create a `Memory` object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a `Memory` object. If you're unfamiliar with `deque`, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.

In [35]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)

## Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon).  That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an **$\epsilon$-greedy policy**.


At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called _exploitation_. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

## Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in _episodes_. One *episode* is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

* Initialize the memory $D$
* Initialize the action-value network $Q$ with random weights
* **For** episode = 1, $M$ **do**
  * **For** $t$, $T$ **do**
     * With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
     * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
     * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
     * Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
     * Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
     * Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
  * **endfor**
* **endfor**

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [36]:
# print('state:', np.array(states).shape[1], 
#       'action size: {}'.format((np.max(np.array(actions)) - np.min(np.array(actions)))+1))

In [37]:
# Training parameters
# Network parameters
state_size = 4                 # number of units for the input state/observation -- simulation
action_size = 2                # number of units for the output actions -- simulation
hidden_size = 64               # number of units in each Q-network hidden layer -- simulation
batch_size = 32                # number of samples in the memory/ experience as mini-batch size
learning_rate = 0.001          # learning rate for adam

In [38]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=batch_size)

## Populate the memory (exprience memory)

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.

In [39]:
state = env.reset()
for _ in range(batch_size):
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)
    memory.buffer.append([state, action, next_state, reward, float(done)])
    state = next_state
    if done is True:
        state = env.reset()

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [40]:
memory.buffer[0]

[array([-0.02798324, -0.02948733,  0.01148243, -0.01523122]),
 1,
 array([-0.02857298,  0.16546808,  0.01117781, -0.3042693 ]),
 1.0,
 0.0]

In [41]:
# states, rewards, actions

In [None]:
# Now train with experiences
saver = tf.train.Saver() # save the trained model
rewards_list, loss_list = [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    episode_reward = deque(maxlen=batch_size)
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        loss_batch = []
        state = env.reset()

        # Training steps/batches
        while True:
            # Testing
            action_logits = sess.run(model.actions_logits, feed_dict = {model.states: state.reshape([1, -1])})
            action = np.argmax(action_logits)
            next_state, reward, done, _ = env.step(action)
            memory.buffer.append([state, action, next_state, reward, float(done)])
            total_reward += reward
            state = next_state

            # Training
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            actions_logits = sess.run(model.actions_logits, feed_dict = {model.states: states})
            labelQs = np.max(actions_logits, axis=1) # explore
            next_actions_logits = sess.run(model.actions_logits, feed_dict = {model.states: next_states})
            nextQs = np.max(next_actions_logits, axis=1) * (1-dones) # exploit
            targetQs = rewards + (0.99 * nextQs)
            loss, _, lossQlbl, lossQtgt = sess.run([model.loss, model.opt, model.lossQlbl, model.lossQtgt], 
                                                   feed_dict = {model.states: states,
                                                                model.actions: actions,
                                                                model.targetQs: targetQs,
                                                                model.labelQs: labelQs})
            loss_batch.append(loss)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode: {}'.format(ep),
              'meanReward: {:.4f}'.format(np.mean(episode_reward)),
              'meanLoss: {:.4f}'.format(np.mean(loss_batch)))
        rewards_list.append([ep, np.mean(episode_reward)])
        loss_list.append([ep, np.mean(loss_batch)])
        if(np.mean(episode_reward) >= 500):
            break
    
    saver.save(sess, 'checkpoints/model5.ckpt')

Episode: 0 meanReward: 10.0000 meanLoss: 1.3356
Episode: 1 meanReward: 10.0000 meanLoss: 1.9251
Episode: 2 meanReward: 9.3333 meanLoss: 2.6823
Episode: 3 meanReward: 9.5000 meanLoss: 4.3664
Episode: 4 meanReward: 9.6000 meanLoss: 7.8865
Episode: 5 meanReward: 9.3333 meanLoss: 9.6882
Episode: 6 meanReward: 9.2857 meanLoss: 8.4829
Episode: 7 meanReward: 9.2500 meanLoss: 6.9393
Episode: 8 meanReward: 9.3333 meanLoss: 4.7404
Episode: 9 meanReward: 9.4000 meanLoss: 2.8284
Episode: 10 meanReward: 9.4545 meanLoss: 1.5039
Episode: 11 meanReward: 9.4167 meanLoss: 0.4788
Episode: 12 meanReward: 9.3846 meanLoss: 0.1064
Episode: 13 meanReward: 9.5000 meanLoss: 0.4145
Episode: 14 meanReward: 9.5333 meanLoss: 1.3311
Episode: 15 meanReward: 9.8125 meanLoss: 2.6349
Episode: 16 meanReward: 10.0000 meanLoss: 1.2052
Episode: 17 meanReward: 10.1111 meanLoss: 1.6110
Episode: 18 meanReward: 10.1053 meanLoss: 4.0876
Episode: 19 meanReward: 10.1500 meanLoss: 12.6177
Episode: 20 meanReward: 10.1905 meanLoss: 2

Episode: 173 meanReward: 9.7188 meanLoss: 0.0579
Episode: 174 meanReward: 9.7188 meanLoss: 0.0734
Episode: 175 meanReward: 9.7188 meanLoss: 0.0858
Episode: 176 meanReward: 9.6875 meanLoss: 0.0831
Episode: 177 meanReward: 9.6562 meanLoss: 0.1492
Episode: 178 meanReward: 9.6875 meanLoss: 0.3584
Episode: 179 meanReward: 9.6875 meanLoss: 0.5149
Episode: 180 meanReward: 9.6875 meanLoss: 0.1941
Episode: 181 meanReward: 9.6875 meanLoss: 0.1151
Episode: 182 meanReward: 9.6875 meanLoss: 0.1374
Episode: 183 meanReward: 9.6250 meanLoss: 0.1142
Episode: 184 meanReward: 9.5938 meanLoss: 0.1790
Episode: 185 meanReward: 9.5312 meanLoss: 0.3523
Episode: 186 meanReward: 9.5000 meanLoss: 0.5337
Episode: 187 meanReward: 9.5312 meanLoss: 0.3422
Episode: 188 meanReward: 9.5625 meanLoss: 0.1650
Episode: 189 meanReward: 9.5938 meanLoss: 0.1945
Episode: 190 meanReward: 9.5625 meanLoss: 0.2850
Episode: 191 meanReward: 9.5625 meanLoss: 0.4425
Episode: 192 meanReward: 9.5625 meanLoss: 0.4770
Episode: 193 meanRew

Episode: 342 meanReward: 19.4688 meanLoss: 0.5560
Episode: 343 meanReward: 19.5625 meanLoss: 0.3399
Episode: 344 meanReward: 20.2188 meanLoss: 0.4950
Episode: 345 meanReward: 21.1562 meanLoss: 0.5652
Episode: 346 meanReward: 21.7188 meanLoss: 1.0414
Episode: 347 meanReward: 22.5625 meanLoss: 1.3368
Episode: 348 meanReward: 23.6250 meanLoss: 1.1145
Episode: 349 meanReward: 24.8438 meanLoss: 0.9540
Episode: 350 meanReward: 24.7812 meanLoss: 1.4731
Episode: 351 meanReward: 25.5000 meanLoss: 0.6568
Episode: 352 meanReward: 29.2500 meanLoss: 3.2040
Episode: 353 meanReward: 29.6562 meanLoss: 8.2118
Episode: 354 meanReward: 32.5625 meanLoss: 9.3398
Episode: 355 meanReward: 36.9062 meanLoss: 53.2631
Episode: 356 meanReward: 36.7500 meanLoss: 2364.8015
Episode: 357 meanReward: 36.6875 meanLoss: 2893.9006
Episode: 358 meanReward: 36.7188 meanLoss: 1679.7695
Episode: 359 meanReward: 36.8125 meanLoss: 276.9999
Episode: 360 meanReward: 37.3125 meanLoss: 96.7212
Episode: 361 meanReward: 37.3125 mean

Episode: 495 meanReward: 211.5312 meanLoss: 32882.1406
Episode: 496 meanReward: 212.9062 meanLoss: 26972.2734
Episode: 497 meanReward: 211.0625 meanLoss: 34103.3242
Episode: 498 meanReward: 211.7500 meanLoss: 28977.3828
Episode: 499 meanReward: 211.4062 meanLoss: 33756.7383
Episode: 500 meanReward: 209.2812 meanLoss: 31179.1211
Episode: 501 meanReward: 209.4375 meanLoss: 33581.3320
Episode: 502 meanReward: 210.4688 meanLoss: 29425.5469
Episode: 503 meanReward: 210.9688 meanLoss: 28731.0312
Episode: 504 meanReward: 210.8750 meanLoss: 35471.9023
Episode: 505 meanReward: 210.5000 meanLoss: 33826.4336
Episode: 506 meanReward: 212.4375 meanLoss: 24869.0039
Episode: 507 meanReward: 210.3750 meanLoss: 33116.1250
Episode: 508 meanReward: 210.1562 meanLoss: 29249.7969
Episode: 509 meanReward: 209.6875 meanLoss: 35083.1094
Episode: 510 meanReward: 208.8438 meanLoss: 29219.6602
Episode: 511 meanReward: 210.0000 meanLoss: 28892.8418
Episode: 512 meanReward: 209.3125 meanLoss: 31208.8535
Episode: 5

Episode: 644 meanReward: 215.0000 meanLoss: 32813.3984
Episode: 645 meanReward: 215.4375 meanLoss: 28551.6973
Episode: 646 meanReward: 213.7188 meanLoss: 32488.8242
Episode: 647 meanReward: 213.0938 meanLoss: 32302.8359
Episode: 648 meanReward: 213.3438 meanLoss: 27946.0508
Episode: 649 meanReward: 213.1875 meanLoss: 33697.6797
Episode: 650 meanReward: 213.4375 meanLoss: 30912.5332
Episode: 651 meanReward: 211.0625 meanLoss: 33725.4375
Episode: 652 meanReward: 210.3125 meanLoss: 32967.1992
Episode: 653 meanReward: 209.0312 meanLoss: 34803.6055
Episode: 654 meanReward: 209.2188 meanLoss: 33543.2773
Episode: 655 meanReward: 209.4688 meanLoss: 28142.0508
Episode: 656 meanReward: 209.0000 meanLoss: 30733.3223
Episode: 657 meanReward: 209.0000 meanLoss: 33229.7344
Episode: 658 meanReward: 209.0000 meanLoss: 32485.0430
Episode: 659 meanReward: 208.3438 meanLoss: 33641.9883
Episode: 660 meanReward: 207.7188 meanLoss: 32648.2227
Episode: 661 meanReward: 207.2812 meanLoss: 32348.2734
Episode: 6

Episode: 793 meanReward: 212.2500 meanLoss: 32145.9570
Episode: 794 meanReward: 213.1250 meanLoss: 28936.9355
Episode: 795 meanReward: 213.8750 meanLoss: 31340.1445
Episode: 796 meanReward: 212.9688 meanLoss: 33353.3008
Episode: 797 meanReward: 213.0625 meanLoss: 32470.6328
Episode: 798 meanReward: 214.0625 meanLoss: 29666.7949
Episode: 799 meanReward: 214.6875 meanLoss: 28525.6309
Episode: 800 meanReward: 213.7188 meanLoss: 32044.3750
Episode: 801 meanReward: 214.1250 meanLoss: 26938.8906
Episode: 802 meanReward: 215.1250 meanLoss: 31254.3984
Episode: 803 meanReward: 213.6875 meanLoss: 33990.5273
Episode: 804 meanReward: 212.6875 meanLoss: 27250.7871
Episode: 805 meanReward: 213.5625 meanLoss: 27236.6211
Episode: 806 meanReward: 213.8125 meanLoss: 29630.2227
Episode: 807 meanReward: 213.1250 meanLoss: 30528.2383
Episode: 808 meanReward: 213.3750 meanLoss: 31752.4023
Episode: 809 meanReward: 213.0000 meanLoss: 33384.2500
Episode: 810 meanReward: 213.5000 meanLoss: 31275.4727
Episode: 8

Episode: 942 meanReward: 208.9062 meanLoss: 35977.8828
Episode: 943 meanReward: 209.2188 meanLoss: 31088.3613
Episode: 944 meanReward: 209.9062 meanLoss: 27982.0547
Episode: 945 meanReward: 210.2812 meanLoss: 29833.8086
Episode: 946 meanReward: 210.4062 meanLoss: 31809.2344
Episode: 947 meanReward: 211.6562 meanLoss: 28964.6953
Episode: 948 meanReward: 211.2500 meanLoss: 31347.3047
Episode: 949 meanReward: 209.8125 meanLoss: 34517.0547
Episode: 950 meanReward: 210.5000 meanLoss: 29852.1191
Episode: 951 meanReward: 211.3438 meanLoss: 29381.6777
Episode: 952 meanReward: 210.7188 meanLoss: 33677.6328
Episode: 953 meanReward: 211.0312 meanLoss: 29401.5508
Episode: 954 meanReward: 210.9062 meanLoss: 26841.9785
Episode: 955 meanReward: 210.7500 meanLoss: 33078.4648
Episode: 956 meanReward: 210.3125 meanLoss: 31985.2988
Episode: 957 meanReward: 209.3125 meanLoss: 31065.3906
Episode: 958 meanReward: 209.7500 meanLoss: 28796.6172
Episode: 959 meanReward: 210.7188 meanLoss: 30429.6875
Episode: 9

Episode: 1090 meanReward: 215.7188 meanLoss: 33642.2539
Episode: 1091 meanReward: 216.5938 meanLoss: 29204.3477
Episode: 1092 meanReward: 216.5938 meanLoss: 31836.4043
Episode: 1093 meanReward: 216.7188 meanLoss: 32513.0449
Episode: 1094 meanReward: 218.7500 meanLoss: 24670.0234
Episode: 1095 meanReward: 218.7188 meanLoss: 29000.3691
Episode: 1096 meanReward: 218.4062 meanLoss: 29643.8555
Episode: 1097 meanReward: 219.1562 meanLoss: 28744.4590
Episode: 1098 meanReward: 218.6875 meanLoss: 33732.5156
Episode: 1099 meanReward: 216.6562 meanLoss: 34072.9961
Episode: 1100 meanReward: 218.3750 meanLoss: 23787.7383
Episode: 1101 meanReward: 219.0625 meanLoss: 32179.6543
Episode: 1102 meanReward: 220.3750 meanLoss: 27807.4355
Episode: 1103 meanReward: 220.7812 meanLoss: 30837.0996
Episode: 1104 meanReward: 218.7812 meanLoss: 33240.5586
Episode: 1105 meanReward: 218.1875 meanLoss: 33085.6797
Episode: 1106 meanReward: 213.0938 meanLoss: 36050.0000
Episode: 1107 meanReward: 212.5625 meanLoss: 347

Episode: 1237 meanReward: 214.2500 meanLoss: 31265.0078
Episode: 1238 meanReward: 212.6875 meanLoss: 32968.4141
Episode: 1239 meanReward: 212.3438 meanLoss: 34883.1914
Episode: 1240 meanReward: 211.1875 meanLoss: 33550.1484
Episode: 1241 meanReward: 210.9375 meanLoss: 28403.3906
Episode: 1242 meanReward: 211.7500 meanLoss: 28065.9688
Episode: 1243 meanReward: 212.0938 meanLoss: 31516.8242
Episode: 1244 meanReward: 212.1875 meanLoss: 30004.6504
Episode: 1245 meanReward: 212.8125 meanLoss: 31215.5820
Episode: 1246 meanReward: 213.0000 meanLoss: 32754.9941
Episode: 1247 meanReward: 212.6562 meanLoss: 29042.7090
Episode: 1248 meanReward: 212.0625 meanLoss: 32506.8223
Episode: 1249 meanReward: 212.3438 meanLoss: 30546.5508
Episode: 1250 meanReward: 210.8750 meanLoss: 33007.3594
Episode: 1251 meanReward: 209.6562 meanLoss: 35219.3789
Episode: 1252 meanReward: 209.3438 meanLoss: 31546.7480
Episode: 1253 meanReward: 208.5625 meanLoss: 33292.1562
Episode: 1254 meanReward: 208.5312 meanLoss: 323

Episode: 1384 meanReward: 205.4375 meanLoss: 32464.5840
Episode: 1385 meanReward: 202.9375 meanLoss: 29131.1133
Episode: 1386 meanReward: 203.2812 meanLoss: 30234.0664
Episode: 1387 meanReward: 203.8750 meanLoss: 29359.2695
Episode: 1388 meanReward: 204.2500 meanLoss: 29990.9746
Episode: 1389 meanReward: 203.9688 meanLoss: 34897.4727
Episode: 1390 meanReward: 204.7500 meanLoss: 31505.3457
Episode: 1391 meanReward: 204.3125 meanLoss: 35415.4688
Episode: 1392 meanReward: 203.4688 meanLoss: 33399.9180
Episode: 1393 meanReward: 204.5938 meanLoss: 27234.4395
Episode: 1394 meanReward: 204.6250 meanLoss: 31722.4355
Episode: 1395 meanReward: 205.0000 meanLoss: 33720.0742
Episode: 1396 meanReward: 206.1875 meanLoss: 28745.3379
Episode: 1397 meanReward: 206.3438 meanLoss: 29089.1758
Episode: 1398 meanReward: 206.5312 meanLoss: 28135.0547
Episode: 1399 meanReward: 205.7188 meanLoss: 34148.5664
Episode: 1400 meanReward: 204.6562 meanLoss: 32667.7598
Episode: 1401 meanReward: 203.9688 meanLoss: 351

Episode: 1531 meanReward: 214.8125 meanLoss: 30149.1934
Episode: 1532 meanReward: 213.0938 meanLoss: 33225.0000
Episode: 1533 meanReward: 213.2188 meanLoss: 33147.7227
Episode: 1534 meanReward: 213.2812 meanLoss: 31035.0312
Episode: 1535 meanReward: 213.8125 meanLoss: 28451.2227
Episode: 1536 meanReward: 214.0625 meanLoss: 29911.1387
Episode: 1537 meanReward: 213.7500 meanLoss: 31406.9297
Episode: 1538 meanReward: 214.2812 meanLoss: 32953.5039
Episode: 1539 meanReward: 215.9375 meanLoss: 25538.9219
Episode: 1540 meanReward: 216.1875 meanLoss: 27247.2207
Episode: 1541 meanReward: 216.5938 meanLoss: 26326.3613
Episode: 1542 meanReward: 215.5625 meanLoss: 31780.3164
Episode: 1543 meanReward: 215.4375 meanLoss: 32708.2988
Episode: 1544 meanReward: 215.6562 meanLoss: 32837.4414
Episode: 1545 meanReward: 216.1250 meanLoss: 32318.7988
Episode: 1546 meanReward: 215.5000 meanLoss: 29394.5820
Episode: 1547 meanReward: 215.6250 meanLoss: 31188.6270
Episode: 1548 meanReward: 216.2812 meanLoss: 291

Episode: 1678 meanReward: 211.1562 meanLoss: 33844.3633
Episode: 1679 meanReward: 210.5625 meanLoss: 32767.2246
Episode: 1680 meanReward: 208.7812 meanLoss: 32780.3594
Episode: 1681 meanReward: 209.1875 meanLoss: 30572.4883
Episode: 1682 meanReward: 208.8438 meanLoss: 33382.7773
Episode: 1683 meanReward: 208.9375 meanLoss: 30269.5117
Episode: 1684 meanReward: 208.4688 meanLoss: 30487.3848
Episode: 1685 meanReward: 210.9062 meanLoss: 25091.9727
Episode: 1686 meanReward: 212.0938 meanLoss: 26030.4883
Episode: 1687 meanReward: 212.6250 meanLoss: 30213.6914
Episode: 1688 meanReward: 213.7812 meanLoss: 28271.7285
Episode: 1689 meanReward: 213.2188 meanLoss: 32912.3438
Episode: 1690 meanReward: 212.3438 meanLoss: 28066.6836
Episode: 1691 meanReward: 212.3125 meanLoss: 30575.1445
Episode: 1692 meanReward: 211.6562 meanLoss: 33201.2500
Episode: 1693 meanReward: 212.7188 meanLoss: 29643.6211
Episode: 1694 meanReward: 213.5000 meanLoss: 28780.9727
Episode: 1695 meanReward: 215.1875 meanLoss: 271

Episode: 1825 meanReward: 211.3125 meanLoss: 33535.0664
Episode: 1826 meanReward: 210.6250 meanLoss: 32164.1406
Episode: 1827 meanReward: 212.8750 meanLoss: 23968.8281
Episode: 1828 meanReward: 212.6250 meanLoss: 32524.8203
Episode: 1829 meanReward: 214.8125 meanLoss: 26043.9941
Episode: 1830 meanReward: 214.9688 meanLoss: 35194.0586
Episode: 1831 meanReward: 214.1562 meanLoss: 28350.8613
Episode: 1832 meanReward: 214.6250 meanLoss: 29594.8242
Episode: 1833 meanReward: 216.1875 meanLoss: 28315.9609
Episode: 1834 meanReward: 214.6250 meanLoss: 33998.1641
Episode: 1835 meanReward: 216.4375 meanLoss: 27616.2656
Episode: 1836 meanReward: 216.1875 meanLoss: 32143.3125
Episode: 1837 meanReward: 215.0312 meanLoss: 33195.7305
Episode: 1838 meanReward: 214.7188 meanLoss: 33598.9883
Episode: 1839 meanReward: 213.3125 meanLoss: 31311.2480
Episode: 1840 meanReward: 211.7188 meanLoss: 33143.3242
Episode: 1841 meanReward: 212.7812 meanLoss: 30066.4531
Episode: 1842 meanReward: 213.0938 meanLoss: 319

Episode: 1972 meanReward: 215.3750 meanLoss: 30106.0820
Episode: 1973 meanReward: 215.2500 meanLoss: 27362.2949
Episode: 1974 meanReward: 216.9688 meanLoss: 25812.2383
Episode: 1975 meanReward: 218.6875 meanLoss: 25129.0078
Episode: 1976 meanReward: 218.1562 meanLoss: 33068.0312
Episode: 1977 meanReward: 220.1562 meanLoss: 24062.1270
Episode: 1978 meanReward: 220.1562 meanLoss: 32080.6016
Episode: 1979 meanReward: 219.2812 meanLoss: 35213.6758
Episode: 1980 meanReward: 218.4688 meanLoss: 33753.6992
Episode: 1981 meanReward: 217.7188 meanLoss: 32765.4629
Episode: 1982 meanReward: 217.5625 meanLoss: 35189.2656
Episode: 1983 meanReward: 216.8125 meanLoss: 28842.7715
Episode: 1984 meanReward: 217.0938 meanLoss: 28778.3789
Episode: 1985 meanReward: 213.5625 meanLoss: 28703.6133
Episode: 1986 meanReward: 213.2812 meanLoss: 32436.5977
Episode: 1987 meanReward: 213.9688 meanLoss: 29440.2754
Episode: 1988 meanReward: 214.1562 meanLoss: 32098.1484
Episode: 1989 meanReward: 214.3750 meanLoss: 307

Episode: 2119 meanReward: 206.1562 meanLoss: 31877.5020
Episode: 2120 meanReward: 206.6875 meanLoss: 29535.5234
Episode: 2121 meanReward: 206.8438 meanLoss: 34109.0273
Episode: 2122 meanReward: 208.4062 meanLoss: 27990.3027
Episode: 2123 meanReward: 209.5312 meanLoss: 27057.0781
Episode: 2124 meanReward: 209.6562 meanLoss: 31339.2988
Episode: 2125 meanReward: 209.2500 meanLoss: 30583.7520
Episode: 2126 meanReward: 208.9688 meanLoss: 29163.2852
Episode: 2127 meanReward: 208.8750 meanLoss: 30355.1680
Episode: 2128 meanReward: 208.7188 meanLoss: 35073.6055
Episode: 2129 meanReward: 209.1875 meanLoss: 28162.0215
Episode: 2130 meanReward: 207.4688 meanLoss: 35705.6172
Episode: 2131 meanReward: 208.5312 meanLoss: 28282.8105
Episode: 2132 meanReward: 209.0625 meanLoss: 25787.7754
Episode: 2133 meanReward: 208.4375 meanLoss: 31973.3242
Episode: 2134 meanReward: 209.0000 meanLoss: 30346.9434
Episode: 2135 meanReward: 210.3750 meanLoss: 26854.5664
Episode: 2136 meanReward: 210.1875 meanLoss: 316

Episode: 2266 meanReward: 212.7500 meanLoss: 28708.5781
Episode: 2267 meanReward: 213.4062 meanLoss: 29418.5117
Episode: 2268 meanReward: 214.1562 meanLoss: 29552.8203
Episode: 2269 meanReward: 214.5312 meanLoss: 28199.5762
Episode: 2270 meanReward: 214.2812 meanLoss: 31072.0645
Episode: 2271 meanReward: 213.8750 meanLoss: 28092.8496
Episode: 2272 meanReward: 213.7188 meanLoss: 29669.8320
Episode: 2273 meanReward: 214.7188 meanLoss: 27660.2832
Episode: 2274 meanReward: 214.3750 meanLoss: 33678.1641
Episode: 2275 meanReward: 214.3750 meanLoss: 33449.9258
Episode: 2276 meanReward: 214.1875 meanLoss: 35193.4805
Episode: 2277 meanReward: 215.9375 meanLoss: 25623.9219
Episode: 2278 meanReward: 214.5625 meanLoss: 34570.8359
Episode: 2279 meanReward: 214.9062 meanLoss: 31415.5117
Episode: 2280 meanReward: 213.8438 meanLoss: 30945.0645
Episode: 2281 meanReward: 212.6875 meanLoss: 32427.6426
Episode: 2282 meanReward: 214.5938 meanLoss: 26392.5176
Episode: 2283 meanReward: 214.4688 meanLoss: 303

Episode: 2413 meanReward: 210.7812 meanLoss: 35715.6875
Episode: 2414 meanReward: 208.5000 meanLoss: 34420.5391
Episode: 2415 meanReward: 208.3438 meanLoss: 32610.6094
Episode: 2416 meanReward: 209.2500 meanLoss: 25392.9746
Episode: 2417 meanReward: 210.1250 meanLoss: 28857.1855
Episode: 2418 meanReward: 209.5000 meanLoss: 33565.7031
Episode: 2419 meanReward: 210.5000 meanLoss: 28646.6016
Episode: 2420 meanReward: 210.9062 meanLoss: 30314.7285
Episode: 2421 meanReward: 212.0000 meanLoss: 28724.9160
Episode: 2422 meanReward: 211.7188 meanLoss: 33405.6328
Episode: 2423 meanReward: 212.5625 meanLoss: 25253.5078
Episode: 2424 meanReward: 214.1562 meanLoss: 26996.4434
Episode: 2425 meanReward: 213.3750 meanLoss: 33548.1797
Episode: 2426 meanReward: 213.2188 meanLoss: 29222.3496
Episode: 2427 meanReward: 212.9062 meanLoss: 32517.3652
Episode: 2428 meanReward: 212.7500 meanLoss: 30219.7852
Episode: 2429 meanReward: 213.5312 meanLoss: 25760.4980
Episode: 2430 meanReward: 212.0000 meanLoss: 343

Episode: 2560 meanReward: 207.6250 meanLoss: 32240.2266
Episode: 2561 meanReward: 208.3438 meanLoss: 28740.1465
Episode: 2562 meanReward: 208.4062 meanLoss: 30427.4102
Episode: 2563 meanReward: 207.8438 meanLoss: 32663.8398
Episode: 2564 meanReward: 208.4688 meanLoss: 28505.9766
Episode: 2565 meanReward: 208.7500 meanLoss: 31447.3867
Episode: 2566 meanReward: 207.0938 meanLoss: 31748.7910
Episode: 2567 meanReward: 206.8750 meanLoss: 28727.6406
Episode: 2568 meanReward: 206.2500 meanLoss: 31873.8691
Episode: 2569 meanReward: 204.3125 meanLoss: 34817.4570
Episode: 2570 meanReward: 204.6875 meanLoss: 30397.2480
Episode: 2571 meanReward: 204.9375 meanLoss: 33062.5352
Episode: 2572 meanReward: 204.2188 meanLoss: 31822.6562
Episode: 2573 meanReward: 202.9375 meanLoss: 34315.4219
Episode: 2574 meanReward: 204.2812 meanLoss: 25761.5586
Episode: 2575 meanReward: 204.4375 meanLoss: 31993.7480
Episode: 2576 meanReward: 203.8750 meanLoss: 35228.6250
Episode: 2577 meanReward: 207.3125 meanLoss: 209

Episode: 2707 meanReward: 218.4375 meanLoss: 33907.2227
Episode: 2708 meanReward: 217.9688 meanLoss: 29698.2090
Episode: 2709 meanReward: 217.3750 meanLoss: 33638.2773
Episode: 2710 meanReward: 217.4375 meanLoss: 29368.9629
Episode: 2711 meanReward: 216.0312 meanLoss: 31902.1172
Episode: 2712 meanReward: 211.8750 meanLoss: 33274.6562
Episode: 2713 meanReward: 211.8125 meanLoss: 33512.6641
Episode: 2714 meanReward: 213.2188 meanLoss: 26918.8242
Episode: 2715 meanReward: 214.0312 meanLoss: 28515.6914
Episode: 2716 meanReward: 215.1875 meanLoss: 29149.4102
Episode: 2717 meanReward: 214.5312 meanLoss: 33909.5430
Episode: 2718 meanReward: 213.0625 meanLoss: 34359.4023
Episode: 2719 meanReward: 213.7500 meanLoss: 28836.6738
Episode: 2720 meanReward: 214.6250 meanLoss: 28352.6465
Episode: 2721 meanReward: 216.1562 meanLoss: 25399.8008
Episode: 2722 meanReward: 214.3750 meanLoss: 32771.1367
Episode: 2723 meanReward: 213.9062 meanLoss: 36822.5352
Episode: 2724 meanReward: 213.0625 meanLoss: 297

Episode: 2854 meanReward: 208.8438 meanLoss: 32287.8477
Episode: 2855 meanReward: 208.6250 meanLoss: 29929.2480
Episode: 2856 meanReward: 208.6250 meanLoss: 32174.7578
Episode: 2857 meanReward: 209.1250 meanLoss: 32486.3574
Episode: 2858 meanReward: 208.4375 meanLoss: 35101.7500
Episode: 2859 meanReward: 208.2500 meanLoss: 33779.3086
Episode: 2860 meanReward: 208.0938 meanLoss: 32043.6016
Episode: 2861 meanReward: 207.4375 meanLoss: 33987.8125
Episode: 2862 meanReward: 205.8750 meanLoss: 33700.6523
Episode: 2863 meanReward: 206.1875 meanLoss: 28526.3945
Episode: 2864 meanReward: 206.6250 meanLoss: 30732.0352
Episode: 2865 meanReward: 206.7812 meanLoss: 33194.1016
Episode: 2866 meanReward: 206.0000 meanLoss: 34618.9102
Episode: 2867 meanReward: 206.6250 meanLoss: 27830.1426
Episode: 2868 meanReward: 206.0938 meanLoss: 30092.6387
Episode: 2869 meanReward: 207.7500 meanLoss: 22281.3652
Episode: 2870 meanReward: 208.5625 meanLoss: 31418.9023
Episode: 2871 meanReward: 208.5312 meanLoss: 344

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

## Testing

Let's checkout how our trained agent plays the game.

In [184]:
import gym

# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model-seq.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    initial_state = sess.run(model.initial_state) # Qs or current batch or states[:-1]
    state = env.reset()
    total_reward = 0
    while True:
        env.render()
        action_logits, initial_state = sess.run([model.actions_logits, model.final_state],
                                                feed_dict = {model.states: state.reshape([1, -1]), 
                                                             model.initial_state: initial_state})
        action = np.argmax(action_logits)
        state, reward, done, _ = env.step(action)
        total_reward += reward
        if done:
            break
print('total_reward:{}'.format(total_reward))
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt




total_reward:120.0


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.