# Sequential Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
import gym
import numpy as np

In [2]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

>**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [3]:
import gym

# Create the Cart-Pole game environment
#env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [4]:
env.reset()
for _ in range(10):
    # env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) # take a random action
    print('state, action, reward, done, info:', state, action, reward, done, info)
    if done:
        env.reset()

state, action, reward, done, info: [ 0.02782733 -0.19801476  0.04266009  0.30762757] 0 1.0 False {}
state, action, reward, done, info: [ 0.02386703 -0.00352583  0.04881264  0.02869784] 1 1.0 False {}
state, action, reward, done, info: [ 0.02379652  0.19086337  0.0493866  -0.24819343] 1 1.0 False {}
state, action, reward, done, info: [ 0.02761378 -0.00492783  0.04442273  0.05964904] 0 1.0 False {}
state, action, reward, done, info: [ 0.02751523  0.18952997  0.04561571 -0.21869391] 1 1.0 False {}
state, action, reward, done, info: [ 0.03130583  0.38397118  0.04124183 -0.49664596] 1 1.0 False {}
state, action, reward, done, info: [ 0.03898525  0.57848806  0.03130891 -0.77605152] 1 1.0 False {}
state, action, reward, done, info: [ 0.05055501  0.77316572  0.01578788 -1.05872159] 1 1.0 False {}
state, action, reward, done, info: [ 0.06601833  0.968075   -0.00538655 -1.34640762] 1 1.0 False {}
state, action, reward, done, info: [ 0.08537983  1.16326426 -0.0323147  -1.6407709 ] 1 1.0 False {}


To shut the window showing the simulation, use `env.close()`.

If you ran the simulation above, we can look at the rewards:

In [5]:
# print(rewards[-20:])
# print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
# print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
# print(np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
# print(np.max(np.array(rewards)), np.min(np.array(rewards)))
# print(np.max(np.array(states)), np.min(np.array(states)))

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [6]:
def model_input(state_size, lstm_size, batch_size=1):
    actions = tf.placeholder(tf.int32, [None], name='actions')
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    labelQs = tf.placeholder(tf.float32, [None], name='labelQs')
        
    gru = tf.nn.rnn_cell.GRUCell(lstm_size)
    cell = tf.nn.rnn_cell.MultiRNNCell([gru], state_is_tuple=False)
    initial_state = cell.zero_state(batch_size, tf.float32)
    
    return actions, states, targetQs, labelQs, cell, initial_state

In [7]:
# RNN generator or sequence generator
def generator(states, initial_state, cell, lstm_size, num_classes, reuse=False): 
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        inputs = tf.layers.dense(inputs=states, units=lstm_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        # dynamic means adapt to the batch_size
        inputs_rnn = tf.reshape(inputs, [1, -1, lstm_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cell, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state.shape)
        outputs = tf.reshape(outputs_rnn, [-1, lstm_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=num_classes)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [8]:
def model_loss(action_size, hidden_size, states, cell, initial_state, actions, targetQs, labelQs):
    actions_logits, final_state = generator(states=states, cell=cell, initial_state=initial_state, 
                                            lstm_size=hidden_size, num_classes=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    Qs = tf.reduce_max(actions_logits*actions_labels, axis=1)
    lossQtgt = tf.reduce_mean(tf.square(Qs - targetQs)) # next state, next action and nextQs
    lossQlbl = tf.reduce_mean(tf.square(Qs - labelQs)) # current state, action, and currentQs
    lossQtgt_sigm = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=Qs, 
                                                                           labels=tf.nn.sigmoid(targetQs)))
    lossQlbl_sigm = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=Qs,
                                                                           labels=tf.nn.sigmoid(labelQs)))
    loss = lossQtgt + lossQlbl + lossQtgt_sigm + lossQlbl_sigm
    return actions_logits, final_state, loss, lossQtgt, lossQlbl, lossQtgt_sigm, lossQlbl_sigm

In [19]:
def model_opt(loss, learning_rate):
    """
    Get optimization operations in order
    :param loss: Generator loss Tensor for action prediction
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]

    # # Optimize
    # with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
    # #opt = tf.train.AdamOptimizer(learning_rate).minimize(loss, var_list=g_vars)

    #grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(loss, g_vars), clip_norm=5) # usually around 1-5
    grads = tf.gradients(loss, g_vars)
    opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(grads, g_vars))

    return opt

In [20]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.actions, self.states, self.targetQs, self.labelQs, cell, self.initial_state = model_input(
            state_size=state_size, lstm_size=hidden_size)
        
        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.final_state, self.loss, self.lossQtgt, self.lossQlbl, self.lossQtgt_sigm, self.lossQlbl_sigm = model_loss(
            action_size=action_size, hidden_size=hidden_size, 
            states=self.states, actions=self.actions, 
            targetQs=self.targetQs, labelQs=self.labelQs, 
            cell=cell, initial_state=self.initial_state)

        # Update the model: backward pass and backprop
        self.opt = model_opt(loss=self.loss, learning_rate=learning_rate)

## Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on. 

Here, we'll create a `Memory` object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a `Memory` object. If you're unfamiliar with `deque`, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.

In [None]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
        self.states = deque(maxlen=max_size)

In [None]:
# episode_total_reward = deque(maxlen=10)

## Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon).  That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an **$\epsilon$-greedy policy**.


At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called _exploitation_. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

## Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in _episodes_. One *episode* is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

* Initialize the memory $D$
* Initialize the action-value network $Q$ with random weights
* **For** episode = 1, $M$ **do**
  * **For** $t$, $T$ **do**
     * With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
     * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
     * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
     * Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
     * Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
     * Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
  * **endfor**
* **endfor**

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [None]:
# print('state:', np.array(states).shape[1], 
#       'action size: {}'.format((np.max(np.array(actions)) - np.min(np.array(actions)))+1))

In [29]:
# Training parameters
# Network parameters
state_size = 4                 # number of units for the input state/observation -- simulation
action_size = 2                # number of units for the output actions -- simulation
hidden_size = 64               # number of units in each Q-network hidden layer -- simulation
batch_size = 32                # number of samples in the memory/ experience as mini-batch size
learning_rate = 0.001          # learning rate for adam

In [30]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=batch_size)

(?, 4) (?, 64)
(1, ?, 64) (1, 64)
(1, ?, 64) (1, 64)
(?, 64)
(?, 2)


## Populate the memory (exprience memory)

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.

In [31]:
state = env.reset()
for _ in range(batch_size):
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)
    memory.buffer.append([state, action, next_state, reward, float(done)])
    state = next_state
    if done is True:
        state = env.reset()

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [32]:
memory.buffer[0]

[array([-0.0297722 , -0.04377287,  0.04515347, -0.03944067]),
 0,
 array([-0.03064765, -0.23951226,  0.04436466,  0.26713977]),
 1.0,
 0.0]

In [33]:
# states, rewards, actions

In [None]:
# Now train with experiences
saver = tf.train.Saver() # save the trained model
rewards_list, loss_list = [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    initial_state = sess.run(model.initial_state) # Qs or current batch or states[:-1]
    episode_loss = deque(maxlen=batch_size)
    episode_reward = deque(maxlen=batch_size)
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        loss_batch = []
        lossQlbl_batch, lossQlbl_sigm_batch, lossQtgt_batch, lossQtgt_sigm_batch = [], [], [], []
        state = env.reset()

        # Training steps/batches
        while True:
            # Testing
            action_logits, final_state = sess.run([model.actions_logits, model.final_state],
                                                  feed_dict = {model.states: state.reshape([1, -1]), 
                                                               model.initial_state: initial_state})
            action = np.argmax(action_logits)
            next_state, reward, done, _ = env.step(action)
            memory.buffer.append([state, action, next_state, reward, float(done)])
            memory.states.append([initial_state, final_state])
            total_reward += reward
            initial_state = final_state
            state = next_state

            # Training
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            rnn_states = memory.states
            initial_states = np.array([each[0] for each in rnn_states])
            final_states = np.array([each[1] for each in rnn_states])
            actions_logits = sess.run(model.actions_logits, 
                                      feed_dict = {model.states: states, 
                                                   model.initial_state: initial_states[0].reshape([1, -1])})
            labelQs = np.max(actions_logits, axis=1) # explore
            next_actions_logits = sess.run(model.actions_logits, 
                                           feed_dict = {model.states: next_states, 
                                                        model.initial_state: final_states[0].reshape([1, -1])})
            nextQs = np.max(next_actions_logits, axis=1) * (1-dones) # exploit
            targetQs = rewards + (0.99 * nextQs)
            loss, _, lossQlbl, lossQlbl_sigm, lossQtgt, lossQtgt_sigm = sess.run([model.loss, model.opt, 
                                                                                  model.lossQlbl, 
                                                                                  model.lossQlbl_sigm, 
                                                                                  model.lossQtgt, 
                                                                                  model.lossQtgt_sigm], 
                                            feed_dict = {model.states: states, 
                                                         model.actions: actions,
                                                         model.targetQs: targetQs,
                                                         model.labelQs: labelQs,
                                                         model.initial_state: initial_states[0].reshape([1, -1])})
            loss_batch.append(loss)
            lossQlbl_batch.append(lossQlbl)
            lossQlbl_sigm_batch.append(lossQlbl_sigm)
            lossQtgt_batch.append(lossQtgt)
            lossQtgt_sigm_batch.append(lossQtgt_sigm)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode: {}'.format(ep),
              'meanReward: {:.4f}'.format(np.mean(episode_reward)),
              'meanLoss: {:.4f}'.format(np.mean(loss_batch)),
              'meanLossQlbl: {:.4f}'.format(np.mean(lossQlbl_batch)),
              'meanLossQlbl_sigm: {:.4f}'.format(np.mean(lossQlbl_sigm_batch)),
              'meanLossQtgt: {:.4f}'.format(np.mean(lossQtgt_batch)),
              'meanLossQtgt_sigm: {:.4f}'.format(np.mean(lossQtgt_sigm_batch)))
        rewards_list.append([ep, np.mean(episode_reward)])
        loss_list.append([ep, np.mean(loss_batch)])
        if(np.mean(episode_reward) >= 500):
            break
    
    saver.save(sess, 'checkpoints/model5.ckpt')

Episode: 0 meanReward: 10.0000 meanLoss: 2.3103 meanLossQlbl: 0.0153 meanLossQlbl_sigm: 0.6746 meanLossQtgt: 0.9949 meanLossQtgt_sigm: 0.6255
Episode: 1 meanReward: 9.5000 meanLoss: 2.5527 meanLossQlbl: 0.4651 meanLossQlbl_sigm: 0.5507 meanLossQtgt: 1.0530 meanLossQtgt_sigm: 0.4838
Episode: 2 meanReward: 14.0000 meanLoss: 5.0278 meanLossQlbl: 2.4131 meanLossQlbl_sigm: 0.4239 meanLossQtgt: 1.7648 meanLossQtgt_sigm: 0.4260
Episode: 3 meanReward: 12.7500 meanLoss: 4.0847 meanLossQlbl: 1.7088 meanLossQlbl_sigm: 0.2778 meanLossQtgt: 1.8149 meanLossQtgt_sigm: 0.2832
Episode: 4 meanReward: 14.0000 meanLoss: 4.6253 meanLossQlbl: 1.6886 meanLossQlbl_sigm: 0.3908 meanLossQtgt: 2.1612 meanLossQtgt_sigm: 0.3846
Episode: 5 meanReward: 13.3333 meanLoss: 5.4610 meanLossQlbl: 2.1427 meanLossQlbl_sigm: 0.4434 meanLossQtgt: 2.4018 meanLossQtgt_sigm: 0.4731
Episode: 6 meanReward: 17.0000 meanLoss: 4.9509 meanLossQlbl: 1.5487 meanLossQlbl_sigm: 0.2573 meanLossQtgt: 2.8808 meanLossQtgt_sigm: 0.2641
Episode

Episode: 56 meanReward: 58.0938 meanLoss: 632.5571 meanLossQlbl: 337.8204 meanLossQlbl_sigm: 0.6931 meanLossQtgt: 293.3504 meanLossQtgt_sigm: 0.6931
Episode: 57 meanReward: 55.9062 meanLoss: 469.8569 meanLossQlbl: 253.3288 meanLossQlbl_sigm: 0.6931 meanLossQtgt: 215.1418 meanLossQtgt_sigm: 0.6931
Episode: 58 meanReward: 55.8438 meanLoss: 484.6460 meanLossQlbl: 259.8382 meanLossQlbl_sigm: 0.6931 meanLossQtgt: 223.4214 meanLossQtgt_sigm: 0.6931
Episode: 59 meanReward: 56.1562 meanLoss: 384.9443 meanLossQlbl: 208.2625 meanLossQlbl_sigm: 0.6931 meanLossQtgt: 175.2955 meanLossQtgt_sigm: 0.6931
Episode: 60 meanReward: 55.6875 meanLoss: 278.4822 meanLossQlbl: 149.5078 meanLossQlbl_sigm: 0.6931 meanLossQtgt: 127.5881 meanLossQtgt_sigm: 0.6931
Episode: 61 meanReward: 56.1250 meanLoss: 402.5965 meanLossQlbl: 215.0907 meanLossQlbl_sigm: 0.6931 meanLossQtgt: 186.1194 meanLossQtgt_sigm: 0.6931
Episode: 62 meanReward: 56.2812 meanLoss: 504.1919 meanLossQlbl: 270.3630 meanLossQlbl_sigm: 0.6931 meanLo

Episode: 112 meanReward: 113.3750 meanLoss: 22.5578 meanLossQlbl: 0.1112 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 22.3382 meanLossQtgt_sigm: 0.1083
Episode: 113 meanReward: 117.2188 meanLoss: 15.3576 meanLossQlbl: 5.9367 meanLossQlbl_sigm: 0.0247 meanLossQtgt: 9.3544 meanLossQtgt_sigm: 0.0418
Episode: 114 meanReward: 126.3750 meanLoss: 12.0738 meanLossQlbl: 0.3848 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 11.6508 meanLossQtgt_sigm: 0.0382
Episode: 115 meanReward: 140.4062 meanLoss: 10.4779 meanLossQlbl: 0.0379 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 10.4017 meanLossQtgt_sigm: 0.0383
Episode: 116 meanReward: 140.1562 meanLoss: 262.0642 meanLossQlbl: 0.8934 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 260.3792 meanLossQtgt_sigm: 0.7917
Episode: 117 meanReward: 137.8438 meanLoss: 156.3128 meanLossQlbl: 3.8418 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 152.0909 meanLossQtgt_sigm: 0.3801
Episode: 118 meanReward: 138.2500 meanLoss: 38.1679 meanLossQlbl: 0.4697 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 37

Episode: 168 meanReward: 320.7812 meanLoss: 4.6868 meanLossQlbl: 0.1019 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 4.5612 meanLossQtgt_sigm: 0.0237
Episode: 169 meanReward: 320.7812 meanLoss: 10.5964 meanLossQlbl: 0.0274 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 10.5310 meanLossQtgt_sigm: 0.0380
Episode: 170 meanReward: 310.9062 meanLoss: 43.0273 meanLossQlbl: 0.4497 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 42.4510 meanLossQtgt_sigm: 0.1266
Episode: 171 meanReward: 309.3438 meanLoss: 148.3476 meanLossQlbl: 1.8674 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 145.9056 meanLossQtgt_sigm: 0.5746
Episode: 172 meanReward: 309.3438 meanLoss: 5.7978 meanLossQlbl: 0.1870 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 5.5868 meanLossQtgt_sigm: 0.0240
Episode: 173 meanReward: 299.7188 meanLoss: 25.0550 meanLossQlbl: 1.1195 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 23.8454 meanLossQtgt_sigm: 0.0901
Episode: 174 meanReward: 288.4688 meanLoss: 14.2404 meanLossQlbl: 0.8483 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 13.3237

Episode: 225 meanReward: 221.4375 meanLoss: 404.2200 meanLossQlbl: 0.9776 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 402.2487 meanLossQtgt_sigm: 0.9938
Episode: 226 meanReward: 216.0312 meanLoss: 567.8749 meanLossQlbl: 3.5865 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 562.6924 meanLossQtgt_sigm: 1.5960
Episode: 227 meanReward: 216.0312 meanLoss: 10.4405 meanLossQlbl: 0.3865 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 10.0186 meanLossQtgt_sigm: 0.0355
Episode: 228 meanReward: 215.0000 meanLoss: 35.7084 meanLossQlbl: 0.4942 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 35.1160 meanLossQtgt_sigm: 0.0982
Episode: 229 meanReward: 215.0000 meanLoss: 17.0261 meanLossQlbl: 0.1525 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 16.8247 meanLossQtgt_sigm: 0.0489
Episode: 230 meanReward: 225.5625 meanLoss: 18.7203 meanLossQlbl: 0.1992 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 18.4707 meanLossQtgt_sigm: 0.0504
Episode: 231 meanReward: 240.8750 meanLoss: 17.2588 meanLossQlbl: 0.1810 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 1

Episode: 281 meanReward: 254.3125 meanLoss: 316.7501 meanLossQlbl: 5.1973 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 310.7095 meanLossQtgt_sigm: 0.8434
Episode: 282 meanReward: 248.4062 meanLoss: 327.5249 meanLossQlbl: 4.7648 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 321.8145 meanLossQtgt_sigm: 0.9456
Episode: 283 meanReward: 233.1875 meanLoss: 360.9575 meanLossQlbl: 5.7629 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 354.0004 meanLossQtgt_sigm: 1.1943
Episode: 284 meanReward: 226.0625 meanLoss: 13.4693 meanLossQlbl: 2.0005 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 11.4204 meanLossQtgt_sigm: 0.0483
Episode: 285 meanReward: 228.2500 meanLoss: 39.3827 meanLossQlbl: 0.3657 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 38.8938 meanLossQtgt_sigm: 0.1231
Episode: 286 meanReward: 224.5000 meanLoss: 180.2837 meanLossQlbl: 0.9088 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 178.7184 meanLossQtgt_sigm: 0.6565
Episode: 287 meanReward: 224.2812 meanLoss: 317.4598 meanLossQlbl: 0.3521 meanLossQlbl_sigm: 0.0000 meanLossQt

Episode: 337 meanReward: 112.3750 meanLoss: 138.5301 meanLossQlbl: 0.0385 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 137.8032 meanLossQtgt_sigm: 0.6885
Episode: 338 meanReward: 102.6250 meanLoss: 12.7394 meanLossQlbl: 0.7641 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 11.9361 meanLossQtgt_sigm: 0.0392
Episode: 339 meanReward: 107.9062 meanLoss: 10.6703 meanLossQlbl: 0.3005 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 10.3161 meanLossQtgt_sigm: 0.0538
Episode: 340 meanReward: 112.0625 meanLoss: 19.7199 meanLossQlbl: 0.9436 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 18.7103 meanLossQtgt_sigm: 0.0660
Episode: 341 meanReward: 113.8438 meanLoss: 61.5299 meanLossQlbl: 0.6273 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 60.6541 meanLossQtgt_sigm: 0.2485
Episode: 342 meanReward: 126.0312 meanLoss: 12.1106 meanLossQlbl: 0.4989 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 11.5815 meanLossQtgt_sigm: 0.0302
Episode: 343 meanReward: 122.3125 meanLoss: 131.6662 meanLossQlbl: 2.2036 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 12

Episode: 393 meanReward: 201.6562 meanLoss: 18.5665 meanLossQlbl: 0.3003 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 18.2155 meanLossQtgt_sigm: 0.0506
Episode: 394 meanReward: 198.8438 meanLoss: 302.5993 meanLossQlbl: 0.3387 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 301.4079 meanLossQtgt_sigm: 0.8527
Episode: 395 meanReward: 194.1875 meanLoss: 489.9209 meanLossQlbl: 0.1253 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 488.2701 meanLossQtgt_sigm: 1.5256
Episode: 396 meanReward: 190.0000 meanLoss: 444.9995 meanLossQlbl: 1.9461 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 441.5470 meanLossQtgt_sigm: 1.5064
Episode: 397 meanReward: 186.7500 meanLoss: 86.4650 meanLossQlbl: 2.5459 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 83.6084 meanLossQtgt_sigm: 0.3107
Episode: 398 meanReward: 188.2500 meanLoss: 33.3212 meanLossQlbl: 0.5963 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 32.6188 meanLossQtgt_sigm: 0.1061
Episode: 399 meanReward: 199.1250 meanLoss: 10.4269 meanLossQlbl: 0.1072 meanLossQlbl_sigm: 0.0000 meanLossQtgt:

Episode: 449 meanReward: 325.5938 meanLoss: 17.3615 meanLossQlbl: 0.2721 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 17.0400 meanLossQtgt_sigm: 0.0494
Episode: 450 meanReward: 321.3438 meanLoss: 106.7128 meanLossQlbl: 1.8193 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 104.5992 meanLossQtgt_sigm: 0.2943
Episode: 451 meanReward: 311.4062 meanLoss: 28.4163 meanLossQlbl: 0.3455 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 27.9661 meanLossQtgt_sigm: 0.1047
Episode: 452 meanReward: 296.2188 meanLoss: 290.3668 meanLossQlbl: 1.9414 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 287.5859 meanLossQtgt_sigm: 0.8394
Episode: 453 meanReward: 293.1250 meanLoss: 543.4620 meanLossQlbl: 5.7164 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 536.1633 meanLossQtgt_sigm: 1.5823
Episode: 454 meanReward: 288.8438 meanLoss: 478.0087 meanLossQlbl: 1.5494 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 474.9526 meanLossQtgt_sigm: 1.5067
Episode: 455 meanReward: 289.1875 meanLoss: 83.3444 meanLossQlbl: 4.9762 meanLossQlbl_sigm: 0.0000 meanLossQtg

Episode: 505 meanReward: 211.1562 meanLoss: 36.0566 meanLossQlbl: 0.1683 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 35.7919 meanLossQtgt_sigm: 0.0964
Episode: 506 meanReward: 212.9688 meanLoss: 14.7717 meanLossQlbl: 0.1638 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 14.5630 meanLossQtgt_sigm: 0.0449
Episode: 507 meanReward: 214.4375 meanLoss: 448.3215 meanLossQlbl: 224.9740 meanLossQlbl_sigm: 0.3496 meanLossQtgt: 222.5566 meanLossQtgt_sigm: 0.4413
Episode: 508 meanReward: 217.1875 meanLoss: 72.1639 meanLossQlbl: 27.9076 meanLossQlbl_sigm: 0.1029 meanLossQtgt: 44.0022 meanLossQtgt_sigm: 0.1513
Episode: 509 meanReward: 213.6875 meanLoss: 116.7195 meanLossQlbl: 2.9293 meanLossQlbl_sigm: 0.0072 meanLossQtgt: 113.5731 meanLossQtgt_sigm: 0.2099
Episode: 510 meanReward: 208.0000 meanLoss: 108.0712 meanLossQlbl: 1.3680 meanLossQlbl_sigm: 0.0000 meanLossQtgt: 106.2301 meanLossQtgt_sigm: 0.4730
Episode: 511 meanReward: 192.7500 meanLoss: 156.8881 meanLossQlbl: 2.9252 meanLossQlbl_sigm: 0.0000 meanLossQ

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

## Testing

Let's checkout how our trained agent plays the game.

In [184]:
import gym

# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model-seq.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    initial_state = sess.run(model.initial_state) # Qs or current batch or states[:-1]
    state = env.reset()
    total_reward = 0
    while True:
        env.render()
        action_logits, initial_state = sess.run([model.actions_logits, model.final_state],
                                                feed_dict = {model.states: state.reshape([1, -1]), 
                                                             model.initial_state: initial_state})
        action = np.argmax(action_logits)
        state, reward, done, _ = env.step(action)
        total_reward += reward
        if done:
            break
print('total_reward:{}'.format(total_reward))
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt




total_reward:120.0


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.