# Deep cortical reinforcement learning: Policy gradients + Q-learning + GAN


In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

##### >**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [2]:
import gym

## Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [3]:
env.reset()
batch = []
for _ in range(1111):
    #env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) # take a random action
    batch.append([action, state, reward, done, info])
    #print('state, action, reward, done, info:', state, action, reward, done, info)
    if done:
        env.reset()

To shut the window showing the simulation, use `env.close()`.

In [4]:
# env.close()

If you ran the simulation above, we can look at the rewards:

In [5]:
batch[0], 
batch[0][1].shape, state.shape

((4,), (4,))

In [6]:
import numpy as np
actions = np.array([each[0] for each in batch])
states = np.array([each[1] for each in batch])
rewards = np.array([each[2] for each in batch])
dones = np.array([each[3] for each in batch])
infos = np.array([each[4] for each in batch])

In [7]:
# print(rewards[-20:])
print('shapes:', np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print('dtypes:', np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print('states:', np.max(np.array(states)), np.min(np.array(states)))
print('actions:', np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
print('rewards:', np.max(np.array(rewards)), np.min(np.array(rewards)))

shapes: (1111,) (1111, 4) (1111,) (1111,)
dtypes: float64 float64 int64 bool
states: 2.7905941157111283 -2.711930050330779
actions: 1 0
rewards: 1.0 1.0


In [8]:
actions[:10]

array([0, 1, 1, 0, 1, 1, 1, 1, 1, 1])

In [9]:
rewards[:10]

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [10]:
# import numpy as np
def sigmoid(x, derivative=False):
  return x*(1-x) if derivative else 1/(1+np.exp(-x))

In [11]:
sigmoid(np.max(np.array(rewards))), sigmoid(np.min(np.array(rewards)))

(0.7310585786300049, 0.7310585786300049)

In [12]:
print('rewards:', np.max(np.array(rewards))/100, np.min(np.array(rewards))/100)

rewards: 0.01 0.01


The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [13]:
# Data of the model
def model_input(state_size, lstm_size, batch_size=1):
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    # GRU: Gated Recurrent Units
    gru = tf.nn.rnn_cell.GRUCell(lstm_size) # hidden size
    cell = tf.nn.rnn_cell.MultiRNNCell([gru], state_is_tuple=False)
    g_initial_state = cell.zero_state(batch_size, tf.float32) # feedback or lateral/recurrent connection from output
    d_initial_state = cell.zero_state(batch_size, tf.float32) # feedback or lateral/recurrent connection from output
    return states, actions, targetQs, cell, g_initial_state, d_initial_state

In [14]:
# How to use batch-norm
#   x_norm = tf.layers.batch_normalization(x, training=training)

#   # ...

#   update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
#   with tf.control_dependencies(update_ops):
#     train_op = optimizer.minimize(loss)

In [15]:
# training: Either a Python boolean, or a TensorFlow boolean scalar tensor (e.g. a placeholder). 
# Whether to return the output in: 
# training mode (normalized with statistics of the current batch) or 
# inference mode (normalized with moving statistics). 
# NOTE: make sure to set this parameter correctly, or else your training/inference will not work properly.

In [16]:
# MLP & Conv
# # Generator/Controller: Generating/prediting the actions
# def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
#     with tf.variable_scope('generator', reuse=reuse):
#         # First fully connected layer
#         h1 = tf.layers.dense(inputs=states, units=hidden_size)
#         bn1 = tf.layers.batch_normalization(h1, training=training)        
#         nl1 = tf.maximum(alpha * bn1, bn1)
        
#         # Second fully connected layer
#         h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
#         bn2 = tf.layers.batch_normalization(h2, training=training)        
#         nl2 = tf.maximum(alpha * bn2, bn2)
        
#         # Output layer
#         logits = tf.layers.dense(inputs=nl2, units=action_size)        
#         #predictions = tf.nn.softmax(logits)

#         # return actions logits
#         return logits

In [17]:
# RNN generator or sequence generator
def generator(states, initial_state, cell, lstm_size, num_classes, reuse=False): 
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        inputs = tf.layers.dense(inputs=states, units=lstm_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        # dynamic means adapt to the batch_size
        inputs_rnn = tf.reshape(inputs, [1, -1, lstm_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cell, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state.shape)
        outputs = tf.reshape(outputs_rnn, [-1, lstm_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=num_classes)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [18]:
# MLP & Conv
# # Discriminator/Dopamine: Reward function/planner/naviator/advisor/supervisor/cortical columns
# def discriminator(states, actions, hidden_size, reuse=False, alpha=0.1, training=False):
#     with tf.variable_scope('discriminator', reuse=reuse):
#         # Fusion/merge states and actions/ SA/ SM
#         x_fused = tf.concat(axis=1, values=[states, actions])
        
#         # First fully connected layer
#         h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
#         bn1 = tf.layers.batch_normalization(h1, training=training)        
#         nl1 = tf.maximum(alpha * bn1, bn1)
        
#         # Second fully connected layer
#         h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
#         bn2 = tf.layers.batch_normalization(h2, training=training)        
#         nl2 = tf.maximum(alpha * bn2, bn2)
        
#         # Output layer
#         logits = tf.layers.dense(inputs=nl2, units=1)        
#         #predictions = tf.nn.softmax(logits)

#         # return rewards logits
#         return logits

In [19]:
# RNN generator or sequence generator
def discriminator(states, actions, initial_state, cell, lstm_size, reuse=False): 
    with tf.variable_scope('discriminator', reuse=reuse):
        # Fusion/merge states and actions/ SA/ SM
        x_fused = tf.concat(axis=1, values=[states, actions])
        
        # First fully connected layer
        inputs = tf.layers.dense(inputs=x_fused, units=lstm_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        inputs_rnn = tf.reshape(inputs, [1, -1, lstm_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cell, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state.shape)
        outputs = tf.reshape(outputs_rnn, [-1, lstm_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=1)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [20]:
def model_loss(action_size, hidden_size, states, actions, targetQs,
               cell, g_initial_state, d_initial_state):
    # G/Actor
    #actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    actions_logits, g_final_state = generator(states=states, num_classes=action_size, 
                                              cell=cell, initial_state=g_initial_state, lstm_size=hidden_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    neg_log_prob_actions = tf.nn.softmax_cross_entropy_with_logits_v2(logits=actions_logits, 
                                                                      labels=actions_labels)
    g_loss = tf.reduce_mean(neg_log_prob_actions * targetQs)
    
    # D/Critic
    #Qs_logits = discriminator(actions=actions_logits, hidden_size=hidden_size, states=states)
    Qs_logits, d_final_state = discriminator(states=states, actions=actions_logits, 
                                             cell=cell, initial_state=d_initial_state, lstm_size=hidden_size)
    d_lossQ = tf.reduce_mean(tf.square(tf.reshape(Qs_logits, [-1]) - targetQs))
    d_lossQ_sigm = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=tf.reshape(Qs_logits, [-1]),
                                                                          labels=tf.nn.sigmoid(targetQs)))
    d_loss = d_lossQ_sigm + d_lossQ

    return actions_logits, Qs_logits, g_final_state, d_final_state, g_loss, d_loss, d_lossQ, d_lossQ_sigm

In [21]:
# Optimizating/training/learning G & D
def model_opt(g_loss, d_loss, learning_rate):
    """
    Get optimization operations in order
    :param g_loss: Generator loss Tensor for action prediction
    :param d_loss: Discriminator loss Tensor for reward prediction for generated/prob/logits action
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize RNN
    # g_grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(g_loss, g_vars), clip_norm=5) # usually around 1-5
    # d_grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(d_loss, d_vars), clip_norm=5) # usually around 1-5
    g_grads=tf.gradients(g_loss, g_vars)
    d_grads=tf.gradients(d_loss, d_vars)
    g_opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(g_grads, g_vars))
    d_opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(d_grads, d_vars))
    
    # # Optimize MLP & CNN
    # with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
    #     g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars)
    #     d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars)

    return g_opt, d_opt

In [22]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, cell, self.g_initial_state, self.d_initial_state = model_input(
            state_size=state_size, lstm_size=hidden_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.Qs_logits, self.g_final_state, self.d_final_state, self.g_loss, self.d_loss, self.d_lossQ, self.d_lossQ_sigm = model_loss(
            action_size=action_size, hidden_size=hidden_size,
            states=self.states, actions=self.actions, cell=cell, targetQs=self.targetQs,
            g_initial_state=self.g_initial_state, d_initial_state=self.d_initial_state)
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(g_loss=self.g_loss, d_loss=self.d_loss, learning_rate=learning_rate)

In [23]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
        self.states = deque(maxlen=max_size)

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [24]:
print('state size:{}'.format(states.shape), 
      'actions:{}'.format(actions.shape)) 
print('action size:{}'.format(np.max(actions) - np.min(actions)+1))

state size:(1111, 4) actions:(1111,)
action size:2


In [25]:
# Training parameters
# Network parameters
state_size = 4                 # number of units for the input state/observation -- simulation
action_size = 2                # number of units for the output actions -- simulation
hidden_size = 64               # number of units in each Q-network hidden layer -- simulation
batch_size = 64                # number of samples in the memory/ experience as mini-batch size
learning_rate = 0.001          # learning rate for adam

In [26]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=batch_size)

(?, 4) (?, 64)
(1, ?, 64) (1, 64)
(1, ?, 64) (1, 64)
(?, 64)
(?, 2)
(?, 4) (?, 64)
(1, ?, 64) (1, 64)
(1, ?, 64) (1, 64)
(?, 64)
(?, 1)


In [27]:
state = env.reset()
for _ in range(batch_size):
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)
    memory.buffer.append([state, action, next_state, reward, float(done)])
    state = next_state
    if done is True:
        state = env.reset()

In [28]:
memory.buffer[0]

[array([-0.04807594,  0.02345137,  0.02045159,  0.03768093]),
 0,
 array([-0.04760692, -0.1719578 ,  0.0212052 ,  0.33674573]),
 1.0,
 0.0]

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [None]:
from collections import deque
episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
saver = tf.train.Saver()
rewards_list, g_loss_list, d_loss_list = [], [], []
rates_list, d_lossQ_list, d_lossQsigm_list = [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Training episodes/epochs
    for ep in range(11111):
        batch = [] # every data batch
        total_reward = 0
        state = env.reset() # env first state
        g_initial_state = sess.run(model.g_initial_state)
        d_initial_state = sess.run(model.d_initial_state)

        # Training steps/batches
        while True:
            # Testing/inference
            action_logits, g_final_state, d_final_state = sess.run(
                fetches=[model.actions_logits, model.g_final_state, model.d_final_state], 
                feed_dict={model.states: np.reshape(state, [1, -1]),
                           model.g_initial_state: g_initial_state,
                           model.d_initial_state: d_initial_state})
            action = np.argmax(action_logits)
            next_state, reward, done, _ = env.step(action)
            memory.buffer.append([state, action, next_state, reward, float(done)])
            memory.states.append([g_initial_state, g_final_state,
                                  d_initial_state, d_final_state])
            total_reward += reward
            g_initial_state = g_final_state
            d_initial_state = d_final_state
            state = next_state
            
            # Training
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            rnn_states = memory.states
            g_initial_states = np.array([each[0] for each in rnn_states])
            g_final_states = np.array([each[1] for each in rnn_states])
            d_initial_states = np.array([each[2] for each in rnn_states])
            d_final_states = np.array([each[3] for each in rnn_states])
            nextQs_logits = sess.run(fetches = model.Qs_logits,
                                     feed_dict = {model.states: next_states, 
                                                  model.g_initial_state: g_final_states[0].reshape([1, -1]),
                                                  model.d_initial_state: d_final_states[0].reshape([1, -1])})
            nextQs = nextQs_logits.reshape([-1]) * (1-dones) # exploit
            targetQs = rewards + (0.99 * nextQs)
            g_loss, d_loss, d_lossQ, d_lossQsigm, _, _ = sess.run(
                fetches=[model.g_loss, model.d_loss, 
                         model.d_lossQ, model.d_lossQ_sigm,
                         model.g_opt, model.d_opt], 
                feed_dict = {model.states: states, model.actions: actions,
                             model.targetQs: targetQs,
                             model.g_initial_state: g_initial_states[0].reshape([1, -1]),
                             model.d_initial_state: d_initial_states[0].reshape([1, -1])})
            if done is True:
                break

        # Episode total reward and success rate/prob
        episode_reward.append(total_reward) # stopping criteria
        rate = total_reward/ 500 # success is 500 points: 0-1
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'rate:{:.4f}'.format(rate),
              'gloss:{:.4f}'.format(g_loss),
              'dloss:{:.4f}'.format(d_loss),
              'dlossQ:{:.4f}'.format(d_lossQ),
              'dlossQsigm:{:.4f}'.format(d_lossQsigm))
        # Ploting out
        rewards_list.append([ep, np.mean(episode_reward)])
        rates_list.append([ep, rate])
        g_loss_list.append([ep, g_loss])
        d_loss_list.append([ep, d_loss])
        d_lossQ_list.append([ep, d_lossQ])
        d_lossQsigm_list.append([ep, d_lossQsigm])
        # Break episode/epoch loop
        if np.mean(episode_reward) >= 500:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model-seq2.ckpt')

Episode:0 meanR:80.0000 rate:0.1600 gloss:-3.2833 dloss:2.2360 dlossQ:1.5143 dlossQsigm:0.7217
Episode:1 meanR:64.0000 rate:0.0960 gloss:1.9968 dloss:1.2606 dlossQ:0.9850 dlossQsigm:0.2756
Episode:2 meanR:53.3333 rate:0.0640 gloss:1.7915 dloss:1.1929 dlossQ:1.0990 dlossQsigm:0.0938
Episode:3 meanR:50.0000 rate:0.0800 gloss:1.0570 dloss:1.8423 dlossQ:1.7622 dlossQsigm:0.0802
Episode:4 meanR:47.4000 rate:0.0740 gloss:0.3914 dloss:2.0815 dlossQ:2.0090 dlossQsigm:0.0725
Episode:5 meanR:46.1667 rate:0.0800 gloss:0.4965 dloss:2.8467 dlossQ:2.7713 dlossQsigm:0.0754
Episode:6 meanR:43.8571 rate:0.0600 gloss:0.6110 dloss:3.7688 dlossQ:3.6806 dlossQsigm:0.0881
Episode:7 meanR:42.6250 rate:0.0680 gloss:0.8966 dloss:4.2057 dlossQ:4.0988 dlossQsigm:0.1069
Episode:8 meanR:43.0000 rate:0.0920 gloss:3.2130 dloss:4.9532 dlossQ:4.8503 dlossQsigm:0.1029
Episode:9 meanR:43.0000 rate:0.0860 gloss:0.4772 dloss:5.9415 dlossQ:5.8268 dlossQsigm:0.1147
Episode:10 meanR:41.8182 rate:0.0600 gloss:0.7667 dloss:5.6

Episode:87 meanR:48.3182 rate:0.1540 gloss:0.1018 dloss:3.7135 dlossQ:3.6678 dlossQsigm:0.0458
Episode:88 meanR:48.7079 rate:0.1660 gloss:0.0276 dloss:1.9742 dlossQ:1.9505 dlossQsigm:0.0237
Episode:89 meanR:49.0000 rate:0.1500 gloss:0.0106 dloss:1.7205 dlossQ:1.7010 dlossQsigm:0.0195
Episode:90 meanR:48.7912 rate:0.0600 gloss:0.0775 dloss:3.7846 dlossQ:3.7236 dlossQsigm:0.0610
Episode:91 meanR:48.6848 rate:0.0780 gloss:0.0097 dloss:3.2909 dlossQ:3.2385 dlossQsigm:0.0523
Episode:92 meanR:48.7527 rate:0.1100 gloss:0.0107 dloss:3.8214 dlossQ:3.7607 dlossQsigm:0.0606
Episode:93 meanR:48.6915 rate:0.0860 gloss:0.0043 dloss:2.3672 dlossQ:2.3159 dlossQsigm:0.0513
Episode:94 meanR:48.5579 rate:0.0720 gloss:0.0805 dloss:2.3420 dlossQ:2.2828 dlossQsigm:0.0591
Episode:95 meanR:48.8438 rate:0.1520 gloss:0.0213 dloss:2.8622 dlossQ:2.8253 dlossQsigm:0.0369
Episode:96 meanR:49.7216 rate:0.2680 gloss:0.0061 dloss:2.4017 dlossQ:2.3680 dlossQsigm:0.0336
Episode:97 meanR:49.7857 rate:0.1120 gloss:0.0189 

Episode:173 meanR:55.9000 rate:0.1880 gloss:0.0102 dloss:5.3963 dlossQ:5.3713 dlossQsigm:0.0250
Episode:174 meanR:56.1100 rate:0.1220 gloss:0.0023 dloss:4.6028 dlossQ:4.5428 dlossQsigm:0.0600
Episode:175 meanR:56.4500 rate:0.1360 gloss:0.0101 dloss:2.8544 dlossQ:2.8289 dlossQsigm:0.0255
Episode:176 meanR:56.4400 rate:0.0800 gloss:0.0692 dloss:3.1025 dlossQ:3.0540 dlossQsigm:0.0485
Episode:177 meanR:56.0000 rate:0.0780 gloss:0.0235 dloss:3.9003 dlossQ:3.8463 dlossQsigm:0.0540
Episode:178 meanR:55.9200 rate:0.0780 gloss:0.0471 dloss:3.1817 dlossQ:3.1335 dlossQsigm:0.0482
Episode:179 meanR:56.1700 rate:0.1140 gloss:0.0272 dloss:2.3139 dlossQ:2.2635 dlossQsigm:0.0504
Episode:180 meanR:56.5600 rate:0.1820 gloss:0.0115 dloss:3.0103 dlossQ:2.9726 dlossQsigm:0.0377
Episode:181 meanR:56.5600 rate:0.1060 gloss:0.0200 dloss:4.4657 dlossQ:4.3849 dlossQsigm:0.0808
Episode:182 meanR:56.4800 rate:0.0740 gloss:0.0188 dloss:3.2673 dlossQ:3.2206 dlossQsigm:0.0468
Episode:183 meanR:56.4000 rate:0.0880 gl

Episode:259 meanR:52.3200 rate:0.1060 gloss:0.0153 dloss:1.8199 dlossQ:1.7814 dlossQsigm:0.0385
Episode:260 meanR:52.3900 rate:0.0980 gloss:0.6484 dloss:1.3950 dlossQ:1.3513 dlossQsigm:0.0437
Episode:261 meanR:52.5100 rate:0.1080 gloss:0.0698 dloss:5.2804 dlossQ:5.2062 dlossQsigm:0.0743
Episode:262 meanR:52.7200 rate:0.1440 gloss:0.0061 dloss:2.5068 dlossQ:2.4778 dlossQsigm:0.0290
Episode:263 meanR:52.7700 rate:0.1280 gloss:0.0063 dloss:8.0492 dlossQ:8.0172 dlossQsigm:0.0321
Episode:264 meanR:52.9100 rate:0.1000 gloss:0.0061 dloss:3.4525 dlossQ:3.3719 dlossQsigm:0.0806
Episode:265 meanR:52.8500 rate:0.1040 gloss:0.0036 dloss:2.2825 dlossQ:2.2313 dlossQsigm:0.0511
Episode:266 meanR:52.3600 rate:0.0820 gloss:0.0161 dloss:2.2125 dlossQ:2.1583 dlossQsigm:0.0542
Episode:267 meanR:52.5400 rate:0.1160 gloss:0.0243 dloss:3.2469 dlossQ:3.1790 dlossQsigm:0.0678
Episode:268 meanR:52.1300 rate:0.0880 gloss:0.0073 dloss:3.8881 dlossQ:3.7983 dlossQsigm:0.0898
Episode:269 meanR:52.2600 rate:0.1140 gl

Episode:345 meanR:63.6000 rate:0.1060 gloss:0.0913 dloss:7.5862 dlossQ:7.4782 dlossQsigm:0.1080
Episode:346 meanR:63.8400 rate:0.1240 gloss:0.0471 dloss:4.8284 dlossQ:4.7749 dlossQsigm:0.0535
Episode:347 meanR:63.9900 rate:0.1440 gloss:0.0005 dloss:4.8157 dlossQ:4.7648 dlossQsigm:0.0509
Episode:348 meanR:63.9700 rate:0.0920 gloss:0.0132 dloss:8.8462 dlossQ:8.7287 dlossQsigm:0.1175
Episode:349 meanR:64.0300 rate:0.0880 gloss:0.0187 dloss:4.4033 dlossQ:4.2610 dlossQsigm:0.1423
Episode:350 meanR:63.8800 rate:0.0800 gloss:0.0046 dloss:7.6487 dlossQ:7.5628 dlossQsigm:0.0859
Episode:351 meanR:64.4000 rate:0.1700 gloss:0.0102 dloss:4.2764 dlossQ:4.2218 dlossQsigm:0.0545
Episode:352 meanR:64.5600 rate:0.1260 gloss:0.0125 dloss:6.6457 dlossQ:6.5559 dlossQsigm:0.0898
Episode:353 meanR:64.5200 rate:0.1280 gloss:0.0171 dloss:10.8579 dlossQ:10.8140 dlossQsigm:0.0439
Episode:354 meanR:64.4900 rate:0.1240 gloss:0.0105 dloss:9.8006 dlossQ:9.7120 dlossQsigm:0.0886
Episode:355 meanR:64.4900 rate:0.0780 

Episode:431 meanR:77.0300 rate:0.1780 gloss:0.0896 dloss:5.2641 dlossQ:5.2149 dlossQsigm:0.0492
Episode:432 meanR:78.5600 rate:0.4020 gloss:0.0094 dloss:14.6531 dlossQ:14.5360 dlossQsigm:0.1171
Episode:433 meanR:78.3000 rate:0.1500 gloss:0.0008 dloss:7.3760 dlossQ:7.3146 dlossQsigm:0.0614
Episode:434 meanR:76.1900 rate:0.1280 gloss:0.0336 dloss:16.9764 dlossQ:16.9179 dlossQsigm:0.0585
Episode:435 meanR:76.6900 rate:0.2140 gloss:0.0033 dloss:7.3125 dlossQ:7.2594 dlossQsigm:0.0531
Episode:436 meanR:76.9000 rate:0.1780 gloss:0.0052 dloss:7.1757 dlossQ:7.0913 dlossQsigm:0.0845
Episode:437 meanR:77.5400 rate:0.2520 gloss:0.0157 dloss:3.3632 dlossQ:3.3223 dlossQsigm:0.0409
Episode:438 meanR:78.2400 rate:0.2180 gloss:0.0002 dloss:5.9969 dlossQ:5.9337 dlossQsigm:0.0633
Episode:439 meanR:78.5900 rate:0.1940 gloss:0.0005 dloss:4.3817 dlossQ:4.3242 dlossQsigm:0.0576
Episode:440 meanR:78.5600 rate:0.1220 gloss:0.0028 dloss:6.1899 dlossQ:6.0790 dlossQsigm:0.1109
Episode:441 meanR:79.5200 rate:0.372

Episode:517 meanR:96.2300 rate:0.2040 gloss:0.0177 dloss:8.3284 dlossQ:8.2379 dlossQsigm:0.0906
Episode:518 meanR:96.2200 rate:0.1200 gloss:0.0005 dloss:15.1030 dlossQ:14.9930 dlossQsigm:0.1101
Episode:519 meanR:96.0600 rate:0.1280 gloss:0.0010 dloss:7.2652 dlossQ:7.2379 dlossQsigm:0.0272
Episode:520 meanR:96.5600 rate:0.2080 gloss:0.0094 dloss:6.3734 dlossQ:6.3181 dlossQsigm:0.0552
Episode:521 meanR:97.7100 rate:0.3400 gloss:0.0155 dloss:7.7956 dlossQ:7.7134 dlossQsigm:0.0821
Episode:522 meanR:97.8600 rate:0.1680 gloss:0.0016 dloss:6.7089 dlossQ:6.6504 dlossQsigm:0.0585
Episode:523 meanR:97.8000 rate:0.1280 gloss:0.0007 dloss:15.6870 dlossQ:15.6352 dlossQsigm:0.0517
Episode:524 meanR:98.2800 rate:0.2120 gloss:0.0003 dloss:5.7385 dlossQ:5.6673 dlossQsigm:0.0713
Episode:525 meanR:98.1000 rate:0.1500 gloss:0.0000 dloss:2.3382 dlossQ:2.3085 dlossQsigm:0.0297
Episode:526 meanR:99.4300 rate:0.4000 gloss:0.0008 dloss:11.5360 dlossQ:11.4547 dlossQsigm:0.0813
Episode:527 meanR:98.0300 rate:0.1

Episode:602 meanR:99.5500 rate:0.1560 gloss:0.0000 dloss:8.5958 dlossQ:8.5019 dlossQsigm:0.0940
Episode:603 meanR:99.2000 rate:0.1040 gloss:0.0019 dloss:10.8313 dlossQ:10.6979 dlossQsigm:0.1334
Episode:604 meanR:99.9000 rate:0.2560 gloss:0.0016 dloss:6.5268 dlossQ:6.4679 dlossQsigm:0.0589
Episode:605 meanR:98.7200 rate:0.1360 gloss:0.0001 dloss:5.6171 dlossQ:5.5698 dlossQsigm:0.0472
Episode:606 meanR:98.4800 rate:0.2020 gloss:0.0001 dloss:12.9708 dlossQ:12.8798 dlossQsigm:0.0909
Episode:607 meanR:97.7200 rate:0.1400 gloss:0.0006 dloss:8.4768 dlossQ:8.3967 dlossQsigm:0.0801
Episode:608 meanR:96.9200 rate:0.1600 gloss:0.0000 dloss:6.4930 dlossQ:6.4178 dlossQsigm:0.0752
Episode:609 meanR:96.6600 rate:0.1680 gloss:0.0063 dloss:11.9371 dlossQ:11.8650 dlossQsigm:0.0721
Episode:610 meanR:96.3000 rate:0.1380 gloss:0.0009 dloss:6.2889 dlossQ:6.2538 dlossQsigm:0.0351
Episode:611 meanR:96.3400 rate:0.1760 gloss:0.0004 dloss:6.8410 dlossQ:6.7724 dlossQsigm:0.0687
Episode:612 meanR:94.9900 rate:0.0

Episode:687 meanR:97.3100 rate:0.1940 gloss:0.0128 dloss:18.5430 dlossQ:18.4079 dlossQsigm:0.1351
Episode:688 meanR:96.8300 rate:0.2080 gloss:0.0003 dloss:9.6652 dlossQ:9.5730 dlossQsigm:0.0922
Episode:689 meanR:96.7100 rate:0.1380 gloss:0.0022 dloss:17.3375 dlossQ:17.2099 dlossQsigm:0.1276
Episode:690 meanR:96.8700 rate:0.1400 gloss:0.0007 dloss:18.8161 dlossQ:18.6793 dlossQsigm:0.1368
Episode:691 meanR:97.5500 rate:0.3540 gloss:0.0000 dloss:12.2507 dlossQ:12.1437 dlossQsigm:0.1070
Episode:692 meanR:100.7600 rate:0.7900 gloss:0.0001 dloss:11.5909 dlossQ:11.4807 dlossQsigm:0.1102
Episode:693 meanR:100.8900 rate:0.2000 gloss:0.0074 dloss:17.5782 dlossQ:17.4534 dlossQsigm:0.1248
Episode:694 meanR:101.5000 rate:0.2960 gloss:0.0000 dloss:19.9587 dlossQ:19.8100 dlossQsigm:0.1487
Episode:695 meanR:100.5900 rate:0.1380 gloss:0.0006 dloss:16.2150 dlossQ:16.1324 dlossQsigm:0.0826
Episode:696 meanR:101.0400 rate:0.1880 gloss:0.0005 dloss:14.5490 dlossQ:14.4627 dlossQsigm:0.0863
Episode:697 meanR

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(g_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('G losses')

In [None]:
eps, arr = np.array(d_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses')

In [None]:
eps, arr = np.array(d_lossR_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses R')

In [None]:
eps, arr = np.array(d_lossQ_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses Q')

## Testing

Let's checkout how our trained agent plays the game.

In [34]:
import gym
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
# env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Episodes/epochs
    for _ in range(100):
    #while True:
        state = env.reset()
        total_reward = 0

        # Steps/batches
        #for _ in range(111111111111111111):
        while True:
            env.render()
            action_logits = sess.run(model.actions_logits, feed_dict={model.states: np.reshape(state, [1, -1])})
            action = np.argmax(action_logits)
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                break
        # Print and break condition
        print('total_reward: {}'.format(total_reward))
        # if total_reward == 500:
        #     break
                
# Closing the env
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt




total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.