# Deep cortical reinforcement learning: Policy gradients + Q-learning + GAN


In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

##### >**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [2]:
import gym

## Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [3]:
env.reset()
batch = []
for _ in range(1111):
    #env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) # take a random action
    batch.append([action, state, reward, done, info])
    #print('state, action, reward, done, info:', state, action, reward, done, info)
    if done:
        env.reset()

To shut the window showing the simulation, use `env.close()`.

In [4]:
# env.close()

If you ran the simulation above, we can look at the rewards:

In [5]:
batch[0], 
batch[0][1].shape, state.shape

((4,), (4,))

In [6]:
import numpy as np
actions = np.array([each[0] for each in batch])
states = np.array([each[1] for each in batch])
rewards = np.array([each[2] for each in batch])
dones = np.array([each[3] for each in batch])
infos = np.array([each[4] for each in batch])

In [7]:
# print(rewards[-20:])
print('shapes:', np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print('dtypes:', np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print('states:', np.max(np.array(states)), np.min(np.array(states)))
print('actions:', np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
print('rewards:', np.max(np.array(rewards)), np.min(np.array(rewards)))

shapes: (1111,) (1111, 4) (1111,) (1111,)
dtypes: float64 float64 int64 bool
states: 2.5946314602374336 -2.7744319439056646
actions: 1 0
rewards: 1.0 1.0


In [8]:
actions[:10]

array([0, 1, 1, 0, 1, 1, 1, 1, 1, 1])

In [9]:
rewards[:10]

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [10]:
# import numpy as np
def sigmoid(x, derivative=False):
  return x*(1-x) if derivative else 1/(1+np.exp(-x))

In [11]:
sigmoid(np.max(np.array(rewards))), sigmoid(np.min(np.array(rewards)))

(0.7310585786300049, 0.7310585786300049)

In [12]:
print('rewards:', np.max(np.array(rewards))/100, np.min(np.array(rewards))/100)

rewards: 0.01 0.01


The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [29]:
# Data of the model
def model_input(state_size, lstm_size, batch_size=1):
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    # GRU: Gated Recurrent Units
    gru = tf.nn.rnn_cell.GRUCell(lstm_size) # hidden size
    cell = tf.nn.rnn_cell.MultiRNNCell([gru], state_is_tuple=False)
    g_initial_state = cell.zero_state(batch_size, tf.float32) # feedback or lateral/recurrent connection from output
    d_initial_state = cell.zero_state(batch_size, tf.float32) # feedback or lateral/recurrent connection from output
    return states, actions, targetQs, cell, g_initial_state, d_initial_state

In [30]:
# How to use batch-norm
#   x_norm = tf.layers.batch_normalization(x, training=training)

#   # ...

#   update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
#   with tf.control_dependencies(update_ops):
#     train_op = optimizer.minimize(loss)

In [31]:
# training: Either a Python boolean, or a TensorFlow boolean scalar tensor (e.g. a placeholder). 
# Whether to return the output in: 
# training mode (normalized with statistics of the current batch) or 
# inference mode (normalized with moving statistics). 
# NOTE: make sure to set this parameter correctly, or else your training/inference will not work properly.

In [32]:
# MLP & Conv
# # Generator/Controller: Generating/prediting the actions
# def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
#     with tf.variable_scope('generator', reuse=reuse):
#         # First fully connected layer
#         h1 = tf.layers.dense(inputs=states, units=hidden_size)
#         bn1 = tf.layers.batch_normalization(h1, training=training)        
#         nl1 = tf.maximum(alpha * bn1, bn1)
        
#         # Second fully connected layer
#         h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
#         bn2 = tf.layers.batch_normalization(h2, training=training)        
#         nl2 = tf.maximum(alpha * bn2, bn2)
        
#         # Output layer
#         logits = tf.layers.dense(inputs=nl2, units=action_size)        
#         #predictions = tf.nn.softmax(logits)

#         # return actions logits
#         return logits

In [33]:
# RNN generator or sequence generator
def generator(states, initial_state, cell, lstm_size, num_classes, reuse=False): 
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        inputs = tf.layers.dense(inputs=states, units=lstm_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        # dynamic means adapt to the batch_size
        inputs_rnn = tf.reshape(inputs, [1, -1, lstm_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cell, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state.shape)
        outputs = tf.reshape(outputs_rnn, [-1, lstm_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=num_classes)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [34]:
# MLP & Conv
# # Discriminator/Dopamine: Reward function/planner/naviator/advisor/supervisor/cortical columns
# def discriminator(states, actions, hidden_size, reuse=False, alpha=0.1, training=False):
#     with tf.variable_scope('discriminator', reuse=reuse):
#         # Fusion/merge states and actions/ SA/ SM
#         x_fused = tf.concat(axis=1, values=[states, actions])
        
#         # First fully connected layer
#         h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
#         bn1 = tf.layers.batch_normalization(h1, training=training)        
#         nl1 = tf.maximum(alpha * bn1, bn1)
        
#         # Second fully connected layer
#         h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
#         bn2 = tf.layers.batch_normalization(h2, training=training)        
#         nl2 = tf.maximum(alpha * bn2, bn2)
        
#         # Output layer
#         logits = tf.layers.dense(inputs=nl2, units=1)        
#         #predictions = tf.nn.softmax(logits)

#         # return rewards logits
#         return logits

In [35]:
# RNN generator or sequence generator
def discriminator(states, actions, initial_state, cell, lstm_size, reuse=False): 
    with tf.variable_scope('discriminator', reuse=reuse):
        # Fusion/merge states and actions/ SA/ SM
        x_fused = tf.concat(axis=1, values=[states, actions])
        
        # First fully connected layer
        inputs = tf.layers.dense(inputs=x_fused, units=lstm_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        inputs_rnn = tf.reshape(inputs, [1, -1, lstm_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cell, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state.shape)
        outputs = tf.reshape(outputs_rnn, [-1, lstm_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=1)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [36]:
def model_loss(action_size, hidden_size, states, actions, targetQs,
               cell, g_initial_state, d_initial_state):
    # G/Actor
    #actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    actions_logits, g_final_state = generator(states=states, num_classes=action_size, 
                                              cell=cell, initial_state=g_initial_state, lstm_size=hidden_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    neg_log_prob_actions = tf.nn.softmax_cross_entropy_with_logits_v2(logits=actions_logits, 
                                                                      labels=actions_labels)
    g_loss = tf.reduce_mean(neg_log_prob_actions * targetQs)
    
    # D/Critic
    #Qs_logits = discriminator(actions=actions_logits, hidden_size=hidden_size, states=states)
    Qs_logits, d_final_state = discriminator(states=states, actions=actions_logits, 
                                             cell=cell, initial_state=d_initial_state, lstm_size=hidden_size)
    d_loss = tf.reduce_mean(tf.square(tf.reshape(Qs_logits, [-1]) - targetQs))
    # d_lossQ_sigm = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=tf.reshape(Qs_logits, [-1]),
    #                                                                       labels=tf.nn.sigmoid(targetQs)))
    #d_loss = d_lossQ_sigm + d_lossQ

    return actions_logits, Qs_logits, g_final_state, d_final_state, g_loss, d_loss

In [37]:
# Optimizating/training/learning G & D
def model_opt(g_loss, d_loss, learning_rate):
    """
    Get optimization operations in order
    :param g_loss: Generator loss Tensor for action prediction
    :param d_loss: Discriminator loss Tensor for reward prediction for generated/prob/logits action
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize RNN
    # g_grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(g_loss, g_vars), clip_norm=5) # usually around 1-5
    # d_grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(d_loss, d_vars), clip_norm=5) # usually around 1-5
    g_grads=tf.gradients(g_loss, g_vars)
    d_grads=tf.gradients(d_loss, d_vars)
    g_opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(g_grads, g_vars))
    d_opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(d_grads, d_vars))
    
    # # Optimize MLP & CNN
    # with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
    #     g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars)
    #     d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars)

    return g_opt, d_opt

In [38]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, cell, self.g_initial_state, self.d_initial_state = model_input(
            state_size=state_size, lstm_size=hidden_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.Qs_logits, self.g_final_state, self.d_final_state, self.g_loss, self.d_loss = model_loss(
            action_size=action_size, hidden_size=hidden_size,
            states=self.states, actions=self.actions, cell=cell, targetQs=self.targetQs,
            g_initial_state=self.g_initial_state, d_initial_state=self.d_initial_state)
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(g_loss=self.g_loss, d_loss=self.d_loss, learning_rate=learning_rate)

In [39]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
        self.states = deque(maxlen=max_size)

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [40]:
print('state size:{}'.format(states.shape), 
      'actions:{}'.format(actions.shape)) 
print('action size:{}'.format(np.max(actions) - np.min(actions)+1))

state size:(1111, 4) actions:(1111,)
action size:2


In [45]:
# Training parameters
# Network parameters
state_size = 4                 # number of units for the input state/observation -- simulation
action_size = 2                # number of units for the output actions -- simulation
hidden_size = 64               # number of units in each Q-network hidden layer -- simulation
batch_size = 128               # number of samples in the memory/ experience as mini-batch size
learning_rate = 0.001          # learning rate for adam

In [46]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=batch_size)

(?, 4) (?, 64)
(1, ?, 64) (1, 64)
(1, ?, 64) (1, 64)
(?, 64)
(?, 2)
(?, 4) (?, 64)
(1, ?, 64) (1, 64)
(1, ?, 64) (1, 64)
(?, 64)
(?, 1)


In [47]:
state = env.reset()
for _ in range(batch_size):
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)
    memory.buffer.append([state, action, next_state, reward, float(done)])
    state = next_state
    if done is True:
        state = env.reset()

In [48]:
memory.buffer[0]

[array([ 0.0029129 , -0.00483114, -0.01933036,  0.0234201 ]),
 0,
 array([ 0.00281628, -0.19967062, -0.01886196,  0.30994194]),
 1.0,
 0.0]

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [None]:
from collections import deque
episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
saver = tf.train.Saver()
rewards_list, g_loss_list, d_loss_list = [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Training episodes/epochs
    for ep in range(11111):
        batch = [] # every data batch
        total_reward = 0
        state = env.reset() # env first state
        g_initial_state = sess.run(model.g_initial_state)
        d_initial_state = sess.run(model.d_initial_state)

        # Training steps/batches
        while True:
            # Testing/inference
            action_logits, g_final_state, d_final_state = sess.run(
                fetches=[model.actions_logits, model.g_final_state, model.d_final_state], 
                feed_dict={model.states: np.reshape(state, [1, -1]),
                           model.g_initial_state: g_initial_state,
                           model.d_initial_state: d_initial_state})
            action = np.argmax(action_logits)
            next_state, reward, done, _ = env.step(action)
            memory.buffer.append([state, action, next_state, reward, float(done)])
            memory.states.append([g_initial_state, g_final_state,
                                  d_initial_state, d_final_state])
            total_reward += reward
            g_initial_state = g_final_state
            d_initial_state = d_final_state
            state = next_state
            
            # Training
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            rnn_states = memory.states
            g_initial_states = np.array([each[0] for each in rnn_states])
            g_final_states = np.array([each[1] for each in rnn_states])
            d_initial_states = np.array([each[2] for each in rnn_states])
            d_final_states = np.array([each[3] for each in rnn_states])
            nextQs_logits = sess.run(fetches = model.Qs_logits,
                                     feed_dict = {model.states: next_states, 
                                                  model.g_initial_state: g_final_states[0].reshape([1, -1]),
                                                  model.d_initial_state: d_final_states[0].reshape([1, -1])})
            nextQs = nextQs_logits.reshape([-1]) * (1-dones)
            targetQs = rewards + (0.99 * nextQs)
            g_loss, d_loss, _, _ = sess.run(
                fetches=[model.g_loss, model.d_loss, model.g_opt, model.d_opt], 
                feed_dict = {model.states: states, 
                             model.actions: actions,
                             model.targetQs: targetQs,
                             model.g_initial_state: g_initial_states[0].reshape([1, -1]),
                             model.d_initial_state: d_initial_states[0].reshape([1, -1])})

            if done is True:
                break

        # Episode total reward and success rate/prob
        episode_reward.append(total_reward) # stopping criteria
        rate = total_reward/ 500 # success is 500 points: 0-1
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'rate:{:.4f}'.format(rate),
              'gloss:{:.4f}'.format(g_loss),
              'dloss:{:.4f}'.format(d_loss))
        # Ploting out
        rewards_list.append([ep, np.mean(episode_reward)])
        g_loss_list.append([ep, g_loss])
        d_loss_list.append([ep, d_loss])
        # Break episode/epoch loop
        if np.mean(episode_reward) >= 500:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model-seq-Copy1.ckpt')

Episode:0 meanR:50.0000 rate:0.1000 gloss:1.1889 dloss:1.0007
Episode:1 meanR:35.5000 rate:0.0420 gloss:1.2996 dloss:1.0501
Episode:2 meanR:37.0000 rate:0.0800 gloss:1.9988 dloss:1.2875
Episode:3 meanR:34.5000 rate:0.0540 gloss:1.7074 dloss:1.7590
Episode:4 meanR:38.2000 rate:0.1060 gloss:1.0673 dloss:2.4366
Episode:5 meanR:38.5000 rate:0.0800 gloss:0.7713 dloss:4.0421
Episode:6 meanR:37.2857 rate:0.0600 gloss:2.2937 dloss:2.1546
Episode:7 meanR:38.5000 rate:0.0940 gloss:0.9863 dloss:3.5994
Episode:8 meanR:39.4444 rate:0.0940 gloss:0.1859 dloss:3.4468
Episode:9 meanR:40.2000 rate:0.0940 gloss:-14.5617 dloss:4.5379
Episode:10 meanR:39.7273 rate:0.0700 gloss:1.6908 dloss:6.8889
Episode:11 meanR:38.1667 rate:0.0420 gloss:13.8402 dloss:5.0318
Episode:12 meanR:37.0769 rate:0.0480 gloss:10.7555 dloss:3.4873
Episode:13 meanR:42.7143 rate:0.2320 gloss:1.7979 dloss:4.0999
Episode:14 meanR:42.1333 rate:0.0680 gloss:2.2200 dloss:7.4854
Episode:15 meanR:42.0000 rate:0.0800 gloss:2.7607 dloss:9.207

Episode:129 meanR:90.0000 rate:0.1020 gloss:0.0094 dloss:10.9600
Episode:130 meanR:89.5500 rate:0.1180 gloss:0.0067 dloss:8.5929
Episode:131 meanR:89.1300 rate:0.1600 gloss:0.0005 dloss:7.9039
Episode:132 meanR:88.6800 rate:0.1220 gloss:0.0221 dloss:11.0543
Episode:133 meanR:88.5700 rate:0.0900 gloss:0.0173 dloss:12.3298
Episode:134 meanR:88.0600 rate:0.0960 gloss:0.0134 dloss:11.6484
Episode:135 meanR:87.9800 rate:0.1120 gloss:0.0205 dloss:7.2780
Episode:136 meanR:87.5900 rate:0.0980 gloss:0.0039 dloss:3.8217
Episode:137 meanR:87.4400 rate:0.0860 gloss:0.0052 dloss:2.3314
Episode:138 meanR:86.9500 rate:0.1160 gloss:0.0110 dloss:5.8355
Episode:139 meanR:86.6000 rate:0.1240 gloss:0.0037 dloss:5.3905
Episode:140 meanR:85.6500 rate:0.0920 gloss:0.0017 dloss:4.9117
Episode:141 meanR:85.3500 rate:0.1180 gloss:0.0017 dloss:8.7412
Episode:142 meanR:84.9400 rate:0.1440 gloss:0.0005 dloss:7.6760
Episode:143 meanR:83.9700 rate:0.1280 gloss:0.0003 dloss:4.5228
Episode:144 meanR:84.1800 rate:0.214

Episode:256 meanR:47.4200 rate:0.0200 gloss:0.0000 dloss:5.0811
Episode:257 meanR:46.8400 rate:0.0200 gloss:0.0000 dloss:4.7868
Episode:258 meanR:46.3200 rate:0.0180 gloss:0.0000 dloss:4.6030
Episode:259 meanR:45.9200 rate:0.0180 gloss:0.0000 dloss:5.2286
Episode:260 meanR:44.8800 rate:0.0180 gloss:0.0000 dloss:4.9717
Episode:261 meanR:44.4200 rate:0.0180 gloss:0.0000 dloss:4.5539
Episode:262 meanR:44.0800 rate:0.0200 gloss:0.0000 dloss:4.3657
Episode:263 meanR:43.3700 rate:0.0180 gloss:0.0000 dloss:4.0544
Episode:264 meanR:42.9300 rate:0.0200 gloss:0.0000 dloss:3.8939
Episode:265 meanR:42.6100 rate:0.0160 gloss:0.0000 dloss:3.6479
Episode:266 meanR:42.0600 rate:0.0160 gloss:0.0000 dloss:3.5703
Episode:267 meanR:41.2900 rate:0.0180 gloss:0.0000 dloss:3.5509
Episode:268 meanR:40.8200 rate:0.0180 gloss:0.0000 dloss:3.3464
Episode:269 meanR:40.2700 rate:0.0180 gloss:0.0000 dloss:3.3325
Episode:270 meanR:39.9100 rate:0.0160 gloss:0.0000 dloss:3.3913
Episode:271 meanR:39.1600 rate:0.0200 gl

Episode:386 meanR:9.3900 rate:0.0200 gloss:0.0000 dloss:0.1453
Episode:387 meanR:9.4000 rate:0.0200 gloss:0.0000 dloss:0.1305
Episode:388 meanR:9.4000 rate:0.0200 gloss:0.0000 dloss:0.1334
Episode:389 meanR:9.3900 rate:0.0180 gloss:0.0000 dloss:0.0920
Episode:390 meanR:9.3900 rate:0.0180 gloss:0.0000 dloss:0.1502
Episode:391 meanR:9.3900 rate:0.0180 gloss:0.0000 dloss:0.1824
Episode:392 meanR:9.3700 rate:0.0160 gloss:0.0000 dloss:0.3044
Episode:393 meanR:9.3700 rate:0.0180 gloss:0.0000 dloss:0.0953
Episode:394 meanR:9.3600 rate:0.0160 gloss:0.0000 dloss:0.1574
Episode:395 meanR:9.3700 rate:0.0200 gloss:0.0000 dloss:0.1314
Episode:396 meanR:9.3700 rate:0.0160 gloss:0.0000 dloss:0.0753
Episode:397 meanR:9.3700 rate:0.0200 gloss:0.0000 dloss:0.0772
Episode:398 meanR:9.3700 rate:0.0200 gloss:0.0000 dloss:0.1752
Episode:399 meanR:9.3800 rate:0.0200 gloss:0.0000 dloss:0.2303
Episode:400 meanR:9.3800 rate:0.0200 gloss:0.0000 dloss:0.1478
Episode:401 meanR:9.3800 rate:0.0180 gloss:0.0000 dloss

Episode:517 meanR:9.2700 rate:0.0200 gloss:0.0000 dloss:1.1963
Episode:518 meanR:9.2700 rate:0.0180 gloss:0.0000 dloss:1.1663
Episode:519 meanR:9.2700 rate:0.0160 gloss:0.0000 dloss:1.2581
Episode:520 meanR:9.2800 rate:0.0180 gloss:0.0000 dloss:1.2947
Episode:521 meanR:9.2800 rate:0.0180 gloss:0.0000 dloss:1.3663
Episode:522 meanR:9.2700 rate:0.0180 gloss:0.0000 dloss:1.3952
Episode:523 meanR:9.2700 rate:0.0180 gloss:0.0000 dloss:1.4859
Episode:524 meanR:9.2600 rate:0.0180 gloss:0.0000 dloss:1.4436
Episode:525 meanR:9.2700 rate:0.0200 gloss:0.0000 dloss:1.3865
Episode:526 meanR:9.2800 rate:0.0200 gloss:0.0000 dloss:1.2913
Episode:527 meanR:9.2700 rate:0.0160 gloss:0.0000 dloss:1.3231
Episode:528 meanR:9.2900 rate:0.0220 gloss:0.0000 dloss:1.3959
Episode:529 meanR:9.2900 rate:0.0200 gloss:0.0000 dloss:1.3987
Episode:530 meanR:9.2900 rate:0.0200 gloss:0.0000 dloss:1.2577
Episode:531 meanR:9.2800 rate:0.0180 gloss:0.0000 dloss:1.2908
Episode:532 meanR:9.2700 rate:0.0180 gloss:0.0000 dloss

Episode:648 meanR:9.3200 rate:0.0180 gloss:0.0000 dloss:1.2954
Episode:649 meanR:9.3300 rate:0.0200 gloss:0.0000 dloss:1.4238
Episode:650 meanR:9.3100 rate:0.0160 gloss:0.0000 dloss:1.3507
Episode:651 meanR:9.3100 rate:0.0180 gloss:0.0000 dloss:1.4796
Episode:652 meanR:9.3100 rate:0.0180 gloss:0.0000 dloss:1.3911
Episode:653 meanR:9.3100 rate:0.0200 gloss:0.0000 dloss:1.4590
Episode:654 meanR:9.3100 rate:0.0200 gloss:0.0000 dloss:1.4616
Episode:655 meanR:9.3100 rate:0.0200 gloss:0.0000 dloss:1.4422
Episode:656 meanR:9.3200 rate:0.0200 gloss:0.0000 dloss:1.3082
Episode:657 meanR:9.3100 rate:0.0180 gloss:0.0000 dloss:1.5117
Episode:658 meanR:9.3200 rate:0.0200 gloss:0.0000 dloss:1.3379
Episode:659 meanR:9.3400 rate:0.0200 gloss:0.0000 dloss:1.4419
Episode:660 meanR:9.3400 rate:0.0180 gloss:0.0000 dloss:1.4453
Episode:661 meanR:9.3500 rate:0.0200 gloss:0.0000 dloss:1.3437
Episode:662 meanR:9.3700 rate:0.0200 gloss:0.0000 dloss:1.4087
Episode:663 meanR:9.3900 rate:0.0200 gloss:0.0000 dloss

Episode:779 meanR:9.4000 rate:0.0180 gloss:0.0000 dloss:1.4995
Episode:780 meanR:9.3900 rate:0.0180 gloss:0.0000 dloss:1.4626
Episode:781 meanR:9.3900 rate:0.0200 gloss:0.0000 dloss:1.5155
Episode:782 meanR:9.3900 rate:0.0180 gloss:0.0000 dloss:1.5611
Episode:783 meanR:9.3900 rate:0.0180 gloss:0.0000 dloss:1.5519
Episode:784 meanR:9.3900 rate:0.0200 gloss:0.0000 dloss:1.4782
Episode:785 meanR:9.4000 rate:0.0180 gloss:0.0000 dloss:1.4981
Episode:786 meanR:9.4100 rate:0.0180 gloss:0.0000 dloss:1.5238
Episode:787 meanR:9.4000 rate:0.0180 gloss:0.0000 dloss:1.5904
Episode:788 meanR:9.3900 rate:0.0180 gloss:0.0000 dloss:1.5422
Episode:789 meanR:9.3900 rate:0.0180 gloss:0.0000 dloss:1.5379
Episode:790 meanR:9.3800 rate:0.0160 gloss:0.0000 dloss:1.6298
Episode:791 meanR:9.3800 rate:0.0200 gloss:0.0000 dloss:1.4623
Episode:792 meanR:9.3700 rate:0.0180 gloss:0.0000 dloss:1.5118
Episode:793 meanR:9.3500 rate:0.0160 gloss:0.0000 dloss:1.5302
Episode:794 meanR:9.3500 rate:0.0200 gloss:0.0000 dloss

Episode:910 meanR:9.5000 rate:0.0180 gloss:0.0000 dloss:0.0609
Episode:911 meanR:9.5000 rate:0.0200 gloss:0.0000 dloss:0.0854
Episode:912 meanR:9.4900 rate:0.0160 gloss:0.0000 dloss:0.0921
Episode:913 meanR:9.4900 rate:0.0180 gloss:0.0000 dloss:0.0721
Episode:914 meanR:9.4800 rate:0.0180 gloss:0.0000 dloss:0.0782
Episode:915 meanR:9.4800 rate:0.0200 gloss:0.0000 dloss:0.0618
Episode:916 meanR:9.4700 rate:0.0180 gloss:0.0000 dloss:0.0758
Episode:917 meanR:9.4700 rate:0.0180 gloss:0.0000 dloss:0.0681
Episode:918 meanR:9.4600 rate:0.0180 gloss:0.0000 dloss:0.0646
Episode:919 meanR:9.4700 rate:0.0200 gloss:0.0000 dloss:0.0744
Episode:920 meanR:9.4600 rate:0.0180 gloss:0.0000 dloss:0.0683
Episode:921 meanR:9.4500 rate:0.0180 gloss:0.0000 dloss:0.0564
Episode:922 meanR:9.4600 rate:0.0180 gloss:0.0000 dloss:0.0704
Episode:923 meanR:9.4700 rate:0.0200 gloss:0.0000 dloss:0.0514
Episode:924 meanR:9.4800 rate:0.0200 gloss:0.0000 dloss:0.0849
Episode:925 meanR:9.5000 rate:0.0220 gloss:0.0000 dloss

Episode:1040 meanR:9.3700 rate:0.0180 gloss:0.0000 dloss:2.3662
Episode:1041 meanR:9.3500 rate:0.0160 gloss:0.0000 dloss:3.0878
Episode:1042 meanR:9.3400 rate:0.0180 gloss:0.0000 dloss:4.1870
Episode:1043 meanR:9.3300 rate:0.0160 gloss:0.0000 dloss:4.7223
Episode:1044 meanR:9.3400 rate:0.0220 gloss:0.0000 dloss:4.6748
Episode:1045 meanR:9.3400 rate:0.0180 gloss:0.0000 dloss:5.1460
Episode:1046 meanR:9.3500 rate:0.0200 gloss:0.0000 dloss:3.7156
Episode:1047 meanR:9.3600 rate:0.0200 gloss:0.0000 dloss:3.3249
Episode:1048 meanR:9.3500 rate:0.0180 gloss:0.0000 dloss:3.7410
Episode:1049 meanR:9.3600 rate:0.0200 gloss:0.0000 dloss:3.3688
Episode:1050 meanR:9.3700 rate:0.0200 gloss:0.0000 dloss:3.1476
Episode:1051 meanR:9.3700 rate:0.0200 gloss:0.0000 dloss:3.0425
Episode:1052 meanR:9.3700 rate:0.0180 gloss:0.0000 dloss:0.7532
Episode:1053 meanR:9.3900 rate:0.0220 gloss:0.0000 dloss:0.7724
Episode:1054 meanR:9.3900 rate:0.0200 gloss:0.0000 dloss:0.7251
Episode:1055 meanR:9.4100 rate:0.0200 gl

Episode:1169 meanR:9.2100 rate:0.0180 gloss:0.0000 dloss:5.4490
Episode:1170 meanR:9.1900 rate:0.0160 gloss:0.0000 dloss:5.8019
Episode:1171 meanR:9.1800 rate:0.0180 gloss:0.0000 dloss:5.4627
Episode:1172 meanR:9.2000 rate:0.0200 gloss:0.0000 dloss:5.4283
Episode:1173 meanR:9.2100 rate:0.0200 gloss:0.0000 dloss:5.1383
Episode:1174 meanR:9.2100 rate:0.0180 gloss:0.0000 dloss:5.3684
Episode:1175 meanR:9.2000 rate:0.0180 gloss:0.0000 dloss:5.3963
Episode:1176 meanR:9.2000 rate:0.0180 gloss:0.0000 dloss:5.3668
Episode:1177 meanR:9.2100 rate:0.0180 gloss:0.0000 dloss:5.4246
Episode:1178 meanR:9.2300 rate:0.0200 gloss:0.0000 dloss:5.3412
Episode:1179 meanR:9.2600 rate:0.0220 gloss:0.0000 dloss:5.4448
Episode:1180 meanR:9.2500 rate:0.0180 gloss:0.0000 dloss:5.0851
Episode:1181 meanR:9.2600 rate:0.0200 gloss:0.0000 dloss:5.3569
Episode:1182 meanR:9.2600 rate:0.0180 gloss:0.0000 dloss:5.3794
Episode:1183 meanR:9.2600 rate:0.0200 gloss:0.0000 dloss:5.4096
Episode:1184 meanR:9.2500 rate:0.0180 gl

Episode:1298 meanR:9.2800 rate:0.0200 gloss:0.0000 dloss:9.9294
Episode:1299 meanR:9.2900 rate:0.0200 gloss:0.0000 dloss:10.6975
Episode:1300 meanR:9.2800 rate:0.0180 gloss:0.0000 dloss:10.6659
Episode:1301 meanR:9.2800 rate:0.0200 gloss:0.0000 dloss:10.6660
Episode:1302 meanR:9.2900 rate:0.0200 gloss:0.0000 dloss:10.2760
Episode:1303 meanR:9.2900 rate:0.0200 gloss:0.0000 dloss:10.3730
Episode:1304 meanR:9.2800 rate:0.0180 gloss:0.0000 dloss:10.9089
Episode:1305 meanR:9.2900 rate:0.0200 gloss:0.0000 dloss:10.8583
Episode:1306 meanR:9.2900 rate:0.0180 gloss:0.0000 dloss:11.0687
Episode:1307 meanR:9.2900 rate:0.0180 gloss:0.0000 dloss:11.2750
Episode:1308 meanR:9.2900 rate:0.0200 gloss:0.0000 dloss:11.5911
Episode:1309 meanR:9.2800 rate:0.0180 gloss:0.0000 dloss:11.6650
Episode:1310 meanR:9.2800 rate:0.0180 gloss:0.0000 dloss:11.0324
Episode:1311 meanR:9.2800 rate:0.0180 gloss:0.0000 dloss:11.6184
Episode:1312 meanR:9.2800 rate:0.0180 gloss:0.0000 dloss:12.1056
Episode:1313 meanR:9.2800 

Episode:1425 meanR:9.3900 rate:0.0200 gloss:0.0000 dloss:7.0276
Episode:1426 meanR:9.3800 rate:0.0180 gloss:0.0000 dloss:7.3893
Episode:1427 meanR:9.3800 rate:0.0180 gloss:0.0000 dloss:7.1289
Episode:1428 meanR:9.3800 rate:0.0200 gloss:0.0000 dloss:7.2078
Episode:1429 meanR:9.3800 rate:0.0180 gloss:0.0000 dloss:7.2028
Episode:1430 meanR:9.3700 rate:0.0180 gloss:0.0000 dloss:7.2071
Episode:1431 meanR:9.3700 rate:0.0200 gloss:0.0000 dloss:7.2076
Episode:1432 meanR:9.3700 rate:0.0200 gloss:0.0000 dloss:7.2085
Episode:1433 meanR:9.3600 rate:0.0180 gloss:0.0000 dloss:7.2080
Episode:1434 meanR:9.3700 rate:0.0200 gloss:0.0000 dloss:7.2060
Episode:1435 meanR:9.3700 rate:0.0200 gloss:0.0000 dloss:7.2047
Episode:1436 meanR:9.3700 rate:0.0180 gloss:0.0000 dloss:7.2162
Episode:1437 meanR:9.3800 rate:0.0200 gloss:0.0000 dloss:7.2130
Episode:1438 meanR:9.3700 rate:0.0180 gloss:0.0000 dloss:7.2172
Episode:1439 meanR:9.3800 rate:0.0180 gloss:0.0000 dloss:7.2190
Episode:1440 meanR:9.3900 rate:0.0200 gl

Episode:1554 meanR:9.3500 rate:0.0180 gloss:0.0000 dloss:7.4485
Episode:1555 meanR:9.3400 rate:0.0180 gloss:0.0000 dloss:7.3785
Episode:1556 meanR:9.3300 rate:0.0180 gloss:0.0000 dloss:7.7988
Episode:1557 meanR:9.3400 rate:0.0180 gloss:0.0000 dloss:7.7128
Episode:1558 meanR:9.3600 rate:0.0200 gloss:0.0000 dloss:7.1900
Episode:1559 meanR:9.3600 rate:0.0200 gloss:0.0000 dloss:7.1713
Episode:1560 meanR:9.3700 rate:0.0200 gloss:0.0000 dloss:7.2043
Episode:1561 meanR:9.3800 rate:0.0200 gloss:0.0000 dloss:7.1167
Episode:1562 meanR:9.3900 rate:0.0200 gloss:0.0000 dloss:7.0804
Episode:1563 meanR:9.3900 rate:0.0180 gloss:0.0000 dloss:7.0665
Episode:1564 meanR:9.4000 rate:0.0200 gloss:0.0000 dloss:7.0837
Episode:1565 meanR:9.3900 rate:0.0180 gloss:0.0000 dloss:7.1472
Episode:1566 meanR:9.3900 rate:0.0180 gloss:0.0000 dloss:7.1186
Episode:1567 meanR:9.3900 rate:0.0200 gloss:0.0000 dloss:7.1462
Episode:1568 meanR:9.4000 rate:0.0180 gloss:0.0000 dloss:7.1813
Episode:1569 meanR:9.4000 rate:0.0180 gl

Episode:1681 meanR:9.3500 rate:0.0200 gloss:0.0000 dloss:15.6069
Episode:1682 meanR:9.3500 rate:0.0180 gloss:0.0000 dloss:14.1269
Episode:1683 meanR:9.3500 rate:0.0180 gloss:0.0000 dloss:13.8003
Episode:1684 meanR:9.3600 rate:0.0180 gloss:0.0000 dloss:13.2397
Episode:1685 meanR:9.3700 rate:0.0200 gloss:0.0000 dloss:12.5294
Episode:1686 meanR:9.3700 rate:0.0180 gloss:0.0000 dloss:11.9275
Episode:1687 meanR:9.3500 rate:0.0160 gloss:0.0000 dloss:11.4321
Episode:1688 meanR:9.3600 rate:0.0200 gloss:0.0000 dloss:10.8445
Episode:1689 meanR:9.3600 rate:0.0200 gloss:0.0000 dloss:10.3464
Episode:1690 meanR:9.3700 rate:0.0200 gloss:0.0000 dloss:9.9328
Episode:1691 meanR:9.3700 rate:0.0180 gloss:0.0000 dloss:9.5186
Episode:1692 meanR:9.3800 rate:0.0200 gloss:0.0000 dloss:9.1061
Episode:1693 meanR:9.3800 rate:0.0180 gloss:0.0000 dloss:8.8482
Episode:1694 meanR:9.3700 rate:0.0160 gloss:0.0000 dloss:9.1096
Episode:1695 meanR:9.3500 rate:0.0180 gloss:0.0000 dloss:8.4553
Episode:1696 meanR:9.3400 rate:

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(g_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('G losses')

In [None]:
eps, arr = np.array(d_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses')

In [None]:
eps, arr = np.array(d_lossR_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses R')

In [None]:
eps, arr = np.array(d_lossQ_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses Q')

## Testing

Let's checkout how our trained agent plays the game.

In [34]:
import gym
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
# env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Episodes/epochs
    for _ in range(100):
    #while True:
        state = env.reset()
        total_reward = 0

        # Steps/batches
        #for _ in range(111111111111111111):
        while True:
            env.render()
            action_logits = sess.run(model.actions_logits, feed_dict={model.states: np.reshape(state, [1, -1])})
            action = np.argmax(action_logits)
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                break
        # Print and break condition
        print('total_reward: {}'.format(total_reward))
        # if total_reward == 500:
        #     break
                
# Closing the env
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt




total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.