# Deep cortical reinforcement learning: Policy gradients + Q-learning + GAN


In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

##### >**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [2]:
import gym

## Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [3]:
env.reset()
batch = []
for _ in range(1111):
    #env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) # take a random action
    batch.append([action, state, reward, done, info])
    #print('state, action, reward, done, info:', state, action, reward, done, info)
    if done:
        env.reset()

To shut the window showing the simulation, use `env.close()`.

In [4]:
# env.close()

If you ran the simulation above, we can look at the rewards:

In [5]:
batch[0], 
batch[0][1].shape, state.shape

((4,), (4,))

In [6]:
import numpy as np
actions = np.array([each[0] for each in batch])
states = np.array([each[1] for each in batch])
rewards = np.array([each[2] for each in batch])
dones = np.array([each[3] for each in batch])
infos = np.array([each[4] for each in batch])

In [7]:
# print(rewards[-20:])
print('shapes:', np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print('dtypes:', np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print('states:', np.max(np.array(states)), np.min(np.array(states)))
print('actions:', np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
print('rewards:', np.max(np.array(rewards)), np.min(np.array(rewards)))

shapes: (1111,) (1111, 4) (1111,) (1111,)
dtypes: float64 float64 int64 bool
states: 2.4041488323398097 -2.748127398357479
actions: 1 0
rewards: 1.0 1.0


In [8]:
actions[:10]

array([0, 1, 1, 0, 1, 1, 1, 1, 1, 1])

In [9]:
rewards[:10]

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [10]:
# import numpy as np
def sigmoid(x, derivative=False):
  return x*(1-x) if derivative else 1/(1+np.exp(-x))

In [11]:
sigmoid(np.max(np.array(rewards))), sigmoid(np.min(np.array(rewards)))

(0.7310585786300049, 0.7310585786300049)

In [12]:
print('rewards:', np.max(np.array(rewards))/100, np.min(np.array(rewards))/100)

rewards: 0.01 0.01


The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [45]:
# Data of the model
def model_input(state_size, lstm_size, batch_size=1):
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    reward = tf.placeholder(tf.float32, [], name='reward')
    # GRU: Gated Recurrent Units
    gru = tf.nn.rnn_cell.GRUCell(lstm_size) # hidden size
    cell = tf.nn.rnn_cell.MultiRNNCell([gru], state_is_tuple=False)
    g_initial_state = cell.zero_state(batch_size, tf.float32) # feedback or lateral/recurrent connection from output
    d_initial_state = cell.zero_state(batch_size, tf.float32) # feedback or lateral/recurrent connection from output
    return states, actions, targetQs, reward, cell, g_initial_state, d_initial_state

In [46]:
# How to use batch-norm
#   x_norm = tf.layers.batch_normalization(x, training=training)

#   # ...

#   update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
#   with tf.control_dependencies(update_ops):
#     train_op = optimizer.minimize(loss)

In [47]:
# training: Either a Python boolean, or a TensorFlow boolean scalar tensor (e.g. a placeholder). 
# Whether to return the output in: 
# training mode (normalized with statistics of the current batch) or 
# inference mode (normalized with moving statistics). 
# NOTE: make sure to set this parameter correctly, or else your training/inference will not work properly.

In [48]:
# MLP & Conv
# # Generator/Controller: Generating/prediting the actions
# def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
#     with tf.variable_scope('generator', reuse=reuse):
#         # First fully connected layer
#         h1 = tf.layers.dense(inputs=states, units=hidden_size)
#         bn1 = tf.layers.batch_normalization(h1, training=training)        
#         nl1 = tf.maximum(alpha * bn1, bn1)
        
#         # Second fully connected layer
#         h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
#         bn2 = tf.layers.batch_normalization(h2, training=training)        
#         nl2 = tf.maximum(alpha * bn2, bn2)
        
#         # Output layer
#         logits = tf.layers.dense(inputs=nl2, units=action_size)        
#         #predictions = tf.nn.softmax(logits)

#         # return actions logits
#         return logits

In [49]:
# RNN generator or sequence generator
def generator(states, initial_state, cell, lstm_size, num_classes, reuse=False): 
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        inputs = tf.layers.dense(inputs=states, units=lstm_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        # dynamic means adapt to the batch_size
        inputs_rnn = tf.reshape(inputs, [1, -1, lstm_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cell, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state.shape)
        outputs = tf.reshape(outputs_rnn, [-1, lstm_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=num_classes)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [50]:
# MLP & Conv
# # Discriminator/Dopamine: Reward function/planner/naviator/advisor/supervisor/cortical columns
# def discriminator(states, actions, hidden_size, reuse=False, alpha=0.1, training=False):
#     with tf.variable_scope('discriminator', reuse=reuse):
#         # Fusion/merge states and actions/ SA/ SM
#         x_fused = tf.concat(axis=1, values=[states, actions])
        
#         # First fully connected layer
#         h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
#         bn1 = tf.layers.batch_normalization(h1, training=training)        
#         nl1 = tf.maximum(alpha * bn1, bn1)
        
#         # Second fully connected layer
#         h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
#         bn2 = tf.layers.batch_normalization(h2, training=training)        
#         nl2 = tf.maximum(alpha * bn2, bn2)
        
#         # Output layer
#         logits = tf.layers.dense(inputs=nl2, units=1)        
#         #predictions = tf.nn.softmax(logits)

#         # return rewards logits
#         return logits

In [51]:
# RNN generator or sequence generator
def discriminator(states, actions, initial_state, cell, lstm_size, reuse=False): 
    with tf.variable_scope('discriminator', reuse=reuse):
        # Fusion/merge states and actions/ SA/ SM
        x_fused = tf.concat(axis=1, values=[states, actions])
        
        # First fully connected layer
        inputs = tf.layers.dense(inputs=x_fused, units=lstm_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        inputs_rnn = tf.reshape(inputs, [1, -1, lstm_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cell, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state.shape)
        outputs = tf.reshape(outputs_rnn, [-1, lstm_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=1)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [52]:
def model_loss(action_size, hidden_size, states, actions, targetQs, reward,
               cell, g_initial_state, d_initial_state):
    # G/Actor
    #actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    actions_logits, g_final_state = generator(states=states, num_classes=action_size, 
                                              cell=cell, initial_state=g_initial_state, lstm_size=hidden_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    neg_log_prob_actions = tf.nn.softmax_cross_entropy_with_logits_v2(logits=actions_logits, 
                                                                      labels=actions_labels)
    g_loss = tf.reduce_mean(neg_log_prob_actions * targetQs)
    
    # D/Critic
    #Qs_logits = discriminator(actions=actions_logits, hidden_size=hidden_size, states=states)
    Qs_logits, d_final_state = discriminator(states=states, actions=actions_logits, 
                                             cell=cell, initial_state=d_initial_state, lstm_size=hidden_size)
    rewards = reward * tf.ones_like(targetQs)
    d_lossR = tf.reduce_mean(tf.square(tf.nn.sigmoid(tf.reshape(Qs_logits, [-1])) - rewards))
    d_lossR_sigm = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=tf.reshape(Qs_logits, [-1]),
                                                                          labels=rewards))
    d_lossQ = tf.reduce_mean(tf.square(tf.reshape(Qs_logits, [-1]) - targetQs))
    d_lossQ_sigm = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=tf.reshape(Qs_logits, [-1]),
                                                                          labels=tf.nn.sigmoid(targetQs)))
    d_loss = d_lossQ_sigm + d_lossQ #+ d_lossR_sigm + d_lossR

    return actions_logits, Qs_logits, g_final_state, d_final_state, g_loss, d_loss, d_lossR, d_lossQ, d_lossR_sigm, d_lossQ_sigm

In [53]:
# Optimizating/training/learning G & D
def model_opt(g_loss, d_loss, learning_rate):
    """
    Get optimization operations in order
    :param g_loss: Generator loss Tensor for action prediction
    :param d_loss: Discriminator loss Tensor for reward prediction for generated/prob/logits action
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize RNN
    # g_grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(g_loss, g_vars), clip_norm=5) # usually around 1-5
    # d_grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(d_loss, d_vars), clip_norm=5) # usually around 1-5
    g_grads=tf.gradients(g_loss, g_vars)
    d_grads=tf.gradients(d_loss, d_vars)
    g_opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(g_grads, g_vars))
    d_opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(d_grads, d_vars))
    
    # # Optimize MLP & CNN
    # with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
    #     g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars)
    #     d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars)

    return g_opt, d_opt

In [59]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, self.reward, cell, self.g_initial_state, self.d_initial_state = model_input(
            state_size=state_size, lstm_size=hidden_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.Qs_logits, self.g_final_state, self.d_final_state, self.g_loss, self.d_loss, self.d_lossR, self.d_lossQ, self.d_lossR_sigm, self.d_lossQ_sigm = model_loss(
            action_size=action_size, hidden_size=hidden_size,
            states=self.states, actions=self.actions, cell=cell,
            targetQs=self.targetQs, reward=self.reward,  
            g_initial_state=self.g_initial_state, d_initial_state=self.d_initial_state)
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(g_loss=self.g_loss, d_loss=self.d_loss, learning_rate=learning_rate)

In [60]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
        self.states = deque(maxlen=max_size)

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [61]:
print('state size:{}'.format(states.shape), 
      'actions:{}'.format(actions.shape)) 
print('action size:{}'.format(np.max(actions) - np.min(actions)+1))

state size:(1111, 4) actions:(1111,)
action size:2


In [62]:
# Training parameters
# Network parameters
state_size = 4                 # number of units for the input state/observation -- simulation
action_size = 2                # number of units for the output actions -- simulation
hidden_size = 64               # number of units in each Q-network hidden layer -- simulation
batch_size = 32                # number of samples in the memory/ experience as mini-batch size
learning_rate = 0.001          # learning rate for adam

In [63]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=batch_size)

(?, 4) (?, 64)
(1, ?, 64) (1, 64)
(1, ?, 64) (1, 64)
(?, 64)
(?, 2)
(?, 4) (?, 64)
(1, ?, 64) (1, 64)
(1, ?, 64) (1, 64)
(?, 64)
(?, 1)


In [64]:
state = env.reset()
for _ in range(batch_size):
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)
    memory.buffer.append([state, action, next_state, reward, float(done)])
    state = next_state
    if done is True:
        state = env.reset()

In [65]:
memory.buffer[0]

[array([ 0.01858505, -0.01960784,  0.02470401, -0.02137975]),
 0,
 array([ 0.01819289, -0.21507521,  0.02427642,  0.27899407]),
 1.0,
 0.0]

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [None]:
from collections import deque
episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
saver = tf.train.Saver()
rewards_list, g_loss_list, d_loss_list = [], [], []
rates_list, d_lossR_list, d_lossQ_list = [], [], []
d_lossRsigm_list, d_lossQsigm_list = [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Training episodes/epochs
    for ep in range(11111):
        batch = [] # every data batch
        total_reward = 0
        state = env.reset() # env first state
        g_initial_state = sess.run(model.g_initial_state)
        d_initial_state = sess.run(model.d_initial_state)

        # Training steps/batches
        while True:
            # Testing/inference
            action_logits, g_final_state, d_final_state = sess.run(
                fetches=[model.actions_logits, model.g_final_state, model.d_final_state], 
                feed_dict={model.states: np.reshape(state, [1, -1]),
                           model.g_initial_state: g_initial_state,
                           model.d_initial_state: d_initial_state})
            action = np.argmax(action_logits)
            next_state, reward, done, _ = env.step(action)
            memory.buffer.append([state, action, next_state, reward, float(done)])
            memory.states.append([g_initial_state, g_final_state,
                                  d_initial_state, d_final_state])
            total_reward += reward
            g_initial_state = g_final_state
            d_initial_state = d_final_state
            state = next_state
            
            # Training
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            rnn_states = memory.states
            g_initial_states = np.array([each[0] for each in rnn_states])
            g_final_states = np.array([each[1] for each in rnn_states])
            d_initial_states = np.array([each[2] for each in rnn_states])
            d_final_states = np.array([each[3] for each in rnn_states])
            nextQs_logits = sess.run(fetches = model.Qs_logits,
                                     feed_dict = {model.states: next_states, 
                                                  model.g_initial_state: g_final_states[0].reshape([1, -1]),
                                                  model.d_initial_state: d_final_states[0].reshape([1, -1])})
            nextQs = nextQs_logits.reshape([-1]) * (1-dones) # exploit
            #print(nextQs.shape, nextQs_logits.shape, dones.shape, rewards.shape)
            targetQs = rewards + (0.99 * nextQs)
            #print(targetQs.shape, rewards.shape, nextQs.shape)
            g_loss, d_loss, d_lossR, d_lossQ, d_lossRsigm, d_lossQsigm, _, _ = sess.run(
                fetches=[model.g_loss, model.d_loss, 
                         model.d_lossR, model.d_lossQ, 
                         model.d_lossR_sigm, model.d_lossQ_sigm,
                         model.g_opt, model.d_opt], 
                feed_dict = {model.states: states, model.actions: actions,
                             model.reward: 1.0, 
                             model.targetQs: targetQs,
                             model.g_initial_state: g_initial_states[0].reshape([1, -1]),
                             model.d_initial_state: d_initial_states[0].reshape([1, -1])})

            if done is True:
                break

        # Episode total reward and success rate/prob
        episode_reward.append(total_reward) # stopping criteria
        rate = total_reward/ 500 # success is 500 points: 0-1
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'rate:{:.4f}'.format(rate),
              'gloss:{:.4f}'.format(g_loss),
              'dloss:{:.4f}'.format(d_loss),
#               'dlossR:{:.4f}'.format(d_lossR),
              'dlossQ:{:.4f}'.format(d_lossQ),
#               'dlossRsigm:{:.4f}'.format(d_lossRsigm),
              'dlossQsigm:{:.4f}'.format(d_lossQsigm))
        # Ploting out
        rewards_list.append([ep, np.mean(episode_reward)])
        rates_list.append([ep, rate])
        g_loss_list.append([ep, g_loss])
        d_loss_list.append([ep, d_loss])
        d_lossR_list.append([ep, d_lossR])
        d_lossQ_list.append([ep, d_lossQ])
        d_lossRsigm_list.append([ep, d_lossRsigm])
        d_lossQsigm_list.append([ep, d_lossQsigm])
        # Break episode/epoch loop
        if np.mean(episode_reward) >= 500:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:25.0000 rate:0.0500 gloss:0.6189 dloss:1.6211 dlossR:0.1952 dlossQsigm:0.6357
Episode:1 meanR:32.5000 rate:0.0800 gloss:0.2741 dloss:1.3121 dlossR:0.0352 dlossQsigm:0.3193
Episode:2 meanR:36.3333 rate:0.0880 gloss:0.0562 dloss:2.3809 dlossR:0.0001 dlossQsigm:0.0698
Episode:3 meanR:33.0000 rate:0.0460 gloss:0.0637 dloss:4.8319 dlossR:0.0000 dlossQsigm:0.1474
Episode:4 meanR:34.0000 rate:0.0760 gloss:0.0816 dloss:4.9760 dlossR:0.0000 dlossQsigm:0.0999
Episode:5 meanR:35.5000 rate:0.0860 gloss:0.5439 dloss:6.5101 dlossR:0.0000 dlossQsigm:0.1182
Episode:6 meanR:35.0000 rate:0.0640 gloss:0.6628 dloss:10.0301 dlossR:0.0001 dlossQsigm:0.1458
Episode:7 meanR:35.3750 rate:0.0760 gloss:0.2767 dloss:11.8986 dlossR:0.0000 dlossQsigm:0.1602
Episode:8 meanR:35.6667 rate:0.0760 gloss:0.2681 dloss:13.1699 dlossR:0.0000 dlossQsigm:0.1671
Episode:9 meanR:35.9000 rate:0.0760 gloss:0.6855 dloss:16.0877 dlossR:0.0000 dlossQsigm:0.1885
Episode:10 meanR:38.5455 rate:0.1300 gloss:0.5599 dloss:

Episode:87 meanR:48.3409 rate:0.0920 gloss:0.0012 dloss:15.3593 dlossR:0.0000 dlossQsigm:0.0937
Episode:88 meanR:48.1461 rate:0.0620 gloss:0.0472 dloss:7.8674 dlossR:0.0000 dlossQsigm:0.1652
Episode:89 meanR:47.9556 rate:0.0620 gloss:0.0000 dloss:4.5460 dlossR:0.0000 dlossQsigm:0.1332
Episode:90 meanR:47.7802 rate:0.0640 gloss:0.0005 dloss:8.5777 dlossR:0.0000 dlossQsigm:0.0451
Episode:91 meanR:47.7609 rate:0.0920 gloss:0.0026 dloss:3.1801 dlossR:0.0000 dlossQsigm:0.0476
Episode:92 meanR:47.7634 rate:0.0960 gloss:0.0046 dloss:4.6769 dlossR:0.0000 dlossQsigm:0.0439
Episode:93 meanR:47.6809 rate:0.0800 gloss:0.0001 dloss:6.1572 dlossR:0.0828 dlossQsigm:0.2780
Episode:94 meanR:47.5158 rate:0.0640 gloss:0.0021 dloss:11.7020 dlossR:0.0001 dlossQsigm:0.0440
Episode:95 meanR:47.5833 rate:0.1080 gloss:0.0053 dloss:15.4754 dlossR:0.0000 dlossQsigm:0.0654
Episode:96 meanR:47.4948 rate:0.0780 gloss:0.0012 dloss:6.1742 dlossR:0.0000 dlossQsigm:0.0510
Episode:97 meanR:47.4082 rate:0.0780 gloss:0.00

Episode:173 meanR:43.4900 rate:0.0940 gloss:0.0029 dloss:2.0470 dlossR:0.0000 dlossQsigm:0.0425
Episode:174 meanR:44.0900 rate:0.2060 gloss:0.0000 dloss:4.1254 dlossR:0.0001 dlossQsigm:0.0329
Episode:175 meanR:44.7700 rate:0.1980 gloss:0.0010 dloss:6.7854 dlossR:0.0000 dlossQsigm:0.1130
Episode:176 meanR:44.6300 rate:0.0780 gloss:0.0004 dloss:3.9098 dlossR:0.0000 dlossQsigm:0.0646
Episode:177 meanR:44.7200 rate:0.0780 gloss:0.0024 dloss:6.3407 dlossR:0.0004 dlossQsigm:0.0484
Episode:178 meanR:44.7400 rate:0.0920 gloss:0.0052 dloss:0.8704 dlossR:0.0029 dlossQsigm:0.0433
Episode:179 meanR:45.0300 rate:0.1180 gloss:0.0051 dloss:32.5792 dlossR:0.0002 dlossQsigm:0.0371
Episode:180 meanR:44.8200 rate:0.0660 gloss:0.0067 dloss:25.5364 dlossR:0.0004 dlossQsigm:0.0331
Episode:181 meanR:44.7800 rate:0.0620 gloss:0.0000 dloss:2.0447 dlossR:0.0009 dlossQsigm:0.0612
Episode:182 meanR:44.6800 rate:0.0780 gloss:0.0017 dloss:1.4886 dlossR:0.0006 dlossQsigm:0.0411
Episode:183 meanR:44.5300 rate:0.0680 

Episode:259 meanR:53.7800 rate:0.0720 gloss:0.0014 dloss:4.6687 dlossR:0.0000 dlossQsigm:0.0882
Episode:260 meanR:53.7100 rate:0.0760 gloss:0.0017 dloss:1.8621 dlossR:0.0000 dlossQsigm:0.0479
Episode:261 meanR:53.6700 rate:0.0740 gloss:0.0008 dloss:3.5490 dlossR:0.0000 dlossQsigm:0.0521
Episode:262 meanR:53.7200 rate:0.0780 gloss:0.0000 dloss:2.9076 dlossR:0.0000 dlossQsigm:0.0660
Episode:263 meanR:53.5100 rate:0.0720 gloss:0.0005 dloss:2.4179 dlossR:0.0000 dlossQsigm:0.0435
Episode:264 meanR:53.7400 rate:0.1260 gloss:0.0000 dloss:2.0998 dlossR:0.0000 dlossQsigm:0.0510
Episode:265 meanR:53.9700 rate:0.1600 gloss:0.0024 dloss:2.4127 dlossR:0.0000 dlossQsigm:0.0437
Episode:266 meanR:53.9300 rate:0.0600 gloss:0.0000 dloss:2.4179 dlossR:0.0000 dlossQsigm:0.0868
Episode:267 meanR:53.8000 rate:0.0640 gloss:0.0754 dloss:9.3861 dlossR:0.0000 dlossQsigm:0.0588
Episode:268 meanR:53.9400 rate:0.1020 gloss:0.0039 dloss:1.9232 dlossR:0.0000 dlossQsigm:0.0478
Episode:269 meanR:54.1100 rate:0.1020 gl

Episode:345 meanR:53.3300 rate:0.1180 gloss:0.0159 dloss:4.0860 dlossR:0.0000 dlossQsigm:0.0746
Episode:346 meanR:53.8900 rate:0.1820 gloss:0.0000 dloss:5.9204 dlossR:0.0000 dlossQsigm:0.0719
Episode:347 meanR:54.4700 rate:0.2140 gloss:0.0000 dloss:23.1967 dlossR:0.0000 dlossQsigm:0.1976
Episode:348 meanR:54.5600 rate:0.0900 gloss:0.0001 dloss:12.7233 dlossR:0.0000 dlossQsigm:0.1160
Episode:349 meanR:54.5300 rate:0.0700 gloss:0.0020 dloss:4.7866 dlossR:0.0000 dlossQsigm:0.0628
Episode:350 meanR:54.2100 rate:0.0980 gloss:0.0000 dloss:2.5960 dlossR:0.0000 dlossQsigm:0.0468
Episode:351 meanR:54.6700 rate:0.1640 gloss:0.0005 dloss:7.3748 dlossR:0.0000 dlossQsigm:0.1098
Episode:352 meanR:55.4200 rate:0.2420 gloss:0.0000 dloss:5.9671 dlossR:0.0000 dlossQsigm:0.0886
Episode:353 meanR:54.4200 rate:0.0820 gloss:0.0040 dloss:4.0189 dlossR:0.0000 dlossQsigm:0.0779
Episode:354 meanR:54.3300 rate:0.1020 gloss:0.0025 dloss:3.1692 dlossR:0.0000 dlossQsigm:0.0643
Episode:355 meanR:54.0700 rate:0.0780 

Episode:431 meanR:62.5700 rate:0.0780 gloss:0.0018 dloss:2.8764 dlossR:0.0000 dlossQsigm:0.0644
Episode:432 meanR:61.7200 rate:0.0940 gloss:0.0000 dloss:2.4951 dlossR:0.0000 dlossQsigm:0.0582
Episode:433 meanR:61.7300 rate:0.0680 gloss:0.0000 dloss:2.0041 dlossR:0.0000 dlossQsigm:0.0551
Episode:434 meanR:61.8800 rate:0.0880 gloss:0.0000 dloss:3.4748 dlossR:0.0000 dlossQsigm:0.0734
Episode:435 meanR:61.7100 rate:0.0840 gloss:0.0000 dloss:2.4669 dlossR:0.0000 dlossQsigm:0.0606
Episode:436 meanR:61.6800 rate:0.0920 gloss:0.0000 dloss:2.7972 dlossR:0.0000 dlossQsigm:0.0709
Episode:437 meanR:61.8800 rate:0.1100 gloss:0.0000 dloss:1.7448 dlossR:0.0000 dlossQsigm:0.0462
Episode:438 meanR:61.7500 rate:0.0620 gloss:0.0000 dloss:2.3500 dlossR:0.0000 dlossQsigm:0.0793
Episode:439 meanR:61.7500 rate:0.0640 gloss:0.0000 dloss:2.6423 dlossR:0.0000 dlossQsigm:0.0481
Episode:440 meanR:61.9900 rate:0.1560 gloss:0.0002 dloss:4.4265 dlossR:0.0000 dlossQsigm:0.0682
Episode:441 meanR:61.7600 rate:0.0640 gl

Episode:517 meanR:48.0100 rate:0.0640 gloss:0.0000 dloss:2.7913 dlossR:0.0004 dlossQsigm:0.0441
Episode:518 meanR:47.7400 rate:0.0620 gloss:0.0000 dloss:1.6531 dlossR:0.0000 dlossQsigm:0.0737
Episode:519 meanR:48.0800 rate:0.1760 gloss:0.0000 dloss:3.1018 dlossR:0.0000 dlossQsigm:0.0761
Episode:520 meanR:48.4300 rate:0.1580 gloss:0.0000 dloss:1.5430 dlossR:0.0002 dlossQsigm:0.0337
Episode:521 meanR:48.3800 rate:0.0820 gloss:0.0001 dloss:2.1235 dlossR:0.0001 dlossQsigm:0.0348
Episode:522 meanR:48.3400 rate:0.0720 gloss:0.0000 dloss:1.6633 dlossR:0.0013 dlossQsigm:0.0513
Episode:523 meanR:48.7500 rate:0.1700 gloss:0.0020 dloss:3.9396 dlossR:0.0000 dlossQsigm:0.0329
Episode:524 meanR:48.3400 rate:0.0960 gloss:0.0003 dloss:1.8004 dlossR:0.0000 dlossQsigm:0.0375
Episode:525 meanR:48.3200 rate:0.1380 gloss:0.0000 dloss:5.9621 dlossR:0.0000 dlossQsigm:0.0859
Episode:526 meanR:48.4000 rate:0.0820 gloss:0.0000 dloss:4.6207 dlossR:0.0000 dlossQsigm:0.0460
Episode:527 meanR:47.9600 rate:0.0760 gl

Episode:603 meanR:41.0400 rate:0.0700 gloss:0.0000 dloss:1.3719 dlossR:0.0005 dlossQsigm:0.0372
Episode:604 meanR:41.3100 rate:0.1440 gloss:0.0000 dloss:4.6723 dlossR:0.0000 dlossQsigm:0.0777
Episode:605 meanR:41.4900 rate:0.1040 gloss:0.0000 dloss:2.2147 dlossR:0.0000 dlossQsigm:0.0340
Episode:606 meanR:41.9500 rate:0.1860 gloss:0.0035 dloss:3.5707 dlossR:0.0000 dlossQsigm:0.0629
Episode:607 meanR:42.2000 rate:0.1360 gloss:0.0000 dloss:1.8466 dlossR:0.0000 dlossQsigm:0.0428
Episode:608 meanR:42.2900 rate:0.0780 gloss:0.0000 dloss:1.2084 dlossR:0.0000 dlossQsigm:0.0366
Episode:609 meanR:42.6800 rate:0.1480 gloss:0.0000 dloss:1.7887 dlossR:0.0000 dlossQsigm:0.0439
Episode:610 meanR:42.6200 rate:0.0720 gloss:0.0000 dloss:1.5753 dlossR:0.0001 dlossQsigm:0.0312
Episode:611 meanR:42.5500 rate:0.0680 gloss:0.0000 dloss:1.8479 dlossR:0.0013 dlossQsigm:0.0452
Episode:612 meanR:42.5100 rate:0.0680 gloss:0.0000 dloss:1.3145 dlossR:0.0028 dlossQsigm:0.0525
Episode:613 meanR:42.4100 rate:0.0720 gl

Episode:689 meanR:43.0100 rate:0.0640 gloss:0.0000 dloss:10.7501 dlossR:0.0012 dlossQsigm:0.0617
Episode:690 meanR:43.1200 rate:0.0800 gloss:0.0000 dloss:1.9868 dlossR:0.0001 dlossQsigm:0.0310
Episode:691 meanR:43.2400 rate:0.1000 gloss:0.0119 dloss:9.1351 dlossR:0.0000 dlossQsigm:0.0575
Episode:692 meanR:43.3900 rate:0.0880 gloss:0.0004 dloss:10.9536 dlossR:0.0000 dlossQsigm:0.0656
Episode:693 meanR:43.4600 rate:0.1040 gloss:0.0000 dloss:4.8768 dlossR:0.0000 dlossQsigm:0.0527
Episode:694 meanR:43.4500 rate:0.1100 gloss:0.0006 dloss:7.9817 dlossR:0.0000 dlossQsigm:0.0710
Episode:695 meanR:43.4600 rate:0.0560 gloss:0.0001 dloss:6.1136 dlossR:0.0000 dlossQsigm:0.1423
Episode:696 meanR:43.4900 rate:0.0660 gloss:0.0000 dloss:3.8920 dlossR:0.0000 dlossQsigm:0.0658
Episode:697 meanR:43.7100 rate:0.1040 gloss:0.0000 dloss:3.8904 dlossR:0.0000 dlossQsigm:0.0724
Episode:698 meanR:43.5500 rate:0.0780 gloss:0.0000 dloss:5.1965 dlossR:0.0000 dlossQsigm:0.0671
Episode:699 meanR:43.7900 rate:0.1100 

Episode:775 meanR:42.8700 rate:0.0820 gloss:0.0004 dloss:1.2211 dlossR:0.0000 dlossQsigm:0.0521
Episode:776 meanR:42.9500 rate:0.0900 gloss:0.0000 dloss:3.6701 dlossR:0.0000 dlossQsigm:0.0829
Episode:777 meanR:43.2600 rate:0.1480 gloss:0.0013 dloss:14.2559 dlossR:0.0000 dlossQsigm:0.1233
Episode:778 meanR:43.1200 rate:0.0660 gloss:0.0002 dloss:2.5055 dlossR:0.0001 dlossQsigm:0.0369
Episode:779 meanR:43.0700 rate:0.0620 gloss:0.0002 dloss:3.7949 dlossR:0.0000 dlossQsigm:0.1122
Episode:780 meanR:42.9600 rate:0.1020 gloss:0.0002 dloss:3.1906 dlossR:0.0000 dlossQsigm:0.0684
Episode:781 meanR:42.8800 rate:0.0820 gloss:0.0000 dloss:1.0283 dlossR:0.0000 dlossQsigm:0.0385
Episode:782 meanR:42.8100 rate:0.0760 gloss:0.0000 dloss:1.7227 dlossR:0.0000 dlossQsigm:0.0392
Episode:783 meanR:42.9800 rate:0.0900 gloss:0.0000 dloss:1.7844 dlossR:0.0000 dlossQsigm:0.0563
Episode:784 meanR:43.1900 rate:0.1080 gloss:0.0000 dloss:6.8430 dlossR:0.0000 dlossQsigm:0.1151
Episode:785 meanR:43.3400 rate:0.1100 g

Episode:861 meanR:43.8300 rate:0.0880 gloss:0.0000 dloss:2.9242 dlossR:0.0000 dlossQsigm:0.0528
Episode:862 meanR:44.0500 rate:0.1300 gloss:0.0000 dloss:9.0083 dlossR:0.0000 dlossQsigm:0.1434
Episode:863 meanR:44.0200 rate:0.0960 gloss:0.0004 dloss:10.8139 dlossR:0.0000 dlossQsigm:0.1478
Episode:864 meanR:44.5900 rate:0.1760 gloss:0.0000 dloss:13.1632 dlossR:0.0000 dlossQsigm:0.1623
Episode:865 meanR:45.0400 rate:0.1520 gloss:0.0001 dloss:15.0355 dlossR:0.0000 dlossQsigm:0.1431
Episode:866 meanR:45.1600 rate:0.0880 gloss:0.0004 dloss:2.5360 dlossR:0.0000 dlossQsigm:0.0651
Episode:867 meanR:45.5000 rate:0.1820 gloss:0.0000 dloss:14.9422 dlossR:0.0000 dlossQsigm:0.1613
Episode:868 meanR:45.5600 rate:0.0820 gloss:0.0000 dloss:2.5487 dlossR:0.0000 dlossQsigm:0.0499
Episode:869 meanR:46.1200 rate:0.1720 gloss:0.0000 dloss:6.3854 dlossR:0.0000 dlossQsigm:0.1171
Episode:870 meanR:46.0100 rate:0.0760 gloss:0.0000 dloss:3.2661 dlossR:0.0003 dlossQsigm:0.0385
Episode:871 meanR:46.4400 rate:0.188

Episode:947 meanR:52.1700 rate:0.0660 gloss:0.0001 dloss:5.7201 dlossR:0.0000 dlossQsigm:0.0734
Episode:948 meanR:52.2000 rate:0.1060 gloss:0.0000 dloss:4.7129 dlossR:0.0000 dlossQsigm:0.0685
Episode:949 meanR:52.1000 rate:0.0820 gloss:0.0001 dloss:3.2351 dlossR:0.0000 dlossQsigm:0.0416
Episode:950 meanR:52.1800 rate:0.0940 gloss:0.0002 dloss:2.4595 dlossR:0.0000 dlossQsigm:0.0381
Episode:951 meanR:52.4700 rate:0.1460 gloss:0.0001 dloss:4.4150 dlossR:0.0000 dlossQsigm:0.0448
Episode:952 meanR:52.3700 rate:0.0800 gloss:0.0001 dloss:0.7178 dlossR:0.0000 dlossQsigm:0.0328
Episode:953 meanR:52.4100 rate:0.1260 gloss:0.0000 dloss:2.0357 dlossR:0.0000 dlossQsigm:0.0478
Episode:954 meanR:52.2300 rate:0.0620 gloss:0.0000 dloss:1.8212 dlossR:0.0000 dlossQsigm:0.0805
Episode:955 meanR:52.0800 rate:0.0800 gloss:0.0000 dloss:0.9885 dlossR:0.0011 dlossQsigm:0.0344
Episode:956 meanR:51.8200 rate:0.0820 gloss:0.0000 dloss:1.2671 dlossR:0.0000 dlossQsigm:0.0414
Episode:957 meanR:51.6700 rate:0.1000 gl

Episode:1032 meanR:47.2500 rate:0.0880 gloss:0.0000 dloss:4.5887 dlossR:0.0000 dlossQsigm:0.0780
Episode:1033 meanR:47.2800 rate:0.0740 gloss:0.0047 dloss:14.7022 dlossR:0.0000 dlossQsigm:0.0605
Episode:1034 meanR:47.4100 rate:0.0920 gloss:0.0000 dloss:3.7284 dlossR:0.0000 dlossQsigm:0.0611
Episode:1035 meanR:47.3400 rate:0.1000 gloss:0.0000 dloss:2.8938 dlossR:0.0000 dlossQsigm:0.0566
Episode:1036 meanR:47.0600 rate:0.0700 gloss:0.0000 dloss:2.4953 dlossR:0.0000 dlossQsigm:0.0696
Episode:1037 meanR:46.9800 rate:0.0720 gloss:0.0002 dloss:2.6915 dlossR:0.0000 dlossQsigm:0.0619
Episode:1038 meanR:47.2200 rate:0.1180 gloss:0.0000 dloss:3.6603 dlossR:0.0000 dlossQsigm:0.0707
Episode:1039 meanR:47.1400 rate:0.0680 gloss:0.0001 dloss:2.2953 dlossR:0.0000 dlossQsigm:0.0490
Episode:1040 meanR:47.2000 rate:0.0780 gloss:0.0000 dloss:2.1660 dlossR:0.0000 dlossQsigm:0.0488
Episode:1041 meanR:47.2000 rate:0.0920 gloss:0.0000 dloss:1.5135 dlossR:0.0000 dlossQsigm:0.0442
Episode:1042 meanR:47.6300 ra

Episode:1117 meanR:55.6900 rate:0.1460 gloss:0.0000 dloss:8.6104 dlossR:0.0011 dlossQsigm:0.0592
Episode:1118 meanR:55.5000 rate:0.0860 gloss:0.0002 dloss:4.9548 dlossR:0.0000 dlossQsigm:0.0655
Episode:1119 meanR:56.4100 rate:0.2540 gloss:0.0000 dloss:9.7114 dlossR:0.0000 dlossQsigm:0.0780
Episode:1120 meanR:56.4400 rate:0.1140 gloss:0.0000 dloss:4.9224 dlossR:0.0000 dlossQsigm:0.0764
Episode:1121 meanR:56.4500 rate:0.1100 gloss:0.0000 dloss:5.5307 dlossR:0.0000 dlossQsigm:0.0631
Episode:1122 meanR:56.4700 rate:0.1520 gloss:0.0000 dloss:7.6744 dlossR:0.0000 dlossQsigm:0.0993
Episode:1123 meanR:56.6800 rate:0.1780 gloss:0.0000 dloss:3.6458 dlossR:0.0000 dlossQsigm:0.0656
Episode:1124 meanR:56.8600 rate:0.1280 gloss:0.0003 dloss:4.9045 dlossR:0.0000 dlossQsigm:0.0631
Episode:1125 meanR:57.0900 rate:0.1340 gloss:0.0000 dloss:8.0955 dlossR:0.0000 dlossQsigm:0.0808
Episode:1126 meanR:57.5100 rate:0.1940 gloss:0.0000 dloss:4.4585 dlossR:0.0000 dlossQsigm:0.0953
Episode:1127 meanR:57.5100 rat

Episode:1202 meanR:55.6100 rate:0.0820 gloss:0.0000 dloss:5.3540 dlossR:0.0000 dlossQsigm:0.1005
Episode:1203 meanR:55.5200 rate:0.0760 gloss:0.0000 dloss:3.0062 dlossR:0.0000 dlossQsigm:0.0710
Episode:1204 meanR:55.9400 rate:0.1780 gloss:0.0000 dloss:4.0076 dlossR:0.0000 dlossQsigm:0.0843
Episode:1205 meanR:55.8900 rate:0.1400 gloss:0.0000 dloss:7.7698 dlossR:0.0000 dlossQsigm:0.0900
Episode:1206 meanR:55.6500 rate:0.1020 gloss:0.0000 dloss:13.4361 dlossR:0.0000 dlossQsigm:0.1299
Episode:1207 meanR:55.7300 rate:0.1180 gloss:0.0000 dloss:4.8187 dlossR:0.0000 dlossQsigm:0.0818
Episode:1208 meanR:55.8300 rate:0.1140 gloss:0.0000 dloss:5.4511 dlossR:0.0000 dlossQsigm:0.1017
Episode:1209 meanR:55.7800 rate:0.1200 gloss:0.0000 dloss:6.3556 dlossR:0.0000 dlossQsigm:0.1159
Episode:1210 meanR:55.6000 rate:0.1140 gloss:0.0000 dloss:10.2749 dlossR:0.0000 dlossQsigm:0.1403
Episode:1211 meanR:55.5400 rate:0.1100 gloss:0.0000 dloss:6.5198 dlossR:0.0000 dlossQsigm:0.1152
Episode:1212 meanR:55.6200 r

Episode:1287 meanR:57.2200 rate:0.1280 gloss:0.0000 dloss:25.6460 dlossR:0.0000 dlossQsigm:0.1816
Episode:1288 meanR:57.0600 rate:0.0800 gloss:0.0000 dloss:8.1636 dlossR:0.0000 dlossQsigm:0.0960
Episode:1289 meanR:57.0600 rate:0.0940 gloss:0.0000 dloss:6.0554 dlossR:0.0000 dlossQsigm:0.0791
Episode:1290 meanR:57.1500 rate:0.1180 gloss:0.0000 dloss:6.9173 dlossR:0.0000 dlossQsigm:0.0847
Episode:1291 meanR:57.2300 rate:0.0960 gloss:0.0000 dloss:9.2237 dlossR:0.0000 dlossQsigm:0.0786
Episode:1292 meanR:57.3400 rate:0.1020 gloss:0.0000 dloss:7.5559 dlossR:0.0000 dlossQsigm:0.0763
Episode:1293 meanR:57.3100 rate:0.1160 gloss:0.0000 dloss:12.8940 dlossR:0.0000 dlossQsigm:0.1196
Episode:1294 meanR:57.4700 rate:0.1080 gloss:0.0000 dloss:27.7566 dlossR:0.0000 dlossQsigm:0.1609
Episode:1295 meanR:57.6800 rate:0.1380 gloss:0.0000 dloss:8.7608 dlossR:0.0000 dlossQsigm:0.0853
Episode:1296 meanR:57.5400 rate:0.0960 gloss:0.0000 dloss:12.1965 dlossR:0.0000 dlossQsigm:0.1174
Episode:1297 meanR:58.0400

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(g_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('G losses')

In [None]:
eps, arr = np.array(d_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses')

In [None]:
eps, arr = np.array(d_lossR_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses R')

In [None]:
eps, arr = np.array(d_lossQ_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses Q')

## Testing

Let's checkout how our trained agent plays the game.

In [34]:
import gym
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
# env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Episodes/epochs
    for _ in range(100):
    #while True:
        state = env.reset()
        total_reward = 0

        # Steps/batches
        #for _ in range(111111111111111111):
        while True:
            env.render()
            action_logits = sess.run(model.actions_logits, feed_dict={model.states: np.reshape(state, [1, -1])})
            action = np.argmax(action_logits)
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                break
        # Print and break condition
        print('total_reward: {}'.format(total_reward))
        # if total_reward == 500:
        #     break
                
# Closing the env
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt




total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.