# Deep cortical reinforcement learning: Policy gradients + Q-learning + GAN


In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

##### >**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [2]:
import gym

## Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [3]:
env.reset()
batch = []
for _ in range(1111):
    #env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) # take a random action
    batch.append([action, state, reward, done, info])
    #print('state, action, reward, done, info:', state, action, reward, done, info)
    if done:
        env.reset()

To shut the window showing the simulation, use `env.close()`.

In [4]:
# env.close()

If you ran the simulation above, we can look at the rewards:

In [5]:
batch[0], 
batch[0][1].shape, state.shape

((4,), (4,))

In [6]:
import numpy as np
actions = np.array([each[0] for each in batch])
states = np.array([each[1] for each in batch])
rewards = np.array([each[2] for each in batch])
dones = np.array([each[3] for each in batch])
infos = np.array([each[4] for each in batch])

In [7]:
# print(rewards[-20:])
print('shapes:', np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print('dtypes:', np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print('states:', np.max(np.array(states)), np.min(np.array(states)))
print('actions:', np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
print('rewards:', np.max(np.array(rewards)), np.min(np.array(rewards)))

shapes: (1111,) (1111, 4) (1111,) (1111,)
dtypes: float64 float64 int64 bool
states: 2.208870795028335 -2.5781655146810603
actions: 1 0
rewards: 1.0 1.0


In [8]:
actions[:10]

array([0, 1, 1, 0, 1, 1, 1, 1, 1, 1])

In [9]:
rewards[:10]

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [10]:
# import numpy as np
def sigmoid(x, derivative=False):
  return x*(1-x) if derivative else 1/(1+np.exp(-x))

In [11]:
sigmoid(np.max(np.array(rewards))), sigmoid(np.min(np.array(rewards)))

(0.7310585786300049, 0.7310585786300049)

In [12]:
print('rewards:', np.max(np.array(rewards))/100, np.min(np.array(rewards))/100)

rewards: 0.01 0.01


The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [34]:
# Data of the model
def model_input(state_size, lstm_size, batch_size=1):
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    # GRU: Gated Recurrent Units
    gru = tf.nn.rnn_cell.GRUCell(lstm_size) # hidden size
    cell = tf.nn.rnn_cell.MultiRNNCell([gru], state_is_tuple=False)
    g_initial_state = cell.zero_state(batch_size, tf.float32) # feedback or lateral/recurrent connection from output
    d_initial_state = cell.zero_state(batch_size, tf.float32) # feedback or lateral/recurrent connection from output
    return states, actions, targetQs, cell, g_initial_state, d_initial_state

In [35]:
# How to use batch-norm
#   x_norm = tf.layers.batch_normalization(x, training=training)

#   # ...

#   update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
#   with tf.control_dependencies(update_ops):
#     train_op = optimizer.minimize(loss)

In [36]:
# training: Either a Python boolean, or a TensorFlow boolean scalar tensor (e.g. a placeholder). 
# Whether to return the output in: 
# training mode (normalized with statistics of the current batch) or 
# inference mode (normalized with moving statistics). 
# NOTE: make sure to set this parameter correctly, or else your training/inference will not work properly.

In [37]:
# MLP & Conv
# # Generator/Controller: Generating/prediting the actions
# def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
#     with tf.variable_scope('generator', reuse=reuse):
#         # First fully connected layer
#         h1 = tf.layers.dense(inputs=states, units=hidden_size)
#         bn1 = tf.layers.batch_normalization(h1, training=training)        
#         nl1 = tf.maximum(alpha * bn1, bn1)
        
#         # Second fully connected layer
#         h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
#         bn2 = tf.layers.batch_normalization(h2, training=training)        
#         nl2 = tf.maximum(alpha * bn2, bn2)
        
#         # Output layer
#         logits = tf.layers.dense(inputs=nl2, units=action_size)        
#         #predictions = tf.nn.softmax(logits)

#         # return actions logits
#         return logits

In [38]:
# RNN generator or sequence generator
def generator(states, initial_state, cell, lstm_size, num_classes, reuse=False): 
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        inputs = tf.layers.dense(inputs=states, units=lstm_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        # dynamic means adapt to the batch_size
        inputs_rnn = tf.reshape(inputs, [1, -1, lstm_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cell, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state.shape)
        outputs = tf.reshape(outputs_rnn, [-1, lstm_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=num_classes)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [39]:
# MLP & Conv
# # Discriminator/Dopamine: Reward function/planner/naviator/advisor/supervisor/cortical columns
# def discriminator(states, actions, hidden_size, reuse=False, alpha=0.1, training=False):
#     with tf.variable_scope('discriminator', reuse=reuse):
#         # Fusion/merge states and actions/ SA/ SM
#         x_fused = tf.concat(axis=1, values=[states, actions])
        
#         # First fully connected layer
#         h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
#         bn1 = tf.layers.batch_normalization(h1, training=training)        
#         nl1 = tf.maximum(alpha * bn1, bn1)
        
#         # Second fully connected layer
#         h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
#         bn2 = tf.layers.batch_normalization(h2, training=training)        
#         nl2 = tf.maximum(alpha * bn2, bn2)
        
#         # Output layer
#         logits = tf.layers.dense(inputs=nl2, units=1)        
#         #predictions = tf.nn.softmax(logits)

#         # return rewards logits
#         return logits

In [40]:
# RNN generator or sequence generator
def discriminator(states, actions, initial_state, cell, lstm_size, reuse=False): 
    with tf.variable_scope('discriminator', reuse=reuse):
        # Fusion/merge states and actions/ SA/ SM
        x_fused = tf.concat(axis=1, values=[states, actions])
        
        # First fully connected layer
        inputs = tf.layers.dense(inputs=x_fused, units=lstm_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        inputs_rnn = tf.reshape(inputs, [1, -1, lstm_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cell, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state.shape)
        outputs = tf.reshape(outputs_rnn, [-1, lstm_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=1)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [41]:
def model_loss(action_size, hidden_size, states, actions, targetQs,
               cell, g_initial_state, d_initial_state):
    # G/Actor
    #actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    actions_logits, g_final_state = generator(states=states, num_classes=action_size, 
                                              cell=cell, initial_state=g_initial_state, lstm_size=hidden_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    neg_log_prob_actions = tf.nn.softmax_cross_entropy_with_logits_v2(logits=actions_logits, 
                                                                      labels=actions_labels)
    g_loss = tf.reduce_mean(neg_log_prob_actions * targetQs)
    
    # D/Critic
    #Qs_logits = discriminator(actions=actions_logits, hidden_size=hidden_size, states=states)
    Qs_logits, d_final_state = discriminator(states=states, actions=actions_logits, 
                                             cell=cell, initial_state=d_initial_state, lstm_size=hidden_size)
    d_lossQ = tf.reduce_mean(tf.square(tf.reshape(Qs_logits, [-1]) - targetQs))
    d_lossQ_sigm = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=tf.reshape(Qs_logits, [-1]),
                                                                          labels=tf.nn.sigmoid(targetQs)))
    d_loss = d_lossQ_sigm + d_lossQ

    return actions_logits, Qs_logits, g_final_state, d_final_state, g_loss, d_loss, d_lossQ, d_lossQ_sigm

In [42]:
# Optimizating/training/learning G & D
def model_opt(g_loss, d_loss, learning_rate):
    """
    Get optimization operations in order
    :param g_loss: Generator loss Tensor for action prediction
    :param d_loss: Discriminator loss Tensor for reward prediction for generated/prob/logits action
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize RNN
    # g_grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(g_loss, g_vars), clip_norm=5) # usually around 1-5
    # d_grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(d_loss, d_vars), clip_norm=5) # usually around 1-5
    g_grads=tf.gradients(g_loss, g_vars)
    d_grads=tf.gradients(d_loss, d_vars)
    g_opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(g_grads, g_vars))
    d_opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(d_grads, d_vars))
    
    # # Optimize MLP & CNN
    # with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
    #     g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars)
    #     d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars)

    return g_opt, d_opt

In [43]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, cell, self.g_initial_state, self.d_initial_state = model_input(
            state_size=state_size, lstm_size=hidden_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.Qs_logits, self.g_final_state, self.d_final_state, self.g_loss, self.d_loss, self.d_lossQ, self.d_lossQ_sigm = model_loss(
            action_size=action_size, hidden_size=hidden_size,
            states=self.states, actions=self.actions, cell=cell, targetQs=self.targetQs,
            g_initial_state=self.g_initial_state, d_initial_state=self.d_initial_state)
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(g_loss=self.g_loss, d_loss=self.d_loss, learning_rate=learning_rate)

In [44]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
        self.states = deque(maxlen=max_size)

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [45]:
print('state size:{}'.format(states.shape), 
      'actions:{}'.format(actions.shape)) 
print('action size:{}'.format(np.max(actions) - np.min(actions)+1))

state size:(1111, 4) actions:(1111,)
action size:2


In [46]:
# Training parameters
# Network parameters
state_size = 4                 # number of units for the input state/observation -- simulation
action_size = 2                # number of units for the output actions -- simulation
hidden_size = 64               # number of units in each Q-network hidden layer -- simulation
batch_size = 32                # number of samples in the memory/ experience as mini-batch size
learning_rate = 0.001          # learning rate for adam

In [47]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=batch_size)

(?, 4) (?, 64)
(1, ?, 64) (1, 64)
(1, ?, 64) (1, 64)
(?, 64)
(?, 2)
(?, 4) (?, 64)
(1, ?, 64) (1, 64)
(1, ?, 64) (1, 64)
(?, 64)
(?, 1)


In [48]:
state = env.reset()
for _ in range(batch_size):
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)
    memory.buffer.append([state, action, next_state, reward, float(done)])
    state = next_state
    if done is True:
        state = env.reset()

In [49]:
memory.buffer[0]

[array([-0.01069676, -0.00322705,  0.03619071, -0.03418689]),
 0,
 array([-0.0107613 , -0.19884879,  0.03550697,  0.26969134]),
 1.0,
 0.0]

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [None]:
from collections import deque
episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
saver = tf.train.Saver()
rewards_list, g_loss_list, d_loss_list = [], [], []
rates_list, d_lossQ_list, d_lossQsigm_list = [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Training episodes/epochs
    for ep in range(11111):
        batch = [] # every data batch
        total_reward = 0
        state = env.reset() # env first state
        g_initial_state = sess.run(model.g_initial_state)
        d_initial_state = sess.run(model.d_initial_state)

        # Training steps/batches
        while True:
            # Testing/inference
            action_logits, g_final_state, d_final_state = sess.run(
                fetches=[model.actions_logits, model.g_final_state, model.d_final_state], 
                feed_dict={model.states: np.reshape(state, [1, -1]),
                           model.g_initial_state: g_initial_state,
                           model.d_initial_state: d_initial_state})
            action = np.argmax(action_logits)
            next_state, reward, done, _ = env.step(action)
            memory.buffer.append([state, action, next_state, reward, float(done)])
            memory.states.append([g_initial_state, g_final_state,
                                  d_initial_state, d_final_state])
            total_reward += reward
            g_initial_state = g_final_state
            d_initial_state = d_final_state
            state = next_state
            
            # Training
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            rnn_states = memory.states
            g_initial_states = np.array([each[0] for each in rnn_states])
            g_final_states = np.array([each[1] for each in rnn_states])
            d_initial_states = np.array([each[2] for each in rnn_states])
            d_final_states = np.array([each[3] for each in rnn_states])
            nextQs_logits = sess.run(fetches = model.Qs_logits,
                                     feed_dict = {model.states: next_states, 
                                                  model.g_initial_state: g_final_states[0].reshape([1, -1]),
                                                  model.d_initial_state: d_final_states[0].reshape([1, -1])})
            nextQs = nextQs_logits.reshape([-1]) * (1-dones) # exploit
            targetQs = rewards + (0.99 * nextQs)
            g_loss, d_loss, d_lossQ, d_lossQsigm, _, _ = sess.run(
                fetches=[model.g_loss, model.d_loss, 
                         model.d_lossQ, model.d_lossQ_sigm,
                         model.g_opt, model.d_opt], 
                feed_dict = {model.states: states, model.actions: actions,
                             model.targetQs: targetQs,
                             model.g_initial_state: g_initial_states[0].reshape([1, -1]),
                             model.d_initial_state: d_initial_states[0].reshape([1, -1])})

            if done is True:
                break

        # Episode total reward and success rate/prob
        episode_reward.append(total_reward) # stopping criteria
        rate = total_reward/ 500 # success is 500 points: 0-1
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'rate:{:.4f}'.format(rate),
              'gloss:{:.4f}'.format(g_loss),
              'dloss:{:.4f}'.format(d_loss),
              'dlossQ:{:.4f}'.format(d_lossQ),
              'dlossQsigm:{:.4f}'.format(d_lossQsigm))
        # Ploting out
        rewards_list.append([ep, np.mean(episode_reward)])
        rates_list.append([ep, rate])
        g_loss_list.append([ep, g_loss])
        d_loss_list.append([ep, d_loss])
        d_lossQ_list.append([ep, d_lossQ])
        d_lossQsigm_list.append([ep, d_lossQsigm])
        # Break episode/epoch loop
        if np.mean(episode_reward) >= 500:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:48.0000 rate:0.0960 gloss:0.5152 dloss:1.4261 dlossQ:0.9713 dlossQsigm:0.4548
Episode:1 meanR:54.0000 rate:0.1200 gloss:0.1059 dloss:3.5117 dlossQ:3.4335 dlossQsigm:0.0782
Episode:2 meanR:59.6667 rate:0.1420 gloss:0.0066 dloss:8.2521 dlossQ:8.1162 dlossQsigm:0.1359
Episode:3 meanR:60.7500 rate:0.1280 gloss:0.0627 dloss:9.2201 dlossQ:9.0784 dlossQsigm:0.1416
Episode:4 meanR:56.0000 rate:0.0740 gloss:0.2940 dloss:3.9147 dlossQ:3.8604 dlossQsigm:0.0543
Episode:5 meanR:62.0000 rate:0.1840 gloss:0.0487 dloss:18.0800 dlossQ:17.8812 dlossQsigm:0.1988
Episode:6 meanR:60.4286 rate:0.1020 gloss:0.1382 dloss:9.5126 dlossQ:9.3767 dlossQsigm:0.1359
Episode:7 meanR:59.5000 rate:0.1060 gloss:0.1514 dloss:5.9219 dlossQ:5.7190 dlossQsigm:0.2029
Episode:8 meanR:57.3333 rate:0.0800 gloss:-14.5985 dloss:17.7668 dlossQ:17.2592 dlossQsigm:0.5076
Episode:9 meanR:57.4000 rate:0.1160 gloss:0.0268 dloss:11.9522 dlossQ:11.5416 dlossQsigm:0.4106
Episode:10 meanR:71.7273 rate:0.4300 gloss:0.0039 dl

Episode:85 meanR:148.3256 rate:0.3600 gloss:0.0047 dloss:38.6580 dlossQ:38.3657 dlossQsigm:0.2922
Episode:86 meanR:152.3678 rate:1.0000 gloss:1.2108 dloss:256.4392 dlossQ:255.6707 dlossQsigm:0.7685
Episode:87 meanR:156.3182 rate:1.0000 gloss:0.4264 dloss:252.4097 dlossQ:251.6474 dlossQsigm:0.7624
Episode:88 meanR:157.3708 rate:0.5000 gloss:1.0918 dloss:138.2171 dlossQ:137.6527 dlossQsigm:0.5644
Episode:89 meanR:161.1778 rate:1.0000 gloss:0.0297 dloss:254.4239 dlossQ:253.6584 dlossQsigm:0.7654
Episode:90 meanR:164.9011 rate:1.0000 gloss:1.4100 dloss:197.7912 dlossQ:197.1163 dlossQsigm:0.6749
Episode:91 meanR:168.5435 rate:1.0000 gloss:1.1271 dloss:235.4438 dlossQ:234.7076 dlossQsigm:0.7363
Episode:92 meanR:172.1075 rate:1.0000 gloss:0.1981 dloss:241.4702 dlossQ:240.7243 dlossQsigm:0.7459
Episode:93 meanR:172.5957 rate:0.4360 gloss:1.3309 dloss:91.2905 dlossQ:90.8382 dlossQsigm:0.4523
Episode:94 meanR:174.5684 rate:0.7200 gloss:5.1505 dloss:185.7843 dlossQ:185.1308 dlossQsigm:0.6535
Epis

Episode:169 meanR:200.6100 rate:0.3380 gloss:0.0000 dloss:8.8003 dlossQ:8.6855 dlossQsigm:0.1148
Episode:170 meanR:203.4400 rate:1.0000 gloss:0.2001 dloss:240.8342 dlossQ:240.0894 dlossQsigm:0.7448
Episode:171 meanR:202.5600 rate:0.2240 gloss:0.0003 dloss:53.1096 dlossQ:52.9648 dlossQsigm:0.1448
Episode:172 meanR:202.5000 rate:0.2500 gloss:0.0003 dloss:38.9590 dlossQ:38.8438 dlossQsigm:0.1153
Episode:173 meanR:205.6800 rate:1.0000 gloss:0.4299 dloss:183.6344 dlossQ:182.9998 dlossQsigm:0.6346
Episode:174 meanR:208.9000 rate:1.0000 gloss:0.0000 dloss:181.7921 dlossQ:181.1445 dlossQsigm:0.6477
Episode:175 meanR:207.7100 rate:0.2740 gloss:0.0000 dloss:179.5241 dlossQ:179.1520 dlossQsigm:0.3720
Episode:176 meanR:206.5000 rate:0.2980 gloss:0.0033 dloss:94.3131 dlossQ:93.8826 dlossQsigm:0.4304
Episode:177 meanR:205.7900 rate:0.3720 gloss:0.0000 dloss:61.4041 dlossQ:61.0289 dlossQsigm:0.3752
Episode:178 meanR:208.7100 rate:1.0000 gloss:0.1163 dloss:232.9000 dlossQ:232.1681 dlossQsigm:0.7319
Ep

Episode:251 meanR:343.5800 rate:0.3740 gloss:0.0004 dloss:192.5699 dlossQ:191.9044 dlossQsigm:0.6654
Episode:252 meanR:347.6600 rate:1.0000 gloss:0.1996 dloss:199.9042 dlossQ:199.2254 dlossQsigm:0.6788
Episode:253 meanR:348.4300 rate:0.3420 gloss:0.0000 dloss:160.6407 dlossQ:160.0313 dlossQsigm:0.6094
Episode:254 meanR:352.2600 rate:1.0000 gloss:0.0000 dloss:191.3875 dlossQ:190.7227 dlossQsigm:0.6648
Episode:255 meanR:353.6600 rate:0.4740 gloss:0.0000 dloss:189.1950 dlossQ:188.5349 dlossQsigm:0.6602
Episode:256 meanR:354.2800 rate:0.3500 gloss:0.0000 dloss:169.7405 dlossQ:169.1141 dlossQsigm:0.6263
Episode:257 meanR:357.4400 rate:0.8800 gloss:0.0000 dloss:188.2865 dlossQ:187.6269 dlossQsigm:0.6595
Episode:258 meanR:357.9200 rate:0.3240 gloss:0.0000 dloss:173.3106 dlossQ:172.6776 dlossQsigm:0.6330
Episode:259 meanR:358.3200 rate:0.3320 gloss:0.0000 dloss:172.0706 dlossQ:171.4398 dlossQsigm:0.6307
Episode:260 meanR:358.4800 rate:0.3400 gloss:0.0005 dloss:162.8668 dlossQ:162.2529 dlossQsi

Episode:333 meanR:254.6300 rate:0.5160 gloss:0.0000 dloss:128.2273 dlossQ:127.6820 dlossQsigm:0.5453
Episode:334 meanR:250.9100 rate:0.2560 gloss:0.0003 dloss:123.1977 dlossQ:122.6631 dlossQsigm:0.5346
Episode:335 meanR:247.2100 rate:0.2600 gloss:0.0000 dloss:124.0608 dlossQ:123.5243 dlossQsigm:0.5365
Episode:336 meanR:243.4600 rate:0.2500 gloss:0.0000 dloss:119.2823 dlossQ:118.7561 dlossQsigm:0.5262
Episode:337 meanR:239.6500 rate:0.2380 gloss:0.0000 dloss:114.9189 dlossQ:114.4023 dlossQsigm:0.5166
Episode:338 meanR:238.9900 rate:0.2240 gloss:0.0000 dloss:110.3908 dlossQ:109.8844 dlossQsigm:0.5064
Episode:339 meanR:235.3300 rate:0.2680 gloss:0.0002 dloss:109.7796 dlossQ:109.2749 dlossQsigm:0.5047
Episode:340 meanR:231.6200 rate:0.2580 gloss:0.0000 dloss:109.0042 dlossQ:108.5009 dlossQsigm:0.5033
Episode:341 meanR:227.6900 rate:0.2140 gloss:0.0000 dloss:105.1620 dlossQ:104.6676 dlossQsigm:0.4944
Episode:342 meanR:227.5000 rate:0.3900 gloss:0.0001 dloss:113.1517 dlossQ:112.6395 dlossQsi

Episode:415 meanR:139.5300 rate:0.2160 gloss:0.0000 dloss:44.1564 dlossQ:43.8932 dlossQsigm:0.2633
Episode:416 meanR:138.9300 rate:0.2040 gloss:0.0000 dloss:45.3275 dlossQ:45.0721 dlossQsigm:0.2554
Episode:417 meanR:138.2400 rate:0.1640 gloss:0.0168 dloss:43.1972 dlossQ:42.1803 dlossQsigm:1.0169
Episode:418 meanR:137.7800 rate:0.2180 gloss:0.0057 dloss:118.6990 dlossQ:118.2887 dlossQsigm:0.4103
Episode:419 meanR:137.1500 rate:0.2120 gloss:0.0000 dloss:99.9956 dlossQ:99.5390 dlossQsigm:0.4566
Episode:420 meanR:136.6200 rate:0.2040 gloss:0.0000 dloss:106.3712 dlossQ:105.8896 dlossQsigm:0.4815
Episode:421 meanR:132.5800 rate:0.1920 gloss:0.0000 dloss:117.3470 dlossQ:116.8368 dlossQsigm:0.5102
Episode:422 meanR:131.5700 rate:0.2040 gloss:0.0015 dloss:42.6710 dlossQ:42.1993 dlossQsigm:0.4717
Episode:423 meanR:135.0900 rate:1.0000 gloss:1.7626 dloss:222.2460 dlossQ:221.5492 dlossQsigm:0.6968
Episode:424 meanR:134.6000 rate:0.1880 gloss:0.0000 dloss:69.0389 dlossQ:68.6453 dlossQsigm:0.3936
Ep

Episode:497 meanR:188.1600 rate:0.2540 gloss:0.0000 dloss:150.3830 dlossQ:149.7939 dlossQsigm:0.5891
Episode:498 meanR:188.0400 rate:0.1740 gloss:0.0000 dloss:80.0243 dlossQ:79.5931 dlossQsigm:0.4312
Episode:499 meanR:192.2100 rate:1.0000 gloss:0.0417 dloss:227.8918 dlossQ:227.1721 dlossQsigm:0.7197
Episode:500 meanR:192.6300 rate:0.2820 gloss:0.0000 dloss:132.7177 dlossQ:132.1688 dlossQsigm:0.5489
Episode:501 meanR:192.1900 rate:0.2200 gloss:0.0000 dloss:183.8793 dlossQ:183.2380 dlossQsigm:0.6413
Episode:502 meanR:192.1200 rate:0.1760 gloss:0.0000 dloss:166.5753 dlossQ:165.9578 dlossQsigm:0.6176
Episode:503 meanR:194.2100 rate:0.6080 gloss:0.0000 dloss:129.8027 dlossQ:129.2696 dlossQsigm:0.5331
Episode:504 meanR:194.4800 rate:0.2240 gloss:0.0000 dloss:141.1375 dlossQ:140.5830 dlossQsigm:0.5545
Episode:505 meanR:196.8400 rate:0.6480 gloss:0.0008 dloss:195.2205 dlossQ:194.5558 dlossQsigm:0.6647
Episode:506 meanR:196.6000 rate:0.2280 gloss:0.0000 dloss:114.0190 dlossQ:113.5099 dlossQsigm

Episode:579 meanR:198.7600 rate:0.5300 gloss:0.0000 dloss:126.5390 dlossQ:126.0645 dlossQsigm:0.4745
Episode:580 meanR:198.7900 rate:0.2500 gloss:0.0000 dloss:222.5580 dlossQ:221.8481 dlossQsigm:0.7098
Episode:581 meanR:202.4500 rate:1.0000 gloss:0.0318 dloss:218.2520 dlossQ:217.5433 dlossQsigm:0.7087
Episode:582 meanR:202.3000 rate:0.1760 gloss:0.0000 dloss:108.6745 dlossQ:108.1944 dlossQsigm:0.4801
Episode:583 meanR:206.4600 rate:1.0000 gloss:0.0028 dloss:220.4199 dlossQ:219.7075 dlossQsigm:0.7124
Episode:584 meanR:210.2800 rate:1.0000 gloss:0.0001 dloss:227.0298 dlossQ:226.3063 dlossQsigm:0.7235
Episode:585 meanR:213.9100 rate:1.0000 gloss:0.0013 dloss:226.5764 dlossQ:225.8536 dlossQsigm:0.7228
Episode:586 meanR:213.4800 rate:0.9140 gloss:0.0001 dloss:156.2690 dlossQ:155.6737 dlossQsigm:0.5953
Episode:587 meanR:209.6600 rate:0.2360 gloss:0.0000 dloss:233.2751 dlossQ:232.5570 dlossQsigm:0.7181
Episode:588 meanR:213.7200 rate:1.0000 gloss:0.0037 dloss:196.1691 dlossQ:195.4960 dlossQsi

Episode:661 meanR:277.5000 rate:0.1820 gloss:0.0040 dloss:120.2917 dlossQ:119.7650 dlossQsigm:0.5267
Episode:662 meanR:277.1400 rate:0.1980 gloss:0.0001 dloss:106.1621 dlossQ:105.6917 dlossQsigm:0.4704
Episode:663 meanR:273.4700 rate:0.2660 gloss:0.0000 dloss:106.9925 dlossQ:106.5017 dlossQsigm:0.4908
Episode:664 meanR:277.5600 rate:1.0000 gloss:0.0000 dloss:169.9553 dlossQ:169.3283 dlossQsigm:0.6270
Episode:665 meanR:277.9700 rate:0.2740 gloss:0.0000 dloss:149.1266 dlossQ:148.5419 dlossQsigm:0.5847
Episode:666 meanR:277.9700 rate:1.0000 gloss:0.0007 dloss:182.6054 dlossQ:181.9558 dlossQsigm:0.6496
Episode:667 meanR:282.0200 rate:1.0000 gloss:0.0189 dloss:207.9608 dlossQ:207.2687 dlossQsigm:0.6921
Episode:668 meanR:281.9700 rate:0.1940 gloss:0.0002 dloss:266.9171 dlossQ:266.1570 dlossQsigm:0.7601
Episode:669 meanR:286.1000 rate:1.0000 gloss:0.0881 dloss:159.2474 dlossQ:158.6416 dlossQsigm:0.6058
Episode:670 meanR:290.1100 rate:1.0000 gloss:0.0023 dloss:197.4972 dlossQ:196.8225 dlossQsi

Episode:743 meanR:226.6200 rate:1.0000 gloss:0.0000 dloss:230.2201 dlossQ:229.4915 dlossQsigm:0.7286
Episode:744 meanR:226.3300 rate:0.9420 gloss:0.0000 dloss:230.4700 dlossQ:229.7410 dlossQsigm:0.7290
Episode:745 meanR:226.4300 rate:0.1600 gloss:0.0000 dloss:133.2056 dlossQ:132.6910 dlossQsigm:0.5145
Episode:746 meanR:228.5200 rate:0.5880 gloss:0.0000 dloss:200.1514 dlossQ:199.4716 dlossQsigm:0.6798
Episode:747 meanR:230.1100 rate:1.0000 gloss:0.0011 dloss:213.5403 dlossQ:212.8384 dlossQsigm:0.7020
Episode:748 meanR:231.2000 rate:1.0000 gloss:0.0005 dloss:222.4349 dlossQ:221.7186 dlossQsigm:0.7163
Episode:749 meanR:227.3800 rate:0.2360 gloss:0.0000 dloss:204.0484 dlossQ:203.3627 dlossQsigm:0.6857
Episode:750 meanR:223.2600 rate:0.1760 gloss:0.0001 dloss:199.8983 dlossQ:199.2779 dlossQsigm:0.6204
Episode:751 meanR:227.5200 rate:1.0000 gloss:0.0003 dloss:196.3666 dlossQ:195.6931 dlossQsigm:0.6734
Episode:752 meanR:230.9600 rate:1.0000 gloss:0.0044 dloss:211.5327 dlossQ:210.8341 dlossQsi

Episode:825 meanR:252.3200 rate:0.2000 gloss:0.0000 dloss:146.7959 dlossQ:146.2128 dlossQsigm:0.5831
Episode:826 meanR:252.5300 rate:0.2060 gloss:0.0000 dloss:109.2681 dlossQ:108.7701 dlossQsigm:0.4980
Episode:827 meanR:252.4400 rate:0.1600 gloss:0.0000 dloss:75.7660 dlossQ:75.3653 dlossQsigm:0.4007
Episode:828 meanR:252.4200 rate:0.1560 gloss:0.0000 dloss:76.9880 dlossQ:76.5993 dlossQsigm:0.3886
Episode:829 meanR:256.3800 rate:1.0000 gloss:0.0060 dloss:167.1447 dlossQ:166.5236 dlossQsigm:0.6211
Episode:830 meanR:256.8100 rate:0.2200 gloss:0.0000 dloss:115.3489 dlossQ:114.8544 dlossQsigm:0.4944
Episode:831 meanR:256.8800 rate:0.1940 gloss:0.0001 dloss:117.3978 dlossQ:116.9021 dlossQsigm:0.4957
Episode:832 meanR:252.8300 rate:0.1900 gloss:0.0000 dloss:60.4399 dlossQ:60.1497 dlossQsigm:0.2902
Episode:833 meanR:250.0700 rate:0.2360 gloss:0.0000 dloss:112.6155 dlossQ:112.1168 dlossQsigm:0.4988
Episode:834 meanR:250.6300 rate:0.2680 gloss:0.0000 dloss:74.4878 dlossQ:74.1092 dlossQsigm:0.378

Episode:907 meanR:208.8100 rate:0.5640 gloss:0.0000 dloss:159.1091 dlossQ:158.5023 dlossQsigm:0.6068
Episode:908 meanR:208.9700 rate:0.1800 gloss:0.0000 dloss:142.9810 dlossQ:142.4054 dlossQsigm:0.5756
Episode:909 meanR:208.7800 rate:0.1400 gloss:0.0000 dloss:128.3510 dlossQ:127.8054 dlossQsigm:0.5456
Episode:910 meanR:209.0800 rate:1.0000 gloss:0.2743 dloss:167.8333 dlossQ:167.2117 dlossQsigm:0.6217
Episode:911 meanR:209.2700 rate:0.1800 gloss:0.0000 dloss:149.2862 dlossQ:148.6982 dlossQsigm:0.5880
Episode:912 meanR:209.2700 rate:1.0000 gloss:0.0054 dloss:182.3827 dlossQ:181.7335 dlossQsigm:0.6492
Episode:913 meanR:209.1000 rate:0.2140 gloss:0.0000 dloss:162.6781 dlossQ:162.0646 dlossQsigm:0.6135
Episode:914 meanR:210.6100 rate:0.8100 gloss:0.0000 dloss:181.0288 dlossQ:180.3819 dlossQsigm:0.6468
Episode:915 meanR:210.1900 rate:0.1120 gloss:0.0000 dloss:141.2732 dlossQ:140.7010 dlossQsigm:0.5721
Episode:916 meanR:209.8600 rate:0.1620 gloss:0.0000 dloss:128.1710 dlossQ:127.6257 dlossQsi

Episode:989 meanR:141.5300 rate:0.1500 gloss:0.0003 dloss:57.6283 dlossQ:57.2724 dlossQsigm:0.3559
Episode:990 meanR:145.5000 rate:1.0000 gloss:0.4109 dloss:129.1706 dlossQ:128.6330 dlossQsigm:0.5375
Episode:991 meanR:145.7500 rate:0.1620 gloss:0.0000 dloss:110.2079 dlossQ:109.7021 dlossQsigm:0.5059
Episode:992 meanR:145.7100 rate:0.2140 gloss:0.0000 dloss:105.7358 dlossQ:105.2415 dlossQsigm:0.4943
Episode:993 meanR:141.4600 rate:0.1500 gloss:0.0000 dloss:97.5042 dlossQ:97.0286 dlossQsigm:0.4757
Episode:994 meanR:141.6700 rate:0.2040 gloss:0.0000 dloss:82.1534 dlossQ:81.7158 dlossQsigm:0.4376
Episode:995 meanR:141.7000 rate:0.1840 gloss:0.0003 dloss:84.5976 dlossQ:84.1553 dlossQsigm:0.4424
Episode:996 meanR:141.5600 rate:0.1540 gloss:0.0000 dloss:72.2756 dlossQ:71.8659 dlossQsigm:0.4097
Episode:997 meanR:144.7000 rate:0.7660 gloss:0.0000 dloss:109.1162 dlossQ:108.6138 dlossQsigm:0.5023
Episode:998 meanR:144.8000 rate:0.1460 gloss:0.0006 dloss:99.2666 dlossQ:98.7893 dlossQsigm:0.4774
Ep

Episode:1071 meanR:126.7600 rate:0.2240 gloss:0.0000 dloss:67.2344 dlossQ:66.8381 dlossQsigm:0.3963
Episode:1072 meanR:126.7100 rate:0.1700 gloss:0.0000 dloss:54.9580 dlossQ:54.6197 dlossQsigm:0.3383
Episode:1073 meanR:126.8100 rate:0.1820 gloss:0.0000 dloss:64.0987 dlossQ:63.7147 dlossQsigm:0.3840
Episode:1074 meanR:126.9000 rate:0.1420 gloss:0.0000 dloss:47.5205 dlossQ:47.1920 dlossQsigm:0.3285
Episode:1075 meanR:126.5200 rate:0.1480 gloss:0.0000 dloss:59.4361 dlossQ:59.0642 dlossQsigm:0.3719
Episode:1076 meanR:130.2800 rate:1.0000 gloss:0.0000 dloss:108.2648 dlossQ:107.7649 dlossQsigm:0.4999
Episode:1077 meanR:130.0300 rate:0.1600 gloss:0.0000 dloss:44.4487 dlossQ:44.1260 dlossQsigm:0.3227
Episode:1078 meanR:130.0400 rate:0.2660 gloss:0.0000 dloss:63.1159 dlossQ:62.7406 dlossQsigm:0.3753
Episode:1079 meanR:129.7500 rate:0.1480 gloss:0.0000 dloss:60.4305 dlossQ:60.0572 dlossQsigm:0.3733
Episode:1080 meanR:129.9600 rate:0.1660 gloss:0.0000 dloss:47.8113 dlossQ:47.5062 dlossQsigm:0.305

Episode:1152 meanR:190.5100 rate:0.2800 gloss:0.0000 dloss:191.3011 dlossQ:190.6364 dlossQsigm:0.6648
Episode:1153 meanR:194.5800 rate:1.0000 gloss:0.0003 dloss:177.0163 dlossQ:176.3766 dlossQsigm:0.6397
Episode:1154 meanR:196.0000 rate:0.4680 gloss:0.0000 dloss:172.8199 dlossQ:172.1877 dlossQsigm:0.6322
Episode:1155 meanR:200.2700 rate:1.0000 gloss:0.0007 dloss:197.7639 dlossQ:197.0882 dlossQsigm:0.6757
Episode:1156 meanR:204.5800 rate:1.0000 gloss:0.0005 dloss:269.1860 dlossQ:268.3991 dlossQsigm:0.7869
Episode:1157 meanR:205.3400 rate:0.5920 gloss:0.0027 dloss:191.9079 dlossQ:191.2486 dlossQsigm:0.6594
Episode:1158 meanR:207.1300 rate:0.5900 gloss:0.0000 dloss:231.0869 dlossQ:230.3569 dlossQsigm:0.7300
Episode:1159 meanR:209.4700 rate:0.6260 gloss:0.0000 dloss:210.0822 dlossQ:209.3860 dlossQsigm:0.6961
Episode:1160 meanR:210.0700 rate:0.2500 gloss:0.0002 dloss:175.1837 dlossQ:174.5860 dlossQsigm:0.5978
Episode:1161 meanR:211.3600 rate:0.7420 gloss:0.0000 dloss:199.7994 dlossQ:199.120

Episode:1233 meanR:254.4100 rate:0.1960 gloss:0.0000 dloss:108.6458 dlossQ:108.1433 dlossQsigm:0.5025
Episode:1234 meanR:258.1600 rate:1.0000 gloss:0.0002 dloss:163.9279 dlossQ:163.3121 dlossQsigm:0.6158
Episode:1235 meanR:257.2500 rate:0.2160 gloss:0.0000 dloss:144.7393 dlossQ:144.1603 dlossQsigm:0.5790
Episode:1236 meanR:261.1400 rate:1.0000 gloss:0.0000 dloss:151.8867 dlossQ:151.2936 dlossQsigm:0.5930
Episode:1237 meanR:260.2400 rate:0.2300 gloss:0.0000 dloss:144.9955 dlossQ:144.4159 dlossQsigm:0.5796
Episode:1238 meanR:257.8700 rate:0.5260 gloss:0.0004 dloss:153.2549 dlossQ:152.6593 dlossQsigm:0.5957
Episode:1239 meanR:255.3300 rate:0.1680 gloss:0.0000 dloss:137.8050 dlossQ:137.2398 dlossQsigm:0.5652
Episode:1240 meanR:253.6000 rate:0.2060 gloss:0.0000 dloss:128.5096 dlossQ:127.9637 dlossQsigm:0.5460
Episode:1241 meanR:249.6300 rate:0.2060 gloss:0.0000 dloss:120.6713 dlossQ:120.1420 dlossQsigm:0.5292
Episode:1242 meanR:249.7600 rate:0.1860 gloss:0.0002 dloss:113.3280 dlossQ:112.815

Episode:1314 meanR:168.8100 rate:0.2240 gloss:0.0000 dloss:96.6398 dlossQ:96.1656 dlossQsigm:0.4742
Episode:1315 meanR:168.5000 rate:0.2360 gloss:0.0000 dloss:73.1864 dlossQ:72.7844 dlossQsigm:0.4020
Episode:1316 meanR:171.8300 rate:1.0000 gloss:0.0000 dloss:146.1535 dlossQ:145.5717 dlossQsigm:0.5818
Episode:1317 meanR:171.5900 rate:0.1520 gloss:0.0000 dloss:131.0112 dlossQ:130.4601 dlossQsigm:0.5512
Episode:1318 meanR:171.8100 rate:0.2600 gloss:0.0000 dloss:133.1328 dlossQ:132.5880 dlossQsigm:0.5448
Episode:1319 meanR:171.6000 rate:0.1660 gloss:0.0000 dloss:164.5894 dlossQ:163.9890 dlossQsigm:0.6005
Episode:1320 meanR:171.9400 rate:0.3320 gloss:0.0000 dloss:149.3546 dlossQ:148.7665 dlossQsigm:0.5881
Episode:1321 meanR:168.1000 rate:0.2320 gloss:0.0091 dloss:153.6682 dlossQ:153.0992 dlossQsigm:0.5689
Episode:1322 meanR:168.4900 rate:0.2720 gloss:0.0000 dloss:133.7682 dlossQ:133.2113 dlossQsigm:0.5569
Episode:1323 meanR:166.3600 rate:0.1680 gloss:0.0000 dloss:124.6092 dlossQ:124.0715 dl

Episode:1395 meanR:223.6400 rate:0.5000 gloss:0.0000 dloss:164.8125 dlossQ:164.1951 dlossQsigm:0.6175
Episode:1396 meanR:224.0200 rate:0.2580 gloss:0.0000 dloss:152.4893 dlossQ:151.8951 dlossQsigm:0.5942
Episode:1397 meanR:222.9200 rate:0.2280 gloss:0.0000 dloss:141.6074 dlossQ:141.0346 dlossQsigm:0.5728
Episode:1398 meanR:226.6200 rate:1.0000 gloss:0.0005 dloss:176.1871 dlossQ:175.5488 dlossQsigm:0.6382
Episode:1399 meanR:227.0700 rate:0.2080 gloss:0.0000 dloss:158.5345 dlossQ:157.9287 dlossQsigm:0.6057
Episode:1400 meanR:227.0700 rate:1.0000 gloss:0.0003 dloss:187.1898 dlossQ:186.5322 dlossQsigm:0.6576
Episode:1401 meanR:228.7600 rate:0.7900 gloss:0.0000 dloss:206.3619 dlossQ:205.6718 dlossQsigm:0.6902
Episode:1402 meanR:228.9700 rate:0.1560 gloss:0.0000 dloss:178.8992 dlossQ:178.2562 dlossQsigm:0.6431
Episode:1403 meanR:233.2600 rate:1.0000 gloss:0.0039 dloss:241.4452 dlossQ:240.6993 dlossQsigm:0.7459
Episode:1404 meanR:237.3400 rate:1.0000 gloss:0.0000 dloss:198.8836 dlossQ:198.205

Episode:1476 meanR:250.9700 rate:0.1680 gloss:0.0000 dloss:157.4353 dlossQ:156.8318 dlossQsigm:0.6035
Episode:1477 meanR:249.5600 rate:0.2100 gloss:0.0000 dloss:141.9666 dlossQ:141.3934 dlossQsigm:0.5733
Episode:1478 meanR:245.3100 rate:0.1260 gloss:0.0000 dloss:127.2456 dlossQ:126.7023 dlossQsigm:0.5433
Episode:1479 meanR:241.2100 rate:0.1800 gloss:0.0000 dloss:87.4920 dlossQ:87.0406 dlossQsigm:0.4514
Episode:1480 meanR:240.8900 rate:0.1260 gloss:0.0000 dloss:96.4634 dlossQ:96.0261 dlossQsigm:0.4373
Episode:1481 meanR:244.6100 rate:1.0000 gloss:0.3146 dloss:127.1069 dlossQ:126.5643 dlossQsigm:0.5426
Episode:1482 meanR:243.5400 rate:0.4180 gloss:0.0000 dloss:132.1916 dlossQ:131.6379 dlossQsigm:0.5536
Episode:1483 meanR:241.1200 rate:0.5160 gloss:0.0000 dloss:141.7778 dlossQ:141.2046 dlossQsigm:0.5732
Episode:1484 meanR:241.1200 rate:1.0000 gloss:0.0000 dloss:178.5196 dlossQ:177.8772 dlossQsigm:0.6424
Episode:1485 meanR:237.9500 rate:0.2380 gloss:0.0000 dloss:163.6515 dlossQ:163.0434 dl

Episode:1557 meanR:239.2200 rate:0.4700 gloss:0.0001 dloss:189.9317 dlossQ:189.2699 dlossQsigm:0.6618
Episode:1558 meanR:239.3400 rate:0.1860 gloss:0.0000 dloss:167.8440 dlossQ:167.2209 dlossQsigm:0.6231
Episode:1559 meanR:241.1900 rate:0.5940 gloss:0.0000 dloss:171.9298 dlossQ:171.2992 dlossQsigm:0.6305
Episode:1560 meanR:240.4000 rate:0.6220 gloss:0.0004 dloss:177.1414 dlossQ:176.5016 dlossQsigm:0.6399
Episode:1561 meanR:240.8700 rate:0.4760 gloss:0.0000 dloss:172.7108 dlossQ:172.0789 dlossQsigm:0.6320
Episode:1562 meanR:240.6200 rate:0.1940 gloss:0.0001 dloss:155.6037 dlossQ:155.0036 dlossQsigm:0.6002
Episode:1563 meanR:240.4600 rate:0.5920 gloss:0.0000 dloss:163.1930 dlossQ:162.5786 dlossQsigm:0.6145
Episode:1564 meanR:242.5200 rate:0.5880 gloss:0.0000 dloss:172.1257 dlossQ:171.4948 dlossQsigm:0.6309
Episode:1565 meanR:244.7000 rate:0.7720 gloss:0.0000 dloss:176.6714 dlossQ:176.0502 dlossQsigm:0.6213
Episode:1566 meanR:245.6900 rate:0.3620 gloss:0.0000 dloss:161.6077 dlossQ:160.996

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(g_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('G losses')

In [None]:
eps, arr = np.array(d_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses')

In [None]:
eps, arr = np.array(d_lossR_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses R')

In [None]:
eps, arr = np.array(d_lossQ_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses Q')

## Testing

Let's checkout how our trained agent plays the game.

In [34]:
import gym
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
# env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Episodes/epochs
    for _ in range(100):
    #while True:
        state = env.reset()
        total_reward = 0

        # Steps/batches
        #for _ in range(111111111111111111):
        while True:
            env.render()
            action_logits = sess.run(model.actions_logits, feed_dict={model.states: np.reshape(state, [1, -1])})
            action = np.argmax(action_logits)
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                break
        # Print and break condition
        print('total_reward: {}'.format(total_reward))
        # if total_reward == 500:
        #     break
                
# Closing the env
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt




total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0
total_reward: 500.0


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.