# Policy gradients


In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

##### >**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [2]:
import gym

## Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('MountainCarContinuous-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
# env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [3]:
env.reset()
batch = []
for _ in range(1111):
    #env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) # take a random action
    batch.append([action, state, reward, done, info])
    #print('state, action, reward, done, info:', state, action, reward, done, info)
    if done:
        env.reset()

To shut the window showing the simulation, use `env.close()`.

In [4]:
# env.close()

If you ran the simulation above, we can look at the rewards:

In [5]:
batch[0], 
batch[0][1].shape, state.shape

((4,), (4,))

In [6]:
import numpy as np
actions = np.array([each[0] for each in batch])
states = np.array([each[1] for each in batch])
rewards = np.array([each[2] for each in batch])
dones = np.array([each[3] for each in batch])
infos = np.array([each[4] for each in batch])

In [7]:
# print(rewards[-20:])
print('shapes:', np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print('dtypes:', np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print('states:', np.max(np.array(states)), np.min(np.array(states)))
print('actions:', np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
print('rewards:', np.max(np.array(rewards)), np.min(np.array(rewards)))

shapes: (1111,) (1111, 4) (1111,) (1111,)
dtypes: float64 float64 int64 bool
states: 2.440401649885929 -2.9888261858931062
actions: 1 0
rewards: 1.0 1.0


In [8]:
actions[:10]

array([0, 1, 1, 0, 1, 1, 1, 1, 1, 1])

In [9]:
rewards[:10]

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [10]:
# import numpy as np
def sigmoid(x, derivative=False):
  return x*(1-x) if derivative else 1/(1+np.exp(-x))

In [11]:
sigmoid(np.max(np.array(rewards))), sigmoid(np.min(np.array(rewards)))

(0.7310585786300049, 0.7310585786300049)

In [12]:
print('rewards:', np.max(np.array(rewards))/100, np.min(np.array(rewards))/100)

rewards: 0.01 0.01


The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [13]:
# Data of the model
def model_input(state_size):
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    reward = tf.placeholder(tf.float32, [], name='reward')
    return states, actions, targetQs, reward

In [14]:
# Generator/Controller: Generating/prediting the actions
def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        #predictions = tf.nn.softmax(logits)

        # return actions logits
        return logits

In [15]:
# Discriminator/Dopamine: Reward function/planner/naviator/advisor/supervisor/cortical columns
def discriminator(states, actions, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # Fusion/merge states and actions/ SA/ SM
        x_fused = tf.concat(axis=1, values=[states, actions])
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)        
        #predictions = tf.nn.softmax(logits)

        # return rewards logits
        return logits

In [16]:
def model_loss(action_size, hidden_size, states, actions, targetQs, reward):
    actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    neg_log_prob_actions = tf.nn.softmax_cross_entropy_with_logits_v2(logits=actions_logits, 
                                                                      labels=actions_labels)
    
    Qs_logits = discriminator(actions=actions_logits, hidden_size=hidden_size, states=states)
    d_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=Qs_logits,
                                                                    labels=reward*tf.ones_like(Qs_logits)))
    g_lossQ = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=Qs_logits[:-1],
                                                                     labels=tf.reshape(tf.nn.sigmoid(targetQs[1:]), 
                                                                         shape=[-1, 1])))
    #g_lossP = tf.reduce_mean(neg_log_prob_actions * tf.nn.sigmoid(targetQs))
    #g_lossP = tf.reduce_mean(neg_log_prob_actions * targetQs)
    g_lossP = tf.reduce_mean(neg_log_prob_actions[:-1] * targetQs[1:])
    #g_loss = g_lossQ + g_lossP
    g_loss = g_lossP
    d_loss += g_lossQ

    return actions_logits, Qs_logits, g_loss, d_loss, g_lossQ, g_lossP

In [17]:
# Optimizating/training/learning G & D
def model_opt(g_loss, d_loss, learning_rate):
    """
    Get optimization operations in order
    :param g_loss: Generator loss Tensor for action prediction
    :param d_loss: Discriminator loss Tensor for reward prediction for generated/prob/logits action
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars)
        d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars)

    return g_opt, d_opt

In [18]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, self.reward = model_input(state_size=state_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.Qs_logits, self.g_loss, self.d_loss, self.g_lossQ, self.g_lossP = model_loss(
            action_size=action_size, hidden_size=hidden_size, # model init parameters
            states=self.states, actions=self.actions, # model input
            targetQs=self.targetQs, reward=self.reward) # model input
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(g_loss=self.g_loss, d_loss=self.d_loss, learning_rate=learning_rate)

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [19]:
print('state size:{}'.format(states.shape), 
      'actions:{}'.format(actions.shape)) 
print('action size:{}'.format(np.max(actions) - np.min(actions)+1))

state size:(1111, 4) actions:(1111,)
action size:2


In [20]:
# Training parameters
# Network parameters
state_size = 4               # number of units for the input state/observation -- simulation
action_size = 2              # number of units for the output actions -- simulation
hidden_size = 64             # number of units in each Q-network hidden layer -- simulation
learning_rate = 0.001          # learning rate for adam

In [21]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [22]:
# import gym

# ## Create the Cart-Pole game environment
# env = gym.make('CartPole-v0')
# env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# # env = gym.make('MountainCar-v0')
# # env = gym.make('Pendulum-v0')
# # env = gym.make('Blackjack-v0')
# # env = gym.make('FrozenLake-v0')
# # env = gym.make('AirRaid-ram-v0')
# # env = gym.make('AirRaid-v0')
# # env = gym.make('BipedalWalker-v2')
# # env = gym.make('Copy-v0')
# # env = gym.make('CarRacing-v0')
# # env = gym.make('Ant-v2') #mujoco
# # env = gym.make('FetchPickAndPlace-v1') # mujoco required!

In [None]:
from collections import deque
episodes_total_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
saver = tf.train.Saver()
rewards_list, g_loss_list, d_loss_list = [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Training episodes/epochs
    for ep in range(11111):
        batch = [] # every data batch
        total_reward = 0
        state = env.reset() # env first state

        # Training steps/batches
        while True:
            action_logits, Q_logits = sess.run(fetches=[model.actions_logits, model.Qs_logits], 
                                               feed_dict={model.states: np.reshape(state, [1, -1])})
            action = np.argmax(action_logits)
            batch.append([state, action, Q_logits])
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done is True: # episode ended success/failure
                episodes_total_reward.append(total_reward) # stopping criteria
                rate = total_reward/ 500 # success is 500 points
                break

        # Training using batches
        #batch = memory.buffer
        states = np.array([each[0] for each in batch])
        actions = np.array([each[1] for each in batch])
        targetQs = np.array([each[2] for each in batch])
        g_lossQ, g_lossP, g_loss, d_loss, _, _ = sess.run([model.g_lossQ, model.g_lossP, 
                                                           model.g_loss, model.d_loss, 
                                                           model.g_opt, model.d_opt],
                                                          feed_dict = {model.states: states, 
                                                                       model.actions: actions,
                                                                       model.reward: rate,
                                                                       model.targetQs: targetQs.reshape([-1])})
        # Average 100 episode total reward
        # Print out
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episodes_total_reward)),
              'glossQ:{:.4f}'.format(g_lossQ),
              'glossP:{:.4f}'.format(g_lossP),
              'gloss:{:.4f}'.format(g_loss),
              'dloss:{:.4f}'.format(d_loss))
        # Ploting out
        rewards_list.append([ep, np.mean(episodes_total_reward)])
        g_loss_list.append([ep, g_loss])
        d_loss_list.append([ep, d_loss])
        # Break episode/epoch loop
        if np.mean(episodes_total_reward) >= 500:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:52.0000 glossQ:0.6928 glossP:0.0546 gloss:0.0546 dloss:1.4188
Episode:1 meanR:49.5000 glossQ:0.6932 glossP:0.0358 gloss:0.0358 dloss:1.4081
Episode:2 meanR:41.0000 glossQ:0.6925 glossP:0.0600 gloss:0.0600 dloss:1.4244
Episode:3 meanR:36.2500 glossQ:0.6931 glossP:0.0349 gloss:0.0349 dloss:1.4080
Episode:4 meanR:31.0000 glossQ:0.6923 glossP:0.0578 gloss:0.0578 dloss:1.4197
Episode:5 meanR:27.5000 glossQ:0.6920 glossP:0.0445 gloss:0.0445 dloss:1.4104
Episode:6 meanR:25.1429 glossQ:0.6926 glossP:-0.0029 gloss:-0.0029 dloss:1.3799
Episode:7 meanR:23.1250 glossQ:0.6917 glossP:-0.0613 gloss:-0.0613 dloss:1.3349
Episode:8 meanR:21.5556 glossQ:0.6890 glossP:-0.1059 gloss:-0.1059 dloss:1.3006
Episode:9 meanR:20.5000 glossQ:0.6842 glossP:-0.1593 gloss:-0.1593 dloss:1.2629
Episode:10 meanR:19.5455 glossQ:0.6785 glossP:-0.2131 gloss:-0.2131 dloss:1.2295
Episode:11 meanR:18.8333 glossQ:0.6719 glossP:-0.2757 gloss:-0.2757 dloss:1.1974
Episode:12 meanR:18.6154 glossQ:0.6795 glossP:-0.2

Episode:102 meanR:40.5700 glossQ:0.3421 glossP:-1.4682 gloss:-1.4682 dloss:1.8704
Episode:103 meanR:42.1300 glossQ:0.3621 glossP:-1.4007 gloss:-1.4007 dloss:1.2129
Episode:104 meanR:43.1000 glossQ:0.3413 glossP:-1.4748 gloss:-1.4748 dloss:0.9163
Episode:105 meanR:43.1300 glossQ:0.3372 glossP:-1.4885 gloss:-1.4885 dloss:0.5093
Episode:106 meanR:43.1400 glossQ:0.3056 glossP:-1.6062 gloss:-1.6062 dloss:0.4652
Episode:107 meanR:43.1500 glossQ:0.2955 glossP:-1.6554 gloss:-1.6554 dloss:0.4469
Episode:108 meanR:43.1700 glossQ:0.2855 glossP:-1.6976 gloss:-1.6976 dloss:0.4386
Episode:109 meanR:43.1700 glossQ:0.3028 glossP:-1.6277 gloss:-1.6277 dloss:0.4613
Episode:110 meanR:43.6400 glossQ:0.4126 glossP:-1.2324 gloss:-1.2324 dloss:0.7751
Episode:111 meanR:48.5300 glossQ:0.4568 glossP:-1.0940 gloss:-1.0940 dloss:2.2339
Episode:112 meanR:49.1400 glossQ:0.4191 glossP:-1.2226 gloss:-1.2226 dloss:0.8590
Episode:113 meanR:50.2000 glossQ:0.4444 glossP:-1.1351 gloss:-1.1351 dloss:1.0440
Episode:114 mean

Episode:207 meanR:33.3800 glossQ:0.1980 glossP:-2.0377 gloss:-2.0377 dloss:0.6511
Episode:208 meanR:33.6200 glossQ:0.2020 glossP:-2.0086 gloss:-2.0086 dloss:0.4592
Episode:209 meanR:33.9800 glossQ:0.2153 glossP:-1.9396 gloss:-1.9396 dloss:0.5388
Episode:210 meanR:33.8200 glossQ:0.2150 glossP:-1.9407 gloss:-1.9407 dloss:0.5046
Episode:211 meanR:29.1100 glossQ:0.2163 glossP:-1.9365 gloss:-1.9365 dloss:0.4379
Episode:212 meanR:28.7100 glossQ:0.2205 glossP:-1.9182 gloss:-1.9182 dloss:0.4868
Episode:213 meanR:27.7700 glossQ:0.2235 glossP:-1.9030 gloss:-1.9030 dloss:0.4673
Episode:214 meanR:27.6500 glossQ:0.2222 glossP:-1.9100 gloss:-1.9100 dloss:0.4050
Episode:215 meanR:27.6800 glossQ:0.2333 glossP:-1.8590 gloss:-1.8590 dloss:0.5257
Episode:216 meanR:27.5700 glossQ:0.2199 glossP:-1.9280 gloss:-1.9280 dloss:0.4362
Episode:217 meanR:27.4600 glossQ:0.2271 glossP:-1.8887 gloss:-1.8887 dloss:0.4380
Episode:218 meanR:27.2400 glossQ:0.2041 glossP:-2.0064 gloss:-2.0064 dloss:0.3454
Episode:219 mean

Episode:308 meanR:59.7800 glossQ:0.4041 glossP:-1.3433 gloss:-1.3433 dloss:0.7226
Episode:309 meanR:59.8400 glossQ:0.5118 glossP:-0.9325 gloss:-0.9325 dloss:0.8873
Episode:310 meanR:59.8500 glossQ:0.5180 glossP:-0.9106 gloss:-0.9106 dloss:0.8675
Episode:311 meanR:59.8000 glossQ:0.5259 glossP:-0.8800 gloss:-0.8800 dloss:0.8337
Episode:312 meanR:59.6800 glossQ:0.5188 glossP:-0.9021 gloss:-0.9021 dloss:0.8240
Episode:313 meanR:59.5600 glossQ:0.5145 glossP:-0.9163 gloss:-0.9163 dloss:0.8059
Episode:314 meanR:59.5100 glossQ:0.5137 glossP:-0.9194 gloss:-0.9194 dloss:0.7942
Episode:315 meanR:59.2700 glossQ:0.5049 glossP:-0.9479 gloss:-0.9479 dloss:0.7812
Episode:316 meanR:59.1900 glossQ:0.4946 glossP:-0.9795 gloss:-0.9795 dloss:0.7690
Episode:317 meanR:59.0700 glossQ:0.4868 glossP:-1.0043 gloss:-1.0043 dloss:0.7421
Episode:318 meanR:59.0700 glossQ:0.4764 glossP:-1.0371 gloss:-1.0371 dloss:0.7244
Episode:319 meanR:59.1200 glossQ:0.4654 glossP:-1.0714 gloss:-1.0714 dloss:0.7097
Episode:320 mean

Episode:413 meanR:34.1300 glossQ:0.2688 glossP:-1.7428 gloss:-1.7428 dloss:0.4687
Episode:414 meanR:34.2100 glossQ:0.2615 glossP:-1.7848 gloss:-1.7848 dloss:0.4676
Episode:415 meanR:34.1700 glossQ:0.2069 glossP:-2.0310 gloss:-2.0310 dloss:0.3457
Episode:416 meanR:34.1600 glossQ:0.2427 glossP:-1.8664 gloss:-1.8664 dloss:0.4148
Episode:417 meanR:34.1500 glossQ:0.2382 glossP:-1.8884 gloss:-1.8884 dloss:0.3828
Episode:418 meanR:34.9200 glossQ:0.2613 glossP:-1.7681 gloss:-1.7681 dloss:0.8066
Episode:419 meanR:35.3100 glossQ:0.2411 glossP:-1.8664 gloss:-1.8664 dloss:0.6056
Episode:420 meanR:35.4000 glossQ:0.2529 glossP:-1.8079 gloss:-1.8079 dloss:0.4347
Episode:421 meanR:35.4000 glossQ:0.1968 glossP:-2.0821 gloss:-2.0821 dloss:0.3627
Episode:422 meanR:35.4000 glossQ:0.1624 glossP:-2.2793 gloss:-2.2793 dloss:0.2894
Episode:423 meanR:35.4300 glossQ:0.1705 glossP:-2.2190 gloss:-2.2190 dloss:0.3161
Episode:424 meanR:35.4300 glossQ:0.1642 glossP:-2.2605 gloss:-2.2605 dloss:0.2785
Episode:425 mean

Episode:515 meanR:37.1200 glossQ:0.2975 glossP:-1.6334 gloss:-1.6334 dloss:0.7387
Episode:516 meanR:37.1400 glossQ:0.2800 glossP:-1.7014 gloss:-1.7014 dloss:0.4675
Episode:517 meanR:37.1300 glossQ:0.1875 glossP:-2.1361 gloss:-2.1361 dloss:0.3186
Episode:518 meanR:36.9300 glossQ:0.2987 glossP:-1.6298 gloss:-1.6298 dloss:0.7303
Episode:519 meanR:37.4900 glossQ:0.2975 glossP:-1.6332 gloss:-1.6332 dloss:0.9134
Episode:520 meanR:38.7200 glossQ:0.3217 glossP:-1.5399 gloss:-1.5399 dloss:1.0659
Episode:521 meanR:43.5300 glossQ:0.3219 glossP:-1.5346 gloss:-1.5346 dloss:2.6399
Episode:522 meanR:46.0200 glossQ:0.3373 glossP:-1.4828 gloss:-1.4828 dloss:1.5707
Episode:523 meanR:46.4000 glossQ:0.2940 glossP:-1.6642 gloss:-1.6642 dloss:0.6460
Episode:524 meanR:46.8200 glossQ:0.3094 glossP:-1.6120 gloss:-1.6120 dloss:0.6563
Episode:525 meanR:46.8300 glossQ:0.3349 glossP:-1.4934 gloss:-1.4934 dloss:0.5116
Episode:526 meanR:46.5900 glossQ:0.2828 glossP:-1.6982 gloss:-1.6982 dloss:0.4195
Episode:527 mean

Episode:621 meanR:64.4600 glossQ:0.2638 glossP:-1.7718 gloss:-1.7718 dloss:0.4437
Episode:622 meanR:62.0400 glossQ:0.2548 glossP:-1.8079 gloss:-1.8079 dloss:0.4330
Episode:623 meanR:61.6400 glossQ:0.2342 glossP:-1.8964 gloss:-1.8964 dloss:0.3776
Episode:624 meanR:61.3200 glossQ:0.2474 glossP:-1.8414 gloss:-1.8414 dloss:0.4297
Episode:625 meanR:61.3400 glossQ:0.2431 glossP:-1.8585 gloss:-1.8585 dloss:0.4038
Episode:626 meanR:61.4300 glossQ:0.2542 glossP:-1.8124 gloss:-1.8124 dloss:0.4222
Episode:627 meanR:61.7800 glossQ:0.2847 glossP:-1.6916 gloss:-1.6916 dloss:0.5865
Episode:628 meanR:63.1900 glossQ:0.3452 glossP:-1.4650 gloss:-1.4650 dloss:1.2066
Episode:629 meanR:63.5800 glossQ:0.3644 glossP:-1.4042 gloss:-1.4042 dloss:0.9015
Episode:630 meanR:62.1000 glossQ:0.3846 glossP:-1.3373 gloss:-1.3373 dloss:0.9075
Episode:631 meanR:60.6200 glossQ:0.3322 glossP:-1.5098 gloss:-1.5098 dloss:0.5545
Episode:632 meanR:59.4500 glossQ:0.3279 glossP:-1.5265 gloss:-1.5265 dloss:0.5141
Episode:633 mean

Episode:721 meanR:137.0000 glossQ:0.6287 glossP:-0.5222 gloss:-0.5222 dloss:1.0906
Episode:722 meanR:137.2400 glossQ:0.6215 glossP:-0.5497 gloss:-0.5497 dloss:1.0696
Episode:723 meanR:137.5800 glossQ:0.6182 glossP:-0.5643 gloss:-0.5643 dloss:1.0680
Episode:724 meanR:138.0500 glossQ:0.6206 glossP:-0.5544 gloss:-0.5544 dloss:1.1047
Episode:725 meanR:138.9400 glossQ:0.6199 glossP:-0.5561 gloss:-0.5561 dloss:1.1646
Episode:726 meanR:138.9100 glossQ:0.5422 glossP:-0.8365 gloss:-0.8365 dloss:0.8542
Episode:727 meanR:138.6700 glossQ:0.5646 glossP:-0.7570 gloss:-0.7570 dloss:0.9117
Episode:728 meanR:137.0400 glossQ:0.5236 glossP:-0.8970 gloss:-0.8970 dloss:0.8169
Episode:729 meanR:137.9100 glossQ:0.6159 glossP:-0.5755 gloss:-0.5755 dloss:1.2915
Episode:730 meanR:137.2800 glossQ:0.5799 glossP:-0.7020 gloss:-0.7020 dloss:0.9760
Episode:731 meanR:137.4000 glossQ:0.5828 glossP:-0.6965 gloss:-0.6965 dloss:0.9783
Episode:732 meanR:137.6400 glossQ:0.5713 glossP:-0.7365 gloss:-0.7365 dloss:0.9655
Epis

Episode:821 meanR:85.7700 glossQ:0.5100 glossP:-0.9703 gloss:-0.9703 dloss:0.8277
Episode:822 meanR:85.5200 glossQ:0.4866 glossP:-1.0436 gloss:-1.0436 dloss:0.7531
Episode:823 meanR:85.2200 glossQ:0.4786 glossP:-1.0707 gloss:-1.0707 dloss:0.7362
Episode:824 meanR:84.7200 glossQ:0.4775 glossP:-1.0740 gloss:-1.0740 dloss:0.7344
Episode:825 meanR:83.8700 glossQ:0.4654 glossP:-1.1122 gloss:-1.1122 dloss:0.7236
Episode:826 meanR:83.9200 glossQ:0.4560 glossP:-1.1434 gloss:-1.1434 dloss:0.7065
Episode:827 meanR:83.9400 glossQ:0.4636 glossP:-1.1199 gloss:-1.1199 dloss:0.7237
Episode:828 meanR:84.0400 glossQ:0.4641 glossP:-1.1209 gloss:-1.1209 dloss:0.7277
Episode:829 meanR:82.4100 glossQ:0.4775 glossP:-1.0789 gloss:-1.0789 dloss:0.7543
Episode:830 meanR:82.6700 glossQ:0.4355 glossP:-1.2571 gloss:-1.2571 dloss:0.8443
Episode:831 meanR:82.8800 glossQ:0.4468 glossP:-1.2251 gloss:-1.2251 dloss:0.8446
Episode:832 meanR:82.7400 glossQ:0.4680 glossP:-1.1077 gloss:-1.1077 dloss:0.7496
Episode:833 mean

Episode:921 meanR:65.0100 glossQ:0.4803 glossP:-1.0462 gloss:-1.0462 dloss:0.9389
Episode:922 meanR:65.1300 glossQ:0.4380 glossP:-1.1948 gloss:-1.1948 dloss:0.7171
Episode:923 meanR:65.1600 glossQ:0.3938 glossP:-1.3384 gloss:-1.3384 dloss:0.6187
Episode:924 meanR:65.1500 glossQ:0.3301 glossP:-1.5525 gloss:-1.5525 dloss:0.5186
Episode:925 meanR:65.0800 glossQ:0.2984 glossP:-1.6692 gloss:-1.6692 dloss:0.4662
Episode:926 meanR:65.0100 glossQ:0.2930 glossP:-1.6926 gloss:-1.6926 dloss:0.4559
Episode:927 meanR:64.9500 glossQ:0.2767 glossP:-1.7593 gloss:-1.7593 dloss:0.4479
Episode:928 meanR:64.8400 glossQ:0.2894 glossP:-1.7074 gloss:-1.7074 dloss:0.4470
Episode:929 meanR:64.7700 glossQ:0.3220 glossP:-1.5875 gloss:-1.5875 dloss:0.5122
Episode:930 meanR:64.4600 glossQ:0.3798 glossP:-1.3904 gloss:-1.3904 dloss:0.6423
Episode:931 meanR:64.1800 glossQ:0.3995 glossP:-1.3251 gloss:-1.3251 dloss:0.6651
Episode:932 meanR:64.5700 glossQ:0.4278 glossP:-1.2263 gloss:-1.2263 dloss:0.8282
Episode:933 mean

Episode:1022 meanR:108.0200 glossQ:0.3132 glossP:-1.6261 gloss:-1.6261 dloss:0.4941
Episode:1023 meanR:107.9400 glossQ:0.3123 glossP:-1.6315 gloss:-1.6315 dloss:0.4932
Episode:1024 meanR:107.9200 glossQ:0.3024 glossP:-1.6676 gloss:-1.6676 dloss:0.4864
Episode:1025 meanR:107.9100 glossQ:0.2820 glossP:-1.7561 gloss:-1.7561 dloss:0.4538
Episode:1026 meanR:107.9200 glossQ:0.2911 glossP:-1.7203 gloss:-1.7203 dloss:0.4689
Episode:1027 meanR:108.1200 glossQ:0.4252 glossP:-1.2340 gloss:-1.2340 dloss:0.7289
Episode:1028 meanR:108.1100 glossQ:0.3335 glossP:-1.5719 gloss:-1.5719 dloss:0.5244
Episode:1029 meanR:108.4800 glossQ:0.4184 glossP:-1.2738 gloss:-1.2738 dloss:0.7938
Episode:1030 meanR:108.4100 glossQ:0.3206 glossP:-1.7828 gloss:-1.7828 dloss:0.5768
Episode:1031 meanR:108.3500 glossQ:0.3098 glossP:-1.8502 gloss:-1.8502 dloss:0.5711
Episode:1032 meanR:107.9900 glossQ:0.4704 glossP:-1.0761 gloss:-1.0761 dloss:0.7682
Episode:1033 meanR:103.1200 glossQ:0.4266 glossP:-1.2120 gloss:-1.2120 dloss

Episode:1125 meanR:70.3000 glossQ:0.4801 glossP:-1.0677 gloss:-1.0677 dloss:1.3169
Episode:1126 meanR:73.4400 glossQ:0.4898 glossP:-1.0292 gloss:-1.0292 dloss:1.6784
Episode:1127 meanR:78.0800 glossQ:0.5041 glossP:-0.9721 gloss:-0.9721 dloss:2.1288
Episode:1128 meanR:82.9700 glossQ:0.5140 glossP:-0.9374 gloss:-0.9374 dloss:2.0982
Episode:1129 meanR:87.4200 glossQ:0.5292 glossP:-0.8851 gloss:-0.8851 dloss:2.0538
Episode:1130 meanR:92.1800 glossQ:0.5454 glossP:-0.8431 gloss:-0.8431 dloss:2.0233
Episode:1131 meanR:96.9300 glossQ:0.5634 glossP:-0.7846 gloss:-0.7846 dloss:1.9769
Episode:1132 meanR:101.6200 glossQ:0.5735 glossP:-0.7487 gloss:-0.7487 dloss:1.9483
Episode:1133 meanR:106.4900 glossQ:0.5883 glossP:-0.6956 gloss:-0.6956 dloss:1.9061
Episode:1134 meanR:111.3400 glossQ:0.6029 glossP:-0.6305 gloss:-0.6305 dloss:1.8522
Episode:1135 meanR:116.2200 glossQ:0.6174 glossP:-0.5811 gloss:-0.5811 dloss:1.8167
Episode:1136 meanR:121.1200 glossQ:0.6335 glossP:-0.5163 gloss:-0.5163 dloss:1.7685

Episode:1224 meanR:334.9100 glossQ:0.6863 glossP:0.1626 gloss:0.1626 dloss:1.2690
Episode:1225 meanR:337.8700 glossQ:0.6832 glossP:0.1963 gloss:0.1963 dloss:1.2447
Episode:1226 meanR:339.5900 glossQ:0.6812 glossP:0.2154 gloss:0.2154 dloss:1.2310
Episode:1227 meanR:339.5900 glossQ:0.6802 glossP:0.2246 gloss:0.2246 dloss:1.2243
Episode:1228 meanR:339.5900 glossQ:0.6780 glossP:0.2431 gloss:0.2431 dloss:1.2111
Episode:1229 meanR:339.5900 glossQ:0.6764 glossP:0.2561 gloss:0.2561 dloss:1.2017
Episode:1230 meanR:339.5900 glossQ:0.6717 glossP:0.2901 gloss:0.2901 dloss:1.1773
Episode:1231 meanR:339.5900 glossQ:0.6702 glossP:0.3006 gloss:0.3006 dloss:1.1698
Episode:1232 meanR:339.5900 glossQ:0.6661 glossP:0.3270 gloss:0.3270 dloss:1.1509
Episode:1233 meanR:339.5900 glossQ:0.6572 glossP:0.3783 gloss:0.3783 dloss:1.1142
Episode:1234 meanR:339.5900 glossQ:0.6538 glossP:0.3970 gloss:0.3970 dloss:1.1010
Episode:1235 meanR:339.5900 glossQ:0.6508 glossP:0.4124 gloss:0.4124 dloss:1.0900
Episode:1236 mea

Episode:1324 meanR:360.8200 glossQ:0.6525 glossP:-0.0995 gloss:-0.0995 dloss:1.3456
Episode:1325 meanR:356.1400 glossQ:0.6330 glossP:-0.4141 gloss:-0.4141 dloss:1.1369
Episode:1326 meanR:351.9600 glossQ:0.6198 glossP:-0.3078 gloss:-0.3078 dloss:1.2568
Episode:1327 meanR:347.4500 glossQ:0.6000 glossP:-0.5245 gloss:-0.5245 dloss:1.1023
Episode:1328 meanR:342.9900 glossQ:0.5567 glossP:-0.6986 gloss:-0.6986 dloss:1.0226
Episode:1329 meanR:338.6000 glossQ:0.5644 glossP:-0.6730 gloss:-0.6730 dloss:1.0541
Episode:1330 meanR:334.1600 glossQ:0.5045 glossP:-0.9018 gloss:-0.9018 dloss:0.9326
Episode:1331 meanR:330.6700 glossQ:0.6970 glossP:0.0381 gloss:0.0381 dloss:1.4047
Episode:1332 meanR:326.6000 glossQ:0.5395 glossP:-0.7758 gloss:-0.7758 dloss:1.0970
Episode:1333 meanR:322.4700 glossQ:0.4863 glossP:-0.9824 gloss:-0.9824 dloss:1.0013
Episode:1334 meanR:319.2600 glossQ:0.5692 glossP:-0.7011 gloss:-0.7011 dloss:1.2960
Episode:1335 meanR:315.5900 glossQ:0.5373 glossP:-0.8120 gloss:-0.8120 dloss:1

Episode:1427 meanR:50.6100 glossQ:0.2973 glossP:-1.6466 gloss:-1.6466 dloss:0.4875
Episode:1428 meanR:50.1700 glossQ:0.1451 glossP:-2.6903 gloss:-2.6903 dloss:0.2699
Episode:1429 meanR:49.7200 glossQ:0.2757 glossP:-1.7337 gloss:-1.7337 dloss:0.4427
Episode:1430 meanR:49.3200 glossQ:0.2606 glossP:-1.7915 gloss:-1.7915 dloss:0.4235
Episode:1431 meanR:47.9400 glossQ:0.2342 glossP:-1.9120 gloss:-1.9120 dloss:0.3772
Episode:1432 meanR:47.1400 glossQ:0.2290 glossP:-1.9436 gloss:-1.9436 dloss:0.3725
Episode:1433 meanR:46.3900 glossQ:0.1809 glossP:-2.2009 gloss:-2.2009 dloss:0.3096
Episode:1434 meanR:44.7000 glossQ:0.1711 glossP:-2.2723 gloss:-2.2723 dloss:0.2877
Episode:1435 meanR:43.5000 glossQ:0.1805 glossP:-2.1929 gloss:-2.1929 dloss:0.3134
Episode:1436 meanR:41.9100 glossQ:0.2254 glossP:-2.0144 gloss:-2.0144 dloss:0.4260
Episode:1437 meanR:41.3600 glossQ:0.2162 glossP:-2.1105 gloss:-2.1105 dloss:0.4142
Episode:1438 meanR:40.9700 glossQ:0.2463 glossP:-1.9136 gloss:-1.9136 dloss:0.4483
Epis

Episode:1526 meanR:77.3000 glossQ:0.5489 glossP:-0.8151 gloss:-0.8151 dloss:1.9952
Episode:1527 meanR:77.4800 glossQ:0.4784 glossP:-1.0595 gloss:-1.0595 dloss:0.8121
Episode:1528 meanR:77.6900 glossQ:0.4898 glossP:-1.0241 gloss:-1.0241 dloss:0.8028
Episode:1529 meanR:77.6700 glossQ:0.3381 glossP:-1.5409 gloss:-1.5409 dloss:0.5300
Episode:1530 meanR:77.7900 glossQ:0.4643 glossP:-1.1130 gloss:-1.1130 dloss:0.7665
Episode:1531 meanR:82.6600 glossQ:0.5359 glossP:-0.8627 gloss:-0.8627 dloss:2.0357
Episode:1532 meanR:83.8500 glossQ:0.5341 glossP:-0.8633 gloss:-0.8633 dloss:1.1238
Episode:1533 meanR:84.5900 glossQ:0.4868 glossP:-1.0211 gloss:-1.0211 dloss:0.9582
Episode:1534 meanR:84.6100 glossQ:0.2561 glossP:-2.0710 gloss:-2.0710 dloss:0.4315
Episode:1535 meanR:86.5700 glossQ:0.5377 glossP:-0.8513 gloss:-0.8513 dloss:1.3238
Episode:1536 meanR:86.5500 glossQ:0.4727 glossP:-1.0681 gloss:-1.0681 dloss:0.7533
Episode:1537 meanR:86.4800 glossQ:0.3809 glossP:-1.3643 gloss:-1.3643 dloss:0.5895
Epis

Episode:1625 meanR:132.7800 glossQ:0.6798 glossP:-0.1675 gloss:-0.1675 dloss:1.3406
Episode:1626 meanR:132.7800 glossQ:0.6914 glossP:-0.1004 gloss:-0.1004 dloss:1.4599
Episode:1627 meanR:137.4000 glossQ:0.6940 glossP:-0.0378 gloss:-0.0378 dloss:1.4152
Episode:1628 meanR:142.0900 glossQ:0.6936 glossP:0.0134 gloss:0.0134 dloss:1.3776
Episode:1629 meanR:144.3400 glossQ:0.6839 glossP:-0.0075 gloss:-0.0075 dloss:1.3901
Episode:1630 meanR:146.0900 glossQ:0.6834 glossP:-0.0151 gloss:-0.0151 dloss:1.3882
Episode:1631 meanR:142.7300 glossQ:0.6803 glossP:-0.0393 gloss:-0.0393 dloss:1.3818
Episode:1632 meanR:142.7600 glossQ:0.6803 glossP:-0.0669 gloss:-0.0669 dloss:1.3683
Episode:1633 meanR:142.0000 glossQ:0.6375 glossP:-0.4665 gloss:-0.4665 dloss:1.0993
Episode:1634 meanR:142.0200 glossQ:0.6127 glossP:-0.5595 gloss:-0.5595 dloss:1.0440
Episode:1635 meanR:141.0100 glossQ:0.6922 glossP:-0.0234 gloss:-0.0234 dloss:1.3785
Episode:1636 meanR:142.2000 glossQ:0.6845 glossP:-0.0781 gloss:-0.0781 dloss:1

Episode:1723 meanR:168.9100 glossQ:0.6703 glossP:-0.2807 gloss:-0.2807 dloss:1.4220
Episode:1724 meanR:170.4200 glossQ:0.6663 glossP:-0.2909 gloss:-0.2909 dloss:1.3767
Episode:1725 meanR:171.7000 glossQ:0.6699 glossP:-0.2943 gloss:-0.2943 dloss:1.4168
Episode:1726 meanR:168.5500 glossQ:0.6675 glossP:-0.3238 gloss:-0.3238 dloss:1.3321
Episode:1727 meanR:163.9700 glossQ:0.6091 glossP:-0.5209 gloss:-0.5209 dloss:1.1218
Episode:1728 meanR:159.3100 glossQ:0.5119 glossP:-0.9763 gloss:-0.9763 dloss:0.8949
Episode:1729 meanR:157.3800 glossQ:0.6095 glossP:-0.5438 gloss:-0.5438 dloss:1.1179
Episode:1730 meanR:158.0000 glossQ:0.6759 glossP:-0.2661 gloss:-0.2661 dloss:1.4020
Episode:1731 meanR:159.2600 glossQ:0.6736 glossP:-0.2594 gloss:-0.2594 dloss:1.4246
Episode:1732 meanR:162.9100 glossQ:0.6834 glossP:-0.2236 gloss:-0.2236 dloss:1.5522
Episode:1733 meanR:167.8100 glossQ:0.6833 glossP:-0.2275 gloss:-0.2275 dloss:1.5550
Episode:1734 meanR:172.6700 glossQ:0.6843 glossP:-0.2135 gloss:-0.2135 dloss

Episode:1822 meanR:288.9700 glossQ:0.6387 glossP:0.4542 gloss:0.4542 dloss:1.3782
Episode:1823 meanR:288.9500 glossQ:0.6350 glossP:0.4758 gloss:0.4758 dloss:1.3505
Episode:1824 meanR:289.4500 glossQ:0.6324 glossP:0.4904 gloss:0.4904 dloss:1.3581
Episode:1825 meanR:289.3700 glossQ:0.6306 glossP:0.4958 gloss:0.4958 dloss:1.3677
Episode:1826 meanR:290.3700 glossQ:0.6292 glossP:0.5045 gloss:0.5045 dloss:1.3399
Episode:1827 meanR:292.5900 glossQ:0.6279 glossP:0.5097 gloss:0.5097 dloss:1.3703
Episode:1828 meanR:294.7700 glossQ:0.6285 glossP:0.5060 gloss:0.5060 dloss:1.3877
Episode:1829 meanR:297.3900 glossQ:0.6330 glossP:0.4871 gloss:0.4871 dloss:1.3090
Episode:1830 meanR:298.1200 glossQ:0.6290 glossP:0.5093 gloss:0.5093 dloss:1.2611
Episode:1831 meanR:298.1800 glossQ:0.6362 glossP:0.4650 gloss:0.4650 dloss:1.3285
Episode:1832 meanR:296.5600 glossQ:0.6323 glossP:0.4933 gloss:0.4933 dloss:1.2648
Episode:1833 meanR:294.8500 glossQ:0.6328 glossP:0.4902 gloss:0.4902 dloss:1.2782
Episode:1834 mea

Episode:1923 meanR:264.3800 glossQ:0.4354 glossP:-1.2638 gloss:-1.2638 dloss:1.2898
Episode:1924 meanR:266.3800 glossQ:0.6779 glossP:-0.1265 gloss:-0.1265 dloss:1.5023
Episode:1925 meanR:266.1700 glossQ:0.6975 glossP:0.1218 gloss:0.1218 dloss:1.4241
Episode:1926 meanR:265.3300 glossQ:0.6873 glossP:0.1065 gloss:0.1065 dloss:1.4283
Episode:1927 meanR:264.3800 glossQ:0.6809 glossP:0.0568 gloss:0.0568 dloss:1.4230
Episode:1928 meanR:262.7400 glossQ:0.6808 glossP:-0.2148 gloss:-0.2148 dloss:1.2974
Episode:1929 meanR:259.8600 glossQ:0.6585 glossP:-0.3674 gloss:-0.3674 dloss:1.1427
Episode:1930 meanR:256.6500 glossQ:0.6427 glossP:-0.4636 gloss:-0.4636 dloss:1.0794
Episode:1931 meanR:253.8600 glossQ:0.6117 glossP:-0.5975 gloss:-0.5975 dloss:0.9941
Episode:1932 meanR:250.6100 glossQ:0.5759 glossP:-0.7287 gloss:-0.7287 dloss:0.9062
Episode:1933 meanR:247.4600 glossQ:0.5548 glossP:-0.8016 gloss:-0.8016 dloss:0.8662
Episode:1934 meanR:244.5500 glossQ:0.5221 glossP:-0.9088 gloss:-0.9088 dloss:0.804

Episode:2022 meanR:109.6700 glossQ:0.6157 glossP:-0.5821 gloss:-0.5821 dloss:1.3146
Episode:2023 meanR:107.9500 glossQ:0.2527 glossP:-2.1457 gloss:-2.1457 dloss:0.4508
Episode:2024 meanR:103.6700 glossQ:0.5549 glossP:-0.8145 gloss:-0.8145 dloss:0.9640
Episode:2025 meanR:101.3500 glossQ:0.3815 glossP:-1.4078 gloss:-1.4078 dloss:0.6067
Episode:2026 meanR:99.6600 glossQ:0.5429 glossP:-0.8473 gloss:-0.8473 dloss:0.9045
Episode:2027 meanR:98.9500 glossQ:0.6053 glossP:-0.6121 gloss:-0.6121 dloss:1.1327
Episode:2028 meanR:99.6300 glossQ:0.6367 glossP:-0.4743 gloss:-0.4743 dloss:1.2668
Episode:2029 meanR:103.7900 glossQ:0.6304 glossP:-0.5129 gloss:-0.5129 dloss:1.6769
Episode:2030 meanR:104.5200 glossQ:0.5227 glossP:-0.9034 gloss:-0.9034 dloss:1.0169
Episode:2031 meanR:105.7000 glossQ:0.5625 glossP:-0.7816 gloss:-0.7816 dloss:1.1676
Episode:2032 meanR:106.8000 glossQ:0.5474 glossP:-0.8248 gloss:-0.8248 dloss:1.1249
Episode:2033 meanR:107.7500 glossQ:0.5174 glossP:-0.9126 gloss:-0.9126 dloss:1.

Episode:2120 meanR:90.2900 glossQ:0.5228 glossP:-0.9027 gloss:-0.9027 dloss:1.1720
Episode:2121 meanR:91.4600 glossQ:0.5232 glossP:-0.8991 gloss:-0.8991 dloss:1.1686
Episode:2122 meanR:92.9300 glossQ:0.5596 glossP:-0.7956 gloss:-0.7956 dloss:1.6097
Episode:2123 meanR:97.8200 glossQ:0.5797 glossP:-0.7315 gloss:-0.7315 dloss:1.9372
Episode:2124 meanR:100.9600 glossQ:0.5931 glossP:-0.6675 gloss:-0.6675 dloss:1.6129
Episode:2125 meanR:103.8300 glossQ:0.5993 glossP:-0.6365 gloss:-0.6365 dloss:1.4916
Episode:2126 meanR:105.3200 glossQ:0.5958 glossP:-0.6494 gloss:-0.6494 dloss:1.2736
Episode:2127 meanR:105.9000 glossQ:0.5993 glossP:-0.6336 gloss:-0.6336 dloss:1.2291
Episode:2128 meanR:105.5100 glossQ:0.5943 glossP:-0.6443 gloss:-0.6443 dloss:1.1533
Episode:2129 meanR:102.5700 glossQ:0.5976 glossP:-0.6430 gloss:-0.6430 dloss:1.2015
Episode:2130 meanR:103.2200 glossQ:0.6006 glossP:-0.6286 gloss:-0.6286 dloss:1.2293
Episode:2131 meanR:102.7900 glossQ:0.5736 glossP:-0.7271 gloss:-0.7271 dloss:1.0

Episode:2221 meanR:120.4100 glossQ:0.5024 glossP:-0.9797 gloss:-0.9797 dloss:0.8570
Episode:2222 meanR:117.8400 glossQ:0.4872 glossP:-1.0607 gloss:-1.0607 dloss:0.9515
Episode:2223 meanR:113.2000 glossQ:0.4813 glossP:-1.0557 gloss:-1.0557 dloss:0.8145
Episode:2224 meanR:110.0300 glossQ:0.4986 glossP:-0.9961 gloss:-0.9961 dloss:0.8633
Episode:2225 meanR:107.5700 glossQ:0.4984 glossP:-1.0119 gloss:-1.0119 dloss:0.8973
Episode:2226 meanR:106.1300 glossQ:0.5020 glossP:-0.9806 gloss:-0.9806 dloss:0.8548
Episode:2227 meanR:104.9600 glossQ:0.5075 glossP:-0.9637 gloss:-0.9637 dloss:0.8679
Episode:2228 meanR:104.4500 glossQ:0.5171 glossP:-0.9668 gloss:-0.9668 dloss:0.9605
Episode:2229 meanR:103.7100 glossQ:0.4501 glossP:-1.1640 gloss:-1.1640 dloss:0.8791
Episode:2230 meanR:102.5700 glossQ:0.4992 glossP:-0.9954 gloss:-0.9954 dloss:0.8632
Episode:2231 meanR:102.3500 glossQ:0.4884 glossP:-1.0294 gloss:-1.0294 dloss:0.9219
Episode:2232 meanR:102.1400 glossQ:0.5053 glossP:-0.9693 gloss:-0.9693 dloss

Episode:2321 meanR:135.7200 glossQ:0.4542 glossP:-1.1431 gloss:-1.1431 dloss:0.7132
Episode:2322 meanR:135.1100 glossQ:0.4676 glossP:-1.1007 gloss:-1.1007 dloss:0.7366
Episode:2323 meanR:134.9300 glossQ:0.4536 glossP:-1.1462 gloss:-1.1462 dloss:0.7197
Episode:2324 meanR:134.7100 glossQ:0.4715 glossP:-1.0933 gloss:-1.0933 dloss:0.7552
Episode:2325 meanR:134.3600 glossQ:0.4727 glossP:-1.0895 gloss:-1.0895 dloss:0.7514
Episode:2326 meanR:134.2400 glossQ:0.4882 glossP:-1.0383 gloss:-1.0383 dloss:0.7908
Episode:2327 meanR:134.2000 glossQ:0.4890 glossP:-1.0334 gloss:-1.0334 dloss:0.8228
Episode:2328 meanR:134.0800 glossQ:0.4593 glossP:-1.1242 gloss:-1.1242 dloss:0.8446
Episode:2329 meanR:134.1600 glossQ:0.4106 glossP:-1.2790 gloss:-1.2790 dloss:0.8637
Episode:2330 meanR:134.9000 glossQ:0.5149 glossP:-0.9203 gloss:-0.9203 dloss:1.0700
Episode:2331 meanR:135.9900 glossQ:0.5898 glossP:-0.6597 gloss:-0.6597 dloss:1.2723
Episode:2332 meanR:140.1800 glossQ:0.6582 glossP:-0.4053 gloss:-0.4053 dloss

Episode:2419 meanR:197.9700 glossQ:0.6819 glossP:-0.1437 gloss:-0.1437 dloss:1.4164
Episode:2420 meanR:202.7400 glossQ:0.6944 glossP:-0.0282 gloss:-0.0282 dloss:1.4088
Episode:2421 meanR:207.5800 glossQ:0.6962 glossP:-0.0137 gloss:-0.0137 dloss:1.4010
Episode:2422 meanR:210.2300 glossQ:0.6896 glossP:-0.0841 gloss:-0.0841 dloss:1.3958
Episode:2423 meanR:212.4100 glossQ:0.6897 glossP:-0.0723 gloss:-0.0723 dloss:1.3852
Episode:2424 meanR:213.1400 glossQ:0.6827 glossP:-0.1621 gloss:-0.1621 dloss:1.3156
Episode:2425 meanR:213.8000 glossQ:0.6827 glossP:-0.1563 gloss:-0.1563 dloss:1.3151
Episode:2426 meanR:214.4700 glossQ:0.6820 glossP:-0.1689 gloss:-0.1689 dloss:1.3141
Episode:2427 meanR:214.9300 glossQ:0.6812 glossP:-0.1654 gloss:-0.1654 dloss:1.3112
Episode:2428 meanR:215.2300 glossQ:0.6807 glossP:-0.1717 gloss:-0.1717 dloss:1.3089
Episode:2429 meanR:215.3100 glossQ:0.6793 glossP:-0.1843 gloss:-0.1843 dloss:1.3024
Episode:2430 meanR:215.1500 glossQ:0.6802 glossP:-0.1826 gloss:-0.1826 dloss

Episode:2518 meanR:155.6800 glossQ:0.6208 glossP:-0.5184 gloss:-0.5184 dloss:1.3471
Episode:2519 meanR:157.5500 glossQ:0.6877 glossP:-0.2074 gloss:-0.2074 dloss:1.5434
Episode:2520 meanR:155.0100 glossQ:0.6715 glossP:-0.2317 gloss:-0.2317 dloss:1.3944
Episode:2521 meanR:151.7900 glossQ:0.6477 glossP:-0.3878 gloss:-0.3878 dloss:1.3183
Episode:2522 meanR:150.6600 glossQ:0.6210 glossP:-0.5271 gloss:-0.5271 dloss:1.2717
Episode:2523 meanR:148.4400 glossQ:0.4586 glossP:-1.1237 gloss:-1.1237 dloss:0.7343
Episode:2524 meanR:147.6100 glossQ:0.3843 glossP:-1.3886 gloss:-1.3886 dloss:0.6210
Episode:2525 meanR:146.8900 glossQ:0.4151 glossP:-1.2737 gloss:-1.2737 dloss:0.6765
Episode:2526 meanR:147.4000 glossQ:0.5644 glossP:-0.7280 gloss:-0.7280 dloss:1.1868
Episode:2527 meanR:148.1800 glossQ:0.6138 glossP:-0.5619 gloss:-0.5619 dloss:1.2489
Episode:2528 meanR:149.1700 glossQ:0.6266 glossP:-0.5075 gloss:-0.5075 dloss:1.2975
Episode:2529 meanR:150.3400 glossQ:0.6401 glossP:-0.4500 gloss:-0.4500 dloss

Episode:2616 meanR:209.7700 glossQ:0.4955 glossP:-0.9763 gloss:-0.9763 dloss:0.8402
Episode:2617 meanR:211.3700 glossQ:0.5994 glossP:-0.5829 gloss:-0.5829 dloss:1.2724
Episode:2618 meanR:209.4000 glossQ:0.4342 glossP:-1.1965 gloss:-1.1965 dloss:0.7562
Episode:2619 meanR:204.5300 glossQ:0.4289 glossP:-1.2056 gloss:-1.2056 dloss:0.7379
Episode:2620 meanR:203.1000 glossQ:0.6363 glossP:-0.4299 gloss:-0.4299 dloss:1.2188
Episode:2621 meanR:203.0000 glossQ:0.6554 glossP:-0.2722 gloss:-0.2722 dloss:1.3335
Episode:2622 meanR:203.4200 glossQ:0.6770 glossP:-0.0786 gloss:-0.0786 dloss:1.3853
Episode:2623 meanR:206.7000 glossQ:0.6831 glossP:0.0446 gloss:0.0446 dloss:1.3813
Episode:2624 meanR:211.6000 glossQ:0.6981 glossP:0.0605 gloss:0.0605 dloss:1.3506
Episode:2625 meanR:214.7400 glossQ:0.6858 glossP:-0.0466 gloss:-0.0466 dloss:1.4022
Episode:2626 meanR:214.4900 glossQ:0.6715 glossP:-0.2260 gloss:-0.2260 dloss:1.3104
Episode:2627 meanR:213.7200 glossQ:0.6607 glossP:-0.3352 gloss:-0.3352 dloss:1.2

Episode:2714 meanR:138.9500 glossQ:0.6794 glossP:-0.2243 gloss:-0.2243 dloss:1.4076
Episode:2715 meanR:140.5300 glossQ:0.6841 glossP:-0.1981 gloss:-0.1981 dloss:1.4469
Episode:2716 meanR:145.3800 glossQ:0.6981 glossP:-0.1594 gloss:-0.1594 dloss:1.5207
Episode:2717 meanR:148.6700 glossQ:0.6924 glossP:-0.2830 gloss:-0.2830 dloss:1.6179
Episode:2718 meanR:152.9300 glossQ:0.6392 glossP:-0.4449 gloss:-0.4449 dloss:1.6585
Episode:2719 meanR:155.5300 glossQ:0.6306 glossP:-0.4884 gloss:-0.4884 dloss:1.4337
Episode:2720 meanR:156.1600 glossQ:0.5807 glossP:-0.7021 gloss:-0.7021 dloss:1.2408
Episode:2721 meanR:155.6600 glossQ:0.5353 glossP:-0.8599 gloss:-0.8599 dloss:1.0989
Episode:2722 meanR:154.1000 glossQ:0.4037 glossP:-1.2975 gloss:-1.2975 dloss:0.7743
Episode:2723 meanR:150.8600 glossQ:0.4682 glossP:-1.1015 gloss:-1.1015 dloss:0.7599
Episode:2724 meanR:145.9700 glossQ:0.3880 glossP:-1.3690 gloss:-1.3690 dloss:0.6260
Episode:2725 meanR:142.8200 glossQ:0.3775 glossP:-1.4135 gloss:-1.4135 dloss

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(g_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('G losses')

In [None]:
eps, arr = np.array(d_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses')

## Testing

Let's checkout how our trained agent plays the game.

In [28]:
import gym
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
# env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Episodes/epochs
    for _ in range(1):
    #while True:
        state = env.reset()
        total_reward = 0

        # Steps/batches
        #for _ in range(111111111111111111):
        while True:
            env.render()
            action_logits = sess.run(model.actions_logits, feed_dict={model.states: np.reshape(state, [1, -1])})
            action = np.argmax(action_logits)
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                break
        # Print and break condition
        print('total_reward: {}'.format(total_reward))
        if total_reward == 500:
            break
                
# Closing the env
env.close()



[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
INFO:tensorflow:Restoring parameters from checkpoints/model2.ckpt
total_reward: 500.0


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.