# DPG for BipedalWalker

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.11.0
Default GPU Device: /device:GPU:0


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

##### >**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [2]:
import gym
env = gym.make('BipedalWalker-v2')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [3]:
env.observation_space, env.action_space

(Box(24,), Box(4,))

In [4]:
state = env.reset()
batch = []
for _ in range(1111):
    #env.render()
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action) # take a random action
    batch.append([state, action, next_state, reward, float(done)])
    state = next_state
    if done:
        state = env.reset()

To shut the window showing the simulation, use `env.close()`.

In [5]:
# env.close()

If you ran the simulation above, we can look at the rewards:

In [6]:
import numpy as np
actions = np.array([each[0] for each in batch])
states = np.array([each[1] for each in batch])
rewards = np.array([each[2] for each in batch])
dones = np.array([each[3] for each in batch])
infos = np.array([each[4] for each in batch])

In [7]:
print('states:', np.max(np.array(states)), np.min(np.array(states)))
print('actions:', np.max(np.array(actions)), np.min(np.array(actions)))
print('rewards:', np.max(np.array(rewards)), np.min(np.array(rewards)))

states: 0.99992806 -0.9998551
actions: 2.253050168355306 -2.222696304321289
rewards: 2.253050168355306 -2.222696304321289


In [8]:
env.action_space.high, env.action_space.low

(array([1., 1., 1., 1.], dtype=float32),
 array([-1., -1., -1., -1.], dtype=float32))

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [11]:
def model_input(state_size, action_size):
    #states = tf.placeholder(tf.float32, [None, *state_shape], name='states')
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.float32, [None, action_size], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    rewards = tf.placeholder(tf.float32, [None], name='rewards')
    training = tf.placeholder(tf.bool, [], name='training')
    return states, actions, targetQs, rewards, training

In [12]:
# Generator/Controller: Generating/prediting the actions
def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        predictions = tf.nn.tanh(logits) # [-1, +1]

        # return actions logits
        return predictions

In [13]:
# Discriminator/Dopamine: Reward function/planner/naviator/advisor/supervisor/cortical columns
def discriminator(states, actions, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=action_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        fused = tf.concat(axis=1, values=[nl1, actions])
        h2 = tf.layers.dense(inputs=fused, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
                
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)        
        #predictions = tf.nn.softmax(logits)

        # return rewards logits
        return logits

In [14]:
# # DDPG
# def model_loss(action_size, hidden_size, states, actions, targetQs, rates):
#     actions_preds = generator(states=states, hidden_size=hidden_size, action_size=action_size)
#     gQs = discriminator(actions=actions_preds, hidden_size=hidden_size, states=states) # nextQs/targetQs
#     gloss = -tf.reduce_mean(gQs)
#     dQs = discriminator(actions=actions, hidden_size=hidden_size, states=states, reuse=True) # Qs
#     targetQs = tf.reshape(targetQs, shape=[-1, 1]) # gQs
#     dloss = tf.reduce_mean(tf.square(dQs - targetQs)) # DQN
#     rates = tf.reshape(rates, shape=[-1, 1]) # [-1,+1]
#     dloss += tf.reduce_mean(tf.square(tf.nn.tanh(dQs) - rates)) # DQN
#     return actions_preds, gQs, gloss, dloss

In [17]:
# Adverseial Q-learning
def model_loss(action_size, hidden_size, states, actions, targetQs, rewards, training):
    actions_preds = generator(states=states, hidden_size=hidden_size, action_size=action_size, training=training)
    gQs = discriminator(actions=actions_preds, hidden_size=hidden_size, states=states, training=training) #nextQs
    dQs = discriminator(actions=actions, hidden_size=hidden_size, states=states, training=training, reuse=True)#Qs
    targetQs = tf.reshape(targetQs, shape=[-1, 1]) # gQs
    gloss = tf.reduce_mean(tf.square(gQs - targetQs)) # DQN
    dloss = tf.reduce_mean(tf.square(dQs - targetQs)) # DQN
    dloss += tf.reduce_mean(tf.square(dQs - rewards)) # DQN
    return actions_preds, gQs, gloss, dloss

In [18]:
# Optimizating/training/learning G & D
def model_opt(g_loss, d_loss, g_learning_rate, d_learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        g_opt = tf.train.AdamOptimizer(g_learning_rate).minimize(g_loss, var_list=g_vars)
        d_opt = tf.train.AdamOptimizer(d_learning_rate).minimize(d_loss, var_list=d_vars)
    return g_opt, d_opt

In [19]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, g_learning_rate, d_learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, self.rewards, self.training = model_input(
            state_size=state_size, action_size=action_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_preds, self.Qs_logits, self.g_loss, self.d_loss = model_loss(
            action_size=action_size, hidden_size=hidden_size, # model init parameters
            states=self.states, actions=self.actions, targetQs=self.targetQs, 
            rewards=self.rewards, training=self.training) # model input
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(g_loss=self.g_loss, 
                                           d_loss=self.d_loss,
                                           g_learning_rate=g_learning_rate, 
                                           d_learning_rate=d_learning_rate)

In [20]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [21]:
env.observation_space, env.action_space

(Box(24,), Box(4,))

In [22]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01           # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
state_size = 24
action_size = 4
hidden_size = 24*2             # number of units in each Q-network hidden layer
g_learning_rate = 1e-4         # Q-network learning rate
d_learning_rate = 1e-4         # Q-network learning rate

# Memory parameters
memory_size = int(1e5)            # memory capacity
batch_size = int(1e2)             # experience mini-batch size == one episode size is 1000/int(1e3) steps
gamma = 0.99                   # future reward discount

In [23]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, state_size=state_size, hidden_size=hidden_size,
              g_learning_rate=g_learning_rate, d_learning_rate=d_learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

In [24]:
env.action_space.high, env.action_space.low, env.action_space.shape, \
env.reward_range, env.action_space

(array([1., 1., 1., 1.], dtype=float32),
 array([-1., -1., -1., -1.], dtype=float32),
 (4,),
 (-inf, inf),
 Box(4,))

In [25]:
print('states:', np.max(np.array(states)), np.min(np.array(states)))
print('actions:', np.max(np.array(actions)), np.min(np.array(actions)))
print('rewards:', np.max(np.array(rewards)), np.min(np.array(rewards)))

states: 0.99992806 -0.9998551
actions: 2.253050168355306 -2.222696304321289
rewards: 2.253050168355306 -2.222696304321289


In [26]:
state = env.reset()
total_reward = 0
num_step = 0
for each_step in range(memory_size):
    action = env.action_space.sample() # randomness
    action = np.clip(action, -1, 1) # clipped: [-1, +1]
    next_state, reward, done, _ = env.step(action)
    rate = -1 # success rate: [-1, +1]
    memory.buffer.append([state, action, next_state, reward, float(done), rate])
    num_step += 1 # memory updated
    total_reward += reward # max reward 300
    state = next_state
    if done is True:
        print('Progress:', each_step/memory_size)
        state = env.reset()
        # Best 100-episode average reward was 220.62 ± 0.69. 
        # (BipedalWalker-v2 is considered "solved" 
        #  when the agent obtains an average reward of at least 300 over 100 consecutive episodes.)        
        rate = total_reward/300
        rate = np.clip(rate, -1, 1) 
        total_reward = 0 # reset
        for idx in range(num_step): # episode length
            if memory.buffer[-1-idx][-1] == -1:
                memory.buffer[-1-idx][-1] = rate
        num_step = 0 # reset

Progress: 0.01599
Progress: 0.01939
Progress: 0.02001
Progress: 0.02127
Progress: 0.02207
Progress: 0.03807
Progress: 0.03923
Progress: 0.05523
Progress: 0.05646
Progress: 0.05746
Progress: 0.05838
Progress: 0.05911
Progress: 0.05992
Progress: 0.07592
Progress: 0.07651
Progress: 0.09251
Progress: 0.09308
Progress: 0.09393
Progress: 0.09488
Progress: 0.11088
Progress: 0.11204
Progress: 0.12804
Progress: 0.12857
Progress: 0.12929
Progress: 0.14529
Progress: 0.14582
Progress: 0.14684
Progress: 0.16284
Progress: 0.17884
Progress: 0.17987
Progress: 0.18092
Progress: 0.18146
Progress: 0.18221
Progress: 0.18293
Progress: 0.18375
Progress: 0.18456
Progress: 0.18524
Progress: 0.18593
Progress: 0.20193
Progress: 0.21793
Progress: 0.21854
Progress: 0.21926
Progress: 0.23526
Progress: 0.23599
Progress: 0.23719
Progress: 0.23806
Progress: 0.23891
Progress: 0.25491
Progress: 0.25597
Progress: 0.25699
Progress: 0.25795
Progress: 0.25854
Progress: 0.25964
Progress: 0.26024
Progress: 0.26098
Progress: 

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [None]:
# Save/load the model and save for plotting
saver = tf.train.Saver()
episode_rewards_list, rewards_list, gloss_list, dloss_list = [], [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model2.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        num_step = 0
        gloss_batch, dloss_batch = [], []
        state = env.reset()

        # Training steps/batches
        while True:
            action_preds = sess.run(model.actions_preds, feed_dict={model.states: state.reshape([1, -1]), 
                                                                    model.training: False})
            noise = np.random.normal(loc=0, scale=0.1, size=action_size) # randomness
            action = action_preds.reshape([-1]) + noise
            #print(action.shape, action_logits.shape, noise.shape)
            action = np.clip(action, -1, 1) # clipped
            next_state, reward, done, _ = env.step(action)
            rate = -1 # success rate: -1 to +1
            memory.buffer.append([state, action, next_state, reward, float(done), rate])
            num_step += 1 # memory updated
            total_reward += reward # max reward 300
            state = next_state
            
            if done is True:
                # Best 100-episode average reward was 220.62 ± 0.69. 
                # (BipedalWalker-v2 is considered "solved" 
                #  when the agent obtains an average reward of at least 300 over 100 consecutive episodes.)        
                rate = total_reward/300
                rate = np.clip(rate, -1, 1)
                for idx in range(num_step): # episode length
                    if memory.buffer[-1-idx][-1] == -1:
                        memory.buffer[-1-idx][-1] = rate
                        
            # Training
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            #rates = np.array([each[5] for each in batch])
            nextQs_logits = sess.run(model.Qs_logits, feed_dict = {model.states: next_states, 
                                                                   model.training: False})
            #nextQs = np.max(nextQs_logits, axis=1) * (1-dones) # discrete DQN
            nextQs = nextQs_logits.reshape([-1]) * (1-dones) # continuous DPG
            targetQs = rewards + (gamma * nextQs)
            gloss, dloss, _, _ = sess.run([model.g_loss, model.d_loss, model.g_opt, model.d_opt],
                                          feed_dict = {model.states: states, 
                                                       model.actions: actions,
                                                       model.targetQs: targetQs, 
                                                       model.rewards: rewards, 
                                                       model.training: True})
            gloss_batch.append(gloss)
            dloss_batch.append(dloss)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'gloss:{:.4f}'.format(np.mean(gloss_batch)),
              'dloss:{:.4f}'.format(np.mean(dloss_batch)))
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        gloss_list.append([ep, np.mean(gloss_batch)])
        dloss_list.append([ep, np.mean(dloss_batch)])
        # Break episode/epoch loop
        # Did not solve the environment. 
        # Best 100-episode average reward was 220.62 ± 0.69. 
        # (BipedalWalker-v2 is considered "solved" 
        #  when the agent obtains an average reward of at least 300 over 100 consecutive episodes.)        
        if np.mean(episode_reward) >= 300:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:-121.5387 R:-121.5387 gloss:15.0686 dloss:30.9303
Episode:1 meanR:-112.5238 R:-103.5088 gloss:10.5824 dloss:21.4712
Episode:2 meanR:-110.7832 R:-107.3021 gloss:14.3477 dloss:29.0787
Episode:3 meanR:-109.5724 R:-105.9397 gloss:11.9405 dloss:24.1308
Episode:4 meanR:-116.9562 R:-146.4917 gloss:10.7301 dloss:21.7678
Episode:5 meanR:-116.0979 R:-111.8065 gloss:8.9250 dloss:18.1928
Episode:6 meanR:-114.9556 R:-108.1016 gloss:6.2803 dloss:12.7687
Episode:7 meanR:-113.9792 R:-107.1444 gloss:12.3230 dloss:24.9873
Episode:8 meanR:-120.4068 R:-171.8278 gloss:10.3868 dloss:21.1026
Episode:9 meanR:-118.8849 R:-105.1880 gloss:8.8043 dloss:17.9101
Episode:10 meanR:-117.4303 R:-102.8840 gloss:14.8935 dloss:30.2764
Episode:11 meanR:-116.5814 R:-107.2432 gloss:9.6568 dloss:19.6784
Episode:12 meanR:-115.4829 R:-102.3006 gloss:13.1001 dloss:26.6146
Episode:13 meanR:-114.7152 R:-104.7349 gloss:10.8661 dloss:22.1534
Episode:14 meanR:-114.0127 R:-104.1782 gloss:5.7906 dloss:11.7912
Episode:15

Episode:123 meanR:-113.0862 R:-134.0903 gloss:18.5711 dloss:43.8839
Episode:124 meanR:-113.4373 R:-136.3668 gloss:23.3709 dloss:55.5921
Episode:125 meanR:-113.7528 R:-133.3325 gloss:10.7791 dloss:26.0888
Episode:126 meanR:-114.0398 R:-132.4066 gloss:27.2516 dloss:64.7860
Episode:127 meanR:-114.3406 R:-132.7044 gloss:22.0260 dloss:53.0313
Episode:128 meanR:-114.3854 R:-108.1004 gloss:26.0584 dloss:63.4180
Episode:129 meanR:-114.6957 R:-133.4283 gloss:12.2271 dloss:30.0559
Episode:130 meanR:-115.0187 R:-136.1895 gloss:12.1583 dloss:29.3942
Episode:131 meanR:-115.3080 R:-132.5337 gloss:23.4739 dloss:55.9023
Episode:132 meanR:-115.5839 R:-132.2448 gloss:20.1185 dloss:48.8712
Episode:133 meanR:-115.9085 R:-134.6576 gloss:17.6676 dloss:43.2666
Episode:134 meanR:-116.1947 R:-132.0057 gloss:14.2514 dloss:34.4620
Episode:135 meanR:-116.4841 R:-132.0100 gloss:13.1595 dloss:31.7312
Episode:136 meanR:-116.7493 R:-131.2472 gloss:15.1882 dloss:36.2729
Episode:137 meanR:-117.0121 R:-130.3144 gloss:21

Episode:244 meanR:-115.6661 R:-98.1094 gloss:16.1383 dloss:55.3515
Episode:245 meanR:-115.3117 R:-97.9777 gloss:17.5843 dloss:59.8073
Episode:246 meanR:-114.9600 R:-98.5891 gloss:14.4877 dloss:51.1116
Episode:247 meanR:-114.6449 R:-100.9594 gloss:20.7569 dloss:74.1512
Episode:248 meanR:-114.3674 R:-104.9520 gloss:12.7569 dloss:44.5850
Episode:249 meanR:-114.0626 R:-98.8324 gloss:18.6215 dloss:64.5589
Episode:250 meanR:-113.7985 R:-103.3901 gloss:13.2598 dloss:49.6411
Episode:251 meanR:-113.5598 R:-101.0813 gloss:12.7820 dloss:47.0969
Episode:252 meanR:-113.2736 R:-103.9842 gloss:23.5848 dloss:81.1778
Episode:253 meanR:-113.0481 R:-104.0563 gloss:11.5423 dloss:44.1997
Episode:254 meanR:-112.8378 R:-104.3241 gloss:14.9650 dloss:52.9399
Episode:255 meanR:-112.5809 R:-97.0733 gloss:14.3504 dloss:49.8211
Episode:256 meanR:-112.2016 R:-95.1385 gloss:19.0912 dloss:64.4395
Episode:257 meanR:-112.0162 R:-103.8047 gloss:17.8694 dloss:62.0498
Episode:258 meanR:-111.6769 R:-95.6320 gloss:15.4165 d

Episode:365 meanR:-102.4089 R:-100.6547 gloss:12.8687 dloss:52.1486
Episode:366 meanR:-102.4207 R:-100.9175 gloss:23.6589 dloss:93.3959
Episode:367 meanR:-102.4461 R:-102.3459 gloss:23.5891 dloss:90.9168
Episode:368 meanR:-102.4496 R:-100.9734 gloss:19.3322 dloss:79.0731
Episode:369 meanR:-102.4792 R:-102.4399 gloss:17.7022 dloss:74.3754
Episode:370 meanR:-102.4832 R:-100.9421 gloss:19.8399 dloss:75.9188
Episode:371 meanR:-102.5122 R:-106.0679 gloss:21.8593 dloss:85.1624
Episode:372 meanR:-102.5269 R:-100.9744 gloss:19.3674 dloss:79.6624
Episode:373 meanR:-102.5188 R:-102.7655 gloss:11.2538 dloss:47.9685
Episode:374 meanR:-102.5802 R:-106.6720 gloss:19.4381 dloss:79.5245
Episode:375 meanR:-102.6273 R:-106.3118 gloss:17.9349 dloss:80.3664
Episode:376 meanR:-102.6957 R:-106.7977 gloss:20.8926 dloss:80.3797
Episode:377 meanR:-102.6908 R:-101.2354 gloss:20.4885 dloss:80.1452
Episode:378 meanR:-102.7480 R:-106.2860 gloss:16.6891 dloss:76.3107
Episode:379 meanR:-102.7911 R:-106.3752 gloss:13

Episode:486 meanR:-102.5569 R:-100.4406 gloss:19.9031 dloss:90.7032
Episode:487 meanR:-102.5828 R:-103.9304 gloss:20.1763 dloss:82.9073
Episode:488 meanR:-102.5245 R:-100.0372 gloss:25.6371 dloss:111.4434
Episode:489 meanR:-102.4635 R:-100.0066 gloss:13.5566 dloss:63.9469
Episode:490 meanR:-102.4169 R:-101.5687 gloss:29.3322 dloss:119.7818
Episode:491 meanR:-102.4269 R:-102.2182 gloss:19.5041 dloss:80.7162
Episode:492 meanR:-102.4208 R:-100.8687 gloss:26.3634 dloss:112.1206
Episode:493 meanR:-102.4237 R:-101.5783 gloss:20.6156 dloss:87.7947
Episode:494 meanR:-102.4264 R:-106.4553 gloss:27.5549 dloss:107.2141
Episode:495 meanR:-102.3747 R:-100.4580 gloss:17.8665 dloss:82.2806
Episode:496 meanR:-102.3735 R:-101.0353 gloss:28.3105 dloss:116.0817
Episode:497 meanR:-102.4289 R:-106.4704 gloss:19.2517 dloss:92.6679
Episode:498 meanR:-102.4548 R:-103.2671 gloss:27.3159 dloss:118.7784
Episode:499 meanR:-102.4686 R:-102.5261 gloss:25.2461 dloss:105.8193
Episode:500 meanR:-102.8718 R:-140.9148 g

Episode:606 meanR:-102.7942 R:-100.2577 gloss:27.1440 dloss:118.9712
Episode:607 meanR:-102.7769 R:-100.6226 gloss:21.0995 dloss:95.6794
Episode:608 meanR:-102.7671 R:-99.9217 gloss:26.0762 dloss:113.1711
Episode:609 meanR:-102.7222 R:-100.9697 gloss:24.8782 dloss:117.0200
Episode:610 meanR:-102.7124 R:-100.0271 gloss:20.0421 dloss:98.0119
Episode:611 meanR:-102.6542 R:-100.1166 gloss:28.6987 dloss:124.4914
Episode:612 meanR:-102.6416 R:-99.9263 gloss:21.1263 dloss:103.7091
Episode:613 meanR:-102.6298 R:-99.9563 gloss:28.1885 dloss:127.5518
Episode:614 meanR:-102.5729 R:-101.7682 gloss:26.9786 dloss:124.6219
Episode:615 meanR:-102.5553 R:-99.7531 gloss:19.0525 dloss:90.1292
Episode:616 meanR:-102.5334 R:-99.9775 gloss:25.0351 dloss:107.6707
Episode:617 meanR:-102.5906 R:-105.8595 gloss:23.5040 dloss:107.2979
Episode:618 meanR:-102.5969 R:-101.1313 gloss:24.1181 dloss:110.4191
Episode:619 meanR:-102.5980 R:-100.8106 gloss:21.4004 dloss:101.2084
Episode:620 meanR:-102.5967 R:-100.9337 gl

Episode:726 meanR:-103.5468 R:-101.0771 gloss:34.9149 dloss:156.3664
Episode:727 meanR:-103.4902 R:-100.9321 gloss:27.4058 dloss:128.1090
Episode:728 meanR:-103.4924 R:-103.3262 gloss:29.4934 dloss:134.2032
Episode:729 meanR:-103.4474 R:-101.6148 gloss:32.9188 dloss:148.4872
Episode:730 meanR:-103.4232 R:-103.4280 gloss:26.5160 dloss:131.5021
Episode:731 meanR:-103.4054 R:-101.1107 gloss:26.8499 dloss:128.9547
Episode:732 meanR:-103.3754 R:-103.1070 gloss:25.7264 dloss:118.5680
Episode:733 meanR:-103.3574 R:-101.8118 gloss:32.2186 dloss:150.1515
Episode:734 meanR:-102.9285 R:-101.3443 gloss:28.2972 dloss:132.2530
Episode:735 meanR:-102.9342 R:-100.8274 gloss:24.9501 dloss:122.9206
Episode:736 meanR:-102.9538 R:-102.0718 gloss:25.5102 dloss:131.1608
Episode:737 meanR:-102.9737 R:-101.5850 gloss:27.4230 dloss:123.6033
Episode:738 meanR:-102.9729 R:-101.0340 gloss:30.4637 dloss:138.4927
Episode:739 meanR:-102.9905 R:-101.2732 gloss:29.4052 dloss:143.6748
Episode:740 meanR:-102.9799 R:-102

Episode:845 meanR:-102.8792 R:-103.3665 gloss:27.8580 dloss:135.4600
Episode:846 meanR:-102.8652 R:-101.9916 gloss:28.7338 dloss:150.0809
Episode:847 meanR:-102.8689 R:-103.8703 gloss:36.0071 dloss:171.6311
Episode:848 meanR:-102.8657 R:-103.1499 gloss:36.2568 dloss:186.8722
Episode:849 meanR:-102.8479 R:-101.4223 gloss:25.5417 dloss:126.1295
Episode:850 meanR:-102.8478 R:-103.5829 gloss:31.2465 dloss:159.1684
Episode:851 meanR:-102.8162 R:-104.2935 gloss:29.5765 dloss:140.0642
Episode:852 meanR:-102.8116 R:-102.8975 gloss:28.0671 dloss:150.4772
Episode:853 meanR:-102.8125 R:-103.5500 gloss:32.4259 dloss:158.6785
Episode:854 meanR:-102.8072 R:-102.7751 gloss:29.4441 dloss:145.6523
Episode:855 meanR:-102.8085 R:-103.2862 gloss:28.8980 dloss:127.6828
Episode:856 meanR:-102.7866 R:-101.3500 gloss:30.0124 dloss:145.3821
Episode:857 meanR:-102.7843 R:-101.1512 gloss:25.0586 dloss:127.0197
Episode:858 meanR:-102.7615 R:-101.2685 gloss:30.6451 dloss:146.7716
Episode:859 meanR:-102.7542 R:-103

Episode:964 meanR:-102.4398 R:-100.9028 gloss:33.1398 dloss:172.9754
Episode:965 meanR:-102.4430 R:-102.6303 gloss:30.0711 dloss:158.6935
Episode:966 meanR:-102.4383 R:-102.7996 gloss:33.2951 dloss:167.0712
Episode:967 meanR:-102.4238 R:-101.1752 gloss:33.2519 dloss:160.2786
Episode:968 meanR:-102.3997 R:-101.0205 gloss:26.1258 dloss:142.3241
Episode:969 meanR:-102.3727 R:-100.9699 gloss:25.9984 dloss:143.9075
Episode:970 meanR:-102.3731 R:-103.5734 gloss:36.4741 dloss:167.5810
Episode:971 meanR:-102.3490 R:-100.9464 gloss:43.7018 dloss:190.9235
Episode:972 meanR:-102.3624 R:-103.7315 gloss:26.7323 dloss:142.7333
Episode:973 meanR:-102.3382 R:-100.9126 gloss:43.2587 dloss:206.0961
Episode:974 meanR:-102.3203 R:-101.4254 gloss:31.1431 dloss:155.4464
Episode:975 meanR:-102.3133 R:-102.8918 gloss:38.2808 dloss:197.2522
Episode:976 meanR:-102.2916 R:-100.8951 gloss:35.7196 dloss:169.7434
Episode:977 meanR:-102.2898 R:-101.2225 gloss:32.6946 dloss:170.2240
Episode:978 meanR:-102.2539 R:-101

Episode:1082 meanR:-103.9770 R:-103.7275 gloss:40.5069 dloss:193.6398
Episode:1083 meanR:-104.0060 R:-103.9479 gloss:41.9118 dloss:194.9802
Episode:1084 meanR:-103.9982 R:-103.8113 gloss:33.2003 dloss:172.7464
Episode:1085 meanR:-103.9949 R:-103.5874 gloss:35.9025 dloss:185.7915
Episode:1086 meanR:-103.9343 R:-101.0752 gloss:32.1363 dloss:169.0248
Episode:1087 meanR:-103.9366 R:-103.9731 gloss:35.0361 dloss:179.6791
Episode:1088 meanR:-103.9512 R:-103.8212 gloss:43.6998 dloss:204.8712
Episode:1089 meanR:-103.9596 R:-103.0374 gloss:44.4152 dloss:227.8216
Episode:1090 meanR:-103.9579 R:-103.0577 gloss:39.8998 dloss:197.0068
Episode:1091 meanR:-103.9642 R:-103.8221 gloss:40.7092 dloss:217.4673
Episode:1092 meanR:-103.9653 R:-103.4791 gloss:32.1373 dloss:173.3514
Episode:1093 meanR:-103.9442 R:-101.1018 gloss:28.8588 dloss:144.1006
Episode:1094 meanR:-103.9213 R:-101.3482 gloss:39.0626 dloss:196.7954
Episode:1095 meanR:-103.8987 R:-101.0662 gloss:42.4459 dloss:205.0241
Episode:1096 meanR:-

Episode:1200 meanR:-102.6900 R:-102.9018 gloss:35.0174 dloss:181.8422
Episode:1201 meanR:-102.6962 R:-101.6601 gloss:37.2324 dloss:197.0632
Episode:1202 meanR:-102.7182 R:-103.2235 gloss:51.6459 dloss:244.0455
Episode:1203 meanR:-102.7416 R:-103.4667 gloss:41.0897 dloss:214.5443
Episode:1204 meanR:-102.7460 R:-101.4283 gloss:37.9334 dloss:188.6633
Episode:1205 meanR:-102.7631 R:-102.8001 gloss:52.4427 dloss:242.7166
Episode:1206 meanR:-102.7674 R:-101.5798 gloss:45.0178 dloss:238.4305
Episode:1207 meanR:-102.7674 R:-103.2770 gloss:47.1373 dloss:241.1972
Episode:1208 meanR:-102.7348 R:-101.5321 gloss:35.1642 dloss:187.8728
Episode:1209 meanR:-102.7219 R:-101.1722 gloss:44.2097 dloss:218.1673
Episode:1210 meanR:-102.6743 R:-103.6356 gloss:49.9374 dloss:241.7059
Episode:1211 meanR:-102.6059 R:-101.4413 gloss:40.7511 dloss:205.5784
Episode:1212 meanR:-102.6076 R:-103.4086 gloss:44.6002 dloss:215.0901
Episode:1213 meanR:-102.5926 R:-101.6428 gloss:40.0654 dloss:200.1942
Episode:1214 meanR:-

Episode:1318 meanR:-102.7022 R:-103.9175 gloss:48.4602 dloss:237.9121
Episode:1319 meanR:-102.7025 R:-103.4988 gloss:39.1671 dloss:201.9849
Episode:1320 meanR:-102.7363 R:-104.7955 gloss:44.6508 dloss:222.8578
Episode:1321 meanR:-102.7389 R:-101.9808 gloss:60.3071 dloss:278.2435
Episode:1322 meanR:-102.7422 R:-101.6255 gloss:36.2688 dloss:205.2984
Episode:1323 meanR:-102.7733 R:-104.7067 gloss:46.6683 dloss:234.5948
Episode:1324 meanR:-102.8043 R:-104.6521 gloss:43.1189 dloss:214.8105
Episode:1325 meanR:-102.8097 R:-101.9223 gloss:54.3496 dloss:274.9986
Episode:1326 meanR:-102.7939 R:-101.8489 gloss:54.6823 dloss:267.7875
Episode:1327 meanR:-102.7940 R:-101.7522 gloss:45.5209 dloss:222.5234
Episode:1328 meanR:-102.7941 R:-103.2408 gloss:42.3616 dloss:214.1233
Episode:1329 meanR:-102.8187 R:-103.9924 gloss:58.5967 dloss:268.3235
Episode:1330 meanR:-102.8003 R:-101.8041 gloss:52.1037 dloss:248.9737
Episode:1331 meanR:-102.7729 R:-101.8968 gloss:33.6645 dloss:186.3628
Episode:1332 meanR:-

Episode:1436 meanR:-102.4058 R:-101.9331 gloss:44.8993 dloss:238.7331
Episode:1437 meanR:-102.4643 R:-107.3569 gloss:37.3123 dloss:199.6142
Episode:1438 meanR:-102.4799 R:-103.2660 gloss:46.1527 dloss:235.5731
Episode:1439 meanR:-102.4839 R:-104.0936 gloss:42.7711 dloss:222.7188
Episode:1440 meanR:-102.5474 R:-107.8770 gloss:43.0327 dloss:231.8843
Episode:1441 meanR:-102.5553 R:-104.4061 gloss:48.4250 dloss:251.1646
Episode:1442 meanR:-102.5800 R:-104.0542 gloss:46.7435 dloss:244.6575
Episode:1443 meanR:-102.5993 R:-103.9904 gloss:41.3019 dloss:224.2234
Episode:1444 meanR:-102.6185 R:-103.8927 gloss:41.3479 dloss:220.0291
Episode:1445 meanR:-102.6552 R:-107.2929 gloss:39.7814 dloss:208.5297
Episode:1446 meanR:-102.7142 R:-107.5711 gloss:56.9688 dloss:276.3275
Episode:1447 meanR:-102.7807 R:-108.2340 gloss:43.5661 dloss:239.6534
Episode:1448 meanR:-102.8242 R:-107.9539 gloss:46.1142 dloss:242.0787
Episode:1449 meanR:-102.8771 R:-107.1236 gloss:44.0272 dloss:227.7260
Episode:1450 meanR:-

Episode:1554 meanR:-104.8908 R:-103.5185 gloss:44.8449 dloss:241.2391
Episode:1555 meanR:-104.9034 R:-104.0905 gloss:38.2266 dloss:220.1562
Episode:1556 meanR:-104.8647 R:-104.1476 gloss:43.9831 dloss:233.5984
Episode:1557 meanR:-104.8828 R:-103.6046 gloss:43.2544 dloss:236.9717
Episode:1558 meanR:-104.8796 R:-103.6205 gloss:42.5418 dloss:228.4591
Episode:1559 meanR:-104.8333 R:-101.9031 gloss:51.9722 dloss:289.9796
Episode:1560 meanR:-104.8462 R:-104.0460 gloss:42.9023 dloss:242.7815
Episode:1561 meanR:-104.8090 R:-103.7680 gloss:47.4735 dloss:277.9037
Episode:1562 meanR:-104.8057 R:-103.6418 gloss:39.6300 dloss:223.4050
Episode:1563 meanR:-104.8005 R:-103.6587 gloss:48.7948 dloss:267.5623
Episode:1564 meanR:-104.8246 R:-103.9330 gloss:38.5464 dloss:225.7413
Episode:1565 meanR:-104.7857 R:-103.4776 gloss:43.6162 dloss:248.5753
Episode:1566 meanR:-104.7451 R:-104.1775 gloss:43.6426 dloss:241.9231
Episode:1567 meanR:-104.7555 R:-102.6121 gloss:42.7599 dloss:238.2529
Episode:1568 meanR:-

Episode:1672 meanR:-106.4746 R:-107.0567 gloss:50.4644 dloss:276.1987
Episode:1673 meanR:-106.5131 R:-107.7423 gloss:40.0479 dloss:229.1865
Episode:1674 meanR:-106.5475 R:-107.6082 gloss:38.0492 dloss:211.6228
Episode:1675 meanR:-106.5760 R:-107.2833 gloss:41.3152 dloss:236.0216
Episode:1676 meanR:-106.6174 R:-108.3016 gloss:43.2413 dloss:241.4338
Episode:1677 meanR:-106.6063 R:-102.6784 gloss:52.3132 dloss:263.0181
Episode:1678 meanR:-106.6272 R:-106.0896 gloss:51.8333 dloss:268.1422
Episode:1679 meanR:-106.6109 R:-102.1176 gloss:42.8293 dloss:231.1048
Episode:1680 meanR:-106.6113 R:-103.9918 gloss:45.9814 dloss:253.6709
Episode:1681 meanR:-106.5952 R:-102.0892 gloss:46.7786 dloss:259.5798
Episode:1682 meanR:-106.5792 R:-102.4078 gloss:43.9217 dloss:238.6962
Episode:1683 meanR:-106.5791 R:-104.4908 gloss:45.0742 dloss:247.7062
Episode:1684 meanR:-106.5855 R:-107.5507 gloss:47.7566 dloss:256.0517
Episode:1685 meanR:-106.6343 R:-108.6285 gloss:49.0815 dloss:263.5505
Episode:1686 meanR:-

Episode:1790 meanR:-110.7664 R:-106.2117 gloss:54.7986 dloss:268.9694
Episode:1791 meanR:-110.7137 R:-103.4711 gloss:45.9692 dloss:244.1793
Episode:1792 meanR:-110.7038 R:-135.8298 gloss:50.6010 dloss:264.7799
Episode:1793 meanR:-110.7092 R:-107.8001 gloss:47.7991 dloss:248.6425
Episode:1794 meanR:-110.6592 R:-103.3308 gloss:49.0984 dloss:246.6903
Episode:1795 meanR:-110.9185 R:-134.4684 gloss:49.3410 dloss:252.8020
Episode:1796 meanR:-111.1985 R:-135.8355 gloss:46.5361 dloss:248.2590
Episode:1797 meanR:-111.1405 R:-101.9900 gloss:44.2094 dloss:233.4526
Episode:1798 meanR:-111.0761 R:-101.8973 gloss:43.8685 dloss:220.4983
Episode:1799 meanR:-111.0426 R:-103.9598 gloss:39.8806 dloss:205.1540
Episode:1800 meanR:-110.9981 R:-103.9590 gloss:55.4984 dloss:264.4752
Episode:1801 meanR:-110.9397 R:-102.4475 gloss:51.4520 dloss:276.6253
Episode:1802 meanR:-110.8858 R:-102.6228 gloss:40.8947 dloss:211.9288
Episode:1803 meanR:-110.8231 R:-102.0286 gloss:52.9832 dloss:268.3921
Episode:1804 meanR:-

Episode:1908 meanR:-115.3255 R:-107.2100 gloss:53.1211 dloss:272.7126
Episode:1909 meanR:-115.2773 R:-102.6672 gloss:45.6407 dloss:231.2599
Episode:1910 meanR:-115.2282 R:-102.0295 gloss:49.8549 dloss:257.6722
Episode:1911 meanR:-115.2324 R:-105.7879 gloss:54.3316 dloss:273.4620
Episode:1912 meanR:-115.1896 R:-103.3904 gloss:53.8801 dloss:279.4720
Episode:1913 meanR:-114.8656 R:-108.7620 gloss:42.9750 dloss:231.4273
Episode:1914 meanR:-114.8362 R:-132.1400 gloss:49.2821 dloss:254.7802
Episode:1915 meanR:-115.1909 R:-139.0613 gloss:45.5484 dloss:232.0231
Episode:1916 meanR:-115.4866 R:-135.4940 gloss:45.4195 dloss:234.5274
Episode:1917 meanR:-115.1622 R:-102.5517 gloss:51.5011 dloss:250.7741
Episode:1918 meanR:-115.1231 R:-104.3211 gloss:46.8293 dloss:242.8297
Episode:1919 meanR:-115.0854 R:-101.4714 gloss:44.1358 dloss:232.2101
Episode:1920 meanR:-115.1214 R:-107.4721 gloss:42.4919 dloss:225.9317
Episode:1921 meanR:-115.1046 R:-102.0156 gloss:45.3679 dloss:232.9097
Episode:1922 meanR:-

Episode:2026 meanR:-110.8652 R:-101.8986 gloss:45.3488 dloss:228.8841
Episode:2027 meanR:-110.8344 R:-102.9254 gloss:40.3756 dloss:203.4209
Episode:2028 meanR:-110.8709 R:-105.8675 gloss:47.4860 dloss:227.8472
Episode:2029 meanR:-110.8133 R:-102.5573 gloss:40.6263 dloss:196.7216
Episode:2030 meanR:-110.4998 R:-102.2660 gloss:55.2632 dloss:268.9727
Episode:2031 meanR:-110.1717 R:-102.2518 gloss:39.1155 dloss:205.0057
Episode:2032 meanR:-110.1091 R:-102.1362 gloss:47.0169 dloss:231.8772
Episode:2033 meanR:-110.0976 R:-102.3418 gloss:47.4364 dloss:228.8996
Episode:2034 meanR:-109.7561 R:-102.1956 gloss:48.4481 dloss:242.2426
Episode:2035 meanR:-109.7930 R:-106.0260 gloss:44.4026 dloss:223.6733
Episode:2036 meanR:-109.5955 R:-116.6555 gloss:48.0130 dloss:242.2682
Episode:2037 meanR:-109.2719 R:-101.2003 gloss:54.7740 dloss:268.1470
Episode:2038 meanR:-109.3061 R:-106.4114 gloss:47.6592 dloss:246.9375
Episode:2039 meanR:-109.2828 R:-104.7767 gloss:52.9670 dloss:266.4637
Episode:2040 meanR:-

Episode:2144 meanR:-104.1942 R:-103.1480 gloss:50.5195 dloss:243.0753
Episode:2145 meanR:-104.1475 R:-103.0356 gloss:50.4132 dloss:243.3562
Episode:2146 meanR:-103.8109 R:-102.1556 gloss:52.2122 dloss:257.5460
Episode:2147 meanR:-103.8581 R:-106.1613 gloss:44.8997 dloss:217.5421
Episode:2148 meanR:-103.8953 R:-105.6837 gloss:46.7240 dloss:244.4368
Episode:2149 meanR:-103.5383 R:-101.9618 gloss:45.1206 dloss:217.2365
Episode:2150 meanR:-103.5403 R:-101.8980 gloss:55.4164 dloss:259.6176
Episode:2151 meanR:-103.5361 R:-101.8390 gloss:44.7496 dloss:231.5821
Episode:2152 meanR:-103.5337 R:-106.1610 gloss:45.8443 dloss:234.3082
Episode:2153 meanR:-103.5305 R:-101.7644 gloss:45.5501 dloss:225.3607
Episode:2154 meanR:-103.5470 R:-104.3375 gloss:55.3199 dloss:284.9768
Episode:2155 meanR:-103.4931 R:-102.0346 gloss:47.9127 dloss:233.1966
Episode:2156 meanR:-103.4439 R:-101.7827 gloss:41.4339 dloss:207.6706
Episode:2157 meanR:-103.4311 R:-102.1267 gloss:47.7136 dloss:235.5369
Episode:2158 meanR:-

Episode:2262 meanR:-103.3275 R:-104.1020 gloss:61.6813 dloss:304.9070
Episode:2263 meanR:-103.3324 R:-103.3711 gloss:46.1345 dloss:250.6414
Episode:2264 meanR:-103.3336 R:-101.8848 gloss:46.9463 dloss:240.4747
Episode:2265 meanR:-103.2984 R:-101.9841 gloss:52.9177 dloss:254.6155
Episode:2266 meanR:-103.3197 R:-104.5243 gloss:55.1620 dloss:287.5439
Episode:2267 meanR:-103.3147 R:-102.5075 gloss:42.7059 dloss:216.5253
Episode:2268 meanR:-103.3549 R:-106.3572 gloss:43.8743 dloss:219.4717
Episode:2269 meanR:-103.3601 R:-102.1495 gloss:47.9921 dloss:242.6031
Episode:2270 meanR:-103.3534 R:-102.1202 gloss:49.0359 dloss:246.0470
Episode:2271 meanR:-103.3547 R:-102.8280 gloss:51.0688 dloss:259.0297
Episode:2272 meanR:-103.3164 R:-102.2623 gloss:45.7361 dloss:240.9114
Episode:2273 meanR:-103.3466 R:-104.9495 gloss:49.1642 dloss:259.8603
Episode:2274 meanR:-103.3540 R:-102.1766 gloss:44.3128 dloss:234.7034
Episode:2275 meanR:-103.4052 R:-106.4490 gloss:49.8981 dloss:247.8972
Episode:2276 meanR:-

Episode:2380 meanR:-103.1842 R:-101.4203 gloss:51.8111 dloss:274.0887
Episode:2381 meanR:-103.4273 R:-131.0352 gloss:48.2550 dloss:244.5912
Episode:2382 meanR:-103.3814 R:-102.1079 gloss:54.3167 dloss:266.9533
Episode:2383 meanR:-103.4154 R:-106.4748 gloss:42.9060 dloss:233.2170
Episode:2384 meanR:-103.3597 R:-101.0522 gloss:49.2654 dloss:247.3414
Episode:2385 meanR:-103.3953 R:-105.7856 gloss:49.8390 dloss:250.5961
Episode:2386 meanR:-103.7225 R:-134.8893 gloss:49.6047 dloss:250.9867
Episode:2387 meanR:-103.7171 R:-102.4919 gloss:45.3179 dloss:231.8934
Episode:2388 meanR:-103.7432 R:-105.4419 gloss:39.8130 dloss:215.4687
Episode:2389 meanR:-103.7349 R:-101.0921 gloss:54.6504 dloss:272.4432
Episode:2390 meanR:-103.7188 R:-100.9495 gloss:48.2860 dloss:246.8325
Episode:2391 meanR:-103.6990 R:-100.4044 gloss:44.0551 dloss:228.3575
Episode:2392 meanR:-103.6745 R:-100.4006 gloss:47.7021 dloss:235.8895
Episode:2393 meanR:-103.6677 R:-101.3668 gloss:53.4741 dloss:271.3164
Episode:2394 meanR:-

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(gloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('G losses')

In [None]:
eps, arr = np.array(dloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses')

## Testing

Let's checkout how our trained agent plays the game.

In [85]:
import gym
env = gym.make('BipedalWalker-v2')

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Episodes/epochs
    for _ in range(1):
        state = env.reset()
        total_reward = 0

        # Steps/batches
        while True:
            env.render()
            action_preds = sess.run(model.actions_preds, feed_dict={model.states: state.reshape([1, -1])})
            action = np.reshape(action_preds, [-1]) # For continuous action space
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                print('total_reward: {}'.format(total_reward))
                break
# End the env                
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt
total_reward: -103.43347330792993


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.