# Deep cortical reinforcement learning: Policy gradients + Q-learning + GAN


In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [63]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

##### >**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [64]:
import gym

## Create the Cart-Pole game environment
# env = gym.make('CartPole-v0')
# env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('MountainCarContinuous-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

# Discrete/int or continuos/float
env.action_space.dtype

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




dtype('float32')

We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [65]:
env.observation_space

Box(24,)

In [66]:
state = env.reset()
batch = []
for _ in range(1111):
    #env.render()
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action) # take a random action
    batch.append([state, action, next_state, reward, float(done)])
    state = next_state
    if done:
        state = env.reset()

To shut the window showing the simulation, use `env.close()`.

In [67]:
# env.close()

If you ran the simulation above, we can look at the rewards:

In [68]:
batch[0], 
batch[0][1].shape, state.shape

((4,), (24,))

In [69]:
import numpy as np
actions = np.array([each[0] for each in batch])
states = np.array([each[1] for each in batch])
rewards = np.array([each[2] for each in batch])
dones = np.array([each[3] for each in batch])
infos = np.array([each[4] for each in batch])

In [70]:
# print(rewards[-20:])
print('shapes:', np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print('dtypes:', np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print('states:', np.max(np.array(states)), np.min(np.array(states)))
print('actions:', np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
print('rewards:', np.max(np.array(rewards)), np.min(np.array(rewards)))

shapes: (1111, 24) (1111, 4) (1111, 24) (1111,)
dtypes: float64 float32 float64 float64
states: 0.9999491 -0.99980396
actions: 2.0423901875813804 -1.386185646057129
rewards: 2.0423901875813804 -1.386185646057129


In [71]:
actions[:10]

array([[ 2.74739973e-03, -1.31684646e-05,  1.02436312e-03,
        -1.59999108e-02,  9.19683278e-02, -1.35180948e-03,
         8.60266164e-01,  2.40187673e-03,  1.00000000e+00,
         3.23748849e-02, -1.35171739e-03,  8.53814155e-01,
         9.56013566e-04,  1.00000000e+00,  4.40814108e-01,
         4.45820212e-01,  4.61422890e-01,  4.89550292e-01,
         5.34102917e-01,  6.02461159e-01,  7.09149063e-01,
         8.85932028e-01,  1.00000000e+00,  1.00000000e+00],
       [ 1.81099847e-02,  2.27325392e-02,  1.50036299e-02,
         1.87298584e-02, -3.19512904e-01, -9.25155520e-01,
         1.48349273e+00,  9.93954817e-01,  1.00000000e+00,
         2.91940302e-01,  1.52260810e-01,  1.36669636e-01,
        -9.99908447e-01,  1.00000000e+00,  4.51477766e-01,
         4.56604987e-01,  4.72585082e-01,  5.01392901e-01,
         5.47023296e-01,  6.17035210e-01,  7.26303935e-01,
         9.07363415e-01,  1.00000000e+00,  1.00000000e+00],
       [ 3.07344068e-02,  2.62205172e-02,  1.24771661e

In [72]:
rewards[:10]

array([[ 1.81099847e-02,  2.27325392e-02,  1.50036299e-02,
         1.87298584e-02, -3.19512904e-01, -9.25155520e-01,
         1.48349273e+00,  9.93954817e-01,  1.00000000e+00,
         2.91940302e-01,  1.52260810e-01,  1.36669636e-01,
        -9.99908447e-01,  1.00000000e+00,  4.51477766e-01,
         4.56604987e-01,  4.72585082e-01,  5.01392901e-01,
         5.47023296e-01,  6.17035210e-01,  7.26303935e-01,
         9.07363415e-01,  1.00000000e+00,  1.00000000e+00],
       [ 3.07344068e-02,  2.62205172e-02,  1.24771661e-02,
        -8.54406476e-03, -8.12581480e-02, -8.26262236e-01,
         1.10302541e+00,  0.00000000e+00,  1.00000000e+00,
         3.13992411e-01,  3.77926439e-01,  4.68376875e-02,
        -1.00000008e+00,  0.00000000e+00,  4.53954518e-01,
         4.59109843e-01,  4.75177616e-01,  5.04143476e-01,
         5.50024211e-01,  6.20420158e-01,  7.30288327e-01,
         9.12341118e-01,  1.00000000e+00,  1.00000000e+00],
       [ 3.42181735e-02,  6.00428879e-03,  1.74115118e

In [73]:
# import numpy as np
def sigmoid(x, derivative=False):
  return x*(1-x) if derivative else 1/(1+np.exp(-x))

In [74]:
sigmoid(np.max(np.array(rewards))), sigmoid(np.min(np.array(rewards)))

(0.8851764282630248, 0.20001739497735352)

In [75]:
print('rewards:', np.max(np.array(rewards))/100, np.min(np.array(rewards))/100)

rewards: 0.020423901875813805 -0.013861856460571288


The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [76]:
def model_input(state_size, action_size):
    #states = tf.placeholder(tf.float32, [None, *state_size], name='states')
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.float32, [None, action_size], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    return states, actions, targetQs

In [77]:
# How to use batch-norm
#   x_norm = tf.layers.batch_normalization(x, training=training)

#   # ...

#   update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
#   with tf.control_dependencies(update_ops):
#     train_op = optimizer.minimize(loss)

In [78]:
# training: Either a Python boolean, or a TensorFlow boolean scalar tensor (e.g. a placeholder). 
# Whether to return the output in: 
# training mode (normalized with statistics of the current batch) or 
# inference mode (normalized with moving statistics). 
# NOTE: make sure to set this parameter correctly, or else your training/inference will not work properly.

In [79]:
# Generator/Controller: Generating/prediting the actions
def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        #predictions = tf.nn.softmax(logits)

        # return actions logits
        return logits

In [80]:
# Discriminator/Dopamine: Reward function/planner/naviator/advisor/supervisor/cortical columns
def discriminator(states, actions, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # Fusion/merge states and actions/ SA/ SM
        x_fused = tf.concat(axis=1, values=[states, actions])
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)        
        #predictions = tf.nn.softmax(logits)

        # return rewards logits
        return logits

In [81]:
batch[1][1], batch[2][1], np.square(batch[1][1] - batch[2][1])

(array([-0.6857498 , -0.1863178 ,  0.5463643 , -0.48135957], dtype=float32),
 array([ 0.76391673,  0.2900372 , -0.20683765, -0.2392354 ], dtype=float32),
 array([2.101533  , 0.22691411, 0.5673132 , 0.05862411], dtype=float32))

In [82]:
def model_loss(action_size, hidden_size, states, actions, targetQs):
    # G
    actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    #actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    actions_labels = tf.nn.sigmoid(actions)
    # neg_log_prob_actions = tf.nn.softmax_cross_entropy_with_logits_v2(logits=actions_logits, 
    #                                                                   labels=actions_labels)
    neg_log_prob_actions = tf.nn.sigmoid_cross_entropy_with_logits(logits=actions_logits, 
                                                                   labels=actions_labels)
    #g_loss = tf.reduce_mean(neg_log_prob_actions * targetQs) # error!
    
    # D
    Qs = discriminator(actions=actions_logits, hidden_size=hidden_size, states=states)
    d_loss = tf.reduce_mean(tf.square(Qs - targetQs))
    g_loss = tf.reduce_mean(neg_log_prob_actions * Qs)
    return actions_logits, Qs, g_loss, d_loss

In [90]:
# Optimizating/training/learning G & D
def model_opt(g_loss, d_loss, learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars)
        d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars)

    return g_opt, d_opt

In [91]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs = model_input(state_size=state_size, action_size=action_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.Qs_logits, self.g_loss, self.d_loss = model_loss(
            action_size=action_size, hidden_size=hidden_size, # model init parameters
            states=self.states, actions=self.actions, targetQs=self.targetQs) # model input
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(g_loss=self.g_loss, d_loss=self.d_loss, learning_rate=learning_rate)

In [92]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [93]:
print('state size:{}'.format(states.shape), 
      'actions:{}'.format(actions.shape)) 
print('action size:{}'.format(np.max(actions) - np.min(actions)+1))

state size:(1111, 4) actions:(1111, 24)
action size:4.42857583363851


In [94]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
# state_size = 37
# state_size_ = (84, 84, 3)
state_size = 24
action_size = 4
hidden_size = 24*2             # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
gamma = 0.99                   # future reward discount
memory_size = 1000            # memory capacity
batch_size = 1000             # experience mini-batch size

In [95]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, state_size=state_size, hidden_size=hidden_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

In [96]:
state = env.reset()
for _ in range(memory_size):
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)
    memory.buffer.append([state, action, next_state, reward, float(done)])
    state = next_state
    if done is True:
        state = env.reset()

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [None]:
# Save/load the model and save for plotting
saver = tf.train.Saver()
episode_rewards_list, rewards_list, gloss_list, dloss_list = [], [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        gloss_batch, dloss_batch = [], []
        state = env.reset()

        # Training steps/batches
        while True:
            # Explore (Env) or Exploit (Model)
            total_step += 1
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            if explore_p > np.random.rand():
                action = env.action_space.sample()
            else:
                action_logits = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
                action = np.reshape(action_logits, [-1]) # For continuous action space
                #action = np.argmax(action_logits) # For discrete action space
            next_state, reward, done, _ = env.step(action)
            memory.buffer.append([state, action, next_state, reward, float(done)])
            total_reward += reward
            state = next_state

            # Training
            #batch = memory.sample(batch_size)
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            nextQs_logits = sess.run(model.Qs_logits, feed_dict = {model.states: next_states})
            #nextQs = np.max(nextQs_logits, axis=1) * (1-dones)
            nextQs = nextQs_logits.reshape([-1]) * (1-dones)
            targetQs = rewards + (gamma * nextQs)
            gloss, dloss, _, _ = sess.run([model.g_loss, model.d_loss, model.g_opt, model.d_opt],
                                            feed_dict = {model.states: states, 
                                                         model.actions: actions,
                                                         model.targetQs: targetQs})
            gloss_batch.append(gloss)
            dloss_batch.append(dloss)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'gloss:{:.4f}'.format(np.mean(gloss_batch)),
              'dloss:{:.4f}'.format(np.mean(dloss_batch)),
              'exploreP:{:.4f}'.format(explore_p))
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        gloss_list.append([ep, np.mean(gloss_batch)])
        dloss_list.append([ep, np.mean(dloss_batch)])
        # Break episode/epoch loop
        if np.mean(episode_reward) <= -150:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model2.ckpt')

Episode:0 meanR:-82.3067 R:-82.3067 gloss:-134.2165 dloss:1.8306 exploreP:0.8536
Episode:1 meanR:-91.3119 R:-100.3171 gloss:-388.0928 dloss:0.2071 exploreP:0.8484
Episode:2 meanR:-100.4618 R:-118.7616 gloss:-753.1843 dloss:9.0228 exploreP:0.8412
Episode:3 meanR:-100.9111 R:-102.2588 gloss:-2239.4990 dloss:8.6724 exploreP:0.7183
Episode:4 meanR:-106.8413 R:-130.5623 gloss:-1098.4009 dloss:0.0550 exploreP:0.6136
Episode:5 meanR:-107.8460 R:-112.8692 gloss:-1225.1321 dloss:0.1985 exploreP:0.6091
Episode:6 meanR:-108.6487 R:-113.4653 gloss:-2143.2043 dloss:8.4064 exploreP:0.6054
Episode:7 meanR:-109.5934 R:-116.2061 gloss:-3168.9819 dloss:14.7852 exploreP:0.6002
Episode:8 meanR:-109.6470 R:-110.0754 gloss:-4279.8823 dloss:19.6476 exploreP:0.5953
Episode:9 meanR:-109.9836 R:-113.0138 gloss:-5097.3779 dloss:23.9260 exploreP:0.5917
Episode:10 meanR:-109.9658 R:-109.7870 gloss:-5920.8799 dloss:27.3604 exploreP:0.5894
Episode:11 meanR:-110.5111 R:-116.5096 gloss:-6912.3984 dloss:29.5833 explore

Episode:95 meanR:-117.6296 R:-118.9551 gloss:-33751.5273 dloss:19.4492 exploreP:0.1591
Episode:96 meanR:-117.6009 R:-114.8476 gloss:-33846.7148 dloss:18.8796 exploreP:0.1573
Episode:97 meanR:-117.5728 R:-114.8409 gloss:-32216.5254 dloss:19.1349 exploreP:0.1559
Episode:98 meanR:-117.5868 R:-118.9664 gloss:-30184.0801 dloss:18.5370 exploreP:0.1535
Episode:99 meanR:-117.5937 R:-118.2697 gloss:-29020.0664 dloss:18.4548 exploreP:0.1514
Episode:100 meanR:-117.9392 R:-116.8619 gloss:-27615.7617 dloss:18.9935 exploreP:0.1490
Episode:101 meanR:-118.1338 R:-119.7713 gloss:-24376.9941 dloss:18.3554 exploreP:0.1470
Episode:102 meanR:-118.0810 R:-113.4856 gloss:-22511.0645 dloss:18.6199 exploreP:0.1461
Episode:103 meanR:-118.2127 R:-115.4286 gloss:-23178.4844 dloss:19.1605 exploreP:0.1449
Episode:104 meanR:-118.0761 R:-116.9018 gloss:-22717.4043 dloss:18.8696 exploreP:0.1431
Episode:105 meanR:-118.1156 R:-116.8202 gloss:-23596.7637 dloss:18.7408 exploreP:0.1411
Episode:106 meanR:-118.2102 R:-122.92

Episode:189 meanR:-134.7846 R:-181.0995 gloss:-274.9923 dloss:0.0759 exploreP:0.0115
Episode:190 meanR:-135.3347 R:-171.9956 gloss:-1089.4066 dloss:0.3037 exploreP:0.0115
Episode:191 meanR:-135.4302 R:-130.7280 gloss:-5541.2095 dloss:8.1978 exploreP:0.0114
Episode:192 meanR:-135.4824 R:-119.1651 gloss:-6424.5894 dloss:16.1300 exploreP:0.0114
Episode:193 meanR:-135.5225 R:-129.8688 gloss:-17639.8809 dloss:22.8825 exploreP:0.0114
Episode:194 meanR:-135.6330 R:-129.9176 gloss:-42471.4414 dloss:26.8103 exploreP:0.0114
Episode:195 meanR:-135.6413 R:-119.7900 gloss:-60710.2500 dloss:33.3036 exploreP:0.0114
Episode:196 meanR:-135.6923 R:-119.9498 gloss:-58488.7148 dloss:29.9767 exploreP:0.0114
Episode:197 meanR:-135.7440 R:-120.0108 gloss:-54161.4219 dloss:33.4855 exploreP:0.0114
Episode:198 meanR:-135.7545 R:-120.0103 gloss:-55681.3789 dloss:35.8326 exploreP:0.0114
Episode:199 meanR:-135.7714 R:-119.9577 gloss:-57414.9219 dloss:32.4272 exploreP:0.0114
Episode:200 meanR:-135.8021 R:-119.9403 

Episode:284 meanR:-128.9313 R:-123.5110 gloss:-4116.3955 dloss:7.3530 exploreP:0.0107
Episode:285 meanR:-128.3659 R:-123.4096 gloss:-4614.1499 dloss:5.0619 exploreP:0.0107
Episode:286 meanR:-127.7930 R:-123.5606 gloss:-4726.7275 dloss:7.2756 exploreP:0.0107
Episode:287 meanR:-127.2231 R:-123.5904 gloss:-4993.6323 dloss:5.8186 exploreP:0.0107
Episode:288 meanR:-126.6591 R:-124.2789 gloss:-4771.0806 dloss:6.7263 exploreP:0.0107
Episode:289 meanR:-126.0837 R:-123.5595 gloss:-4821.4277 dloss:5.0775 exploreP:0.0107
Episode:290 meanR:-125.5994 R:-123.5582 gloss:-4722.2983 dloss:5.0578 exploreP:0.0107
Episode:291 meanR:-125.5266 R:-123.4568 gloss:-5101.2012 dloss:6.6891 exploreP:0.0107
Episode:292 meanR:-125.5707 R:-123.5694 gloss:-5311.0103 dloss:5.4038 exploreP:0.0107
Episode:293 meanR:-125.5074 R:-123.5440 gloss:-5316.4429 dloss:7.1010 exploreP:0.0107
Episode:294 meanR:-125.4450 R:-123.6750 gloss:-5106.9434 dloss:5.3758 exploreP:0.0107
Episode:295 meanR:-125.4833 R:-123.6199 gloss:-4971.72

Episode:379 meanR:-123.4578 R:-123.5285 gloss:-4149.0620 dloss:7.8656 exploreP:0.0105
Episode:380 meanR:-123.4540 R:-123.5626 gloss:-4164.6509 dloss:6.7130 exploreP:0.0105
Episode:381 meanR:-123.4508 R:-123.5891 gloss:-4174.3540 dloss:6.9395 exploreP:0.0105
Episode:382 meanR:-123.4469 R:-123.5667 gloss:-4218.5254 dloss:8.0204 exploreP:0.0105
Episode:383 meanR:-123.4475 R:-123.5610 gloss:-4486.2627 dloss:7.4002 exploreP:0.0105
Episode:384 meanR:-123.4477 R:-123.5370 gloss:-4159.6094 dloss:7.0571 exploreP:0.0105
Episode:385 meanR:-123.4488 R:-123.5133 gloss:-3589.9290 dloss:6.8889 exploreP:0.0105
Episode:386 meanR:-123.4338 R:-122.0626 gloss:-4105.4668 dloss:7.2557 exploreP:0.0105
Episode:387 meanR:-123.4329 R:-123.5041 gloss:-4085.1243 dloss:18.9537 exploreP:0.0105
Episode:388 meanR:-123.3207 R:-113.0544 gloss:-4720.0303 dloss:27.7959 exploreP:0.0104
Episode:389 meanR:-123.3203 R:-123.5256 gloss:-4476.3755 dloss:32.5143 exploreP:0.0104
Episode:390 meanR:-123.3200 R:-123.5262 gloss:-4006

Episode:474 meanR:-127.8834 R:-122.3754 gloss:-5610.1060 dloss:18.5470 exploreP:0.0101
Episode:475 meanR:-127.8477 R:-119.9690 gloss:-5234.2212 dloss:19.8820 exploreP:0.0101
Episode:476 meanR:-127.8178 R:-120.4510 gloss:-5779.4463 dloss:21.3239 exploreP:0.0101
Episode:477 meanR:-127.7828 R:-119.9762 gloss:-6035.4858 dloss:18.2361 exploreP:0.0101
Episode:478 meanR:-127.7470 R:-120.0046 gloss:-5441.0098 dloss:19.8518 exploreP:0.0101
Episode:479 meanR:-127.7167 R:-120.4974 gloss:-5638.6938 dloss:19.4225 exploreP:0.0101
Episode:480 meanR:-127.6808 R:-119.9775 gloss:-5921.9736 dloss:19.5766 exploreP:0.0101
Episode:481 meanR:-127.6446 R:-119.9658 gloss:-5720.7236 dloss:19.1818 exploreP:0.0101
Episode:482 meanR:-127.6091 R:-120.0112 gloss:-5595.4419 dloss:26.9110 exploreP:0.0101
Episode:483 meanR:-127.5732 R:-119.9785 gloss:-5405.3291 dloss:20.6690 exploreP:0.0101
Episode:484 meanR:-127.5423 R:-120.4448 gloss:-5526.2222 dloss:19.5873 exploreP:0.0101
Episode:485 meanR:-127.5063 R:-119.9151 glo

Episode:568 meanR:-124.8729 R:-119.9807 gloss:-6676.3428 dloss:18.7046 exploreP:0.0101
Episode:569 meanR:-124.7429 R:-119.8845 gloss:-3698.7131 dloss:16.7519 exploreP:0.0101
Episode:570 meanR:-124.6860 R:-124.4775 gloss:-5161.1685 dloss:16.3219 exploreP:0.0101
Episode:571 meanR:-124.6901 R:-130.2228 gloss:-6046.9365 dloss:16.6836 exploreP:0.0101
Episode:572 meanR:-124.6903 R:-130.5727 gloss:-5895.8096 dloss:16.3136 exploreP:0.0101
Episode:573 meanR:-124.6205 R:-122.9920 gloss:-6036.1543 dloss:16.6048 exploreP:0.0101
Episode:574 meanR:-124.6970 R:-130.0243 gloss:-7399.1216 dloss:16.4182 exploreP:0.0101
Episode:575 meanR:-124.9397 R:-144.2443 gloss:-7536.6636 dloss:15.1298 exploreP:0.0101
Episode:576 meanR:-125.0366 R:-130.1371 gloss:-5993.4722 dloss:15.0194 exploreP:0.0101
Episode:577 meanR:-125.1148 R:-127.7952 gloss:-5491.2061 dloss:14.7680 exploreP:0.0101
Episode:578 meanR:-125.2143 R:-129.9566 gloss:-6589.7646 dloss:15.2925 exploreP:0.0101
Episode:579 meanR:-125.3100 R:-130.0705 glo

Episode:663 meanR:-126.0622 R:-127.3173 gloss:-7251.0181 dloss:21.2783 exploreP:0.0100
Episode:664 meanR:-126.1373 R:-127.5016 gloss:-7111.3076 dloss:21.5213 exploreP:0.0100
Episode:665 meanR:-126.1825 R:-127.4932 gloss:-6621.8335 dloss:21.4214 exploreP:0.0100
Episode:666 meanR:-126.2560 R:-127.3620 gloss:-6240.5181 dloss:21.2639 exploreP:0.0100
Episode:667 meanR:-126.3314 R:-127.5101 gloss:-6137.0942 dloss:21.3385 exploreP:0.0100
Episode:668 meanR:-126.4066 R:-127.5052 gloss:-6210.1016 dloss:21.5497 exploreP:0.0100
Episode:669 meanR:-126.4884 R:-128.0651 gloss:-5875.4912 dloss:22.5859 exploreP:0.0100
Episode:670 meanR:-126.5239 R:-128.0229 gloss:-5887.8164 dloss:34.0436 exploreP:0.0100
Episode:671 meanR:-126.4949 R:-127.3304 gloss:-5439.1978 dloss:22.2426 exploreP:0.0100
Episode:672 meanR:-126.4639 R:-127.4648 gloss:-5455.0918 dloss:22.3087 exploreP:0.0100
Episode:673 meanR:-126.5085 R:-127.4537 gloss:-5454.4131 dloss:23.9621 exploreP:0.0100
Episode:674 meanR:-126.4835 R:-127.5242 glo

Episode:758 meanR:-127.6375 R:-127.9043 gloss:-6950.0908 dloss:5.7106 exploreP:0.0100
Episode:759 meanR:-127.6383 R:-128.0961 gloss:-5255.4800 dloss:5.2642 exploreP:0.0100
Episode:760 meanR:-127.6483 R:-128.9901 gloss:-5336.2158 dloss:7.8199 exploreP:0.0100
Episode:761 meanR:-127.6548 R:-128.1197 gloss:-5009.9614 dloss:5.4331 exploreP:0.0100
Episode:762 meanR:-127.6555 R:-128.0786 gloss:-4923.8428 dloss:9.7801 exploreP:0.0100
Episode:763 meanR:-127.6609 R:-127.8575 gloss:-5008.9263 dloss:5.5881 exploreP:0.0100
Episode:764 meanR:-127.6642 R:-127.8354 gloss:-6187.5557 dloss:5.3461 exploreP:0.0100
Episode:765 meanR:-127.6698 R:-128.0557 gloss:-6924.0117 dloss:15.1519 exploreP:0.0100
Episode:766 meanR:-127.6768 R:-128.0605 gloss:-7660.7227 dloss:6.2426 exploreP:0.0100
Episode:767 meanR:-127.6805 R:-127.8772 gloss:-7450.4858 dloss:5.2489 exploreP:0.0100
Episode:768 meanR:-127.6863 R:-128.0893 gloss:-7741.5103 dloss:5.1548 exploreP:0.0100
Episode:769 meanR:-127.6857 R:-128.0065 gloss:-7540.2

Episode:852 meanR:-127.1294 R:-123.9055 gloss:-11765.8779 dloss:11.3964 exploreP:0.0100
Episode:853 meanR:-127.0883 R:-123.9269 gloss:-11748.7529 dloss:12.6244 exploreP:0.0100
Episode:854 meanR:-127.0479 R:-124.0290 gloss:-11776.3301 dloss:7.9688 exploreP:0.0100
Episode:855 meanR:-127.0356 R:-123.9035 gloss:-12025.8408 dloss:17.2796 exploreP:0.0100
Episode:856 meanR:-126.9961 R:-124.0197 gloss:-11509.0146 dloss:8.6767 exploreP:0.0100
Episode:857 meanR:-126.9529 R:-123.6386 gloss:-11527.5283 dloss:7.6450 exploreP:0.0100
Episode:858 meanR:-126.9101 R:-123.6242 gloss:-11535.0244 dloss:15.6281 exploreP:0.0100
Episode:859 meanR:-126.8640 R:-123.4897 gloss:-12699.0029 dloss:8.7248 exploreP:0.0100
Episode:860 meanR:-126.7839 R:-120.9782 gloss:-12942.3428 dloss:16.2609 exploreP:0.0100
Episode:861 meanR:-126.7377 R:-123.4990 gloss:-13599.2158 dloss:9.3465 exploreP:0.0100
Episode:862 meanR:-126.6912 R:-123.4303 gloss:-16386.4316 dloss:10.3255 exploreP:0.0100
Episode:863 meanR:-126.6480 R:-123.53

Episode:946 meanR:-123.6551 R:-123.6202 gloss:-14905.0918 dloss:10.8820 exploreP:0.0100
Episode:947 meanR:-123.6347 R:-122.0209 gloss:-16106.0322 dloss:10.0197 exploreP:0.0100
Episode:948 meanR:-123.6398 R:-123.5812 gloss:-15208.7305 dloss:14.8062 exploreP:0.0100
Episode:949 meanR:-123.6371 R:-123.6368 gloss:-13318.2236 dloss:57.3782 exploreP:0.0100
Episode:950 meanR:-123.6318 R:-123.5275 gloss:-13847.7109 dloss:12.1383 exploreP:0.0100
Episode:951 meanR:-123.6220 R:-123.4965 gloss:-9822.3984 dloss:10.5415 exploreP:0.0100
Episode:952 meanR:-123.6206 R:-123.7617 gloss:-9824.7217 dloss:12.0715 exploreP:0.0100
Episode:953 meanR:-123.6284 R:-124.7109 gloss:-11407.8887 dloss:12.2245 exploreP:0.0100
Episode:954 meanR:-123.6245 R:-123.6375 gloss:-10863.6650 dloss:12.0215 exploreP:0.0100
Episode:955 meanR:-123.6032 R:-121.7771 gloss:-12294.6631 dloss:10.9425 exploreP:0.0100
Episode:956 meanR:-123.5895 R:-122.6469 gloss:-13557.9053 dloss:7.8454 exploreP:0.0100
Episode:957 meanR:-123.5937 R:-124.

Episode:1039 meanR:-122.9292 R:-123.9989 gloss:-16142.2529 dloss:35.9157 exploreP:0.0100
Episode:1040 meanR:-122.9340 R:-123.9938 gloss:-16021.7070 dloss:12.9924 exploreP:0.0100
Episode:1041 meanR:-122.9387 R:-124.0462 gloss:-16083.0215 dloss:8.1909 exploreP:0.0100
Episode:1042 meanR:-122.9436 R:-123.9817 gloss:-16181.0430 dloss:22.5647 exploreP:0.0100
Episode:1043 meanR:-122.9469 R:-123.9297 gloss:-15649.5029 dloss:16.1772 exploreP:0.0100
Episode:1044 meanR:-122.9504 R:-123.9874 gloss:-13532.8320 dloss:7.8452 exploreP:0.0100
Episode:1045 meanR:-122.9538 R:-123.9286 gloss:-12195.8604 dloss:38.7159 exploreP:0.0100
Episode:1046 meanR:-122.9569 R:-123.9243 gloss:-14082.5625 dloss:10.1797 exploreP:0.0100
Episode:1047 meanR:-122.9767 R:-124.0066 gloss:-13183.3340 dloss:5.2812 exploreP:0.0100
Episode:1048 meanR:-122.9815 R:-124.0596 gloss:-13132.2959 dloss:5.4514 exploreP:0.0100
Episode:1049 meanR:-122.9746 R:-122.9410 gloss:-15143.1484 dloss:29.9577 exploreP:0.0100
Episode:1050 meanR:-122.9

Episode:1132 meanR:-123.4594 R:-119.9903 gloss:-18125.6348 dloss:24.6015 exploreP:0.0100
Episode:1133 meanR:-123.4230 R:-119.9909 gloss:-18251.9805 dloss:36.3324 exploreP:0.0100
Episode:1134 meanR:-123.3874 R:-119.9904 gloss:-18697.7891 dloss:47.7583 exploreP:0.0100
Episode:1135 meanR:-123.3472 R:-120.0112 gloss:-18810.1641 dloss:18.1842 exploreP:0.0100
Episode:1136 meanR:-123.3069 R:-119.9900 gloss:-19203.7676 dloss:19.3000 exploreP:0.0100
Episode:1137 meanR:-123.2663 R:-119.9691 gloss:-19313.2480 dloss:94.1610 exploreP:0.0100
Episode:1138 meanR:-123.2271 R:-119.9810 gloss:-22158.4316 dloss:20.4523 exploreP:0.0100
Episode:1139 meanR:-123.1865 R:-119.9394 gloss:-23727.5508 dloss:17.9919 exploreP:0.0100
Episode:1140 meanR:-123.1467 R:-120.0120 gloss:-24397.5820 dloss:17.0624 exploreP:0.0100
Episode:1141 meanR:-123.1057 R:-119.9495 gloss:-23192.7695 dloss:42.6267 exploreP:0.0100
Episode:1142 meanR:-123.0659 R:-119.9946 gloss:-21086.8418 dloss:18.1734 exploreP:0.0100
Episode:1143 meanR:-1

Episode:1224 meanR:-121.5412 R:-119.9565 gloss:-226407.5156 dloss:47.7790 exploreP:0.0100
Episode:1225 meanR:-121.5413 R:-120.0148 gloss:-202677.0938 dloss:44.8937 exploreP:0.0100
Episode:1226 meanR:-121.5713 R:-123.0172 gloss:-95077.3359 dloss:24.6470 exploreP:0.0100
Episode:1227 meanR:-121.6702 R:-129.8640 gloss:-26417.9023 dloss:14.8240 exploreP:0.0100
Episode:1228 meanR:-121.7998 R:-132.9569 gloss:-28643.1855 dloss:14.0229 exploreP:0.0100
Episode:1229 meanR:-121.8412 R:-124.1571 gloss:-25097.1582 dloss:14.6062 exploreP:0.0100
Episode:1230 meanR:-121.8731 R:-123.1781 gloss:-21378.8789 dloss:16.3762 exploreP:0.0100
Episode:1231 meanR:-121.8460 R:-117.7354 gloss:-25189.6191 dloss:15.5879 exploreP:0.0100
Episode:1232 meanR:-121.9271 R:-128.1022 gloss:-24644.3223 dloss:16.4211 exploreP:0.0100
Episode:1233 meanR:-121.9584 R:-123.1133 gloss:-23249.3750 dloss:16.4129 exploreP:0.0100
Episode:1234 meanR:-122.0596 R:-130.1111 gloss:-22450.0605 dloss:16.5182 exploreP:0.0100
Episode:1235 meanR:

Episode:1316 meanR:-121.0590 R:-113.2187 gloss:-92816.6094 dloss:26.6560 exploreP:0.0100
Episode:1317 meanR:-120.9929 R:-113.3550 gloss:-50782.0312 dloss:24.4899 exploreP:0.0100
Episode:1318 meanR:-120.9283 R:-113.5572 gloss:-19597.3242 dloss:21.7414 exploreP:0.0100
Episode:1319 meanR:-120.8559 R:-112.7129 gloss:-19798.7090 dloss:22.1126 exploreP:0.0100
Episode:1320 meanR:-120.7919 R:-113.4456 gloss:-18017.8906 dloss:22.3919 exploreP:0.0100
Episode:1321 meanR:-120.7264 R:-113.4393 gloss:-17226.5352 dloss:22.8304 exploreP:0.0100
Episode:1322 meanR:-120.6619 R:-113.5592 gloss:-17247.8574 dloss:23.7938 exploreP:0.0100
Episode:1323 meanR:-120.5970 R:-113.5198 gloss:-17017.4102 dloss:23.6663 exploreP:0.0100
Episode:1324 meanR:-120.5317 R:-113.4238 gloss:-18493.7734 dloss:24.2382 exploreP:0.0100
Episode:1325 meanR:-120.4655 R:-113.3988 gloss:-18802.7070 dloss:23.4073 exploreP:0.0100
Episode:1326 meanR:-120.3684 R:-113.3035 gloss:-16632.7539 dloss:31.6017 exploreP:0.0100
Episode:1327 meanR:-1

Episode:1409 meanR:-115.4848 R:-113.5484 gloss:-32586.0566 dloss:15.6524 exploreP:0.0100
Episode:1410 meanR:-115.4570 R:-113.4499 gloss:-28917.5078 dloss:17.4608 exploreP:0.0100
Episode:1411 meanR:-115.4262 R:-113.3119 gloss:-31091.7227 dloss:19.3163 exploreP:0.0100
Episode:1412 meanR:-115.3992 R:-113.6152 gloss:-36470.5781 dloss:20.4687 exploreP:0.0100
Episode:1413 meanR:-115.3748 R:-113.4602 gloss:-29517.6543 dloss:30.2577 exploreP:0.0100
Episode:1414 meanR:-115.3643 R:-114.9400 gloss:-57242.7578 dloss:40.2252 exploreP:0.0100
Episode:1415 meanR:-115.4664 R:-123.1675 gloss:-148114.5312 dloss:66.9733 exploreP:0.0100
Episode:1416 meanR:-115.5651 R:-123.0913 gloss:-133203.3125 dloss:31.0399 exploreP:0.0100
Episode:1417 meanR:-115.6627 R:-123.1165 gloss:-155181.4062 dloss:153.4011 exploreP:0.0100
Episode:1418 meanR:-115.7580 R:-123.0920 gloss:-177946.2031 dloss:37.5628 exploreP:0.0100
Episode:1419 meanR:-115.8626 R:-123.1724 gloss:-161912.3438 dloss:34.1072 exploreP:0.0100
Episode:1420 me

Episode:1502 meanR:-124.4352 R:-115.0171 gloss:-29658.3945 dloss:27.4137 exploreP:0.0100
Episode:1503 meanR:-124.3521 R:-114.8668 gloss:-33625.9141 dloss:24.0940 exploreP:0.0100
Episode:1504 meanR:-124.2900 R:-116.9577 gloss:-34521.6211 dloss:25.4322 exploreP:0.0100
Episode:1505 meanR:-124.2098 R:-115.1544 gloss:-37777.0117 dloss:38.8762 exploreP:0.0100
Episode:1506 meanR:-123.4426 R:-114.9994 gloss:-34373.7266 dloss:34.1090 exploreP:0.0100
Episode:1507 meanR:-123.4580 R:-114.8363 gloss:-29645.5898 dloss:22.0265 exploreP:0.0100
Episode:1508 meanR:-123.4768 R:-115.1106 gloss:-28642.6133 dloss:20.6919 exploreP:0.0100
Episode:1509 meanR:-123.4909 R:-114.9601 gloss:-25278.4082 dloss:22.4330 exploreP:0.0100
Episode:1510 meanR:-123.5091 R:-115.2667 gloss:-24233.4141 dloss:41.1317 exploreP:0.0100
Episode:1511 meanR:-123.5298 R:-115.3873 gloss:-23814.0352 dloss:20.9697 exploreP:0.0100
Episode:1512 meanR:-123.5387 R:-114.4989 gloss:-25453.9707 dloss:20.5773 exploreP:0.0100
Episode:1513 meanR:-1

Episode:1595 meanR:-119.5907 R:-179.7669 gloss:-36544.9844 dloss:19.9578 exploreP:0.0100
Episode:1596 meanR:-120.2118 R:-176.9319 gloss:-8197.5908 dloss:6.8634 exploreP:0.0100
Episode:1597 meanR:-120.8052 R:-176.8564 gloss:-6956.3281 dloss:2.7249 exploreP:0.0100
Episode:1598 meanR:-121.4288 R:-177.3815 gloss:-3799.9951 dloss:3.1242 exploreP:0.0100
Episode:1599 meanR:-122.0410 R:-176.6157 gloss:-7353.5405 dloss:2.0670 exploreP:0.0100
Episode:1600 meanR:-122.6541 R:-176.2701 gloss:-5653.8193 dloss:2.0217 exploreP:0.0100
Episode:1601 meanR:-122.5755 R:-107.1684 gloss:-8423.0957 dloss:15.1710 exploreP:0.0100
Episode:1602 meanR:-122.5075 R:-108.2214 gloss:-29112.8555 dloss:23.4694 exploreP:0.0100
Episode:1603 meanR:-123.1274 R:-176.8581 gloss:-22381.6523 dloss:13.3666 exploreP:0.0100
Episode:1604 meanR:-123.7301 R:-177.2276 gloss:-3263.3450 dloss:3.0674 exploreP:0.0100
Episode:1605 meanR:-124.3466 R:-176.8006 gloss:-3539.1106 dloss:2.0862 exploreP:0.0100
Episode:1606 meanR:-124.9717 R:-177.

Episode:1688 meanR:-112.2704 R:-102.6564 gloss:-16486.0117 dloss:26.9168 exploreP:0.0100
Episode:1689 meanR:-112.0641 R:-102.5399 gloss:-17650.5371 dloss:26.3241 exploreP:0.0100
Episode:1690 meanR:-111.8578 R:-102.5413 gloss:-16592.3496 dloss:26.4556 exploreP:0.0100
Episode:1691 meanR:-111.6524 R:-102.5364 gloss:-14064.9551 dloss:26.0435 exploreP:0.0100
Episode:1692 meanR:-111.5247 R:-102.5015 gloss:-13908.7764 dloss:128.1593 exploreP:0.0100
Episode:1693 meanR:-111.3978 R:-102.2184 gloss:-15653.4512 dloss:37.0246 exploreP:0.0100
Episode:1694 meanR:-111.2751 R:-102.5395 gloss:-16221.1689 dloss:20.5966 exploreP:0.0100
Episode:1695 meanR:-110.5038 R:-102.6383 gloss:-16721.5879 dloss:26.2479 exploreP:0.0100
Episode:1696 meanR:-109.7589 R:-102.4432 gloss:-17104.3691 dloss:60.0094 exploreP:0.0100
Episode:1697 meanR:-109.0155 R:-102.5121 gloss:-15541.3633 dloss:27.0155 exploreP:0.0100
Episode:1698 meanR:-108.2672 R:-102.5532 gloss:-14756.0596 dloss:108.9003 exploreP:0.0100
Episode:1699 meanR:

Episode:1781 meanR:-102.6381 R:-103.1204 gloss:-15179.0039 dloss:25.0641 exploreP:0.0100
Episode:1782 meanR:-102.6386 R:-102.5391 gloss:-16877.9297 dloss:26.9426 exploreP:0.0100
Episode:1783 meanR:-102.6283 R:-102.5391 gloss:-15060.0215 dloss:24.2703 exploreP:0.0100
Episode:1784 meanR:-102.6367 R:-102.5432 gloss:-11945.1592 dloss:21.9219 exploreP:0.0100
Episode:1785 meanR:-102.6238 R:-102.5516 gloss:-10205.3555 dloss:22.0316 exploreP:0.0100
Episode:1786 meanR:-102.6238 R:-102.4852 gloss:-9663.7930 dloss:22.3287 exploreP:0.0100
Episode:1787 meanR:-102.6286 R:-103.0213 gloss:-10479.0723 dloss:23.2248 exploreP:0.0100
Episode:1788 meanR:-102.6273 R:-102.5221 gloss:-11321.1328 dloss:28.4577 exploreP:0.0100
Episode:1789 meanR:-102.6272 R:-102.5352 gloss:-10695.1484 dloss:30.2094 exploreP:0.0100
Episode:1790 meanR:-102.6268 R:-102.4955 gloss:-10740.9375 dloss:30.0002 exploreP:0.0100
Episode:1791 meanR:-102.6268 R:-102.5402 gloss:-9806.7627 dloss:29.4351 exploreP:0.0100
Episode:1792 meanR:-102

Episode:1873 meanR:-119.1777 R:-123.5208 gloss:-32437.9941 dloss:13.3692 exploreP:0.0100
Episode:1874 meanR:-119.3879 R:-123.5701 gloss:-31333.2500 dloss:9.2693 exploreP:0.0100
Episode:1875 meanR:-119.6027 R:-123.8846 gloss:-28396.4473 dloss:10.1897 exploreP:0.0100
Episode:1876 meanR:-119.8425 R:-126.4622 gloss:-30091.4980 dloss:10.1962 exploreP:0.0100
Episode:1877 meanR:-120.0582 R:-124.0568 gloss:-31554.4238 dloss:12.2521 exploreP:0.0100
Episode:1878 meanR:-120.2732 R:-124.0477 gloss:-30069.9961 dloss:12.3797 exploreP:0.0100
Episode:1879 meanR:-120.4877 R:-123.9712 gloss:-29444.4277 dloss:12.5253 exploreP:0.0100
Episode:1880 meanR:-120.7022 R:-123.9897 gloss:-29510.4277 dloss:13.1336 exploreP:0.0100
Episode:1881 meanR:-120.9112 R:-124.0214 gloss:-28274.3438 dloss:13.4400 exploreP:0.0100
Episode:1882 meanR:-121.1211 R:-123.5270 gloss:-28240.1133 dloss:14.0135 exploreP:0.0100
Episode:1883 meanR:-121.3308 R:-123.5068 gloss:-26578.4902 dloss:13.9598 exploreP:0.0100
Episode:1884 meanR:-12

Episode:1965 meanR:-122.5025 R:-115.7484 gloss:-76634.3516 dloss:21.0524 exploreP:0.0100
Episode:1966 meanR:-122.4256 R:-115.8601 gloss:-35530.1719 dloss:21.3786 exploreP:0.0100
Episode:1967 meanR:-122.3602 R:-117.0806 gloss:-27584.7559 dloss:21.1756 exploreP:0.0100
Episode:1968 meanR:-122.2852 R:-116.1306 gloss:-25729.1699 dloss:20.7929 exploreP:0.0100
Episode:1969 meanR:-122.2135 R:-116.3822 gloss:-27540.2441 dloss:21.3284 exploreP:0.0100
Episode:1970 meanR:-122.1343 R:-115.5661 gloss:-27172.8438 dloss:23.5816 exploreP:0.0100
Episode:1971 meanR:-122.0598 R:-116.0977 gloss:-29911.9570 dloss:19.9970 exploreP:0.0100
Episode:1972 meanR:-121.9831 R:-115.7420 gloss:-28446.5781 dloss:31.0359 exploreP:0.0100
Episode:1973 meanR:-121.9109 R:-116.3041 gloss:-23653.9766 dloss:19.6635 exploreP:0.0100
Episode:1974 meanR:-121.8334 R:-115.8128 gloss:-28369.9355 dloss:19.6582 exploreP:0.0100
Episode:1975 meanR:-121.7715 R:-117.6967 gloss:-28575.0410 dloss:21.1021 exploreP:0.0100
Episode:1976 meanR:-1

Episode:2057 meanR:-120.1575 R:-123.1678 gloss:-20784.6445 dloss:28.5304 exploreP:0.0100
Episode:2058 meanR:-120.1978 R:-117.3057 gloss:-26403.7070 dloss:36.0769 exploreP:0.0100
Episode:2059 meanR:-120.3662 R:-131.3645 gloss:-29463.7793 dloss:28.9941 exploreP:0.0100
Episode:2060 meanR:-120.5311 R:-129.9848 gloss:-31788.7188 dloss:77.7710 exploreP:0.0100
Episode:2061 meanR:-120.5520 R:-115.4310 gloss:-40119.6719 dloss:28.6721 exploreP:0.0100
Episode:2062 meanR:-120.6272 R:-120.9714 gloss:-43320.0938 dloss:29.3182 exploreP:0.0100
Episode:2063 meanR:-120.8789 R:-138.7312 gloss:-44473.1016 dloss:30.0183 exploreP:0.0100
Episode:2064 meanR:-121.0211 R:-130.5381 gloss:-45880.7695 dloss:31.1932 exploreP:0.0100
Episode:2065 meanR:-121.2111 R:-134.7478 gloss:-45018.8359 dloss:38.0872 exploreP:0.0100
Episode:2066 meanR:-121.3853 R:-133.2846 gloss:-40323.0117 dloss:34.9804 exploreP:0.0100
Episode:2067 meanR:-121.5145 R:-129.9974 gloss:-36529.3086 dloss:38.8027 exploreP:0.0100
Episode:2068 meanR:-1

Episode:2149 meanR:-120.1740 R:-102.7388 gloss:-22060.2480 dloss:23.9355 exploreP:0.0100
Episode:2150 meanR:-119.9041 R:-102.4815 gloss:-17417.5449 dloss:23.5913 exploreP:0.0100
Episode:2151 meanR:-119.6424 R:-102.5293 gloss:-16531.0898 dloss:22.7145 exploreP:0.0100
Episode:2152 meanR:-119.3684 R:-102.5228 gloss:-15798.9775 dloss:50.3665 exploreP:0.0100
Episode:2153 meanR:-119.0936 R:-102.5527 gloss:-16077.0098 dloss:23.3781 exploreP:0.0100
Episode:2154 meanR:-118.8180 R:-102.5400 gloss:-15352.2822 dloss:21.9617 exploreP:0.0100
Episode:2155 meanR:-118.6101 R:-102.5361 gloss:-13867.9297 dloss:21.8307 exploreP:0.0100
Episode:2156 meanR:-118.3961 R:-102.6024 gloss:-12130.5293 dloss:50.3757 exploreP:0.0100
Episode:2157 meanR:-118.1901 R:-102.5590 gloss:-11580.8535 dloss:23.9282 exploreP:0.0100
Episode:2158 meanR:-118.0422 R:-102.5240 gloss:-11527.8125 dloss:22.0174 exploreP:0.0100
Episode:2159 meanR:-117.7541 R:-102.5459 gloss:-11543.8623 dloss:23.5882 exploreP:0.0100
Episode:2160 meanR:-1

Episode:2242 meanR:-104.2142 R:-127.9395 gloss:-550966.1875 dloss:149.9440 exploreP:0.0100
Episode:2243 meanR:-104.4694 R:-128.0719 gloss:-577037.6250 dloss:26.4515 exploreP:0.0100
Episode:2244 meanR:-104.7149 R:-127.9040 gloss:-590196.5000 dloss:25.8449 exploreP:0.0100
Episode:2245 meanR:-104.9685 R:-127.8929 gloss:-584936.3750 dloss:61.2751 exploreP:0.0100
Episode:2246 meanR:-105.2218 R:-128.0870 gloss:-848638.1875 dloss:950.6832 exploreP:0.0100
Episode:2247 meanR:-105.4550 R:-125.8632 gloss:-383591.3438 dloss:1599.1367 exploreP:0.0100
Episode:2248 meanR:-105.7098 R:-128.0185 gloss:-575200.9375 dloss:133.7741 exploreP:0.0100
Episode:2249 meanR:-105.9611 R:-127.8620 gloss:-485843.5938 dloss:80.9043 exploreP:0.0100
Episode:2250 meanR:-106.2159 R:-127.9626 gloss:-544850.6250 dloss:65.6795 exploreP:0.0100
Episode:2251 meanR:-106.4652 R:-127.4612 gloss:-513965.8750 dloss:51.2417 exploreP:0.0100
Episode:2252 meanR:-106.7204 R:-128.0405 gloss:-452934.6562 dloss:49.5334 exploreP:0.0100
Episo

Episode:2334 meanR:-127.7052 R:-127.4520 gloss:-34093.4648 dloss:9.8610 exploreP:0.0100
Episode:2335 meanR:-127.9529 R:-127.4674 gloss:-31608.8750 dloss:16.5246 exploreP:0.0100
Episode:2336 meanR:-128.2072 R:-130.8458 gloss:-32336.0859 dloss:11.5037 exploreP:0.0100
Episode:2337 meanR:-128.2016 R:-127.4545 gloss:-30736.0781 dloss:15.2505 exploreP:0.0100
Episode:2338 meanR:-128.1961 R:-127.5409 gloss:-29184.6504 dloss:13.6744 exploreP:0.0100
Episode:2339 meanR:-128.1913 R:-127.4203 gloss:-27789.4258 dloss:16.6876 exploreP:0.0100
Episode:2340 meanR:-128.1878 R:-127.4121 gloss:-27268.8301 dloss:14.3901 exploreP:0.0100
Episode:2341 meanR:-128.1814 R:-127.4121 gloss:-25911.4785 dloss:10.7609 exploreP:0.0100
Episode:2342 meanR:-128.1764 R:-127.4410 gloss:-25365.3438 dloss:10.2338 exploreP:0.0100
Episode:2343 meanR:-128.1691 R:-127.3499 gloss:-25567.5605 dloss:23.7761 exploreP:0.0100
Episode:2344 meanR:-128.1709 R:-128.0753 gloss:-25472.5508 dloss:11.2659 exploreP:0.0100
Episode:2345 meanR:-12

Episode:2428 meanR:-145.2487 R:-180.8662 gloss:-3611.7656 dloss:0.2558 exploreP:0.0100
Episode:2429 meanR:-145.7752 R:-180.7568 gloss:-3016.8818 dloss:0.6367 exploreP:0.0100
Episode:2430 meanR:-146.3037 R:-180.7505 gloss:-2283.8425 dloss:0.2937 exploreP:0.0100
Episode:2431 meanR:-146.8174 R:-179.3042 gloss:-6882.1294 dloss:0.9582 exploreP:0.0100
Episode:2432 meanR:-147.3461 R:-180.8216 gloss:-2769.7681 dloss:0.2496 exploreP:0.0100
Episode:2433 meanR:-147.3320 R:-126.0320 gloss:-20096.3574 dloss:37.4209 exploreP:0.0100
Episode:2434 meanR:-147.1024 R:-104.4850 gloss:-43776.8359 dloss:18.2823 exploreP:0.0100
Episode:2435 meanR:-146.8450 R:-101.7352 gloss:-91747.6484 dloss:127.8971 exploreP:0.0100
Episode:2436 meanR:-146.5610 R:-102.4462 gloss:-262219.6875 dloss:332.7137 exploreP:0.0100
Episode:2437 meanR:-146.3048 R:-101.8289 gloss:-131513.3750 dloss:193.5925 exploreP:0.0100
Episode:2438 meanR:-146.0561 R:-102.6679 gloss:-233053.3594 dloss:116.1310 exploreP:0.0100
Episode:2439 meanR:-145.

Episode:2520 meanR:-113.2894 R:-102.9093 gloss:-22427.5508 dloss:47.4101 exploreP:0.0100
Episode:2521 meanR:-112.5012 R:-101.8295 gloss:-25171.4980 dloss:47.9769 exploreP:0.0100
Episode:2522 meanR:-111.7105 R:-101.7802 gloss:-26448.9863 dloss:48.7381 exploreP:0.0100
Episode:2523 meanR:-110.9316 R:-102.6701 gloss:-27465.7422 dloss:58.1551 exploreP:0.0100
Episode:2524 meanR:-110.1394 R:-101.8298 gloss:-26728.6289 dloss:65.3227 exploreP:0.0100
Episode:2525 meanR:-109.3496 R:-101.8280 gloss:-22569.4004 dloss:54.4296 exploreP:0.0100
Episode:2526 meanR:-108.5573 R:-102.4424 gloss:-23164.1699 dloss:30.0681 exploreP:0.0100
Episode:2527 meanR:-107.8196 R:-102.0507 gloss:-23144.4434 dloss:30.9049 exploreP:0.0100
Episode:2528 meanR:-107.0293 R:-101.8287 gloss:-24612.7734 dloss:38.0321 exploreP:0.0100
Episode:2529 meanR:-106.2495 R:-102.7773 gloss:-24230.2383 dloss:28.7365 exploreP:0.0100
Episode:2530 meanR:-105.4664 R:-102.4427 gloss:-22404.2305 dloss:26.2948 exploreP:0.0100
Episode:2531 meanR:-1

Episode:2613 meanR:-139.9519 R:-185.9456 gloss:-25174.1621 dloss:6.7229 exploreP:0.0100
Episode:2614 meanR:-140.7558 R:-187.3165 gloss:-3167.6001 dloss:0.8231 exploreP:0.0100
Episode:2615 meanR:-141.6190 R:-189.4258 gloss:-2070.5881 dloss:1.0902 exploreP:0.0100
Episode:2616 meanR:-142.4225 R:-186.4768 gloss:-2063.9365 dloss:0.4269 exploreP:0.0100
Episode:2617 meanR:-143.2782 R:-187.4092 gloss:-3873.8525 dloss:0.6648 exploreP:0.0100
Episode:2618 meanR:-143.3985 R:-113.8584 gloss:-6237.3384 dloss:22.3854 exploreP:0.0100
Episode:2619 meanR:-143.5060 R:-113.2316 gloss:-9360.5693 dloss:30.1670 exploreP:0.0100
Episode:2620 meanR:-143.6468 R:-116.9905 gloss:-15217.6689 dloss:51.4719 exploreP:0.0100
Episode:2621 meanR:-143.7844 R:-115.5952 gloss:-14164.0410 dloss:51.9561 exploreP:0.0100
Episode:2622 meanR:-144.4180 R:-165.1389 gloss:-28784.8711 dloss:38.6738 exploreP:0.0100
Episode:2623 meanR:-144.5299 R:-113.8596 gloss:-26545.8438 dloss:30.2831 exploreP:0.0100
Episode:2624 meanR:-144.6718 R:-

Episode:2705 meanR:-128.4613 R:-123.4765 gloss:-34952.3086 dloss:37.4046 exploreP:0.0100
Episode:2706 meanR:-127.8320 R:-123.5883 gloss:-36600.5469 dloss:35.8350 exploreP:0.0100
Episode:2707 meanR:-127.2078 R:-123.4968 gloss:-36240.1445 dloss:37.6138 exploreP:0.0100
Episode:2708 meanR:-126.4696 R:-111.5777 gloss:-40249.5664 dloss:37.5310 exploreP:0.0100
Episode:2709 meanR:-125.8277 R:-123.5016 gloss:-42145.9297 dloss:34.1383 exploreP:0.0100
Episode:2710 meanR:-125.1933 R:-123.5029 gloss:-42026.6797 dloss:44.8883 exploreP:0.0100
Episode:2711 meanR:-124.5764 R:-123.6169 gloss:-40449.8320 dloss:33.4026 exploreP:0.0100
Episode:2712 meanR:-122.8850 R:-123.4819 gloss:-41968.1680 dloss:33.3280 exploreP:0.0100
Episode:2713 meanR:-122.2618 R:-123.6263 gloss:-42644.2148 dloss:43.1387 exploreP:0.0100
Episode:2714 meanR:-121.6246 R:-123.5925 gloss:-39624.5195 dloss:32.9006 exploreP:0.0100
Episode:2715 meanR:-120.9702 R:-123.9892 gloss:-38050.5508 dloss:35.0226 exploreP:0.0100
Episode:2716 meanR:-1

Episode:2798 meanR:-123.3822 R:-123.8974 gloss:-20780.8496 dloss:35.5526 exploreP:0.0100
Episode:2799 meanR:-123.3809 R:-123.8964 gloss:-20840.5840 dloss:21.8149 exploreP:0.0100
Episode:2800 meanR:-123.3789 R:-123.8562 gloss:-22496.3652 dloss:37.0010 exploreP:0.0100
Episode:2801 meanR:-123.4626 R:-124.0097 gloss:-23998.4961 dloss:21.5770 exploreP:0.0100
Episode:2802 meanR:-123.4666 R:-123.9198 gloss:-23923.0273 dloss:44.6143 exploreP:0.0100
Episode:2803 meanR:-123.4724 R:-123.9914 gloss:-21805.8652 dloss:22.8853 exploreP:0.0100
Episode:2804 meanR:-123.4770 R:-123.9675 gloss:-22181.8496 dloss:19.9417 exploreP:0.0100
Episode:2805 meanR:-123.4821 R:-123.9849 gloss:-23435.2676 dloss:35.8453 exploreP:0.0100
Episode:2806 meanR:-123.4854 R:-123.9171 gloss:-23898.9414 dloss:21.0387 exploreP:0.0100
Episode:2807 meanR:-123.4905 R:-124.0089 gloss:-24360.2598 dloss:47.2614 exploreP:0.0100
Episode:2808 meanR:-123.6142 R:-123.9443 gloss:-20517.6133 dloss:22.9549 exploreP:0.0100
Episode:2809 meanR:-1

Episode:2890 meanR:-126.1433 R:-127.8797 gloss:-10189.3633 dloss:68.1297 exploreP:0.0100
Episode:2891 meanR:-126.1874 R:-127.9427 gloss:-12367.3809 dloss:96.9693 exploreP:0.0100
Episode:2892 meanR:-126.2312 R:-127.9622 gloss:-13742.4971 dloss:52.2310 exploreP:0.0100
Episode:2893 meanR:-126.2768 R:-128.0802 gloss:-12850.1865 dloss:176.5732 exploreP:0.0100
Episode:2894 meanR:-126.3208 R:-127.9617 gloss:-13807.9229 dloss:54.9818 exploreP:0.0100
Episode:2895 meanR:-126.3653 R:-128.4925 gloss:-17601.0117 dloss:56.8419 exploreP:0.0100
Episode:2896 meanR:-126.4059 R:-128.0408 gloss:-18848.5117 dloss:93.0440 exploreP:0.0100
Episode:2897 meanR:-126.4462 R:-127.9363 gloss:-20113.5117 dloss:69.1016 exploreP:0.0100
Episode:2898 meanR:-126.4858 R:-127.8598 gloss:-22555.7266 dloss:81.9646 exploreP:0.0100
Episode:2899 meanR:-126.5270 R:-128.0213 gloss:-22183.6055 dloss:70.6015 exploreP:0.0100
Episode:2900 meanR:-126.5692 R:-128.0736 gloss:-21863.1895 dloss:61.9322 exploreP:0.0100
Episode:2901 meanR:-

Episode:2982 meanR:-125.4994 R:-124.0431 gloss:-37167.6328 dloss:19.9902 exploreP:0.0100
Episode:2983 meanR:-125.4463 R:-122.5978 gloss:-38935.0234 dloss:19.5069 exploreP:0.0100
Episode:2984 meanR:-124.9148 R:-124.0424 gloss:-40574.5273 dloss:20.5271 exploreP:0.0100
Episode:2985 meanR:-124.8742 R:-123.9794 gloss:-41978.8047 dloss:39.6673 exploreP:0.0100
Episode:2986 meanR:-124.8325 R:-123.8844 gloss:-40800.4570 dloss:22.3063 exploreP:0.0100
Episode:2987 meanR:-124.7799 R:-122.6519 gloss:-36356.7227 dloss:23.2515 exploreP:0.0100
Episode:2988 meanR:-124.7398 R:-123.9870 gloss:-33629.6289 dloss:88.3343 exploreP:0.0100
Episode:2989 meanR:-124.6953 R:-123.5368 gloss:-31612.4727 dloss:19.0586 exploreP:0.0100
Episode:2990 meanR:-124.6522 R:-123.5645 gloss:-33293.5352 dloss:18.4370 exploreP:0.0100
Episode:2991 meanR:-124.6087 R:-123.5924 gloss:-34563.7578 dloss:18.4347 exploreP:0.0100
Episode:2992 meanR:-124.5594 R:-123.0376 gloss:-35691.4062 dloss:19.7840 exploreP:0.0100
Episode:2993 meanR:-1

Episode:3074 meanR:-127.7233 R:-180.0388 gloss:-127004.9297 dloss:16.4147 exploreP:0.0100
Episode:3075 meanR:-128.3427 R:-185.9539 gloss:-4195.9233 dloss:0.5387 exploreP:0.0100
Episode:3076 meanR:-128.9579 R:-185.4262 gloss:-5619.1289 dloss:1.4392 exploreP:0.0100
Episode:3077 meanR:-129.4963 R:-177.7440 gloss:-12974.3779 dloss:2.6930 exploreP:0.0100
Episode:3078 meanR:-130.1195 R:-186.2268 gloss:-9728.3398 dloss:4.6723 exploreP:0.0100
Episode:3079 meanR:-130.7350 R:-185.5946 gloss:-5054.3423 dloss:1.2802 exploreP:0.0100
Episode:3080 meanR:-131.3656 R:-187.0438 gloss:-7505.6274 dloss:4.5159 exploreP:0.0100
Episode:3081 meanR:-131.9930 R:-186.3754 gloss:-4019.4019 dloss:0.1535 exploreP:0.0100
Episode:3082 meanR:-132.6198 R:-186.7261 gloss:-4967.9175 dloss:0.1618 exploreP:0.0100
Episode:3083 meanR:-133.2530 R:-185.9205 gloss:-5242.6475 dloss:0.1613 exploreP:0.0100
Episode:3084 meanR:-133.8689 R:-185.6328 gloss:-5221.2251 dloss:0.1419 exploreP:0.0100
Episode:3085 meanR:-134.4806 R:-185.142

Episode:3167 meanR:-128.7175 R:-113.4420 gloss:-47967.1055 dloss:20.8698 exploreP:0.0100
Episode:3168 meanR:-128.5724 R:-113.5578 gloss:-43243.2891 dloss:40.3077 exploreP:0.0100
Episode:3169 meanR:-128.4260 R:-113.3270 gloss:-33649.9961 dloss:22.0408 exploreP:0.0100
Episode:3170 meanR:-128.3061 R:-116.0648 gloss:-27351.7422 dloss:21.8284 exploreP:0.0100
Episode:3171 meanR:-128.1589 R:-113.3271 gloss:-27986.1406 dloss:21.7470 exploreP:0.0100
Episode:3172 meanR:-128.0100 R:-113.2047 gloss:-31140.5879 dloss:21.5542 exploreP:0.0100
Episode:3173 meanR:-127.8669 R:-113.2160 gloss:-31591.6621 dloss:21.2851 exploreP:0.0100
Episode:3174 meanR:-127.2006 R:-113.4080 gloss:-30818.5801 dloss:21.1671 exploreP:0.0100
Episode:3175 meanR:-126.4755 R:-113.4412 gloss:-27849.3105 dloss:21.2041 exploreP:0.0100
Episode:3176 meanR:-125.7568 R:-113.5568 gloss:-27888.7324 dloss:21.1123 exploreP:0.0100
Episode:3177 meanR:-125.1146 R:-113.5228 gloss:-28693.3184 dloss:22.6148 exploreP:0.0100
Episode:3178 meanR:-1

Episode:3259 meanR:-119.1620 R:-123.5733 gloss:-81552.4609 dloss:47.0144 exploreP:0.0100
Episode:3260 meanR:-119.2654 R:-123.6249 gloss:-81465.9141 dloss:43.0392 exploreP:0.0100
Episode:3261 meanR:-119.3077 R:-117.5517 gloss:-85605.7109 dloss:52.3524 exploreP:0.0100
Episode:3262 meanR:-119.4085 R:-123.5208 gloss:-88669.6562 dloss:84.9983 exploreP:0.0100
Episode:3263 meanR:-119.4869 R:-123.4902 gloss:-84812.5781 dloss:43.9344 exploreP:0.0100
Episode:3264 meanR:-119.5894 R:-123.5807 gloss:-77524.4141 dloss:37.2918 exploreP:0.0100
Episode:3265 meanR:-119.6893 R:-123.5523 gloss:-76531.7969 dloss:53.7259 exploreP:0.0100
Episode:3266 meanR:-119.7898 R:-123.4989 gloss:-66114.8438 dloss:30.7689 exploreP:0.0100
Episode:3267 meanR:-119.8913 R:-123.5863 gloss:-65027.0195 dloss:54.7932 exploreP:0.0100
Episode:3268 meanR:-119.9914 R:-123.5694 gloss:-62288.0781 dloss:30.1903 exploreP:0.0100
Episode:3269 meanR:-120.0931 R:-123.4943 gloss:-55727.8203 dloss:44.7000 exploreP:0.0100
Episode:3270 meanR:-1

Episode:3351 meanR:-120.0591 R:-102.5094 gloss:-214649.9531 dloss:39.2848 exploreP:0.0100
Episode:3352 meanR:-119.8467 R:-102.3851 gloss:-172290.1250 dloss:38.7039 exploreP:0.0100
Episode:3353 meanR:-119.6398 R:-102.7642 gloss:-127342.3828 dloss:38.8527 exploreP:0.0100
Episode:3354 meanR:-119.4356 R:-103.7092 gloss:-87539.2891 dloss:38.3678 exploreP:0.0100
Episode:3355 meanR:-119.2257 R:-102.5137 gloss:-54201.3789 dloss:37.6587 exploreP:0.0100
Episode:3356 meanR:-119.0157 R:-102.4773 gloss:-50805.4336 dloss:37.2316 exploreP:0.0100
Episode:3357 meanR:-118.8113 R:-103.2354 gloss:-52546.6055 dloss:36.9757 exploreP:0.0100
Episode:3358 meanR:-118.6097 R:-103.3540 gloss:-52819.6289 dloss:36.9546 exploreP:0.0100
Episode:3359 meanR:-118.3995 R:-102.5513 gloss:-53842.0938 dloss:37.6687 exploreP:0.0100
Episode:3360 meanR:-118.1882 R:-102.5022 gloss:-52410.4844 dloss:66.8038 exploreP:0.0100
Episode:3361 meanR:-118.0390 R:-102.6226 gloss:-57243.0547 dloss:49.5419 exploreP:0.0100
Episode:3362 meanR

Episode:3443 meanR:-102.8626 R:-102.5373 gloss:-25888.9473 dloss:123.9569 exploreP:0.0100
Episode:3444 meanR:-102.8625 R:-102.5364 gloss:-19756.5801 dloss:25.9834 exploreP:0.0100
Episode:3445 meanR:-102.8627 R:-102.5215 gloss:-16432.5039 dloss:24.8156 exploreP:0.0100
Episode:3446 meanR:-102.8617 R:-102.3868 gloss:-20017.5977 dloss:48.5066 exploreP:0.0100
Episode:3447 meanR:-102.8546 R:-102.4968 gloss:-21930.6895 dloss:257.1576 exploreP:0.0100
Episode:3448 meanR:-102.8500 R:-102.5407 gloss:-22153.6133 dloss:25.7098 exploreP:0.0100
Episode:3449 meanR:-102.8505 R:-102.5522 gloss:-22131.4570 dloss:24.2241 exploreP:0.0100
Episode:3450 meanR:-102.8514 R:-102.5209 gloss:-22164.8555 dloss:24.1257 exploreP:0.0100
Episode:3451 meanR:-102.8512 R:-102.4962 gloss:-20775.3633 dloss:23.6538 exploreP:0.0100
Episode:3452 meanR:-102.8621 R:-103.4748 gloss:-22222.3105 dloss:25.7734 exploreP:0.0100
Episode:3453 meanR:-102.8595 R:-102.4973 gloss:-22889.1875 dloss:50.3112 exploreP:0.0100
Episode:3454 meanR:

Episode:3535 meanR:-102.7221 R:-102.9900 gloss:-30296.0488 dloss:35.0489 exploreP:0.0100
Episode:3536 meanR:-102.7223 R:-102.5523 gloss:-29463.2246 dloss:33.5103 exploreP:0.0100
Episode:3537 meanR:-102.7218 R:-102.5393 gloss:-29185.0469 dloss:106.4362 exploreP:0.0100
Episode:3538 meanR:-102.7214 R:-103.4004 gloss:-33094.9727 dloss:139.7803 exploreP:0.0100
Episode:3539 meanR:-102.7317 R:-103.5885 gloss:-31327.7949 dloss:39.6095 exploreP:0.0100
Episode:3540 meanR:-102.7457 R:-104.7476 gloss:-88802.6250 dloss:52.4094 exploreP:0.0100
Episode:3541 meanR:-102.7488 R:-102.8440 gloss:-160275.7969 dloss:46.0270 exploreP:0.0100
Episode:3542 meanR:-102.7909 R:-106.7590 gloss:-165856.4062 dloss:40.6591 exploreP:0.0100
Episode:3543 meanR:-102.7973 R:-103.1760 gloss:-170235.2656 dloss:38.5718 exploreP:0.0100
Episode:3544 meanR:-102.9248 R:-115.2864 gloss:-272963.8750 dloss:97.5480 exploreP:0.0100
Episode:3545 meanR:-103.6946 R:-179.4982 gloss:-144951.0000 dloss:26.0590 exploreP:0.0100
Episode:3546 m

Episode:3628 meanR:-131.5544 R:-103.4335 gloss:-83117.5547 dloss:35.2061 exploreP:0.0100
Episode:3629 meanR:-131.5990 R:-107.0139 gloss:-73801.1562 dloss:37.0594 exploreP:0.0100
Episode:3630 meanR:-131.6179 R:-104.1790 gloss:-66209.2812 dloss:38.9387 exploreP:0.0100
Episode:3631 meanR:-131.6420 R:-104.8990 gloss:-61675.5156 dloss:39.9640 exploreP:0.0100
Episode:3632 meanR:-131.6337 R:-102.5657 gloss:-48121.2383 dloss:41.3438 exploreP:0.0100
Episode:3633 meanR:-131.6424 R:-103.3787 gloss:-38915.2969 dloss:43.0994 exploreP:0.0100
Episode:3634 meanR:-131.6410 R:-103.2956 gloss:-36062.9336 dloss:44.7429 exploreP:0.0100
Episode:3635 meanR:-131.6295 R:-101.8364 gloss:-36500.2930 dloss:45.6668 exploreP:0.0100
Episode:3636 meanR:-131.6538 R:-104.9822 gloss:-36873.8672 dloss:45.6116 exploreP:0.0100
Episode:3637 meanR:-131.6532 R:-102.4766 gloss:-39065.8672 dloss:38.6661 exploreP:0.0100
Episode:3638 meanR:-131.6471 R:-102.7959 gloss:-41588.3164 dloss:35.5672 exploreP:0.0100
Episode:3639 meanR:-1

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(gloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('G losses')

In [None]:
eps, arr = np.array(dloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses')

## Testing

Let's checkout how our trained agent plays the game.

In [36]:
import gym
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
# env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Episodes/epochs
    for _ in range(1):
        state = env.reset()
        total_reward = 0

        # Steps/batches
        while True:
            env.render()
            action_logits = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
            action = np.argmax(action_logits)
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                print('total_reward: {}'.format(total_reward))
                break
                
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt




total_reward: 500.0


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.