# Deep cortical reinforcement learning: Policy gradients + Q-learning + GAN


In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [2]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

##### >**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [3]:
import gym

## Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('MountainCarContinuous-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
# env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [4]:
env.reset()
batch = []
for _ in range(1111):
    #env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) # take a random action
    batch.append([action, state, reward, done, info])
    #print('state, action, reward, done, info:', state, action, reward, done, info)
    if done:
        env.reset()

To shut the window showing the simulation, use `env.close()`.

In [5]:
# env.close()

If you ran the simulation above, we can look at the rewards:

In [6]:
batch[0], 
batch[0][1].shape, state.shape

((4,), (4,))

In [7]:
import numpy as np
actions = np.array([each[0] for each in batch])
states = np.array([each[1] for each in batch])
rewards = np.array([each[2] for each in batch])
dones = np.array([each[3] for each in batch])
infos = np.array([each[4] for each in batch])

In [8]:
# print(rewards[-20:])
print('shapes:', np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print('dtypes:', np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print('states:', np.max(np.array(states)), np.min(np.array(states)))
print('actions:', np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
print('rewards:', np.max(np.array(rewards)), np.min(np.array(rewards)))

shapes: (1111,) (1111, 4) (1111,) (1111,)
dtypes: float64 float64 int64 bool
states: 2.349380847965382 -2.602122511534467
actions: 1 0
rewards: 1.0 1.0


In [9]:
actions[:10]

array([0, 1, 1, 0, 1, 1, 1, 1, 1, 1])

In [10]:
rewards[:10]

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [11]:
# import numpy as np
def sigmoid(x, derivative=False):
  return x*(1-x) if derivative else 1/(1+np.exp(-x))

In [12]:
sigmoid(np.max(np.array(rewards))), sigmoid(np.min(np.array(rewards)))

(0.7310585786300049, 0.7310585786300049)

In [13]:
print('rewards:', np.max(np.array(rewards))/100, np.min(np.array(rewards))/100)

rewards: 0.01 0.01


The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [14]:
def model_input(state_size):
    #states = tf.placeholder(tf.float32, [None, *state_size], name='states')
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    return states, actions, targetQs

In [15]:
# How to use batch-norm
#   x_norm = tf.layers.batch_normalization(x, training=training)

#   # ...

#   update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
#   with tf.control_dependencies(update_ops):
#     train_op = optimizer.minimize(loss)

In [16]:
# training: Either a Python boolean, or a TensorFlow boolean scalar tensor (e.g. a placeholder). 
# Whether to return the output in: 
# training mode (normalized with statistics of the current batch) or 
# inference mode (normalized with moving statistics). 
# NOTE: make sure to set this parameter correctly, or else your training/inference will not work properly.

In [83]:
# Generator/Controller: Generating/prediting the actions
def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        #predictions = tf.nn.softmax(logits)

        # return actions logits
        return logits

In [84]:
# Discriminator/Dopamine: Reward function/planner/naviator/advisor/supervisor/cortical columns
def discriminator(states, actions, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # Fusion/merge states and actions/ SA/ SM
        x_fused = tf.concat(axis=1, values=[states, actions])
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)        
        #predictions = tf.nn.softmax(logits)

        # return rewards logits
        return logits

In [85]:
def model_loss(action_size, hidden_size, states, actions, targetQs):
    # G
    actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    neg_log_prob_actions = tf.nn.softmax_cross_entropy_with_logits_v2(logits=actions_logits, 
                                                                      labels=actions_labels)
    g_loss = tf.reduce_mean(neg_log_prob_actions * targetQs)
    #g_loss = -tf.reduce_mean(Qs)
    
    # D
    Qs = discriminator(actions=actions_logits, hidden_size=hidden_size, states=states)
    d_loss = tf.reduce_mean(tf.square(Qs - targetQs))
    return actions_logits, Qs, g_loss, d_loss

In [86]:
# Optimizating/training/learning G & D
def model_opt(g_loss, d_loss, learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars)
        d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars)

    return g_opt, d_opt

In [87]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs = model_input(state_size=state_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.Qs_logits, self.g_loss, self.d_loss = model_loss(
            action_size=action_size, hidden_size=hidden_size, # model init parameters
            states=self.states, actions=self.actions, targetQs=self.targetQs) # model input
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(g_loss=self.g_loss, d_loss=self.d_loss, learning_rate=learning_rate)

In [88]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [89]:
print('state size:{}'.format(states.shape), 
      'actions:{}'.format(actions.shape)) 
print('action size:{}'.format(np.max(actions) - np.min(actions)+1))

state size:(1000, 4) actions:(1000,)
action size:2


In [96]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
# state_size = 37
# state_size_ = (84, 84, 3)
state_size = 4
action_size = 2
hidden_size = 64             # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
gamma = 0.99                   # future reward discount
memory_size = 1000            # memory capacity
batch_size = 1000             # experience mini-batch size

In [97]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, state_size=state_size, hidden_size=hidden_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

In [98]:
state = env.reset()
#for _ in range(batch_size):
for _ in range(memory_size):
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)
    memory.buffer.append([state, action, next_state, reward, float(done)])
    state = next_state
    if done is True:
        state = env.reset()

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [None]:
# Save/load the model and save for plotting
saver = tf.train.Saver()
episode_rewards_list, rewards_list, gloss_list, dloss_list = [], [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        gloss_batch, dloss_batch = [], []
        state = env.reset()

        # Training steps/batches
        while True:
            # Explore (Env) or Exploit (Model)
            total_step += 1
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            if explore_p > np.random.rand():
                action = env.action_space.sample()
            else:
                action_logits = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
                action = np.argmax(action_logits)
            next_state, reward, done, _ = env.step(action)
            memory.buffer.append([state, action, next_state, reward, float(done)])
            total_reward += reward
            state = next_state

            # Training
            #batch = memory.sample(batch_size)
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            nextQs_logits = sess.run(model.Qs_logits, feed_dict = {model.states: next_states})
            #nextQs = np.max(nextQs_logits, axis=1) * (1-dones)
            nextQs = nextQs_logits.reshape([-1]) * (1-dones)
            targetQs = rewards + (gamma * nextQs)
            gloss, dloss, _, _ = sess.run([model.g_loss, model.d_loss, model.g_opt, model.d_opt],
                                            feed_dict = {model.states: states, 
                                                         model.actions: actions,
                                                         model.targetQs: targetQs})
            gloss_batch.append(gloss)
            dloss_batch.append(dloss)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{}'.format(total_reward),
              'gloss:{:.4f}'.format(np.mean(gloss_batch)),
              'dloss:{:.4f}'.format(np.mean(dloss_batch)),
              'exploreP:{:.4f}'.format(explore_p))
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        gloss_list.append([ep, np.mean(gloss_batch)])
        dloss_list.append([ep, np.mean(dloss_batch)])
        # Break episode/epoch loop
        if np.mean(episode_reward) >= 500:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:17.0000 R:17.0 gloss:0.6995 dloss:1.0029 exploreP:0.9983
Episode:1 meanR:14.0000 R:11.0 gloss:0.7397 dloss:0.9983 exploreP:0.9972
Episode:2 meanR:13.6667 R:13.0 gloss:0.7764 dloss:0.9974 exploreP:0.9959
Episode:3 meanR:14.7500 R:18.0 gloss:0.8250 dloss:1.0005 exploreP:0.9942
Episode:4 meanR:14.0000 R:11.0 gloss:0.8725 dloss:1.0073 exploreP:0.9931
Episode:5 meanR:13.3333 R:10.0 gloss:0.9091 dloss:1.0145 exploreP:0.9921
Episode:6 meanR:15.4286 R:28.0 gloss:0.9780 dloss:1.0336 exploreP:0.9894
Episode:7 meanR:15.5000 R:16.0 gloss:1.0668 dloss:1.0634 exploreP:0.9878
Episode:8 meanR:17.5556 R:34.0 gloss:1.1835 dloss:1.1067 exploreP:0.9845
Episode:9 meanR:18.0000 R:22.0 gloss:1.3335 dloss:1.1752 exploreP:0.9823
Episode:10 meanR:19.6364 R:36.0 gloss:1.5212 dloss:1.2884 exploreP:0.9788
Episode:11 meanR:18.8333 R:10.0 gloss:1.6844 dloss:1.4028 exploreP:0.9779
Episode:12 meanR:18.6923 R:17.0 gloss:1.7944 dloss:1.5063 exploreP:0.9762
Episode:13 meanR:19.7143 R:33.0 gloss:2.0090 dlo

Episode:109 meanR:22.9400 R:23.0 gloss:12.7152 dloss:15.2435 exploreP:0.7830
Episode:110 meanR:22.9900 R:41.0 gloss:12.6760 dloss:15.1259 exploreP:0.7799
Episode:111 meanR:23.1300 R:24.0 gloss:12.8144 dloss:15.1220 exploreP:0.7780
Episode:112 meanR:23.1600 R:20.0 gloss:12.8668 dloss:15.2661 exploreP:0.7765
Episode:113 meanR:23.1900 R:36.0 gloss:13.0036 dloss:15.2785 exploreP:0.7737
Episode:114 meanR:23.3300 R:26.0 gloss:13.1341 dloss:15.3569 exploreP:0.7717
Episode:115 meanR:23.1700 R:19.0 gloss:13.2686 dloss:15.8176 exploreP:0.7703
Episode:116 meanR:23.1300 R:15.0 gloss:13.2091 dloss:16.1829 exploreP:0.7692
Episode:117 meanR:23.1600 R:11.0 gloss:13.0833 dloss:15.8950 exploreP:0.7683
Episode:118 meanR:23.0900 R:25.0 gloss:12.9167 dloss:15.8223 exploreP:0.7664
Episode:119 meanR:23.0300 R:10.0 gloss:12.7687 dloss:15.7991 exploreP:0.7657
Episode:120 meanR:23.0400 R:32.0 gloss:12.4883 dloss:15.2803 exploreP:0.7633
Episode:121 meanR:23.2900 R:37.0 gloss:12.4689 dloss:14.8850 exploreP:0.7605

Episode:216 meanR:24.5400 R:34.0 gloss:13.1648 dloss:16.6589 exploreP:0.6040
Episode:217 meanR:24.6000 R:17.0 gloss:13.2167 dloss:16.7117 exploreP:0.6029
Episode:218 meanR:24.4600 R:11.0 gloss:13.1961 dloss:16.9715 exploreP:0.6023
Episode:219 meanR:24.4700 R:11.0 gloss:13.0788 dloss:16.7509 exploreP:0.6016
Episode:220 meanR:24.2700 R:12.0 gloss:12.8684 dloss:16.7371 exploreP:0.6009
Episode:221 meanR:24.2700 R:37.0 gloss:12.6662 dloss:16.1550 exploreP:0.5988
Episode:222 meanR:24.3100 R:14.0 gloss:12.8151 dloss:16.5046 exploreP:0.5979
Episode:223 meanR:24.3200 R:32.0 gloss:12.7402 dloss:16.3624 exploreP:0.5960
Episode:224 meanR:24.5000 R:35.0 gloss:12.9939 dloss:16.4979 exploreP:0.5940
Episode:225 meanR:24.5000 R:10.0 gloss:13.3630 dloss:17.2178 exploreP:0.5934
Episode:226 meanR:25.1300 R:74.0 gloss:13.3023 dloss:16.8726 exploreP:0.5891
Episode:227 meanR:25.2400 R:29.0 gloss:13.6339 dloss:17.4952 exploreP:0.5874
Episode:228 meanR:25.2000 R:21.0 gloss:13.4390 dloss:16.9777 exploreP:0.5862

Episode:323 meanR:36.6800 R:72.0 gloss:13.6003 dloss:20.3463 exploreP:0.4161
Episode:324 meanR:37.0300 R:70.0 gloss:13.5597 dloss:20.6116 exploreP:0.4133
Episode:325 meanR:37.2800 R:35.0 gloss:13.2605 dloss:20.5697 exploreP:0.4119
Episode:326 meanR:37.4100 R:87.0 gloss:13.5185 dloss:20.5977 exploreP:0.4084
Episode:327 meanR:37.8600 R:74.0 gloss:13.5631 dloss:20.9808 exploreP:0.4054
Episode:328 meanR:38.2900 R:64.0 gloss:13.8831 dloss:20.9112 exploreP:0.4029
Episode:329 meanR:39.0000 R:83.0 gloss:14.5756 dloss:21.9077 exploreP:0.3997
Episode:330 meanR:38.8200 R:21.0 gloss:14.5045 dloss:22.3210 exploreP:0.3989
Episode:331 meanR:39.0200 R:56.0 gloss:14.2327 dloss:21.4943 exploreP:0.3967
Episode:332 meanR:38.8300 R:24.0 gloss:14.7648 dloss:21.8332 exploreP:0.3958
Episode:333 meanR:39.1800 R:50.0 gloss:14.4786 dloss:21.5456 exploreP:0.3938
Episode:334 meanR:39.7200 R:66.0 gloss:13.9311 dloss:21.4666 exploreP:0.3913
Episode:335 meanR:39.6600 R:22.0 gloss:13.7931 dloss:21.2300 exploreP:0.3905

Episode:430 meanR:35.5500 R:43.0 gloss:9.2953 dloss:17.5232 exploreP:0.2825
Episode:431 meanR:35.4900 R:50.0 gloss:9.2967 dloss:17.6757 exploreP:0.2812
Episode:432 meanR:35.8700 R:62.0 gloss:9.4646 dloss:18.0042 exploreP:0.2795
Episode:433 meanR:35.6200 R:25.0 gloss:9.5457 dloss:18.3695 exploreP:0.2788
Episode:434 meanR:35.3100 R:35.0 gloss:9.5134 dloss:18.0277 exploreP:0.2779
Episode:435 meanR:35.4400 R:35.0 gloss:9.5653 dloss:18.1101 exploreP:0.2769
Episode:436 meanR:35.5600 R:46.0 gloss:9.6875 dloss:18.5578 exploreP:0.2757
Episode:437 meanR:35.5500 R:15.0 gloss:9.5815 dloss:18.6030 exploreP:0.2753
Episode:438 meanR:35.9300 R:50.0 gloss:9.2979 dloss:18.0256 exploreP:0.2740
Episode:439 meanR:35.9500 R:33.0 gloss:9.3050 dloss:18.2557 exploreP:0.2731
Episode:440 meanR:35.5800 R:13.0 gloss:9.4716 dloss:18.6649 exploreP:0.2728
Episode:441 meanR:35.3500 R:42.0 gloss:9.4406 dloss:17.9791 exploreP:0.2717
Episode:442 meanR:35.3800 R:39.0 gloss:9.2838 dloss:17.8713 exploreP:0.2707
Episode:443 

Episode:538 meanR:41.8400 R:42.0 gloss:8.3528 dloss:21.5183 exploreP:0.1837
Episode:539 meanR:41.9300 R:42.0 gloss:8.3547 dloss:21.6328 exploreP:0.1830
Episode:540 meanR:42.5000 R:70.0 gloss:8.1746 dloss:21.4485 exploreP:0.1818
Episode:541 meanR:42.5500 R:47.0 gloss:8.1238 dloss:21.9226 exploreP:0.1810
Episode:542 meanR:42.6800 R:52.0 gloss:8.0146 dloss:21.4201 exploreP:0.1801
Episode:543 meanR:42.8300 R:57.0 gloss:8.1376 dloss:21.7653 exploreP:0.1791
Episode:544 meanR:42.8700 R:44.0 gloss:8.4035 dloss:21.8031 exploreP:0.1784
Episode:545 meanR:42.8000 R:57.0 gloss:8.2157 dloss:21.8785 exploreP:0.1774
Episode:546 meanR:43.1400 R:46.0 gloss:8.1526 dloss:21.9065 exploreP:0.1767
Episode:547 meanR:43.4000 R:54.0 gloss:8.2631 dloss:21.5255 exploreP:0.1758
Episode:548 meanR:43.0300 R:25.0 gloss:8.3616 dloss:22.2420 exploreP:0.1754
Episode:549 meanR:43.2000 R:43.0 gloss:8.6622 dloss:21.2838 exploreP:0.1746
Episode:550 meanR:43.2500 R:39.0 gloss:8.2307 dloss:21.9770 exploreP:0.1740
Episode:551 

Episode:646 meanR:54.5200 R:42.0 gloss:6.6903 dloss:22.2436 exploreP:0.1066
Episode:647 meanR:54.5800 R:60.0 gloss:6.2213 dloss:22.2802 exploreP:0.1060
Episode:648 meanR:54.8400 R:51.0 gloss:5.6114 dloss:22.2023 exploreP:0.1056
Episode:649 meanR:54.8200 R:41.0 gloss:5.5190 dloss:22.1243 exploreP:0.1052
Episode:650 meanR:54.6500 R:22.0 gloss:5.5403 dloss:22.4142 exploreP:0.1050
Episode:651 meanR:54.6100 R:52.0 gloss:5.3222 dloss:21.7953 exploreP:0.1045
Episode:652 meanR:54.8100 R:43.0 gloss:5.4279 dloss:22.0737 exploreP:0.1041
Episode:653 meanR:54.8200 R:35.0 gloss:5.2054 dloss:21.4659 exploreP:0.1037
Episode:654 meanR:54.8700 R:70.0 gloss:5.1916 dloss:21.3462 exploreP:0.1031
Episode:655 meanR:54.6500 R:46.0 gloss:5.2394 dloss:21.4093 exploreP:0.1026
Episode:656 meanR:54.6000 R:51.0 gloss:5.0846 dloss:21.2146 exploreP:0.1022
Episode:657 meanR:54.6100 R:51.0 gloss:4.9385 dloss:21.0890 exploreP:0.1017
Episode:658 meanR:55.0800 R:80.0 gloss:5.2063 dloss:21.3202 exploreP:0.1010
Episode:659 

Episode:754 meanR:55.7600 R:44.0 gloss:4.2007 dloss:21.7653 exploreP:0.0633
Episode:755 meanR:55.8700 R:57.0 gloss:4.1102 dloss:21.8529 exploreP:0.0630
Episode:756 meanR:55.7200 R:36.0 gloss:4.0688 dloss:22.1559 exploreP:0.0628
Episode:757 meanR:55.6900 R:48.0 gloss:3.6616 dloss:21.7399 exploreP:0.0625
Episode:758 meanR:55.2900 R:40.0 gloss:3.3758 dloss:21.8804 exploreP:0.0623
Episode:759 meanR:55.6800 R:79.0 gloss:3.3838 dloss:21.8379 exploreP:0.0619
Episode:760 meanR:55.7800 R:61.0 gloss:3.5990 dloss:21.6367 exploreP:0.0616
Episode:761 meanR:55.6900 R:66.0 gloss:3.6333 dloss:21.3557 exploreP:0.0613
Episode:762 meanR:56.0000 R:91.0 gloss:3.8457 dloss:21.9959 exploreP:0.0608
Episode:763 meanR:56.1400 R:65.0 gloss:4.5757 dloss:22.2202 exploreP:0.0605
Episode:764 meanR:55.8200 R:39.0 gloss:4.5841 dloss:22.4681 exploreP:0.0603
Episode:765 meanR:55.6600 R:37.0 gloss:4.5799 dloss:22.6262 exploreP:0.0601
Episode:766 meanR:55.4100 R:62.0 gloss:4.7196 dloss:22.5138 exploreP:0.0598
Episode:767 

Episode:862 meanR:52.2000 R:38.0 gloss:2.6215 dloss:22.3348 exploreP:0.0401
Episode:863 meanR:52.5800 R:103.0 gloss:2.6286 dloss:22.0282 exploreP:0.0398
Episode:864 meanR:52.5900 R:40.0 gloss:2.5821 dloss:22.5250 exploreP:0.0397
Episode:865 meanR:52.6500 R:43.0 gloss:2.4106 dloss:22.5123 exploreP:0.0396
Episode:866 meanR:52.7400 R:71.0 gloss:2.3593 dloss:22.4115 exploreP:0.0394
Episode:867 meanR:52.5600 R:45.0 gloss:2.2241 dloss:22.5443 exploreP:0.0392
Episode:868 meanR:52.5900 R:69.0 gloss:2.3464 dloss:22.5309 exploreP:0.0390
Episode:869 meanR:52.1600 R:42.0 gloss:2.1871 dloss:22.9997 exploreP:0.0389
Episode:870 meanR:52.4400 R:94.0 gloss:2.0504 dloss:22.5098 exploreP:0.0387
Episode:871 meanR:53.0400 R:96.0 gloss:2.2836 dloss:22.8828 exploreP:0.0384
Episode:872 meanR:53.3300 R:68.0 gloss:2.1218 dloss:23.2248 exploreP:0.0382
Episode:873 meanR:53.4300 R:41.0 gloss:2.1359 dloss:23.3144 exploreP:0.0381
Episode:874 meanR:53.9300 R:81.0 gloss:1.8935 dloss:22.9838 exploreP:0.0378
Episode:875

Episode:970 meanR:59.1100 R:43.0 gloss:1.3462 dloss:21.9702 exploreP:0.0259
Episode:971 meanR:58.7400 R:59.0 gloss:1.4876 dloss:21.5966 exploreP:0.0258
Episode:972 meanR:58.5100 R:45.0 gloss:1.3486 dloss:21.8831 exploreP:0.0257
Episode:973 meanR:58.7600 R:66.0 gloss:1.2540 dloss:21.8353 exploreP:0.0256
Episode:974 meanR:58.5100 R:56.0 gloss:1.2579 dloss:21.5747 exploreP:0.0255
Episode:975 meanR:58.3800 R:44.0 gloss:1.2924 dloss:21.8101 exploreP:0.0254
Episode:976 meanR:58.3600 R:42.0 gloss:1.4926 dloss:21.7925 exploreP:0.0254
Episode:977 meanR:58.4800 R:73.0 gloss:1.6963 dloss:21.4616 exploreP:0.0253
Episode:978 meanR:57.9900 R:43.0 gloss:1.4707 dloss:21.7983 exploreP:0.0252
Episode:979 meanR:58.0600 R:49.0 gloss:1.3784 dloss:21.7676 exploreP:0.0251
Episode:980 meanR:58.4900 R:84.0 gloss:1.2826 dloss:21.8308 exploreP:0.0250
Episode:981 meanR:58.4000 R:49.0 gloss:1.2175 dloss:22.2019 exploreP:0.0249
Episode:982 meanR:58.2800 R:47.0 gloss:1.1862 dloss:22.2945 exploreP:0.0249
Episode:983 

Episode:1077 meanR:76.6200 R:56.0 gloss:3.1849 dloss:25.0638 exploreP:0.0171
Episode:1078 meanR:76.7700 R:58.0 gloss:1.9247 dloss:24.9669 exploreP:0.0171
Episode:1079 meanR:78.5400 R:226.0 gloss:2.3008 dloss:24.1629 exploreP:0.0169
Episode:1080 meanR:78.2000 R:50.0 gloss:1.8988 dloss:25.1681 exploreP:0.0169
Episode:1081 meanR:78.2000 R:49.0 gloss:1.6219 dloss:25.0855 exploreP:0.0168
Episode:1082 meanR:78.2500 R:52.0 gloss:1.4808 dloss:24.6589 exploreP:0.0168
Episode:1083 meanR:79.1700 R:153.0 gloss:1.7425 dloss:24.0776 exploreP:0.0167
Episode:1084 meanR:79.1000 R:54.0 gloss:2.6306 dloss:24.4907 exploreP:0.0167
Episode:1085 meanR:80.3400 R:203.0 gloss:3.9172 dloss:24.3009 exploreP:0.0165
Episode:1086 meanR:80.1200 R:45.0 gloss:3.9565 dloss:25.2757 exploreP:0.0165
Episode:1087 meanR:81.3700 R:179.0 gloss:4.2883 dloss:24.1295 exploreP:0.0164
Episode:1088 meanR:80.7900 R:58.0 gloss:4.5718 dloss:24.8733 exploreP:0.0163
Episode:1089 meanR:80.5600 R:43.0 gloss:3.7927 dloss:25.3295 exploreP:0.

Episode:1183 meanR:120.2700 R:500.0 gloss:5.0335 dloss:13.6263 exploreP:0.0120
Episode:1184 meanR:124.7300 R:500.0 gloss:5.3266 dloss:13.6254 exploreP:0.0119
Episode:1185 meanR:127.7000 R:500.0 gloss:4.9734 dloss:13.6264 exploreP:0.0118
Episode:1186 meanR:131.5500 R:430.0 gloss:4.7188 dloss:13.6426 exploreP:0.0117
Episode:1187 meanR:134.4000 R:464.0 gloss:4.0317 dloss:14.2068 exploreP:0.0117
Episode:1188 meanR:138.3100 R:449.0 gloss:5.2048 dloss:14.5366 exploreP:0.0116
Episode:1189 meanR:142.4800 R:460.0 gloss:2.8478 dloss:14.3610 exploreP:0.0115
Episode:1190 meanR:144.4900 R:242.0 gloss:1.7149 dloss:15.0911 exploreP:0.0115
Episode:1191 meanR:145.8500 R:224.0 gloss:1.5776 dloss:17.6003 exploreP:0.0114
Episode:1192 meanR:144.3300 R:69.0 gloss:3.9890 dloss:20.6440 exploreP:0.0114
Episode:1193 meanR:147.9700 R:443.0 gloss:1.1458 dloss:20.0496 exploreP:0.0114
Episode:1194 meanR:149.6400 R:217.0 gloss:1.5070 dloss:20.2248 exploreP:0.0113
Episode:1195 meanR:150.3900 R:135.0 gloss:2.2115 dlos

Episode:1288 meanR:157.5600 R:150.0 gloss:0.8742 dloss:23.7902 exploreP:0.0103
Episode:1289 meanR:153.4900 R:53.0 gloss:0.7861 dloss:23.7916 exploreP:0.0103
Episode:1290 meanR:151.7900 R:72.0 gloss:0.7403 dloss:24.7255 exploreP:0.0103
Episode:1291 meanR:151.3700 R:182.0 gloss:0.7340 dloss:24.0098 exploreP:0.0103
Episode:1292 meanR:151.5500 R:87.0 gloss:0.9394 dloss:24.6351 exploreP:0.0103
Episode:1293 meanR:147.9200 R:80.0 gloss:0.8371 dloss:24.3573 exploreP:0.0103
Episode:1294 meanR:147.0600 R:131.0 gloss:0.7084 dloss:24.7253 exploreP:0.0103
Episode:1295 meanR:146.8200 R:111.0 gloss:0.6176 dloss:24.0949 exploreP:0.0103
Episode:1296 meanR:146.8300 R:130.0 gloss:0.7759 dloss:24.7206 exploreP:0.0103
Episode:1297 meanR:146.8100 R:209.0 gloss:1.5216 dloss:24.1957 exploreP:0.0103
Episode:1298 meanR:147.8300 R:156.0 gloss:1.7590 dloss:23.8337 exploreP:0.0103
Episode:1299 meanR:145.0700 R:224.0 gloss:1.6166 dloss:23.6857 exploreP:0.0103
Episode:1300 meanR:145.9900 R:171.0 gloss:3.1944 dloss:2

Episode:1393 meanR:133.8800 R:193.0 gloss:2.3619 dloss:23.8286 exploreP:0.0101
Episode:1394 meanR:133.8000 R:123.0 gloss:2.4862 dloss:24.5426 exploreP:0.0101
Episode:1395 meanR:133.5700 R:88.0 gloss:1.6331 dloss:24.3558 exploreP:0.0101
Episode:1396 meanR:133.8700 R:160.0 gloss:1.5162 dloss:24.3119 exploreP:0.0101
Episode:1397 meanR:132.8700 R:109.0 gloss:1.4122 dloss:24.2105 exploreP:0.0101
Episode:1398 meanR:133.0100 R:170.0 gloss:1.1125 dloss:23.9571 exploreP:0.0101
Episode:1399 meanR:131.9400 R:117.0 gloss:1.2200 dloss:23.9794 exploreP:0.0101
Episode:1400 meanR:131.3500 R:112.0 gloss:1.1335 dloss:23.9371 exploreP:0.0101
Episode:1401 meanR:131.4100 R:160.0 gloss:1.0928 dloss:24.1130 exploreP:0.0101
Episode:1402 meanR:129.1000 R:73.0 gloss:0.8904 dloss:24.8371 exploreP:0.0101
Episode:1403 meanR:128.7400 R:67.0 gloss:0.8947 dloss:24.2567 exploreP:0.0101
Episode:1404 meanR:129.2500 R:157.0 gloss:0.9668 dloss:24.2766 exploreP:0.0101
Episode:1405 meanR:127.4300 R:59.0 gloss:0.9317 dloss:2

Episode:1498 meanR:111.6700 R:68.0 gloss:1.9544 dloss:24.8838 exploreP:0.0100
Episode:1499 meanR:111.0100 R:51.0 gloss:2.4545 dloss:24.4678 exploreP:0.0100
Episode:1500 meanR:111.0400 R:115.0 gloss:2.0702 dloss:24.4736 exploreP:0.0100
Episode:1501 meanR:109.9900 R:55.0 gloss:1.6739 dloss:24.4959 exploreP:0.0100
Episode:1502 meanR:110.4200 R:116.0 gloss:1.6340 dloss:24.8071 exploreP:0.0100
Episode:1503 meanR:110.8700 R:112.0 gloss:1.6355 dloss:24.4833 exploreP:0.0100
Episode:1504 meanR:110.1300 R:83.0 gloss:1.4813 dloss:24.4816 exploreP:0.0100
Episode:1505 meanR:110.3800 R:84.0 gloss:1.2238 dloss:24.4933 exploreP:0.0100
Episode:1506 meanR:111.0700 R:150.0 gloss:0.9883 dloss:24.5970 exploreP:0.0100
Episode:1507 meanR:109.4100 R:97.0 gloss:0.9512 dloss:24.2568 exploreP:0.0100
Episode:1508 meanR:110.2300 R:155.0 gloss:0.8385 dloss:24.3079 exploreP:0.0100
Episode:1509 meanR:109.1000 R:112.0 gloss:0.9475 dloss:24.4394 exploreP:0.0100
Episode:1510 meanR:108.9100 R:86.0 gloss:2.0271 dloss:24.4

Episode:1604 meanR:90.8300 R:97.0 gloss:1.1569 dloss:23.9474 exploreP:0.0100
Episode:1605 meanR:91.2600 R:127.0 gloss:0.8685 dloss:24.0628 exploreP:0.0100
Episode:1606 meanR:90.6900 R:93.0 gloss:0.8336 dloss:24.0117 exploreP:0.0100
Episode:1607 meanR:90.1700 R:45.0 gloss:0.7865 dloss:24.9725 exploreP:0.0100
Episode:1608 meanR:89.6100 R:99.0 gloss:0.7096 dloss:24.2043 exploreP:0.0100
Episode:1609 meanR:89.0200 R:53.0 gloss:0.7112 dloss:24.2244 exploreP:0.0100
Episode:1610 meanR:89.1100 R:95.0 gloss:0.7585 dloss:24.2596 exploreP:0.0100
Episode:1611 meanR:88.8500 R:78.0 gloss:1.4449 dloss:24.2127 exploreP:0.0100
Episode:1612 meanR:88.8300 R:117.0 gloss:2.5689 dloss:24.0219 exploreP:0.0100
Episode:1613 meanR:89.4500 R:111.0 gloss:6.3716 dloss:24.3476 exploreP:0.0100
Episode:1614 meanR:88.8900 R:90.0 gloss:3.6779 dloss:24.2554 exploreP:0.0100
Episode:1615 meanR:89.2900 R:107.0 gloss:2.0159 dloss:24.3466 exploreP:0.0100
Episode:1616 meanR:89.3700 R:104.0 gloss:1.3369 dloss:24.6647 exploreP:0

Episode:1710 meanR:96.7300 R:114.0 gloss:1.9926 dloss:24.2341 exploreP:0.0100
Episode:1711 meanR:97.9000 R:195.0 gloss:1.5698 dloss:24.2975 exploreP:0.0100
Episode:1712 meanR:97.5600 R:83.0 gloss:1.2527 dloss:24.2451 exploreP:0.0100
Episode:1713 meanR:97.5200 R:107.0 gloss:0.9026 dloss:24.3305 exploreP:0.0100
Episode:1714 meanR:97.5600 R:94.0 gloss:0.5988 dloss:24.3477 exploreP:0.0100
Episode:1715 meanR:97.4400 R:95.0 gloss:0.8049 dloss:24.0182 exploreP:0.0100
Episode:1716 meanR:97.4200 R:102.0 gloss:0.7634 dloss:24.4755 exploreP:0.0100
Episode:1717 meanR:97.6400 R:141.0 gloss:0.8339 dloss:24.7077 exploreP:0.0100
Episode:1718 meanR:97.6600 R:65.0 gloss:0.8289 dloss:24.4608 exploreP:0.0100
Episode:1719 meanR:97.6000 R:96.0 gloss:1.1194 dloss:24.4596 exploreP:0.0100
Episode:1720 meanR:97.8200 R:104.0 gloss:1.0271 dloss:24.4689 exploreP:0.0100
Episode:1721 meanR:97.3100 R:60.0 gloss:0.8253 dloss:25.0886 exploreP:0.0100
Episode:1722 meanR:96.9800 R:52.0 gloss:0.7478 dloss:25.0902 exploreP:

Episode:1816 meanR:105.0600 R:90.0 gloss:0.6081 dloss:24.8780 exploreP:0.0100
Episode:1817 meanR:104.4300 R:78.0 gloss:0.5052 dloss:24.4599 exploreP:0.0100
Episode:1818 meanR:105.2000 R:142.0 gloss:1.4197 dloss:24.1298 exploreP:0.0100
Episode:1819 meanR:105.2500 R:101.0 gloss:1.6618 dloss:24.4952 exploreP:0.0100
Episode:1820 meanR:105.8800 R:167.0 gloss:1.5368 dloss:24.3085 exploreP:0.0100
Episode:1821 meanR:106.5700 R:129.0 gloss:1.4709 dloss:24.3592 exploreP:0.0100
Episode:1822 meanR:108.1400 R:209.0 gloss:1.6084 dloss:24.0669 exploreP:0.0100
Episode:1823 meanR:108.6100 R:119.0 gloss:1.2099 dloss:24.1054 exploreP:0.0100
Episode:1824 meanR:108.7000 R:121.0 gloss:1.1439 dloss:23.9923 exploreP:0.0100
Episode:1825 meanR:108.4200 R:70.0 gloss:1.5072 dloss:23.8647 exploreP:0.0100
Episode:1826 meanR:108.2200 R:104.0 gloss:1.3315 dloss:24.1986 exploreP:0.0100
Episode:1827 meanR:108.9500 R:183.0 gloss:1.6047 dloss:23.9388 exploreP:0.0100
Episode:1828 meanR:109.9300 R:174.0 gloss:1.7623 dloss:

Episode:1921 meanR:115.2500 R:43.0 gloss:1.7020 dloss:24.3308 exploreP:0.0100
Episode:1922 meanR:114.3200 R:116.0 gloss:1.5440 dloss:24.5867 exploreP:0.0100
Episode:1923 meanR:114.0500 R:92.0 gloss:1.5556 dloss:24.7049 exploreP:0.0100
Episode:1924 meanR:113.8000 R:96.0 gloss:1.1793 dloss:24.4713 exploreP:0.0100
Episode:1925 meanR:113.6600 R:56.0 gloss:0.9082 dloss:25.1510 exploreP:0.0100
Episode:1926 meanR:113.2100 R:59.0 gloss:0.8989 dloss:25.0338 exploreP:0.0100
Episode:1927 meanR:111.8800 R:50.0 gloss:1.7521 dloss:24.5254 exploreP:0.0100
Episode:1928 meanR:110.9000 R:76.0 gloss:2.0566 dloss:24.6099 exploreP:0.0100
Episode:1929 meanR:110.7500 R:52.0 gloss:2.3026 dloss:24.6665 exploreP:0.0100
Episode:1930 meanR:111.7200 R:224.0 gloss:2.5655 dloss:23.9302 exploreP:0.0100
Episode:1931 meanR:112.4400 R:152.0 gloss:2.0048 dloss:23.9970 exploreP:0.0100
Episode:1932 meanR:112.6100 R:88.0 gloss:1.7920 dloss:24.4430 exploreP:0.0100
Episode:1933 meanR:112.2900 R:69.0 gloss:1.7070 dloss:24.5453

Episode:2026 meanR:106.4900 R:88.0 gloss:2.1069 dloss:24.9084 exploreP:0.0100
Episode:2027 meanR:106.5800 R:59.0 gloss:2.1403 dloss:24.7055 exploreP:0.0100
Episode:2028 meanR:106.2600 R:44.0 gloss:2.5933 dloss:24.9208 exploreP:0.0100
Episode:2029 meanR:106.1100 R:37.0 gloss:2.1230 dloss:24.7055 exploreP:0.0100
Episode:2030 meanR:104.5700 R:70.0 gloss:1.3503 dloss:24.5589 exploreP:0.0100
Episode:2031 meanR:104.3100 R:126.0 gloss:3.4556 dloss:24.2630 exploreP:0.0100
Episode:2032 meanR:106.0500 R:262.0 gloss:3.4593 dloss:24.0175 exploreP:0.0100
Episode:2033 meanR:106.5100 R:115.0 gloss:2.5704 dloss:24.4913 exploreP:0.0100
Episode:2034 meanR:106.7400 R:71.0 gloss:2.6362 dloss:24.4736 exploreP:0.0100
Episode:2035 meanR:105.4000 R:87.0 gloss:4.5064 dloss:24.8982 exploreP:0.0100
Episode:2036 meanR:107.2000 R:248.0 gloss:4.8650 dloss:23.9950 exploreP:0.0100
Episode:2037 meanR:107.5400 R:112.0 gloss:4.3546 dloss:23.3612 exploreP:0.0100
Episode:2038 meanR:107.6200 R:118.0 gloss:3.7164 dloss:23.7

Episode:2131 meanR:142.8400 R:164.0 gloss:2.1096 dloss:23.0918 exploreP:0.0100
Episode:2132 meanR:141.6800 R:146.0 gloss:1.7827 dloss:23.1872 exploreP:0.0100
Episode:2133 meanR:141.8700 R:134.0 gloss:1.5160 dloss:23.1681 exploreP:0.0100
Episode:2134 meanR:144.1000 R:294.0 gloss:1.7363 dloss:21.9719 exploreP:0.0100
Episode:2135 meanR:144.7700 R:154.0 gloss:1.2420 dloss:22.8517 exploreP:0.0100
Episode:2136 meanR:145.3400 R:305.0 gloss:2.5948 dloss:21.9038 exploreP:0.0100
Episode:2137 meanR:144.9800 R:76.0 gloss:4.2065 dloss:22.5053 exploreP:0.0100
Episode:2138 meanR:144.0600 R:26.0 gloss:3.6872 dloss:24.9192 exploreP:0.0100
Episode:2139 meanR:145.3800 R:233.0 gloss:3.8091 dloss:22.4078 exploreP:0.0100
Episode:2140 meanR:143.8300 R:51.0 gloss:4.0138 dloss:24.0709 exploreP:0.0100
Episode:2141 meanR:142.3100 R:8.0 gloss:4.1122 dloss:26.8130 exploreP:0.0100
Episode:2142 meanR:142.9500 R:164.0 gloss:3.0299 dloss:24.4742 exploreP:0.0100
Episode:2143 meanR:142.8900 R:182.0 gloss:2.4273 dloss:24

Episode:2236 meanR:74.6500 R:103.0 gloss:2.8775 dloss:23.0329 exploreP:0.0100
Episode:2237 meanR:74.3400 R:45.0 gloss:3.0030 dloss:24.3705 exploreP:0.0100
Episode:2238 meanR:74.7000 R:62.0 gloss:2.7407 dloss:23.8554 exploreP:0.0100
Episode:2239 meanR:72.5000 R:13.0 gloss:2.5569 dloss:24.6007 exploreP:0.0100
Episode:2240 meanR:73.9100 R:192.0 gloss:2.1705 dloss:23.1165 exploreP:0.0100
Episode:2241 meanR:74.8900 R:106.0 gloss:1.4625 dloss:23.6939 exploreP:0.0100
Episode:2242 meanR:73.5300 R:28.0 gloss:1.3685 dloss:24.0189 exploreP:0.0100
Episode:2243 meanR:72.8900 R:118.0 gloss:1.2569 dloss:23.0724 exploreP:0.0100
Episode:2244 meanR:73.4400 R:187.0 gloss:1.0717 dloss:23.0375 exploreP:0.0100
Episode:2245 meanR:73.2800 R:51.0 gloss:0.8812 dloss:24.6615 exploreP:0.0100
Episode:2246 meanR:73.0100 R:29.0 gloss:0.7862 dloss:24.4118 exploreP:0.0100
Episode:2247 meanR:71.8300 R:11.0 gloss:0.7609 dloss:23.7722 exploreP:0.0100
Episode:2248 meanR:71.5100 R:128.0 gloss:0.6286 dloss:23.7251 exploreP:

Episode:2343 meanR:53.8000 R:30.0 gloss:0.4241 dloss:20.5112 exploreP:0.0100
Episode:2344 meanR:52.1600 R:23.0 gloss:0.4111 dloss:20.9241 exploreP:0.0100
Episode:2345 meanR:52.0700 R:42.0 gloss:0.4070 dloss:21.3721 exploreP:0.0100
Episode:2346 meanR:51.9800 R:20.0 gloss:0.3884 dloss:21.6245 exploreP:0.0100
Episode:2347 meanR:51.9600 R:9.0 gloss:0.3814 dloss:21.6445 exploreP:0.0100
Episode:2348 meanR:51.0600 R:38.0 gloss:0.3604 dloss:20.5294 exploreP:0.0100
Episode:2349 meanR:51.0800 R:11.0 gloss:0.3512 dloss:20.6259 exploreP:0.0100
Episode:2350 meanR:51.8400 R:95.0 gloss:0.3102 dloss:19.8917 exploreP:0.0100
Episode:2351 meanR:52.4600 R:79.0 gloss:0.2708 dloss:20.6082 exploreP:0.0100
Episode:2352 meanR:51.1100 R:42.0 gloss:0.2006 dloss:20.8532 exploreP:0.0100
Episode:2353 meanR:50.8100 R:28.0 gloss:0.1980 dloss:20.8613 exploreP:0.0100
Episode:2354 meanR:51.4500 R:86.0 gloss:0.1949 dloss:20.7929 exploreP:0.0100
Episode:2355 meanR:51.3800 R:46.0 gloss:0.1844 dloss:20.6461 exploreP:0.0100


Episode:2450 meanR:46.2400 R:47.0 gloss:1.2651 dloss:21.4668 exploreP:0.0100
Episode:2451 meanR:45.8000 R:35.0 gloss:1.2726 dloss:21.5283 exploreP:0.0100
Episode:2452 meanR:46.0600 R:68.0 gloss:1.2949 dloss:22.1803 exploreP:0.0100
Episode:2453 meanR:47.0600 R:128.0 gloss:1.0895 dloss:23.3121 exploreP:0.0100
Episode:2454 meanR:46.3300 R:13.0 gloss:0.8507 dloss:24.8865 exploreP:0.0100
Episode:2455 meanR:46.8300 R:96.0 gloss:0.7471 dloss:23.2617 exploreP:0.0100
Episode:2456 meanR:46.1600 R:17.0 gloss:2.1924 dloss:23.1913 exploreP:0.0100
Episode:2457 meanR:46.1300 R:11.0 gloss:2.0656 dloss:24.1708 exploreP:0.0100
Episode:2458 meanR:45.8700 R:13.0 gloss:1.9084 dloss:23.9541 exploreP:0.0100
Episode:2459 meanR:45.4300 R:17.0 gloss:1.6582 dloss:23.0852 exploreP:0.0100
Episode:2460 meanR:45.4600 R:118.0 gloss:1.3753 dloss:21.6913 exploreP:0.0100
Episode:2461 meanR:44.9800 R:13.0 gloss:1.2487 dloss:22.8304 exploreP:0.0100
Episode:2462 meanR:45.6800 R:89.0 gloss:1.0980 dloss:21.4702 exploreP:0.01

Episode:2557 meanR:47.1600 R:15.0 gloss:0.3773 dloss:19.3253 exploreP:0.0100
Episode:2558 meanR:48.5700 R:154.0 gloss:0.3457 dloss:18.4755 exploreP:0.0100
Episode:2559 meanR:48.5200 R:12.0 gloss:0.3422 dloss:19.1842 exploreP:0.0100
Episode:2560 meanR:47.5200 R:18.0 gloss:0.3426 dloss:18.6104 exploreP:0.0100
Episode:2561 meanR:48.0500 R:66.0 gloss:0.3251 dloss:18.3231 exploreP:0.0100
Episode:2562 meanR:47.3200 R:16.0 gloss:0.3004 dloss:18.6406 exploreP:0.0100
Episode:2563 meanR:47.8500 R:88.0 gloss:0.3224 dloss:18.6880 exploreP:0.0100
Episode:2564 meanR:47.0400 R:10.0 gloss:0.3415 dloss:21.1725 exploreP:0.0100
Episode:2565 meanR:47.2800 R:36.0 gloss:0.7312 dloss:20.1836 exploreP:0.0100
Episode:2566 meanR:47.5400 R:54.0 gloss:0.7868 dloss:19.9822 exploreP:0.0100
Episode:2567 meanR:47.4000 R:14.0 gloss:0.7192 dloss:20.2205 exploreP:0.0100
Episode:2568 meanR:46.6400 R:26.0 gloss:0.6974 dloss:19.2667 exploreP:0.0100
Episode:2569 meanR:46.4800 R:20.0 gloss:0.7083 dloss:19.4730 exploreP:0.010

Episode:2664 meanR:38.6200 R:9.0 gloss:1.0065 dloss:18.9128 exploreP:0.0100
Episode:2665 meanR:38.4000 R:14.0 gloss:1.1331 dloss:18.8266 exploreP:0.0100
Episode:2666 meanR:38.3500 R:49.0 gloss:1.5634 dloss:18.5531 exploreP:0.0100
Episode:2667 meanR:38.3800 R:17.0 gloss:1.5624 dloss:19.1360 exploreP:0.0100
Episode:2668 meanR:38.5400 R:42.0 gloss:1.4792 dloss:19.1990 exploreP:0.0100
Episode:2669 meanR:38.7600 R:42.0 gloss:1.4153 dloss:19.8545 exploreP:0.0100
Episode:2670 meanR:38.7700 R:18.0 gloss:1.4009 dloss:20.8093 exploreP:0.0100
Episode:2671 meanR:38.3800 R:24.0 gloss:1.2787 dloss:20.2748 exploreP:0.0100
Episode:2672 meanR:38.4100 R:28.0 gloss:1.1990 dloss:19.4356 exploreP:0.0100
Episode:2673 meanR:38.7700 R:66.0 gloss:1.1749 dloss:19.8249 exploreP:0.0100
Episode:2674 meanR:38.9300 R:29.0 gloss:1.1166 dloss:20.2182 exploreP:0.0100
Episode:2675 meanR:38.8800 R:30.0 gloss:1.0780 dloss:19.5313 exploreP:0.0100
Episode:2676 meanR:37.5900 R:10.0 gloss:1.0748 dloss:20.1139 exploreP:0.0100


Episode:2771 meanR:43.2700 R:11.0 gloss:2.0652 dloss:21.0792 exploreP:0.0100
Episode:2772 meanR:43.7300 R:74.0 gloss:1.9573 dloss:19.9819 exploreP:0.0100
Episode:2773 meanR:43.7700 R:70.0 gloss:1.8929 dloss:20.6224 exploreP:0.0100
Episode:2774 meanR:43.7500 R:27.0 gloss:2.2035 dloss:21.3730 exploreP:0.0100
Episode:2775 meanR:44.1100 R:66.0 gloss:1.5411 dloss:20.6644 exploreP:0.0100
Episode:2776 meanR:44.1000 R:9.0 gloss:1.2580 dloss:20.8110 exploreP:0.0100
Episode:2777 meanR:43.8600 R:10.0 gloss:1.2059 dloss:20.9527 exploreP:0.0100
Episode:2778 meanR:43.7200 R:14.0 gloss:1.1740 dloss:20.7938 exploreP:0.0100
Episode:2779 meanR:43.9300 R:46.0 gloss:1.1191 dloss:19.9960 exploreP:0.0100
Episode:2780 meanR:43.8500 R:17.0 gloss:1.1152 dloss:20.5315 exploreP:0.0100
Episode:2781 meanR:44.4600 R:85.0 gloss:1.1195 dloss:20.0513 exploreP:0.0100
Episode:2782 meanR:44.3300 R:12.0 gloss:1.0917 dloss:21.6123 exploreP:0.0100
Episode:2783 meanR:44.3200 R:44.0 gloss:1.0695 dloss:20.4868 exploreP:0.0100


Episode:2878 meanR:54.7600 R:130.0 gloss:1.1238 dloss:22.6525 exploreP:0.0100
Episode:2879 meanR:54.7600 R:46.0 gloss:1.0690 dloss:23.0531 exploreP:0.0100
Episode:2880 meanR:54.6800 R:9.0 gloss:1.0209 dloss:23.9487 exploreP:0.0100
Episode:2881 meanR:54.1700 R:34.0 gloss:1.0787 dloss:22.7769 exploreP:0.0100
Episode:2882 meanR:54.9800 R:93.0 gloss:1.2670 dloss:22.9916 exploreP:0.0100
Episode:2883 meanR:54.9000 R:36.0 gloss:1.0852 dloss:22.8624 exploreP:0.0100
Episode:2884 meanR:54.8400 R:79.0 gloss:0.9434 dloss:22.5196 exploreP:0.0100
Episode:2885 meanR:55.0600 R:35.0 gloss:0.9138 dloss:22.5565 exploreP:0.0100
Episode:2886 meanR:53.6300 R:65.0 gloss:0.9785 dloss:22.7414 exploreP:0.0100
Episode:2887 meanR:53.8600 R:53.0 gloss:1.1597 dloss:22.2945 exploreP:0.0100
Episode:2888 meanR:53.8100 R:11.0 gloss:1.2245 dloss:24.0244 exploreP:0.0100
Episode:2889 meanR:54.5100 R:106.0 gloss:1.3990 dloss:22.5854 exploreP:0.0100
Episode:2890 meanR:54.1300 R:31.0 gloss:2.0332 dloss:23.2098 exploreP:0.010

Episode:2985 meanR:61.1400 R:34.0 gloss:1.3379 dloss:22.2178 exploreP:0.0100
Episode:2986 meanR:61.9700 R:148.0 gloss:1.0832 dloss:22.5952 exploreP:0.0100
Episode:2987 meanR:63.1600 R:172.0 gloss:0.5950 dloss:23.0477 exploreP:0.0100
Episode:2988 meanR:64.0600 R:101.0 gloss:1.1817 dloss:23.1678 exploreP:0.0100
Episode:2989 meanR:63.1200 R:12.0 gloss:2.1513 dloss:24.9648 exploreP:0.0100
Episode:2990 meanR:63.0700 R:26.0 gloss:1.6115 dloss:24.3705 exploreP:0.0100
Episode:2991 meanR:62.8500 R:75.0 gloss:0.8437 dloss:22.9853 exploreP:0.0100
Episode:2992 meanR:63.1900 R:70.0 gloss:0.5999 dloss:22.5006 exploreP:0.0100
Episode:2993 meanR:64.1000 R:146.0 gloss:0.5067 dloss:22.6952 exploreP:0.0100
Episode:2994 meanR:62.2100 R:11.0 gloss:0.4371 dloss:23.8063 exploreP:0.0100
Episode:2995 meanR:61.8100 R:19.0 gloss:0.4271 dloss:24.6663 exploreP:0.0100
Episode:2996 meanR:61.5900 R:26.0 gloss:0.4153 dloss:23.7595 exploreP:0.0100
Episode:2997 meanR:61.3700 R:10.0 gloss:0.3925 dloss:23.7393 exploreP:0.

Episode:3092 meanR:49.8400 R:29.0 gloss:0.6267 dloss:21.5857 exploreP:0.0100
Episode:3093 meanR:48.5400 R:16.0 gloss:0.6214 dloss:21.8900 exploreP:0.0100
Episode:3094 meanR:49.0400 R:61.0 gloss:0.5642 dloss:20.9513 exploreP:0.0100
Episode:3095 meanR:48.9400 R:9.0 gloss:0.5589 dloss:21.2910 exploreP:0.0100
Episode:3096 meanR:48.8100 R:13.0 gloss:0.5542 dloss:22.1305 exploreP:0.0100
Episode:3097 meanR:49.1400 R:43.0 gloss:0.5239 dloss:20.8694 exploreP:0.0100
Episode:3098 meanR:49.4000 R:47.0 gloss:0.4995 dloss:20.8222 exploreP:0.0100
Episode:3099 meanR:49.5000 R:20.0 gloss:0.4709 dloss:20.9056 exploreP:0.0100
Episode:3100 meanR:50.7100 R:131.0 gloss:0.4287 dloss:20.0880 exploreP:0.0100
Episode:3101 meanR:50.9000 R:30.0 gloss:0.4240 dloss:20.9787 exploreP:0.0100
Episode:3102 meanR:51.0300 R:37.0 gloss:0.4560 dloss:21.7847 exploreP:0.0100
Episode:3103 meanR:50.9500 R:31.0 gloss:0.4348 dloss:20.8561 exploreP:0.0100
Episode:3104 meanR:50.1500 R:9.0 gloss:0.4362 dloss:22.2986 exploreP:0.0100


Episode:3199 meanR:47.6300 R:29.0 gloss:0.4186 dloss:23.1664 exploreP:0.0100
Episode:3200 meanR:46.5800 R:26.0 gloss:0.8214 dloss:23.5292 exploreP:0.0100
Episode:3201 meanR:46.3700 R:9.0 gloss:2.4975 dloss:24.1396 exploreP:0.0100
Episode:3202 meanR:47.1500 R:115.0 gloss:2.0624 dloss:22.3021 exploreP:0.0100
Episode:3203 meanR:47.2200 R:38.0 gloss:1.7815 dloss:22.7040 exploreP:0.0100
Episode:3204 meanR:47.4800 R:35.0 gloss:2.0561 dloss:22.7454 exploreP:0.0100
Episode:3205 meanR:48.6400 R:130.0 gloss:1.8322 dloss:21.8665 exploreP:0.0100
Episode:3206 meanR:49.0500 R:68.0 gloss:1.9604 dloss:22.1778 exploreP:0.0100
Episode:3207 meanR:48.7900 R:41.0 gloss:1.9334 dloss:22.5671 exploreP:0.0100
Episode:3208 meanR:48.9400 R:82.0 gloss:1.8994 dloss:21.9069 exploreP:0.0100
Episode:3209 meanR:49.1500 R:33.0 gloss:1.7620 dloss:21.9355 exploreP:0.0100
Episode:3210 meanR:49.7700 R:96.0 gloss:1.5605 dloss:20.9572 exploreP:0.0100
Episode:3211 meanR:49.9500 R:31.0 gloss:1.5043 dloss:21.1550 exploreP:0.010

Episode:3306 meanR:55.4200 R:23.0 gloss:0.4183 dloss:19.9202 exploreP:0.0100
Episode:3307 meanR:55.2200 R:21.0 gloss:0.4072 dloss:19.8504 exploreP:0.0100
Episode:3308 meanR:54.5700 R:17.0 gloss:0.4091 dloss:19.7182 exploreP:0.0100
Episode:3309 meanR:54.4100 R:17.0 gloss:0.4298 dloss:19.8895 exploreP:0.0100
Episode:3310 meanR:53.6400 R:19.0 gloss:0.4200 dloss:19.8741 exploreP:0.0100
Episode:3311 meanR:53.7200 R:39.0 gloss:0.3772 dloss:18.8583 exploreP:0.0100
Episode:3312 meanR:53.7300 R:31.0 gloss:0.3479 dloss:19.5718 exploreP:0.0100
Episode:3313 meanR:51.8600 R:18.0 gloss:0.3403 dloss:19.1152 exploreP:0.0100
Episode:3314 meanR:51.7900 R:31.0 gloss:0.3265 dloss:19.0991 exploreP:0.0100
Episode:3315 meanR:52.0600 R:74.0 gloss:0.3054 dloss:18.5849 exploreP:0.0100
Episode:3316 meanR:52.5000 R:87.0 gloss:0.3049 dloss:18.1912 exploreP:0.0100
Episode:3317 meanR:52.4300 R:31.0 gloss:0.3232 dloss:18.1817 exploreP:0.0100
Episode:3318 meanR:52.3500 R:31.0 gloss:0.3321 dloss:18.3235 exploreP:0.0100

Episode:3413 meanR:43.0700 R:237.0 gloss:2.0209 dloss:20.1458 exploreP:0.0100
Episode:3414 meanR:44.0200 R:126.0 gloss:2.1733 dloss:20.8569 exploreP:0.0100
Episode:3415 meanR:44.0700 R:79.0 gloss:2.0387 dloss:21.7308 exploreP:0.0100
Episode:3416 meanR:43.9200 R:72.0 gloss:2.2410 dloss:22.0248 exploreP:0.0100
Episode:3417 meanR:43.7600 R:15.0 gloss:2.6942 dloss:22.7884 exploreP:0.0100
Episode:3418 meanR:43.7800 R:33.0 gloss:2.4936 dloss:22.8433 exploreP:0.0100
Episode:3419 meanR:44.0400 R:36.0 gloss:2.2499 dloss:22.3807 exploreP:0.0100
Episode:3420 meanR:43.9200 R:19.0 gloss:2.1011 dloss:22.0556 exploreP:0.0100
Episode:3421 meanR:44.0200 R:29.0 gloss:2.0688 dloss:21.7018 exploreP:0.0100
Episode:3422 meanR:44.2600 R:43.0 gloss:2.3576 dloss:21.4809 exploreP:0.0100
Episode:3423 meanR:44.0800 R:15.0 gloss:2.4017 dloss:22.4447 exploreP:0.0100
Episode:3424 meanR:46.4800 R:255.0 gloss:1.9873 dloss:22.8230 exploreP:0.0100
Episode:3425 meanR:46.8200 R:82.0 gloss:1.8724 dloss:23.9120 exploreP:0.0

Episode:3520 meanR:41.8200 R:89.0 gloss:0.9338 dloss:18.9998 exploreP:0.0100
Episode:3521 meanR:41.7800 R:25.0 gloss:0.9092 dloss:19.5113 exploreP:0.0100
Episode:3522 meanR:41.5900 R:24.0 gloss:0.9060 dloss:19.2301 exploreP:0.0100
Episode:3523 meanR:41.7500 R:31.0 gloss:0.8956 dloss:18.9679 exploreP:0.0100
Episode:3524 meanR:41.7500 R:255.0 gloss:0.8464 dloss:19.6779 exploreP:0.0100
Episode:3525 meanR:41.1500 R:22.0 gloss:0.8238 dloss:20.7384 exploreP:0.0100
Episode:3526 meanR:41.2500 R:37.0 gloss:0.7818 dloss:19.8512 exploreP:0.0100
Episode:3527 meanR:41.1600 R:10.0 gloss:0.7912 dloss:20.5839 exploreP:0.0100
Episode:3528 meanR:40.9200 R:14.0 gloss:0.7523 dloss:20.2593 exploreP:0.0100
Episode:3529 meanR:41.1600 R:53.0 gloss:0.6595 dloss:19.3297 exploreP:0.0100
Episode:3530 meanR:40.9300 R:19.0 gloss:0.6977 dloss:20.6231 exploreP:0.0100
Episode:3531 meanR:41.1300 R:55.0 gloss:0.6386 dloss:19.9474 exploreP:0.0100
Episode:3532 meanR:41.1500 R:43.0 gloss:0.5961 dloss:20.2770 exploreP:0.010

Episode:3627 meanR:39.4000 R:56.0 gloss:1.0333 dloss:22.7343 exploreP:0.0100
Episode:3628 meanR:39.5600 R:30.0 gloss:0.9883 dloss:23.5463 exploreP:0.0100
Episode:3629 meanR:39.1500 R:12.0 gloss:0.9445 dloss:25.0330 exploreP:0.0100
Episode:3630 meanR:39.4000 R:44.0 gloss:0.8256 dloss:23.6575 exploreP:0.0100
Episode:3631 meanR:39.0200 R:17.0 gloss:0.7616 dloss:23.9016 exploreP:0.0100
Episode:3632 meanR:39.2000 R:61.0 gloss:0.8572 dloss:22.6978 exploreP:0.0100
Episode:3633 meanR:39.2700 R:32.0 gloss:1.1073 dloss:22.6654 exploreP:0.0100
Episode:3634 meanR:39.8600 R:70.0 gloss:0.9917 dloss:21.8671 exploreP:0.0100
Episode:3635 meanR:40.0800 R:68.0 gloss:0.9871 dloss:21.3630 exploreP:0.0100
Episode:3636 meanR:39.9100 R:18.0 gloss:1.0263 dloss:21.2522 exploreP:0.0100
Episode:3637 meanR:40.1100 R:41.0 gloss:1.0453 dloss:22.3116 exploreP:0.0100
Episode:3638 meanR:39.8400 R:33.0 gloss:1.0205 dloss:21.8841 exploreP:0.0100
Episode:3639 meanR:39.9700 R:26.0 gloss:1.0101 dloss:22.1621 exploreP:0.0100

Episode:3734 meanR:38.3000 R:35.0 gloss:0.7122 dloss:18.7060 exploreP:0.0100
Episode:3735 meanR:38.2500 R:63.0 gloss:0.8715 dloss:18.8764 exploreP:0.0100
Episode:3736 meanR:38.5400 R:47.0 gloss:0.9531 dloss:19.2040 exploreP:0.0100
Episode:3737 meanR:38.3400 R:21.0 gloss:0.8623 dloss:19.4312 exploreP:0.0100
Episode:3738 meanR:38.3600 R:35.0 gloss:0.8028 dloss:19.2941 exploreP:0.0100
Episode:3739 meanR:38.4400 R:34.0 gloss:1.0790 dloss:18.8854 exploreP:0.0100
Episode:3740 meanR:37.9800 R:9.0 gloss:1.0059 dloss:19.1681 exploreP:0.0100
Episode:3741 meanR:38.2200 R:37.0 gloss:0.8713 dloss:18.7506 exploreP:0.0100
Episode:3742 meanR:38.3100 R:24.0 gloss:0.7689 dloss:18.3323 exploreP:0.0100
Episode:3743 meanR:38.2200 R:27.0 gloss:0.7315 dloss:18.2731 exploreP:0.0100
Episode:3744 meanR:36.7200 R:49.0 gloss:0.6957 dloss:18.0893 exploreP:0.0100
Episode:3745 meanR:36.6800 R:9.0 gloss:0.6777 dloss:18.9781 exploreP:0.0100
Episode:3746 meanR:36.8500 R:46.0 gloss:0.6546 dloss:17.8927 exploreP:0.0100
E

Episode:3841 meanR:41.8200 R:60.0 gloss:0.3099 dloss:18.8478 exploreP:0.0100
Episode:3842 meanR:41.8400 R:26.0 gloss:0.2998 dloss:19.2896 exploreP:0.0100
Episode:3843 meanR:42.1100 R:54.0 gloss:0.2790 dloss:19.2327 exploreP:0.0100
Episode:3844 meanR:41.8300 R:21.0 gloss:0.2681 dloss:19.7056 exploreP:0.0100
Episode:3845 meanR:42.0900 R:35.0 gloss:0.2608 dloss:19.0295 exploreP:0.0100
Episode:3846 meanR:44.0900 R:246.0 gloss:0.8679 dloss:19.2874 exploreP:0.0100
Episode:3847 meanR:46.1400 R:232.0 gloss:1.5230 dloss:21.2051 exploreP:0.0100
Episode:3848 meanR:46.8900 R:120.0 gloss:1.6547 dloss:23.0350 exploreP:0.0100
Episode:3849 meanR:48.4400 R:167.0 gloss:1.9876 dloss:23.1954 exploreP:0.0100
Episode:3850 meanR:48.5500 R:39.0 gloss:2.4172 dloss:23.3499 exploreP:0.0100
Episode:3851 meanR:49.5600 R:118.0 gloss:2.5545 dloss:24.0172 exploreP:0.0100
Episode:3852 meanR:49.5100 R:24.0 gloss:2.6720 dloss:25.6164 exploreP:0.0100
Episode:3853 meanR:49.1300 R:10.0 gloss:2.5142 dloss:24.7594 exploreP:0

Episode:3948 meanR:39.1000 R:37.0 gloss:0.5565 dloss:16.4845 exploreP:0.0100
Episode:3949 meanR:37.8500 R:42.0 gloss:0.5245 dloss:16.7502 exploreP:0.0100
Episode:3950 meanR:37.7500 R:29.0 gloss:0.6361 dloss:16.5817 exploreP:0.0100
Episode:3951 meanR:36.9900 R:42.0 gloss:0.6023 dloss:16.2613 exploreP:0.0100
Episode:3952 meanR:37.0500 R:30.0 gloss:0.5194 dloss:16.5566 exploreP:0.0100
Episode:3953 meanR:37.3400 R:39.0 gloss:0.4186 dloss:16.1355 exploreP:0.0100
Episode:3954 meanR:36.9400 R:10.0 gloss:0.3931 dloss:16.3925 exploreP:0.0100
Episode:3955 meanR:37.0600 R:72.0 gloss:0.3678 dloss:16.2291 exploreP:0.0100
Episode:3956 meanR:37.1500 R:18.0 gloss:0.3280 dloss:16.4084 exploreP:0.0100
Episode:3957 meanR:36.2000 R:22.0 gloss:0.3169 dloss:16.3457 exploreP:0.0100
Episode:3958 meanR:36.9100 R:94.0 gloss:0.2716 dloss:16.6196 exploreP:0.0100
Episode:3959 meanR:34.8600 R:13.0 gloss:0.2455 dloss:17.1326 exploreP:0.0100
Episode:3960 meanR:34.8400 R:20.0 gloss:0.2423 dloss:16.6034 exploreP:0.0100

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(gloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('G losses')

In [None]:
eps, arr = np.array(dloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses')

## Testing

Let's checkout how our trained agent plays the game.

In [36]:
import gym
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
# env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Episodes/epochs
    for _ in range(1):
        state = env.reset()
        total_reward = 0

        # Steps/batches
        while True:
            env.render()
            action_logits = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
            action = np.argmax(action_logits)
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                print('total_reward: {}'.format(total_reward))
                break
                
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt




total_reward: 500.0


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.