# DQN for continuous action space


In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

##### >**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [2]:
import gym

## Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('MountainCarContinuous-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
# env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [3]:
env.reset()
batch = []
for _ in range(1111):
    #env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) # take a random action
    batch.append([action, state, reward, done, info])
    #print('state, action, reward, done, info:', state, action, reward, done, info)
    if done:
        env.reset()

To shut the window showing the simulation, use `env.close()`.

In [4]:
# env.close()

If you ran the simulation above, we can look at the rewards:

In [5]:
batch[0], 
batch[0][1].shape, state.shape

((4,), (4,))

In [6]:
import numpy as np
actions = np.array([each[0] for each in batch])
states = np.array([each[1] for each in batch])
rewards = np.array([each[2] for each in batch])
dones = np.array([each[3] for each in batch])
infos = np.array([each[4] for each in batch])

In [7]:
# print(rewards[-20:])
print('shapes:', np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print('dtypes:', np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print('states:', np.max(np.array(states)), np.min(np.array(states)))
print('actions:', np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
print('rewards:', np.max(np.array(rewards)), np.min(np.array(rewards)))

shapes: (1111,) (1111, 4) (1111,) (1111,)
dtypes: float64 float64 int64 bool
states: 2.339260107412202 -2.8400300900230797
actions: 1 0
rewards: 1.0 1.0


In [8]:
actions[:10]

array([0, 1, 1, 0, 1, 1, 1, 1, 1, 1])

In [9]:
rewards[:10]

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [10]:
# import numpy as np
def sigmoid(x, derivative=False):
  return x*(1-x) if derivative else 1/(1+np.exp(-x))

In [11]:
sigmoid(np.max(np.array(rewards))), sigmoid(np.min(np.array(rewards)))

(0.7310585786300049, 0.7310585786300049)

In [12]:
print('rewards:', np.max(np.array(rewards))/100, np.min(np.array(rewards))/100)

rewards: 0.01 0.01


The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [13]:
def model_input(state_size):
    #states = tf.placeholder(tf.float32, [None, *state_size], name='states')
    states = tf.placeholder(tf.float64, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    targetQs = tf.placeholder(tf.float64, [None], name='targetQs')
    return states, actions, targetQs

In [14]:
# Generator/Controller: Generating/prediting the actions
def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        #predictions = tf.nn.softmax(logits)

        # return actions logits
        return logits

In [15]:
# Discriminator/Dopamine: Reward function/planner/naviator/advisor/supervisor/cortical columns
def discriminator(states, actions, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # Fusion/merge states and actions/ SA/ SM
        x_fused = tf.concat(axis=1, values=[states, actions])
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)        
        #predictions = tf.nn.softmax(logits)

        # return rewards logits
        return logits

In [20]:
def model_loss(action_size, hidden_size, states, actions, targetQs):
    actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    neg_log_prob_actions = tf.nn.softmax_cross_entropy_with_logits_v2(logits=actions_logits, 
                                                                      labels=actions_labels)
    Qs = discriminator(actions=actions_logits, hidden_size=hidden_size, states=states) # nextQs
    targetQs = tf.reshape(targetQs, shape=[-1, 1])
    gloss = tf.reduce_mean(neg_log_prob_actions * targetQs) # DPG
    gloss += tf.reduce_mean(neg_log_prob_actions * Qs) # DPG
    dloss = tf.reduce_mean(tf.square(Qs - targetQs)) # DQN
    # dloss = tf.nn.sigmoid_cross_entropy_with_logits(logits=Qs,
    #                                                 labels=tf.nn.sigmoid(targetQs))
    dQs = discriminator(actions=actions_labels, hidden_size=hidden_size, states=states, reuse=True) # Qs
    gloss += tf.reduce_mean(neg_log_prob_actions * dQs) # DPG
    dloss += tf.reduce_mean(tf.square(dQs - targetQs)) # DQN
    # dloss += tf.nn.sigmoid_cross_entropy_with_logits(logits=dQs,
    #                                                  labels=tf.nn.sigmoid(targetQs))
    gloss1 = tf.reduce_mean(neg_log_prob_actions)
    gloss2 = tf.reduce_mean(Qs)
    gloss3 = tf.reduce_mean(dQs)
    gloss4 = tf.reduce_mean(targetQs)
    return actions_logits, Qs, gloss, dloss, gloss1, gloss2, gloss3, gloss4

In [21]:
# Optimizating/training/learning G & D
def model_opt(g_loss, d_loss, learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars)
        d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars)

    return g_opt, d_opt

In [22]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs = model_input(state_size=state_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.Qs_logits, self.g_loss, self.d_loss, self.g_loss1, self.g_loss2, self.g_loss3, self.g_loss4 = model_loss(
            action_size=action_size, hidden_size=hidden_size, # model init parameters
            states=self.states, actions=self.actions, targetQs=self.targetQs) # model input
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(g_loss=self.g_loss, d_loss=self.d_loss, learning_rate=learning_rate)

In [23]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [24]:
print('state size:{}'.format(states.shape), 
      'actions:{}'.format(actions.shape)) 
print('action size:{}'.format(np.max(actions) - np.min(actions)+1))

state size:(1111, 4) actions:(1111,)
action size:2


In [None]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
action_size = 2
state_size = 4
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 50000            # memory capacity
batch_size = 500                # experience mini-batch size
gamma = 0.99                   # future reward discount

In [None]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, state_size=state_size, hidden_size=hidden_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

In [None]:
state = env.reset()
for _ in range(memory_size):
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)
    memory.buffer.append([state, action, next_state, reward, float(done)])
    state = next_state
    if done is True:
        state = env.reset()

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [None]:
# Save/load the model and save for plotting
saver = tf.train.Saver()
episode_rewards_list, rewards_list, gloss_list, dloss_list = [], [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        gloss_batch, dloss_batch = [], []
        gloss1_batch, gloss2_batch, gloss3_batch, gloss4_batch = [], [], [], []
        state = env.reset()

        # Training steps/batches
        for num_steps in range(1111111111):
            # Explore (Env) or Exploit (Model)
            total_step += 0.001
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            if explore_p > np.random.rand():
                action = env.action_space.sample()
            else:
                action_logits = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
                action = np.argmax(action_logits)
            next_state, reward, done, _ = env.step(action)
            memory.buffer.append([state, action, next_state, reward, float(done)])
            total_reward += reward
            state = next_state

            # Training
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            nextQs_logits = sess.run(model.Qs_logits, feed_dict = {model.states: next_states})
            #nextQs = np.max(nextQs_logits, axis=1) * (1-dones)
            nextQs = nextQs_logits.reshape([-1]) * (1-dones)
            targetQs = rewards + (gamma * nextQs)
            gloss, dloss, gloss1, gloss2, gloss3, gloss4, _, _ = sess.run([model.g_loss, model.d_loss,
                                                                           model.g_loss1, model.g_loss2, 
                                                                           model.g_loss3, model.g_loss4,
                                                                           model.g_opt, model.d_opt],
                                                                          feed_dict = {model.states: states, 
                                                                                       model.actions: actions,
                                                                                       model.targetQs: targetQs})
            gloss_batch.append(gloss)
            dloss_batch.append(dloss)
            gloss1_batch.append(gloss1)
            gloss2_batch.append(gloss2)
            gloss3_batch.append(gloss3)
            gloss4_batch.append(gloss4)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'Steps:{}'.format(num_steps),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'gloss:{:.4f}'.format(np.mean(gloss_batch)),
              'gloss1-lgP:{:.4f}'.format(np.mean(gloss1_batch)), #-logp
              'gloss2-gQs:{:.4f}'.format(np.mean(gloss2_batch)),#gQs
              'gloss3-dQs:{:.4f}'.format(np.mean(gloss3_batch)),#dQs
              'gloss4-tgtQ:{:.4f}'.format(np.mean(gloss4_batch)),#tgtQs
              'dloss:{:.4f}'.format(np.mean(dloss_batch)),
              'exploreP:{:.4f}'.format(explore_p))

        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        gloss_list.append([ep, np.mean(gloss_batch)])
        dloss_list.append([ep, np.mean(dloss_batch)])
        # Break episode/epoch loop
        ## Option 1: Solve the First Version
        #The task is episodic, and in order to solve the environment, 
        #your agent must get an average score of +30 over 100 consecutive episodes.        
        if np.mean(episode_reward) >= 500:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 Steps:14 meanR:15.0000 R:15.0000 gloss:0.6979 gloss1-lgP:0.6977 gloss2-gQs:0.0376 gloss3-dQs:-0.0733 gloss4-tgtQ:1.0360 dloss:2.2325 exploreP:1.0000
Episode:1 Steps:11 meanR:13.5000 R:12.0000 gloss:0.7891 gloss1-lgP:0.6952 gloss2-gQs:0.0761 gloss3-dQs:-0.0152 gloss4-tgtQ:1.0744 dloss:2.1906 exploreP:1.0000
Episode:2 Steps:28 meanR:18.6667 R:29.0000 gloss:0.9428 gloss1-lgP:0.6947 gloss2-gQs:0.1420 gloss3-dQs:0.0781 gloss4-tgtQ:1.1371 dloss:2.1265 exploreP:1.0000
Episode:3 Steps:10 meanR:16.7500 R:11.0000 gloss:1.0776 gloss1-lgP:0.6927 gloss2-gQs:0.1994 gloss3-dQs:0.1612 gloss4-tgtQ:1.1950 dloss:2.0814 exploreP:1.0000
Episode:4 Steps:14 meanR:16.4000 R:15.0000 gloss:1.1848 gloss1-lgP:0.6936 gloss2-gQs:0.2446 gloss3-dQs:0.2270 gloss4-tgtQ:1.2366 dloss:2.0345 exploreP:1.0000
Episode:5 Steps:19 meanR:17.0000 R:20.0000 gloss:1.3333 gloss1-lgP:0.6938 gloss2-gQs:0.3100 gloss3-dQs:0.3136 gloss4-tgtQ:1.2982 dloss:1.9888 exploreP:1.0000
Episode:6 Steps:100 meanR:29.0000 R:101.0000 gloss

Episode:53 Steps:24 meanR:22.2222 R:25.0000 gloss:26.0957 gloss1-lgP:0.6928 gloss2-gQs:12.3962 gloss3-dQs:12.6896 gloss4-tgtQ:12.5813 dloss:14.5889 exploreP:0.9999
Episode:54 Steps:14 meanR:22.0909 R:15.0000 gloss:26.3004 gloss1-lgP:0.6926 gloss2-gQs:12.4983 gloss3-dQs:12.7365 gloss4-tgtQ:12.7409 dloss:12.7895 exploreP:0.9999
Episode:55 Steps:35 meanR:22.3393 R:36.0000 gloss:26.5550 gloss1-lgP:0.6923 gloss2-gQs:12.5770 gloss3-dQs:12.9551 gloss4-tgtQ:12.8259 dloss:12.5972 exploreP:0.9999
Episode:56 Steps:11 meanR:22.1579 R:12.0000 gloss:26.5400 gloss1-lgP:0.6916 gloss2-gQs:12.6016 gloss3-dQs:13.0340 gloss4-tgtQ:12.7384 dloss:14.9610 exploreP:0.9999
Episode:57 Steps:15 meanR:22.0517 R:16.0000 gloss:26.2433 gloss1-lgP:0.6927 gloss2-gQs:12.4313 gloss3-dQs:12.8532 gloss4-tgtQ:12.6029 dloss:13.4149 exploreP:0.9999
Episode:58 Steps:17 meanR:21.9831 R:18.0000 gloss:26.0254 gloss1-lgP:0.6923 gloss2-gQs:12.2926 gloss3-dQs:12.7342 gloss4-tgtQ:12.5639 dloss:11.7371 exploreP:0.9999
Episode:59 Steps

Episode:104 Steps:27 meanR:22.0000 R:28.0000 gloss:28.6366 gloss1-lgP:0.6919 gloss2-gQs:13.7490 gloss3-dQs:13.9031 gloss4-tgtQ:13.7332 dloss:10.3301 exploreP:0.9998
Episode:105 Steps:10 meanR:21.9100 R:11.0000 gloss:28.2950 gloss1-lgP:0.6930 gloss2-gQs:13.7027 gloss3-dQs:13.4401 gloss4-tgtQ:13.6857 dloss:10.3607 exploreP:0.9998
Episode:106 Steps:42 meanR:21.3300 R:43.0000 gloss:28.5463 gloss1-lgP:0.6932 gloss2-gQs:13.6805 gloss3-dQs:13.7873 gloss4-tgtQ:13.7154 dloss:9.4510 exploreP:0.9998
Episode:107 Steps:17 meanR:21.2500 R:18.0000 gloss:28.0100 gloss1-lgP:0.6917 gloss2-gQs:13.5151 gloss3-dQs:13.5313 gloss4-tgtQ:13.4475 dloss:10.1294 exploreP:0.9998
Episode:108 Steps:20 meanR:21.3000 R:21.0000 gloss:28.0017 gloss1-lgP:0.6937 gloss2-gQs:13.4419 gloss3-dQs:13.3802 gloss4-tgtQ:13.5422 dloss:9.5357 exploreP:0.9998
Episode:109 Steps:30 meanR:21.5100 R:31.0000 gloss:30.4013 gloss1-lgP:0.6927 gloss2-gQs:14.6800 gloss3-dQs:14.5188 gloss4-tgtQ:14.6904 dloss:9.5128 exploreP:0.9998
Episode:110 S

Episode:154 Steps:34 meanR:23.0400 R:35.0000 gloss:27.2543 gloss1-lgP:0.6940 gloss2-gQs:13.0343 gloss3-dQs:13.1185 gloss4-tgtQ:13.1158 dloss:7.5426 exploreP:0.9997
Episode:155 Steps:21 meanR:22.9000 R:22.0000 gloss:26.5372 gloss1-lgP:0.6927 gloss2-gQs:12.5898 gloss3-dQs:13.0568 gloss4-tgtQ:12.6635 dloss:7.8645 exploreP:0.9996
Episode:156 Steps:30 meanR:23.0900 R:31.0000 gloss:26.2079 gloss1-lgP:0.6949 gloss2-gQs:12.4303 gloss3-dQs:12.7454 gloss4-tgtQ:12.5422 dloss:7.6166 exploreP:0.9996
Episode:157 Steps:32 meanR:23.2600 R:33.0000 gloss:27.1258 gloss1-lgP:0.6940 gloss2-gQs:12.9587 gloss3-dQs:13.0839 gloss4-tgtQ:13.0400 dloss:8.6485 exploreP:0.9996
Episode:158 Steps:28 meanR:23.3700 R:29.0000 gloss:26.3424 gloss1-lgP:0.6937 gloss2-gQs:12.6629 gloss3-dQs:12.5482 gloss4-tgtQ:12.7595 dloss:8.3322 exploreP:0.9996
Episode:159 Steps:20 meanR:23.4300 R:21.0000 gloss:27.0947 gloss1-lgP:0.6913 gloss2-gQs:12.9937 gloss3-dQs:13.1168 gloss4-tgtQ:13.0822 dloss:6.4355 exploreP:0.9996
Episode:160 Step

Episode:204 Steps:15 meanR:23.4800 R:16.0000 gloss:26.7042 gloss1-lgP:0.6952 gloss2-gQs:12.6311 gloss3-dQs:13.0528 gloss4-tgtQ:12.7317 dloss:7.4369 exploreP:0.9995
Episode:205 Steps:13 meanR:23.5100 R:14.0000 gloss:27.4754 gloss1-lgP:0.6940 gloss2-gQs:13.2545 gloss3-dQs:12.9642 gloss4-tgtQ:13.3669 dloss:7.5591 exploreP:0.9995
Episode:206 Steps:17 meanR:23.2600 R:18.0000 gloss:28.1931 gloss1-lgP:0.6935 gloss2-gQs:13.5030 gloss3-dQs:13.5426 gloss4-tgtQ:13.6038 dloss:7.8438 exploreP:0.9995
Episode:207 Steps:37 meanR:23.4600 R:38.0000 gloss:28.2940 gloss1-lgP:0.6950 gloss2-gQs:13.4674 gloss3-dQs:13.6958 gloss4-tgtQ:13.5442 dloss:8.3638 exploreP:0.9995
Episode:208 Steps:18 meanR:23.4400 R:19.0000 gloss:27.9548 gloss1-lgP:0.6942 gloss2-gQs:13.4285 gloss3-dQs:13.3984 gloss4-tgtQ:13.4429 dloss:8.0691 exploreP:0.9995
Episode:209 Steps:26 meanR:23.4000 R:27.0000 gloss:27.7849 gloss1-lgP:0.6951 gloss2-gQs:13.1566 gloss3-dQs:13.5397 gloss4-tgtQ:13.2741 dloss:7.4381 exploreP:0.9995
Episode:210 Step

Episode:255 Steps:17 meanR:21.7100 R:18.0000 gloss:28.6052 gloss1-lgP:0.6942 gloss2-gQs:13.6828 gloss3-dQs:13.7514 gloss4-tgtQ:13.7724 dloss:8.6679 exploreP:0.9994
Episode:256 Steps:17 meanR:21.5800 R:18.0000 gloss:28.3164 gloss1-lgP:0.6936 gloss2-gQs:13.5416 gloss3-dQs:13.6757 gloss4-tgtQ:13.6082 dloss:7.1147 exploreP:0.9994
Episode:257 Steps:10 meanR:21.3600 R:11.0000 gloss:29.1303 gloss1-lgP:0.6936 gloss2-gQs:14.0508 gloss3-dQs:13.8032 gloss4-tgtQ:14.1442 dloss:6.9111 exploreP:0.9994
Episode:258 Steps:17 meanR:21.2500 R:18.0000 gloss:28.1418 gloss1-lgP:0.6921 gloss2-gQs:13.3965 gloss3-dQs:13.8704 gloss4-tgtQ:13.3857 dloss:7.8436 exploreP:0.9994
Episode:259 Steps:19 meanR:21.2400 R:20.0000 gloss:28.5993 gloss1-lgP:0.6958 gloss2-gQs:13.7622 gloss3-dQs:13.5207 gloss4-tgtQ:13.8152 dloss:9.2626 exploreP:0.9994
Episode:260 Steps:14 meanR:21.2600 R:15.0000 gloss:28.4216 gloss1-lgP:0.6942 gloss2-gQs:13.5084 gloss3-dQs:13.8379 gloss4-tgtQ:13.5900 dloss:7.7789 exploreP:0.9994
Episode:261 Step

Episode:305 Steps:59 meanR:21.6100 R:60.0000 gloss:26.9678 gloss1-lgP:0.6945 gloss2-gQs:12.9313 gloss3-dQs:12.9063 gloss4-tgtQ:12.9889 dloss:7.3714 exploreP:0.9993
Episode:306 Steps:40 meanR:21.8400 R:41.0000 gloss:27.5274 gloss1-lgP:0.6947 gloss2-gQs:13.2315 gloss3-dQs:13.1338 gloss4-tgtQ:13.2577 dloss:7.6991 exploreP:0.9993
Episode:307 Steps:17 meanR:21.6400 R:18.0000 gloss:28.2022 gloss1-lgP:0.6938 gloss2-gQs:13.6070 gloss3-dQs:13.4036 gloss4-tgtQ:13.6355 dloss:6.7744 exploreP:0.9993
Episode:308 Steps:15 meanR:21.6100 R:16.0000 gloss:27.5089 gloss1-lgP:0.6940 gloss2-gQs:13.0279 gloss3-dQs:13.5128 gloss4-tgtQ:13.0978 dloss:9.3300 exploreP:0.9993
Episode:309 Steps:49 meanR:21.8400 R:50.0000 gloss:27.9793 gloss1-lgP:0.6950 gloss2-gQs:13.3742 gloss3-dQs:13.4481 gloss4-tgtQ:13.4380 dloss:7.3596 exploreP:0.9993
Episode:310 Steps:9 meanR:21.7600 R:10.0000 gloss:28.9424 gloss1-lgP:0.6949 gloss2-gQs:13.9826 gloss3-dQs:13.6352 gloss4-tgtQ:14.0276 dloss:7.9215 exploreP:0.9993
Episode:311 Steps

Episode:355 Steps:44 meanR:22.1600 R:45.0000 gloss:28.4116 gloss1-lgP:0.6949 gloss2-gQs:13.6265 gloss3-dQs:13.5868 gloss4-tgtQ:13.6696 dloss:8.1895 exploreP:0.9992
Episode:356 Steps:9 meanR:22.0800 R:10.0000 gloss:28.7872 gloss1-lgP:0.6913 gloss2-gQs:14.0770 gloss3-dQs:13.5488 gloss4-tgtQ:14.0134 dloss:7.1757 exploreP:0.9992
Episode:357 Steps:11 meanR:22.0900 R:12.0000 gloss:27.3968 gloss1-lgP:0.6935 gloss2-gQs:12.8947 gloss3-dQs:13.6234 gloss4-tgtQ:12.9849 dloss:7.7206 exploreP:0.9992
Episode:358 Steps:23 meanR:22.1500 R:24.0000 gloss:28.1113 gloss1-lgP:0.6945 gloss2-gQs:13.3930 gloss3-dQs:13.6403 gloss4-tgtQ:13.4421 dloss:8.5633 exploreP:0.9992
Episode:359 Steps:15 meanR:22.1100 R:16.0000 gloss:27.6377 gloss1-lgP:0.6976 gloss2-gQs:13.1250 gloss3-dQs:13.3350 gloss4-tgtQ:13.1555 dloss:13.7594 exploreP:0.9992
Episode:360 Steps:25 meanR:22.2200 R:26.0000 gloss:27.6217 gloss1-lgP:0.6953 gloss2-gQs:13.2544 gloss3-dQs:13.1623 gloss4-tgtQ:13.3165 dloss:9.1377 exploreP:0.9992
Episode:361 Step

Episode:405 Steps:21 meanR:22.8000 R:22.0000 gloss:27.3067 gloss1-lgP:0.6947 gloss2-gQs:13.0209 gloss3-dQs:13.1920 gloss4-tgtQ:13.0890 dloss:6.7875 exploreP:0.9991
Episode:406 Steps:14 meanR:22.5400 R:15.0000 gloss:27.2492 gloss1-lgP:0.6960 gloss2-gQs:12.9100 gloss3-dQs:13.2023 gloss4-tgtQ:13.0344 dloss:7.2059 exploreP:0.9991
Episode:407 Steps:21 meanR:22.5800 R:22.0000 gloss:28.3189 gloss1-lgP:0.6923 gloss2-gQs:13.7531 gloss3-dQs:13.3684 gloss4-tgtQ:13.7858 dloss:6.8389 exploreP:0.9991
Episode:408 Steps:24 meanR:22.6700 R:25.0000 gloss:28.6609 gloss1-lgP:0.6957 gloss2-gQs:13.6652 gloss3-dQs:13.7948 gloss4-tgtQ:13.7343 dloss:6.8166 exploreP:0.9991
Episode:409 Steps:19 meanR:22.3700 R:20.0000 gloss:28.4765 gloss1-lgP:0.6960 gloss2-gQs:13.4783 gloss3-dQs:13.8905 gloss4-tgtQ:13.5470 dloss:6.6448 exploreP:0.9991
Episode:410 Steps:17 meanR:22.4500 R:18.0000 gloss:28.5527 gloss1-lgP:0.6933 gloss2-gQs:13.7349 gloss3-dQs:13.6995 gloss4-tgtQ:13.7478 dloss:6.1942 exploreP:0.9991
Episode:411 Step

Episode:455 Steps:8 meanR:22.3900 R:9.0000 gloss:27.8611 gloss1-lgP:0.6963 gloss2-gQs:13.0853 gloss3-dQs:13.6976 gloss4-tgtQ:13.2248 dloss:10.7317 exploreP:0.9990
Episode:456 Steps:26 meanR:22.5600 R:27.0000 gloss:29.4070 gloss1-lgP:0.6952 gloss2-gQs:14.2809 gloss3-dQs:13.7103 gloss4-tgtQ:14.3127 dloss:9.3167 exploreP:0.9990
Episode:457 Steps:15 meanR:22.6000 R:16.0000 gloss:28.9191 gloss1-lgP:0.6949 gloss2-gQs:13.8112 gloss3-dQs:13.9653 gloss4-tgtQ:13.8349 dloss:6.8297 exploreP:0.9990
Episode:458 Steps:28 meanR:22.6500 R:29.0000 gloss:28.0711 gloss1-lgP:0.6936 gloss2-gQs:13.3186 gloss3-dQs:13.8335 gloss4-tgtQ:13.3143 dloss:8.2655 exploreP:0.9990
Episode:459 Steps:32 meanR:22.8200 R:33.0000 gloss:28.0624 gloss1-lgP:0.6968 gloss2-gQs:13.3553 gloss3-dQs:13.4732 gloss4-tgtQ:13.4431 dloss:9.5315 exploreP:0.9990
Episode:460 Steps:60 meanR:23.1700 R:61.0000 gloss:28.2426 gloss1-lgP:0.6949 gloss2-gQs:13.5499 gloss3-dQs:13.5217 gloss4-tgtQ:13.5701 dloss:7.1589 exploreP:0.9990
Episode:461 Steps

Episode:506 Steps:11 meanR:22.3800 R:12.0000 gloss:28.3502 gloss1-lgP:0.6978 gloss2-gQs:13.6192 gloss3-dQs:13.3895 gloss4-tgtQ:13.6181 dloss:6.4402 exploreP:0.9989
Episode:507 Steps:49 meanR:22.6600 R:50.0000 gloss:27.7538 gloss1-lgP:0.6967 gloss2-gQs:13.2276 gloss3-dQs:13.3070 gloss4-tgtQ:13.2922 dloss:8.4945 exploreP:0.9989
Episode:508 Steps:12 meanR:22.5400 R:13.0000 gloss:28.2406 gloss1-lgP:0.6970 gloss2-gQs:13.3199 gloss3-dQs:13.7733 gloss4-tgtQ:13.4078 dloss:12.3158 exploreP:0.9989
Episode:509 Steps:27 meanR:22.6200 R:28.0000 gloss:27.3554 gloss1-lgP:0.6962 gloss2-gQs:12.9753 gloss3-dQs:13.3065 gloss4-tgtQ:13.0115 dloss:7.5953 exploreP:0.9989
Episode:510 Steps:14 meanR:22.5900 R:15.0000 gloss:27.8703 gloss1-lgP:0.6932 gloss2-gQs:13.4436 gloss3-dQs:13.2732 gloss4-tgtQ:13.4855 dloss:5.7307 exploreP:0.9989
Episode:511 Steps:10 meanR:22.5900 R:11.0000 gloss:28.4268 gloss1-lgP:0.6949 gloss2-gQs:13.6586 gloss3-dQs:13.5094 gloss4-tgtQ:13.7435 dloss:5.3034 exploreP:0.9989
Episode:512 Ste

Episode:557 Steps:15 meanR:23.6400 R:16.0000 gloss:28.5794 gloss1-lgP:0.6956 gloss2-gQs:13.6641 gloss3-dQs:13.7477 gloss4-tgtQ:13.6753 dloss:6.4280 exploreP:0.9988
Episode:558 Steps:23 meanR:23.5900 R:24.0000 gloss:28.0677 gloss1-lgP:0.6937 gloss2-gQs:13.4078 gloss3-dQs:13.5964 gloss4-tgtQ:13.4538 dloss:5.6505 exploreP:0.9988
Episode:559 Steps:18 meanR:23.4500 R:19.0000 gloss:29.1138 gloss1-lgP:0.6948 gloss2-gQs:14.0443 gloss3-dQs:13.7563 gloss4-tgtQ:14.0943 dloss:7.2426 exploreP:0.9988
Episode:560 Steps:12 meanR:22.9700 R:13.0000 gloss:29.9857 gloss1-lgP:0.6974 gloss2-gQs:14.5060 gloss3-dQs:13.9311 gloss4-tgtQ:14.5566 dloss:7.9393 exploreP:0.9988
Episode:561 Steps:27 meanR:23.0200 R:28.0000 gloss:30.7429 gloss1-lgP:0.6960 gloss2-gQs:14.7947 gloss3-dQs:14.5201 gloss4-tgtQ:14.8482 dloss:9.4214 exploreP:0.9987
Episode:562 Steps:11 meanR:22.7400 R:12.0000 gloss:29.9182 gloss1-lgP:0.6939 gloss2-gQs:14.2904 gloss3-dQs:14.5336 gloss4-tgtQ:14.2886 dloss:8.9493 exploreP:0.9987
Episode:563 Step

Episode:607 Steps:30 meanR:23.8200 R:31.0000 gloss:27.4050 gloss1-lgP:0.6951 gloss2-gQs:12.9841 gloss3-dQs:13.4333 gloss4-tgtQ:13.0173 dloss:10.7782 exploreP:0.9986
Episode:608 Steps:20 meanR:23.9000 R:21.0000 gloss:27.1817 gloss1-lgP:0.6949 gloss2-gQs:12.9728 gloss3-dQs:13.1393 gloss4-tgtQ:13.0003 dloss:7.1188 exploreP:0.9986
Episode:609 Steps:17 meanR:23.8000 R:18.0000 gloss:27.7930 gloss1-lgP:0.6968 gloss2-gQs:13.3260 gloss3-dQs:13.1831 gloss4-tgtQ:13.3804 dloss:6.8798 exploreP:0.9986
Episode:610 Steps:9 meanR:23.7500 R:10.0000 gloss:27.8019 gloss1-lgP:0.6933 gloss2-gQs:13.4914 gloss3-dQs:13.0991 gloss4-tgtQ:13.5108 dloss:6.5282 exploreP:0.9986
Episode:611 Steps:15 meanR:23.8000 R:16.0000 gloss:27.6566 gloss1-lgP:0.6956 gloss2-gQs:13.0386 gloss3-dQs:13.5895 gloss4-tgtQ:13.1379 dloss:7.1823 exploreP:0.9986
Episode:612 Steps:13 meanR:23.4200 R:14.0000 gloss:27.6046 gloss1-lgP:0.6944 gloss2-gQs:13.2805 gloss3-dQs:13.1947 gloss4-tgtQ:13.2751 dloss:5.8163 exploreP:0.9986
Episode:613 Step

Episode:657 Steps:16 meanR:21.2100 R:17.0000 gloss:27.6537 gloss1-lgP:0.6962 gloss2-gQs:13.2437 gloss3-dQs:13.1927 gloss4-tgtQ:13.2920 dloss:7.9674 exploreP:0.9985
Episode:658 Steps:26 meanR:21.2400 R:27.0000 gloss:27.9444 gloss1-lgP:0.6938 gloss2-gQs:13.5048 gloss3-dQs:13.2557 gloss4-tgtQ:13.5125 dloss:6.5061 exploreP:0.9985
Episode:659 Steps:9 meanR:21.1500 R:10.0000 gloss:29.6835 gloss1-lgP:0.6951 gloss2-gQs:14.4903 gloss3-dQs:13.6549 gloss4-tgtQ:14.5570 dloss:7.6362 exploreP:0.9985
Episode:660 Steps:37 meanR:21.4000 R:38.0000 gloss:27.0815 gloss1-lgP:0.6975 gloss2-gQs:12.7546 gloss3-dQs:13.2807 gloss4-tgtQ:12.7994 dloss:13.1338 exploreP:0.9985
Episode:661 Steps:15 meanR:21.2800 R:16.0000 gloss:27.9749 gloss1-lgP:0.6957 gloss2-gQs:13.4635 gloss3-dQs:13.2206 gloss4-tgtQ:13.5151 dloss:10.5799 exploreP:0.9985
Episode:662 Steps:17 meanR:21.3400 R:18.0000 gloss:28.2361 gloss1-lgP:0.6959 gloss2-gQs:13.6158 gloss3-dQs:13.3516 gloss4-tgtQ:13.6027 dloss:8.5559 exploreP:0.9985
Episode:663 Ste

Episode:707 Steps:57 meanR:19.9900 R:58.0000 gloss:27.5887 gloss1-lgP:0.6968 gloss2-gQs:13.2460 gloss3-dQs:13.0424 gloss4-tgtQ:13.2987 dloss:6.9869 exploreP:0.9984
Episode:708 Steps:11 meanR:19.9000 R:12.0000 gloss:26.2243 gloss1-lgP:0.6943 gloss2-gQs:12.3692 gloss3-dQs:13.0476 gloss4-tgtQ:12.3557 dloss:8.1351 exploreP:0.9984
Episode:709 Steps:14 meanR:19.8700 R:15.0000 gloss:27.8410 gloss1-lgP:0.6954 gloss2-gQs:13.5405 gloss3-dQs:12.9147 gloss4-tgtQ:13.5787 dloss:10.4425 exploreP:0.9984
Episode:710 Steps:25 meanR:20.0300 R:26.0000 gloss:27.9769 gloss1-lgP:0.6957 gloss2-gQs:13.5111 gloss3-dQs:13.1621 gloss4-tgtQ:13.5521 dloss:12.0568 exploreP:0.9984
Episode:711 Steps:15 meanR:20.0300 R:16.0000 gloss:27.5743 gloss1-lgP:0.6973 gloss2-gQs:13.0201 gloss3-dQs:13.4016 gloss4-tgtQ:13.1165 dloss:13.9163 exploreP:0.9984
Episode:712 Steps:9 meanR:19.9900 R:10.0000 gloss:29.4915 gloss1-lgP:0.6949 gloss2-gQs:14.3927 gloss3-dQs:13.5671 gloss4-tgtQ:14.4694 dloss:9.7636 exploreP:0.9984
Episode:713 St

Episode:757 Steps:12 meanR:20.8200 R:13.0000 gloss:26.9657 gloss1-lgP:0.6925 gloss2-gQs:12.9392 gloss3-dQs:13.0747 gloss4-tgtQ:12.9255 dloss:5.0566 exploreP:0.9983
Episode:758 Steps:17 meanR:20.7300 R:18.0000 gloss:28.6562 gloss1-lgP:0.6971 gloss2-gQs:13.7465 gloss3-dQs:13.5351 gloss4-tgtQ:13.8248 dloss:5.4815 exploreP:0.9983
Episode:759 Steps:15 meanR:20.7900 R:16.0000 gloss:26.0111 gloss1-lgP:0.6949 gloss2-gQs:12.1158 gloss3-dQs:13.1599 gloss4-tgtQ:12.1534 dloss:7.9460 exploreP:0.9983
Episode:760 Steps:11 meanR:20.5300 R:12.0000 gloss:26.6013 gloss1-lgP:0.6924 gloss2-gQs:12.7438 gloss3-dQs:12.9504 gloss4-tgtQ:12.7184 dloss:5.7145 exploreP:0.9983
Episode:761 Steps:26 meanR:20.6400 R:27.0000 gloss:26.4211 gloss1-lgP:0.6971 gloss2-gQs:12.4838 gloss3-dQs:12.8749 gloss4-tgtQ:12.5404 dloss:6.7697 exploreP:0.9983
Episode:762 Steps:53 meanR:21.0000 R:54.0000 gloss:25.5550 gloss1-lgP:0.6971 gloss2-gQs:12.1350 gloss3-dQs:12.3421 gloss4-tgtQ:12.1791 dloss:13.5404 exploreP:0.9983
Episode:763 Ste

Episode:807 Steps:60 meanR:20.9800 R:61.0000 gloss:26.7287 gloss1-lgP:0.6958 gloss2-gQs:12.8311 gloss3-dQs:12.7057 gloss4-tgtQ:12.8789 dloss:6.2361 exploreP:0.9982
Episode:808 Steps:21 meanR:21.0800 R:22.0000 gloss:26.8382 gloss1-lgP:0.6988 gloss2-gQs:12.7341 gloss3-dQs:12.9106 gloss4-tgtQ:12.7651 dloss:6.3962 exploreP:0.9982
Episode:809 Steps:23 meanR:21.1700 R:24.0000 gloss:27.8264 gloss1-lgP:0.6974 gloss2-gQs:13.3645 gloss3-dQs:13.0940 gloss4-tgtQ:13.4421 dloss:9.1894 exploreP:0.9982
Episode:810 Steps:20 meanR:21.1200 R:21.0000 gloss:27.0855 gloss1-lgP:0.6965 gloss2-gQs:12.8208 gloss3-dQs:13.1930 gloss4-tgtQ:12.8749 dloss:6.2936 exploreP:0.9982
Episode:811 Steps:11 meanR:21.0800 R:12.0000 gloss:27.8210 gloss1-lgP:0.6923 gloss2-gQs:13.4243 gloss3-dQs:13.3336 gloss4-tgtQ:13.4336 dloss:6.2219 exploreP:0.9982
Episode:812 Steps:19 meanR:21.1800 R:20.0000 gloss:27.2903 gloss1-lgP:0.6958 gloss2-gQs:12.9764 gloss3-dQs:13.2388 gloss4-tgtQ:13.0110 dloss:9.1565 exploreP:0.9982
Episode:813 Step

Episode:857 Steps:22 meanR:21.7100 R:23.0000 gloss:28.6481 gloss1-lgP:0.6964 gloss2-gQs:13.9167 gloss3-dQs:13.2856 gloss4-tgtQ:13.9189 dloss:12.6665 exploreP:0.9981
Episode:858 Steps:24 meanR:21.7800 R:25.0000 gloss:27.2230 gloss1-lgP:0.6964 gloss2-gQs:12.9165 gloss3-dQs:13.2238 gloss4-tgtQ:12.9500 dloss:5.8051 exploreP:0.9981
Episode:859 Steps:15 meanR:21.7800 R:16.0000 gloss:27.0505 gloss1-lgP:0.6946 gloss2-gQs:12.9282 gloss3-dQs:13.0732 gloss4-tgtQ:12.9432 dloss:5.9964 exploreP:0.9981
Episode:860 Steps:10 meanR:21.7700 R:11.0000 gloss:28.4964 gloss1-lgP:0.7027 gloss2-gQs:13.7052 gloss3-dQs:13.1011 gloss4-tgtQ:13.7479 dloss:6.8084 exploreP:0.9981
Episode:861 Steps:27 meanR:21.7800 R:28.0000 gloss:27.6887 gloss1-lgP:0.6963 gloss2-gQs:13.2274 gloss3-dQs:13.2984 gloss4-tgtQ:13.2383 dloss:6.7634 exploreP:0.9981
Episode:862 Steps:18 meanR:21.4300 R:19.0000 gloss:26.5214 gloss1-lgP:0.6956 gloss2-gQs:12.5004 gloss3-dQs:13.0582 gloss4-tgtQ:12.5613 dloss:8.4651 exploreP:0.9981
Episode:863 Ste

Episode:908 Steps:10 meanR:22.6300 R:11.0000 gloss:26.8243 gloss1-lgP:0.6940 gloss2-gQs:12.8733 gloss3-dQs:12.8941 gloss4-tgtQ:12.8841 dloss:5.2327 exploreP:0.9980
Episode:909 Steps:19 meanR:22.5900 R:20.0000 gloss:26.2193 gloss1-lgP:0.6947 gloss2-gQs:12.4682 gloss3-dQs:12.7953 gloss4-tgtQ:12.4706 dloss:6.6532 exploreP:0.9980
Episode:910 Steps:18 meanR:22.5700 R:19.0000 gloss:25.0938 gloss1-lgP:0.6951 gloss2-gQs:11.8507 gloss3-dQs:12.3845 gloss4-tgtQ:11.8666 dloss:6.5816 exploreP:0.9980
Episode:911 Steps:21 meanR:22.6700 R:22.0000 gloss:25.4386 gloss1-lgP:0.6966 gloss2-gQs:12.1912 gloss3-dQs:12.0736 gloss4-tgtQ:12.2557 dloss:5.8299 exploreP:0.9980
Episode:912 Steps:26 meanR:22.7400 R:27.0000 gloss:25.7028 gloss1-lgP:0.6954 gloss2-gQs:12.2303 gloss3-dQs:12.4552 gloss4-tgtQ:12.2731 dloss:6.4174 exploreP:0.9980
Episode:913 Steps:9 meanR:22.6900 R:10.0000 gloss:26.7361 gloss1-lgP:0.6970 gloss2-gQs:12.8573 gloss3-dQs:12.6152 gloss4-tgtQ:12.8867 dloss:4.7961 exploreP:0.9980
Episode:914 Steps

Episode:958 Steps:15 meanR:22.7400 R:16.0000 gloss:27.3655 gloss1-lgP:0.6966 gloss2-gQs:13.0426 gloss3-dQs:13.1856 gloss4-tgtQ:13.0522 dloss:6.8560 exploreP:0.9979
Episode:959 Steps:14 meanR:22.7300 R:15.0000 gloss:26.3850 gloss1-lgP:0.6942 gloss2-gQs:12.5080 gloss3-dQs:12.9716 gloss4-tgtQ:12.5267 dloss:5.4085 exploreP:0.9979
Episode:960 Steps:15 meanR:22.7800 R:16.0000 gloss:25.6502 gloss1-lgP:0.6946 gloss2-gQs:12.0159 gloss3-dQs:12.8613 gloss4-tgtQ:12.0507 dloss:6.6765 exploreP:0.9979
Episode:961 Steps:39 meanR:22.9000 R:40.0000 gloss:26.3586 gloss1-lgP:0.6957 gloss2-gQs:12.6491 gloss3-dQs:12.5511 gloss4-tgtQ:12.6894 dloss:5.3337 exploreP:0.9979
Episode:962 Steps:18 meanR:22.9000 R:19.0000 gloss:27.0449 gloss1-lgP:0.6959 gloss2-gQs:13.0176 gloss3-dQs:12.8281 gloss4-tgtQ:13.0188 dloss:6.6395 exploreP:0.9979
Episode:963 Steps:11 meanR:22.8400 R:12.0000 gloss:25.4305 gloss1-lgP:0.6948 gloss2-gQs:12.0080 gloss3-dQs:12.5398 gloss4-tgtQ:12.0454 dloss:6.1792 exploreP:0.9979
Episode:964 Step

Episode:1009 Steps:32 meanR:22.9500 R:33.0000 gloss:26.3538 gloss1-lgP:0.6957 gloss2-gQs:12.4992 gloss3-dQs:12.8375 gloss4-tgtQ:12.5427 dloss:5.4953 exploreP:0.9978
Episode:1010 Steps:25 meanR:23.0200 R:26.0000 gloss:25.1891 gloss1-lgP:0.6954 gloss2-gQs:11.8420 gloss3-dQs:12.4971 gloss4-tgtQ:11.8855 dloss:10.5590 exploreP:0.9978
Episode:1011 Steps:9 meanR:22.9000 R:10.0000 gloss:25.8579 gloss1-lgP:0.6939 gloss2-gQs:12.5298 gloss3-dQs:12.1788 gloss4-tgtQ:12.5597 dloss:7.4821 exploreP:0.9978
Episode:1012 Steps:26 meanR:22.9000 R:27.0000 gloss:25.8655 gloss1-lgP:0.6975 gloss2-gQs:12.3587 gloss3-dQs:12.2959 gloss4-tgtQ:12.4297 dloss:10.4244 exploreP:0.9978
Episode:1013 Steps:33 meanR:23.1400 R:34.0000 gloss:26.3759 gloss1-lgP:0.6949 gloss2-gQs:12.6797 gloss3-dQs:12.5623 gloss4-tgtQ:12.7192 dloss:20.2840 exploreP:0.9978
Episode:1014 Steps:27 meanR:23.2300 R:28.0000 gloss:26.0530 gloss1-lgP:0.6946 gloss2-gQs:12.4095 gloss3-dQs:12.6752 gloss4-tgtQ:12.4210 dloss:5.7525 exploreP:0.9978
Episode:

Episode:1059 Steps:23 meanR:23.1900 R:24.0000 gloss:24.6385 gloss1-lgP:0.6961 gloss2-gQs:11.5481 gloss3-dQs:12.2316 gloss4-tgtQ:11.6159 dloss:12.4580 exploreP:0.9977
Episode:1060 Steps:66 meanR:23.7000 R:67.0000 gloss:24.9592 gloss1-lgP:0.6968 gloss2-gQs:11.9026 gloss3-dQs:11.9202 gloss4-tgtQ:11.9873 dloss:7.8299 exploreP:0.9977
Episode:1061 Steps:15 meanR:23.4600 R:16.0000 gloss:25.1607 gloss1-lgP:0.6937 gloss2-gQs:12.0638 gloss3-dQs:12.1434 gloss4-tgtQ:12.0612 dloss:9.4426 exploreP:0.9977
Episode:1062 Steps:19 meanR:23.4700 R:20.0000 gloss:25.3633 gloss1-lgP:0.6955 gloss2-gQs:12.0698 gloss3-dQs:12.2639 gloss4-tgtQ:12.1331 dloss:6.2600 exploreP:0.9977
Episode:1063 Steps:44 meanR:23.8000 R:45.0000 gloss:26.3199 gloss1-lgP:0.6963 gloss2-gQs:12.6281 gloss3-dQs:12.4898 gloss4-tgtQ:12.6758 dloss:6.9019 exploreP:0.9977
Episode:1064 Steps:18 meanR:23.8400 R:19.0000 gloss:26.0962 gloss1-lgP:0.6963 gloss2-gQs:12.3618 gloss3-dQs:12.6591 gloss4-tgtQ:12.4562 dloss:7.0786 exploreP:0.9977
Episode:1

Episode:1109 Steps:23 meanR:24.0300 R:24.0000 gloss:27.0616 gloss1-lgP:0.7006 gloss2-gQs:12.8501 gloss3-dQs:12.8554 gloss4-tgtQ:12.9257 dloss:9.5121 exploreP:0.9975
Episode:1110 Steps:11 meanR:23.8900 R:12.0000 gloss:26.7250 gloss1-lgP:0.6967 gloss2-gQs:12.7358 gloss3-dQs:12.8226 gloss4-tgtQ:12.8042 dloss:8.4096 exploreP:0.9975
Episode:1111 Steps:20 meanR:24.0000 R:21.0000 gloss:27.9566 gloss1-lgP:0.6960 gloss2-gQs:13.5645 gloss3-dQs:12.9981 gloss4-tgtQ:13.6043 dloss:7.5393 exploreP:0.9975
Episode:1112 Steps:21 meanR:23.9500 R:22.0000 gloss:27.8658 gloss1-lgP:0.6960 gloss2-gQs:13.3618 gloss3-dQs:13.2741 gloss4-tgtQ:13.4005 dloss:6.3699 exploreP:0.9975
Episode:1113 Steps:11 meanR:23.7300 R:12.0000 gloss:26.7231 gloss1-lgP:0.6979 gloss2-gQs:12.4384 gloss3-dQs:13.3529 gloss4-tgtQ:12.5021 dloss:9.2618 exploreP:0.9975
Episode:1114 Steps:20 meanR:23.6600 R:21.0000 gloss:28.6586 gloss1-lgP:0.6994 gloss2-gQs:13.7303 gloss3-dQs:13.4627 gloss4-tgtQ:13.7691 dloss:11.2776 exploreP:0.9975
Episode:1

Episode:1160 Steps:26 meanR:22.5900 R:27.0000 gloss:27.5612 gloss1-lgP:0.6974 gloss2-gQs:13.2088 gloss3-dQs:13.0201 gloss4-tgtQ:13.2910 dloss:9.0209 exploreP:0.9974
Episode:1161 Steps:21 meanR:22.6500 R:22.0000 gloss:27.4422 gloss1-lgP:0.6974 gloss2-gQs:13.0087 gloss3-dQs:13.2950 gloss4-tgtQ:13.0468 dloss:8.4939 exploreP:0.9974
Episode:1162 Steps:12 meanR:22.5800 R:13.0000 gloss:29.3497 gloss1-lgP:0.6955 gloss2-gQs:14.2522 gloss3-dQs:13.6425 gloss4-tgtQ:14.2953 dloss:9.7819 exploreP:0.9974
Episode:1163 Steps:53 meanR:22.6700 R:54.0000 gloss:28.3371 gloss1-lgP:0.6956 gloss2-gQs:13.5609 gloss3-dQs:13.5952 gloss4-tgtQ:13.5802 dloss:9.8052 exploreP:0.9974
Episode:1164 Steps:26 meanR:22.7500 R:27.0000 gloss:26.8463 gloss1-lgP:0.6967 gloss2-gQs:12.5629 gloss3-dQs:13.3678 gloss4-tgtQ:12.6077 dloss:8.8894 exploreP:0.9974
Episode:1165 Steps:33 meanR:22.9200 R:34.0000 gloss:26.8654 gloss1-lgP:0.6963 gloss2-gQs:12.8568 gloss3-dQs:12.7956 gloss4-tgtQ:12.9244 dloss:8.6166 exploreP:0.9974
Episode:11

Episode:1211 Steps:11 meanR:21.4400 R:12.0000 gloss:28.3522 gloss1-lgP:0.6946 gloss2-gQs:13.5796 gloss3-dQs:13.6547 gloss4-tgtQ:13.5831 dloss:5.9141 exploreP:0.9973
Episode:1212 Steps:13 meanR:21.3600 R:14.0000 gloss:29.0844 gloss1-lgP:0.6987 gloss2-gQs:13.9557 gloss3-dQs:13.6621 gloss4-tgtQ:14.0074 dloss:7.9709 exploreP:0.9973
Episode:1213 Steps:19 meanR:21.4400 R:20.0000 gloss:27.8306 gloss1-lgP:0.6946 gloss2-gQs:13.1251 gloss3-dQs:13.7964 gloss4-tgtQ:13.1423 dloss:6.3883 exploreP:0.9973
Episode:1214 Steps:12 meanR:21.3600 R:13.0000 gloss:26.2019 gloss1-lgP:0.6949 gloss2-gQs:12.1901 gloss3-dQs:13.3220 gloss4-tgtQ:12.1917 dloss:7.9839 exploreP:0.9973
Episode:1215 Steps:25 meanR:21.3000 R:26.0000 gloss:27.2283 gloss1-lgP:0.6976 gloss2-gQs:13.0351 gloss3-dQs:12.8880 gloss4-tgtQ:13.0977 dloss:11.3357 exploreP:0.9973
Episode:1216 Steps:45 meanR:21.5700 R:46.0000 gloss:27.9749 gloss1-lgP:0.6975 gloss2-gQs:13.4690 gloss3-dQs:13.1368 gloss4-tgtQ:13.5022 dloss:11.0817 exploreP:0.9973
Episode:

Episode:1261 Steps:14 meanR:22.4500 R:15.0000 gloss:27.9517 gloss1-lgP:0.6965 gloss2-gQs:13.3245 gloss3-dQs:13.4466 gloss4-tgtQ:13.3620 dloss:6.2217 exploreP:0.9972
Episode:1262 Steps:22 meanR:22.5500 R:23.0000 gloss:27.5480 gloss1-lgP:0.6984 gloss2-gQs:12.9609 gloss3-dQs:13.4863 gloss4-tgtQ:13.0012 dloss:10.1636 exploreP:0.9972
Episode:1263 Steps:12 meanR:22.1400 R:13.0000 gloss:27.0639 gloss1-lgP:0.6970 gloss2-gQs:12.7266 gloss3-dQs:13.3513 gloss4-tgtQ:12.7489 dloss:8.7643 exploreP:0.9972
Episode:1264 Steps:11 meanR:21.9900 R:12.0000 gloss:29.9067 gloss1-lgP:0.6995 gloss2-gQs:14.6549 gloss3-dQs:13.3556 gloss4-tgtQ:14.7273 dloss:18.4428 exploreP:0.9972
Episode:1265 Steps:11 meanR:21.7700 R:12.0000 gloss:27.1131 gloss1-lgP:0.6967 gloss2-gQs:12.8552 gloss3-dQs:13.1946 gloss4-tgtQ:12.8739 dloss:9.8799 exploreP:0.9972
Episode:1266 Steps:23 meanR:21.8800 R:24.0000 gloss:25.6196 gloss1-lgP:0.6994 gloss2-gQs:11.7561 gloss3-dQs:13.0467 gloss4-tgtQ:11.8317 dloss:14.0516 exploreP:0.9972
Episode

Episode:1311 Steps:19 meanR:22.8400 R:20.0000 gloss:26.4426 gloss1-lgP:0.6974 gloss2-gQs:12.3712 gloss3-dQs:13.1273 gloss4-tgtQ:12.4194 dloss:10.1009 exploreP:0.9971
Episode:1312 Steps:43 meanR:23.1400 R:44.0000 gloss:27.9915 gloss1-lgP:0.6992 gloss2-gQs:13.3889 gloss3-dQs:13.1957 gloss4-tgtQ:13.4526 dloss:9.6877 exploreP:0.9971
Episode:1313 Steps:20 meanR:23.1500 R:21.0000 gloss:28.2372 gloss1-lgP:0.6983 gloss2-gQs:13.4657 gloss3-dQs:13.4606 gloss4-tgtQ:13.4969 dloss:19.4545 exploreP:0.9971
Episode:1314 Steps:37 meanR:23.4000 R:38.0000 gloss:28.9394 gloss1-lgP:0.7005 gloss2-gQs:13.7065 gloss3-dQs:13.8854 gloss4-tgtQ:13.7252 dloss:30.7623 exploreP:0.9971
Episode:1315 Steps:30 meanR:23.4500 R:31.0000 gloss:28.1336 gloss1-lgP:0.6967 gloss2-gQs:13.4100 gloss3-dQs:13.5247 gloss4-tgtQ:13.4421 dloss:6.6602 exploreP:0.9971
Episode:1316 Steps:11 meanR:23.1100 R:12.0000 gloss:27.6890 gloss1-lgP:0.6939 gloss2-gQs:13.3018 gloss3-dQs:13.2827 gloss4-tgtQ:13.3159 dloss:5.0316 exploreP:0.9971
Episode

Episode:1362 Steps:23 meanR:22.9600 R:24.0000 gloss:27.8869 gloss1-lgP:0.6981 gloss2-gQs:13.0667 gloss3-dQs:13.7503 gloss4-tgtQ:13.1201 dloss:11.4662 exploreP:0.9970
Episode:1363 Steps:40 meanR:23.2400 R:41.0000 gloss:28.5558 gloss1-lgP:0.6954 gloss2-gQs:13.6589 gloss3-dQs:13.7129 gloss4-tgtQ:13.6901 dloss:6.5327 exploreP:0.9970
Episode:1364 Steps:10 meanR:23.2300 R:11.0000 gloss:28.4979 gloss1-lgP:0.6947 gloss2-gQs:13.7397 gloss3-dQs:13.5519 gloss4-tgtQ:13.7205 dloss:6.9732 exploreP:0.9970
Episode:1365 Steps:28 meanR:23.4000 R:29.0000 gloss:28.3858 gloss1-lgP:0.6963 gloss2-gQs:13.4990 gloss3-dQs:13.7252 gloss4-tgtQ:13.5369 dloss:9.6760 exploreP:0.9970
Episode:1366 Steps:21 meanR:23.3800 R:22.0000 gloss:28.9666 gloss1-lgP:0.6975 gloss2-gQs:13.9027 gloss3-dQs:13.7000 gloss4-tgtQ:13.9256 dloss:6.5850 exploreP:0.9970
Episode:1367 Steps:26 meanR:23.4700 R:27.0000 gloss:29.4080 gloss1-lgP:0.6963 gloss2-gQs:14.1985 gloss3-dQs:13.8127 gloss4-tgtQ:14.2259 dloss:10.9201 exploreP:0.9970
Episode:

Episode:1412 Steps:37 meanR:23.6200 R:38.0000 gloss:26.5375 gloss1-lgP:0.6954 gloss2-gQs:12.4794 gloss3-dQs:13.1556 gloss4-tgtQ:12.5316 dloss:13.0620 exploreP:0.9969
Episode:1413 Steps:9 meanR:23.5100 R:10.0000 gloss:28.7012 gloss1-lgP:0.6958 gloss2-gQs:14.0814 gloss3-dQs:13.0308 gloss4-tgtQ:14.1287 dloss:12.1979 exploreP:0.9969
Episode:1414 Steps:34 meanR:23.4800 R:35.0000 gloss:27.2134 gloss1-lgP:0.6960 gloss2-gQs:13.1339 gloss3-dQs:12.7929 gloss4-tgtQ:13.1759 dloss:8.2477 exploreP:0.9969
Episode:1415 Steps:9 meanR:23.2700 R:10.0000 gloss:25.6294 gloss1-lgP:0.6952 gloss2-gQs:11.8657 gloss3-dQs:13.1176 gloss4-tgtQ:11.8805 dloss:8.9877 exploreP:0.9969
Episode:1416 Steps:34 meanR:23.5000 R:35.0000 gloss:27.1306 gloss1-lgP:0.6954 gloss2-gQs:12.9591 gloss3-dQs:13.0453 gloss4-tgtQ:13.0147 dloss:7.0323 exploreP:0.9969
Episode:1417 Steps:37 meanR:23.3500 R:38.0000 gloss:29.2463 gloss1-lgP:0.7000 gloss2-gQs:14.0881 gloss3-dQs:13.5315 gloss4-tgtQ:14.1630 dloss:12.1073 exploreP:0.9969
Episode:1

Episode:1462 Steps:39 meanR:23.3000 R:40.0000 gloss:27.5820 gloss1-lgP:0.6961 gloss2-gQs:13.1315 gloss3-dQs:13.2908 gloss4-tgtQ:13.2000 dloss:24.5584 exploreP:0.9968
Episode:1463 Steps:41 meanR:23.3100 R:42.0000 gloss:29.1841 gloss1-lgP:0.6966 gloss2-gQs:14.0834 gloss3-dQs:13.7027 gloss4-tgtQ:14.1021 dloss:10.5279 exploreP:0.9968
Episode:1464 Steps:20 meanR:23.4100 R:21.0000 gloss:29.3186 gloss1-lgP:0.6986 gloss2-gQs:14.0437 gloss3-dQs:13.8418 gloss4-tgtQ:14.0797 dloss:9.3304 exploreP:0.9967
Episode:1465 Steps:27 meanR:23.4000 R:28.0000 gloss:30.0431 gloss1-lgP:0.6962 gloss2-gQs:14.4938 gloss3-dQs:14.1325 gloss4-tgtQ:14.5282 dloss:7.3243 exploreP:0.9967
Episode:1466 Steps:18 meanR:23.3700 R:19.0000 gloss:29.2814 gloss1-lgP:0.6942 gloss2-gQs:13.9376 gloss3-dQs:14.3042 gloss4-tgtQ:13.9373 dloss:6.0662 exploreP:0.9967
Episode:1467 Steps:14 meanR:23.2500 R:15.0000 gloss:28.8806 gloss1-lgP:0.6938 gloss2-gQs:13.7435 gloss3-dQs:14.1670 gloss4-tgtQ:13.7199 dloss:6.9106 exploreP:0.9967
Episode:

Episode:1513 Steps:15 meanR:22.0600 R:16.0000 gloss:28.9481 gloss1-lgP:0.6967 gloss2-gQs:13.9092 gloss3-dQs:13.6559 gloss4-tgtQ:13.9754 dloss:7.8998 exploreP:0.9966
Episode:1514 Steps:11 meanR:21.8300 R:12.0000 gloss:28.1540 gloss1-lgP:0.6948 gloss2-gQs:13.3577 gloss3-dQs:13.7967 gloss4-tgtQ:13.3593 dloss:6.3344 exploreP:0.9966
Episode:1515 Steps:51 meanR:22.2500 R:52.0000 gloss:26.1817 gloss1-lgP:0.6957 gloss2-gQs:12.2584 gloss3-dQs:13.0662 gloss4-tgtQ:12.3110 dloss:13.6388 exploreP:0.9966
Episode:1516 Steps:14 meanR:22.0500 R:15.0000 gloss:28.0498 gloss1-lgP:0.7004 gloss2-gQs:13.5761 gloss3-dQs:12.8032 gloss4-tgtQ:13.6529 dloss:19.7057 exploreP:0.9966
Episode:1517 Steps:8 meanR:21.7600 R:9.0000 gloss:27.1276 gloss1-lgP:0.6958 gloss2-gQs:12.9656 gloss3-dQs:13.0152 gloss4-tgtQ:13.0104 dloss:6.7895 exploreP:0.9966
Episode:1518 Steps:10 meanR:21.6700 R:11.0000 gloss:27.7755 gloss1-lgP:0.6944 gloss2-gQs:13.5087 gloss3-dQs:12.9274 gloss4-tgtQ:13.5551 dloss:8.4908 exploreP:0.9966
Episode:15

Episode:1563 Steps:24 meanR:22.9300 R:25.0000 gloss:27.7631 gloss1-lgP:0.6957 gloss2-gQs:13.1753 gloss3-dQs:13.5568 gloss4-tgtQ:13.1745 dloss:7.5415 exploreP:0.9965
Episode:1564 Steps:19 meanR:22.9200 R:20.0000 gloss:26.9657 gloss1-lgP:0.6957 gloss2-gQs:12.7096 gloss3-dQs:13.2987 gloss4-tgtQ:12.7413 dloss:12.0889 exploreP:0.9965
Episode:1565 Steps:9 meanR:22.7400 R:10.0000 gloss:27.4108 gloss1-lgP:0.7007 gloss2-gQs:13.1372 gloss3-dQs:12.7653 gloss4-tgtQ:13.1993 dloss:13.9721 exploreP:0.9965
Episode:1566 Steps:13 meanR:22.6900 R:14.0000 gloss:26.4821 gloss1-lgP:0.6981 gloss2-gQs:12.4855 gloss3-dQs:12.9078 gloss4-tgtQ:12.5480 dloss:9.1801 exploreP:0.9965
Episode:1567 Steps:47 meanR:23.0200 R:48.0000 gloss:27.4107 gloss1-lgP:0.6979 gloss2-gQs:13.0363 gloss3-dQs:13.1402 gloss4-tgtQ:13.0959 dloss:9.4669 exploreP:0.9965
Episode:1568 Steps:8 meanR:22.7400 R:9.0000 gloss:28.2500 gloss1-lgP:0.6940 gloss2-gQs:13.9453 gloss3-dQs:12.8949 gloss4-tgtQ:13.8545 dloss:8.7731 exploreP:0.9965
Episode:156

Episode:1614 Steps:9 meanR:22.7700 R:10.0000 gloss:25.1415 gloss1-lgP:0.6934 gloss2-gQs:11.6374 gloss3-dQs:12.9761 gloss4-tgtQ:11.6356 dloss:10.3291 exploreP:0.9964
Episode:1615 Steps:14 meanR:22.4000 R:15.0000 gloss:27.9742 gloss1-lgP:0.6997 gloss2-gQs:13.4237 gloss3-dQs:13.0378 gloss4-tgtQ:13.5186 dloss:7.2157 exploreP:0.9964
Episode:1616 Steps:15 meanR:22.4100 R:16.0000 gloss:29.1283 gloss1-lgP:0.6970 gloss2-gQs:14.1949 gloss3-dQs:13.3602 gloss4-tgtQ:14.2346 dloss:11.6184 exploreP:0.9964
Episode:1617 Steps:14 meanR:22.4700 R:15.0000 gloss:26.1340 gloss1-lgP:0.6960 gloss2-gQs:12.0911 gloss3-dQs:13.3316 gloss4-tgtQ:12.1220 dloss:17.2251 exploreP:0.9964
Episode:1618 Steps:17 meanR:22.5400 R:18.0000 gloss:27.1766 gloss1-lgP:0.6956 gloss2-gQs:12.9211 gloss3-dQs:13.1329 gloss4-tgtQ:13.0130 dloss:11.3160 exploreP:0.9964
Episode:1619 Steps:15 meanR:22.5800 R:16.0000 gloss:27.7222 gloss1-lgP:0.6971 gloss2-gQs:13.3539 gloss3-dQs:12.9929 gloss4-tgtQ:13.4188 dloss:6.4891 exploreP:0.9964
Episode

Episode:1664 Steps:17 meanR:21.0500 R:18.0000 gloss:27.8478 gloss1-lgP:0.6988 gloss2-gQs:13.2137 gloss3-dQs:13.3632 gloss4-tgtQ:13.2777 dloss:8.1571 exploreP:0.9963
Episode:1665 Steps:22 meanR:21.1800 R:23.0000 gloss:28.8597 gloss1-lgP:0.6971 gloss2-gQs:13.9912 gloss3-dQs:13.3966 gloss4-tgtQ:14.0163 dloss:9.5720 exploreP:0.9963
Episode:1666 Steps:11 meanR:21.1600 R:12.0000 gloss:25.9409 gloss1-lgP:0.6937 gloss2-gQs:12.0101 gloss3-dQs:13.4179 gloss4-tgtQ:11.9457 dloss:19.1837 exploreP:0.9963
Episode:1667 Steps:9 meanR:20.7800 R:10.0000 gloss:30.0449 gloss1-lgP:0.6964 gloss2-gQs:14.8967 gloss3-dQs:13.3201 gloss4-tgtQ:14.9219 dloss:15.1394 exploreP:0.9963
Episode:1668 Steps:17 meanR:20.8700 R:18.0000 gloss:27.3506 gloss1-lgP:0.6940 gloss2-gQs:13.0437 gloss3-dQs:13.3033 gloss4-tgtQ:13.0561 dloss:8.6809 exploreP:0.9963
Episode:1669 Steps:11 meanR:20.7200 R:12.0000 gloss:26.9824 gloss1-lgP:0.6974 gloss2-gQs:12.6901 gloss3-dQs:13.2463 gloss4-tgtQ:12.7383 dloss:5.9096 exploreP:0.9963
Episode:1

Episode:1714 Steps:24 meanR:21.2300 R:25.0000 gloss:28.1232 gloss1-lgP:0.6977 gloss2-gQs:13.3718 gloss3-dQs:13.5292 gloss4-tgtQ:13.4017 dloss:11.3388 exploreP:0.9962
Episode:1715 Steps:27 meanR:21.3600 R:28.0000 gloss:28.6753 gloss1-lgP:0.6961 gloss2-gQs:13.7468 gloss3-dQs:13.6752 gloss4-tgtQ:13.7711 dloss:8.0693 exploreP:0.9962
Episode:1716 Steps:14 meanR:21.3500 R:15.0000 gloss:30.7634 gloss1-lgP:0.6973 gloss2-gQs:15.0867 gloss3-dQs:13.9368 gloss4-tgtQ:15.1092 dloss:14.5649 exploreP:0.9962
Episode:1717 Steps:18 meanR:21.3900 R:19.0000 gloss:28.4608 gloss1-lgP:0.6978 gloss2-gQs:13.3588 gloss3-dQs:14.0514 gloss4-tgtQ:13.3638 dloss:16.5222 exploreP:0.9962
Episode:1718 Steps:15 meanR:21.3700 R:16.0000 gloss:29.3522 gloss1-lgP:0.7003 gloss2-gQs:13.9054 gloss3-dQs:14.1325 gloss4-tgtQ:13.8866 dloss:13.7169 exploreP:0.9962
Episode:1719 Steps:18 meanR:21.4000 R:19.0000 gloss:28.0461 gloss1-lgP:0.6973 gloss2-gQs:13.2043 gloss3-dQs:13.7418 gloss4-tgtQ:13.2584 dloss:13.7166 exploreP:0.9962
Episo

Episode:1764 Steps:29 meanR:20.7400 R:30.0000 gloss:28.0148 gloss1-lgP:0.6954 gloss2-gQs:13.4744 gloss3-dQs:13.3181 gloss4-tgtQ:13.4940 dloss:5.9935 exploreP:0.9961
Episode:1765 Steps:13 meanR:20.6500 R:14.0000 gloss:28.1196 gloss1-lgP:0.6999 gloss2-gQs:13.3445 gloss3-dQs:13.4745 gloss4-tgtQ:13.3596 dloss:7.4646 exploreP:0.9961
Episode:1766 Steps:17 meanR:20.7100 R:18.0000 gloss:28.8213 gloss1-lgP:0.6947 gloss2-gQs:13.9464 gloss3-dQs:13.5577 gloss4-tgtQ:13.9676 dloss:7.8497 exploreP:0.9961
Episode:1767 Steps:35 meanR:20.9700 R:36.0000 gloss:28.2002 gloss1-lgP:0.6978 gloss2-gQs:13.3784 gloss3-dQs:13.6219 gloss4-tgtQ:13.4120 dloss:7.6955 exploreP:0.9961
Episode:1768 Steps:27 meanR:21.0700 R:28.0000 gloss:28.6326 gloss1-lgP:0.6953 gloss2-gQs:13.8285 gloss3-dQs:13.5251 gloss4-tgtQ:13.8326 dloss:18.4770 exploreP:0.9961
Episode:1769 Steps:11 meanR:21.0700 R:12.0000 gloss:24.8464 gloss1-lgP:0.6959 gloss2-gQs:11.1449 gloss3-dQs:13.3663 gloss4-tgtQ:11.1955 dloss:17.1922 exploreP:0.9961
Episode:

Episode:1815 Steps:17 meanR:20.4200 R:18.0000 gloss:26.7233 gloss1-lgP:0.6956 gloss2-gQs:12.7724 gloss3-dQs:12.8057 gloss4-tgtQ:12.8414 dloss:5.9528 exploreP:0.9960
Episode:1816 Steps:11 meanR:20.3900 R:12.0000 gloss:26.2331 gloss1-lgP:0.6940 gloss2-gQs:12.4978 gloss3-dQs:12.8363 gloss4-tgtQ:12.4640 dloss:10.4265 exploreP:0.9960
Episode:1817 Steps:11 meanR:20.3200 R:12.0000 gloss:26.1186 gloss1-lgP:0.6969 gloss2-gQs:12.4679 gloss3-dQs:12.4941 gloss4-tgtQ:12.5079 dloss:6.1528 exploreP:0.9960
Episode:1818 Steps:20 meanR:20.3700 R:21.0000 gloss:27.2194 gloss1-lgP:0.7010 gloss2-gQs:12.9699 gloss3-dQs:12.8127 gloss4-tgtQ:13.0462 dloss:8.2939 exploreP:0.9960
Episode:1819 Steps:38 meanR:20.5700 R:39.0000 gloss:26.5181 gloss1-lgP:0.6960 gloss2-gQs:12.6547 gloss3-dQs:12.7674 gloss4-tgtQ:12.6805 dloss:7.1869 exploreP:0.9960
Episode:1820 Steps:31 meanR:20.7300 R:32.0000 gloss:26.9214 gloss1-lgP:0.6966 gloss2-gQs:12.9023 gloss3-dQs:12.8241 gloss4-tgtQ:12.9248 dloss:9.1238 exploreP:0.9960
Episode:1

Episode:1866 Steps:9 meanR:22.1600 R:10.0000 gloss:26.5528 gloss1-lgP:0.6960 gloss2-gQs:12.7081 gloss3-dQs:12.7068 gloss4-tgtQ:12.7314 dloss:9.7229 exploreP:0.9959
Episode:1867 Steps:24 meanR:22.0500 R:25.0000 gloss:25.5969 gloss1-lgP:0.6965 gloss2-gQs:12.0553 gloss3-dQs:12.6051 gloss4-tgtQ:12.1001 dloss:9.5620 exploreP:0.9959
Episode:1868 Steps:17 meanR:21.9500 R:18.0000 gloss:25.3085 gloss1-lgP:0.6935 gloss2-gQs:12.1151 gloss3-dQs:12.2747 gloss4-tgtQ:12.0989 dloss:7.3873 exploreP:0.9959
Episode:1869 Steps:34 meanR:22.1800 R:35.0000 gloss:24.9831 gloss1-lgP:0.6970 gloss2-gQs:11.8927 gloss3-dQs:12.0186 gloss4-tgtQ:11.9363 dloss:5.6258 exploreP:0.9959
Episode:1870 Steps:15 meanR:22.1700 R:16.0000 gloss:26.1019 gloss1-lgP:0.6971 gloss2-gQs:12.6494 gloss3-dQs:12.1342 gloss4-tgtQ:12.6517 dloss:8.2380 exploreP:0.9959
Episode:1871 Steps:18 meanR:22.1400 R:19.0000 gloss:25.0331 gloss1-lgP:0.6961 gloss2-gQs:11.8634 gloss3-dQs:12.2035 gloss4-tgtQ:11.8961 dloss:5.7832 exploreP:0.9959
Episode:187

Episode:1916 Steps:11 meanR:22.5400 R:12.0000 gloss:24.0312 gloss1-lgP:0.6947 gloss2-gQs:11.3149 gloss3-dQs:11.9207 gloss4-tgtQ:11.3757 dloss:13.8747 exploreP:0.9958
Episode:1917 Steps:36 meanR:22.7900 R:37.0000 gloss:25.1984 gloss1-lgP:0.6957 gloss2-gQs:12.0787 gloss3-dQs:11.9977 gloss4-tgtQ:12.1351 dloss:8.6152 exploreP:0.9958
Episode:1918 Steps:16 meanR:22.7500 R:17.0000 gloss:26.1522 gloss1-lgP:0.6962 gloss2-gQs:12.6412 gloss3-dQs:12.2500 gloss4-tgtQ:12.6754 dloss:6.2647 exploreP:0.9958
Episode:1919 Steps:9 meanR:22.4600 R:10.0000 gloss:27.0295 gloss1-lgP:0.6943 gloss2-gQs:13.2420 gloss3-dQs:12.4493 gloss4-tgtQ:13.2365 dloss:7.1299 exploreP:0.9958
Episode:1920 Steps:10 meanR:22.2500 R:11.0000 gloss:27.3624 gloss1-lgP:0.7008 gloss2-gQs:13.1635 gloss3-dQs:12.6468 gloss4-tgtQ:13.2332 dloss:7.0804 exploreP:0.9958
Episode:1921 Steps:11 meanR:22.1800 R:12.0000 gloss:25.3979 gloss1-lgP:0.6959 gloss2-gQs:11.8978 gloss3-dQs:12.6429 gloss4-tgtQ:11.9643 dloss:9.7741 exploreP:0.9958
Episode:19

Episode:1967 Steps:10 meanR:20.7800 R:11.0000 gloss:26.5403 gloss1-lgP:0.7056 gloss2-gQs:12.0602 gloss3-dQs:13.4656 gloss4-tgtQ:12.0709 dloss:13.3407 exploreP:0.9957
Episode:1968 Steps:20 meanR:20.8100 R:21.0000 gloss:26.9562 gloss1-lgP:0.6973 gloss2-gQs:12.9399 gloss3-dQs:12.7816 gloss4-tgtQ:12.9330 dloss:11.8268 exploreP:0.9957
Episode:1969 Steps:41 meanR:20.8800 R:42.0000 gloss:26.5490 gloss1-lgP:0.6987 gloss2-gQs:12.6534 gloss3-dQs:12.6433 gloss4-tgtQ:12.6928 dloss:16.4154 exploreP:0.9957
Episode:1970 Steps:39 meanR:21.1200 R:40.0000 gloss:27.3036 gloss1-lgP:0.6974 gloss2-gQs:13.0723 gloss3-dQs:12.9837 gloss4-tgtQ:13.0923 dloss:7.6589 exploreP:0.9957
Episode:1971 Steps:20 meanR:21.1400 R:21.0000 gloss:27.5849 gloss1-lgP:0.6953 gloss2-gQs:13.3123 gloss3-dQs:13.0082 gloss4-tgtQ:13.3483 dloss:5.6014 exploreP:0.9957
Episode:1972 Steps:49 meanR:21.4200 R:50.0000 gloss:27.7891 gloss1-lgP:0.6962 gloss2-gQs:13.2436 gloss3-dQs:13.4229 gloss4-tgtQ:13.2500 dloss:6.8446 exploreP:0.9957
Episode

Episode:2018 Steps:11 meanR:20.8200 R:12.0000 gloss:26.1793 gloss1-lgP:0.6951 gloss2-gQs:12.5384 gloss3-dQs:12.5765 gloss4-tgtQ:12.5530 dloss:8.3836 exploreP:0.9956
Episode:2019 Steps:14 meanR:20.8700 R:15.0000 gloss:25.9278 gloss1-lgP:0.6971 gloss2-gQs:12.1690 gloss3-dQs:12.8261 gloss4-tgtQ:12.2039 dloss:17.0592 exploreP:0.9956
Episode:2020 Steps:18 meanR:20.9500 R:19.0000 gloss:26.8621 gloss1-lgP:0.6969 gloss2-gQs:12.9569 gloss3-dQs:12.6334 gloss4-tgtQ:12.9658 dloss:14.9017 exploreP:0.9956
Episode:2021 Steps:20 meanR:21.0400 R:21.0000 gloss:25.6580 gloss1-lgP:0.6926 gloss2-gQs:12.2345 gloss3-dQs:12.5405 gloss4-tgtQ:12.2635 dloss:8.9980 exploreP:0.9956
Episode:2022 Steps:26 meanR:21.2000 R:27.0000 gloss:26.2620 gloss1-lgP:0.6956 gloss2-gQs:12.5437 gloss3-dQs:12.6157 gloss4-tgtQ:12.5927 dloss:6.8606 exploreP:0.9956
Episode:2023 Steps:53 meanR:21.4300 R:54.0000 gloss:25.3624 gloss1-lgP:0.6945 gloss2-gQs:12.0876 gloss3-dQs:12.3157 gloss4-tgtQ:12.1105 dloss:6.2585 exploreP:0.9956
Episode:

Episode:2068 Steps:11 meanR:21.8500 R:12.0000 gloss:26.0511 gloss1-lgP:0.6956 gloss2-gQs:12.4649 gloss3-dQs:12.4814 gloss4-tgtQ:12.5057 dloss:5.9582 exploreP:0.9955
Episode:2069 Steps:36 meanR:21.8000 R:37.0000 gloss:25.7217 gloss1-lgP:0.6944 gloss2-gQs:12.3291 gloss3-dQs:12.3849 gloss4-tgtQ:12.3258 dloss:5.7971 exploreP:0.9955
Episode:2070 Steps:14 meanR:21.5500 R:15.0000 gloss:28.0749 gloss1-lgP:0.6981 gloss2-gQs:13.7194 gloss3-dQs:12.7426 gloss4-tgtQ:13.7642 dloss:9.2024 exploreP:0.9955
Episode:2071 Steps:12 meanR:21.4700 R:13.0000 gloss:27.3320 gloss1-lgP:0.7007 gloss2-gQs:12.9165 gloss3-dQs:13.1032 gloss4-tgtQ:12.9770 dloss:4.9699 exploreP:0.9955
Episode:2072 Steps:35 meanR:21.3300 R:36.0000 gloss:27.2851 gloss1-lgP:0.7067 gloss2-gQs:12.8566 gloss3-dQs:12.8888 gloss4-tgtQ:12.8555 dloss:11.3833 exploreP:0.9955
Episode:2073 Steps:11 meanR:21.3500 R:12.0000 gloss:25.5749 gloss1-lgP:0.6982 gloss2-gQs:11.8323 gloss3-dQs:12.9268 gloss4-tgtQ:11.8773 dloss:9.6465 exploreP:0.9955
Episode:2

Episode:2119 Steps:8 meanR:20.6100 R:9.0000 gloss:26.1618 gloss1-lgP:0.6925 gloss2-gQs:12.5779 gloss3-dQs:12.5756 gloss4-tgtQ:12.6324 dloss:7.9434 exploreP:0.9954
Episode:2120 Steps:37 meanR:20.8000 R:38.0000 gloss:25.8927 gloss1-lgP:0.6946 gloss2-gQs:12.3628 gloss3-dQs:12.5615 gloss4-tgtQ:12.3609 dloss:8.4802 exploreP:0.9954
Episode:2121 Steps:19 meanR:20.7900 R:20.0000 gloss:24.9008 gloss1-lgP:0.6901 gloss2-gQs:11.8990 gloss3-dQs:12.2801 gloss4-tgtQ:11.9177 dloss:8.8891 exploreP:0.9954
Episode:2122 Steps:14 meanR:20.6700 R:15.0000 gloss:24.4547 gloss1-lgP:0.6975 gloss2-gQs:11.5101 gloss3-dQs:11.9559 gloss4-tgtQ:11.5735 dloss:16.4198 exploreP:0.9954
Episode:2123 Steps:37 meanR:20.5100 R:38.0000 gloss:26.6044 gloss1-lgP:0.6959 gloss2-gQs:12.8763 gloss3-dQs:12.4256 gloss4-tgtQ:12.9429 dloss:12.9716 exploreP:0.9954
Episode:2124 Steps:30 meanR:20.6300 R:31.0000 gloss:25.8003 gloss1-lgP:0.6943 gloss2-gQs:12.2605 gloss3-dQs:12.6286 gloss4-tgtQ:12.2712 dloss:6.1372 exploreP:0.9954
Episode:21

Episode:2169 Steps:24 meanR:21.4800 R:25.0000 gloss:25.9161 gloss1-lgP:0.6927 gloss2-gQs:12.5008 gloss3-dQs:12.3945 gloss4-tgtQ:12.5078 dloss:7.0636 exploreP:0.9952
Episode:2170 Steps:11 meanR:21.4500 R:12.0000 gloss:26.5510 gloss1-lgP:0.6929 gloss2-gQs:12.8331 gloss3-dQs:12.6480 gloss4-tgtQ:12.8381 dloss:5.4400 exploreP:0.9952
Episode:2171 Steps:14 meanR:21.4700 R:15.0000 gloss:25.9364 gloss1-lgP:0.6937 gloss2-gQs:12.4247 gloss3-dQs:12.5453 gloss4-tgtQ:12.4313 dloss:7.4596 exploreP:0.9952
Episode:2172 Steps:15 meanR:21.2700 R:16.0000 gloss:25.5393 gloss1-lgP:0.6967 gloss2-gQs:12.0473 gloss3-dQs:12.5368 gloss4-tgtQ:12.0746 dloss:7.3811 exploreP:0.9952
Episode:2173 Steps:8 meanR:21.2400 R:9.0000 gloss:24.8153 gloss1-lgP:0.6928 gloss2-gQs:11.7122 gloss3-dQs:12.3532 gloss4-tgtQ:11.7576 dloss:4.2592 exploreP:0.9952
Episode:2174 Steps:8 meanR:21.2400 R:9.0000 gloss:25.2860 gloss1-lgP:0.6957 gloss2-gQs:12.1068 gloss3-dQs:12.1306 gloss4-tgtQ:12.1127 dloss:5.1676 exploreP:0.9952
Episode:2175 S

Episode:2220 Steps:18 meanR:22.4600 R:19.0000 gloss:26.0242 gloss1-lgP:0.6993 gloss2-gQs:12.5441 gloss3-dQs:12.0705 gloss4-tgtQ:12.5934 dloss:10.0305 exploreP:0.9951
Episode:2221 Steps:9 meanR:22.3600 R:10.0000 gloss:27.7842 gloss1-lgP:0.6967 gloss2-gQs:13.7250 gloss3-dQs:12.4423 gloss4-tgtQ:13.7143 dloss:9.0517 exploreP:0.9951
Episode:2222 Steps:10 meanR:22.3200 R:11.0000 gloss:27.7785 gloss1-lgP:0.7005 gloss2-gQs:13.5137 gloss3-dQs:12.5682 gloss4-tgtQ:13.5695 dloss:9.2340 exploreP:0.9951
Episode:2223 Steps:18 meanR:22.1300 R:19.0000 gloss:27.8198 gloss1-lgP:0.7008 gloss2-gQs:13.2396 gloss3-dQs:13.1739 gloss4-tgtQ:13.2867 dloss:8.1431 exploreP:0.9951
Episode:2224 Steps:20 meanR:22.0300 R:21.0000 gloss:26.7625 gloss1-lgP:0.6970 gloss2-gQs:12.5889 gloss3-dQs:13.2216 gloss4-tgtQ:12.5879 dloss:6.8106 exploreP:0.9951
Episode:2225 Steps:54 meanR:22.4300 R:55.0000 gloss:26.8243 gloss1-lgP:0.6956 gloss2-gQs:12.8380 gloss3-dQs:12.8695 gloss4-tgtQ:12.8519 dloss:6.0407 exploreP:0.9951
Episode:22

Episode:2271 Steps:18 meanR:20.7600 R:19.0000 gloss:25.7571 gloss1-lgP:0.6961 gloss2-gQs:12.3705 gloss3-dQs:12.1978 gloss4-tgtQ:12.4287 dloss:4.7455 exploreP:0.9950
Episode:2272 Steps:34 meanR:20.9500 R:35.0000 gloss:25.5518 gloss1-lgP:0.6939 gloss2-gQs:12.2384 gloss3-dQs:12.3387 gloss4-tgtQ:12.2424 dloss:6.1550 exploreP:0.9950
Episode:2273 Steps:21 meanR:21.0800 R:22.0000 gloss:25.5329 gloss1-lgP:0.6928 gloss2-gQs:12.2198 gloss3-dQs:12.3739 gloss4-tgtQ:12.2674 dloss:7.4937 exploreP:0.9950
Episode:2274 Steps:24 meanR:21.2400 R:25.0000 gloss:26.7897 gloss1-lgP:0.6994 gloss2-gQs:12.8948 gloss3-dQs:12.4571 gloss4-tgtQ:12.9468 dloss:10.3938 exploreP:0.9950
Episode:2275 Steps:15 meanR:20.9900 R:16.0000 gloss:27.0992 gloss1-lgP:0.6938 gloss2-gQs:13.0730 gloss3-dQs:12.8851 gloss4-tgtQ:13.1110 dloss:7.3951 exploreP:0.9950
Episode:2276 Steps:24 meanR:20.9800 R:25.0000 gloss:27.3725 gloss1-lgP:0.6959 gloss2-gQs:13.2519 gloss3-dQs:12.8305 gloss4-tgtQ:13.2562 dloss:9.4283 exploreP:0.9950
Episode:2

Episode:2322 Steps:22 meanR:20.8900 R:23.0000 gloss:24.9912 gloss1-lgP:0.6969 gloss2-gQs:11.8024 gloss3-dQs:12.2104 gloss4-tgtQ:11.8394 dloss:6.3533 exploreP:0.9949
Episode:2323 Steps:18 meanR:20.8900 R:19.0000 gloss:27.3263 gloss1-lgP:0.6954 gloss2-gQs:13.4007 gloss3-dQs:12.4913 gloss4-tgtQ:13.4024 dloss:7.1883 exploreP:0.9949
Episode:2324 Steps:40 meanR:21.0900 R:41.0000 gloss:26.7143 gloss1-lgP:0.6964 gloss2-gQs:12.8322 gloss3-dQs:12.6842 gloss4-tgtQ:12.8420 dloss:8.9714 exploreP:0.9949
Episode:2325 Steps:33 meanR:20.8800 R:34.0000 gloss:25.9919 gloss1-lgP:0.6970 gloss2-gQs:12.2729 gloss3-dQs:12.7279 gloss4-tgtQ:12.2986 dloss:8.1189 exploreP:0.9949
Episode:2326 Steps:23 meanR:20.9200 R:24.0000 gloss:25.1267 gloss1-lgP:0.6988 gloss2-gQs:11.7726 gloss3-dQs:12.3533 gloss4-tgtQ:11.8269 dloss:7.6819 exploreP:0.9949
Episode:2327 Steps:10 meanR:20.9100 R:11.0000 gloss:25.5078 gloss1-lgP:0.7014 gloss2-gQs:12.0799 gloss3-dQs:12.1854 gloss4-tgtQ:12.1237 dloss:10.3627 exploreP:0.9949
Episode:2

Episode:2373 Steps:28 meanR:20.7700 R:29.0000 gloss:25.4340 gloss1-lgP:0.6945 gloss2-gQs:12.0513 gloss3-dQs:12.5030 gloss4-tgtQ:12.0692 dloss:5.8609 exploreP:0.9948
Episode:2374 Steps:21 meanR:20.7400 R:22.0000 gloss:26.1428 gloss1-lgP:0.6955 gloss2-gQs:12.6868 gloss3-dQs:12.2115 gloss4-tgtQ:12.6902 dloss:5.5217 exploreP:0.9948
Episode:2375 Steps:34 meanR:20.9300 R:35.0000 gloss:26.3672 gloss1-lgP:0.6954 gloss2-gQs:12.6956 gloss3-dQs:12.5151 gloss4-tgtQ:12.7010 dloss:7.6272 exploreP:0.9948
Episode:2376 Steps:30 meanR:20.9900 R:31.0000 gloss:26.6061 gloss1-lgP:0.7003 gloss2-gQs:12.6214 gloss3-dQs:12.7202 gloss4-tgtQ:12.6659 dloss:8.1964 exploreP:0.9948
Episode:2377 Steps:22 meanR:20.9100 R:23.0000 gloss:26.1619 gloss1-lgP:0.6945 gloss2-gQs:12.4547 gloss3-dQs:12.7415 gloss4-tgtQ:12.4671 dloss:5.1257 exploreP:0.9948
Episode:2378 Steps:48 meanR:21.2100 R:49.0000 gloss:25.3480 gloss1-lgP:0.6945 gloss2-gQs:12.1003 gloss3-dQs:12.2743 gloss4-tgtQ:12.1248 dloss:6.7536 exploreP:0.9948
Episode:23

Episode:2424 Steps:22 meanR:21.8200 R:23.0000 gloss:26.7171 gloss1-lgP:0.6953 gloss2-gQs:12.8157 gloss3-dQs:12.7764 gloss4-tgtQ:12.8213 dloss:8.8370 exploreP:0.9947
Episode:2425 Steps:14 meanR:21.6300 R:15.0000 gloss:26.9601 gloss1-lgP:0.6999 gloss2-gQs:12.8188 gloss3-dQs:12.8780 gloss4-tgtQ:12.8312 dloss:8.0138 exploreP:0.9947
Episode:2426 Steps:31 meanR:21.7100 R:32.0000 gloss:25.9183 gloss1-lgP:0.6955 gloss2-gQs:12.2807 gloss3-dQs:12.6941 gloss4-tgtQ:12.2906 dloss:7.1159 exploreP:0.9947
Episode:2427 Steps:35 meanR:21.9600 R:36.0000 gloss:26.2204 gloss1-lgP:0.6935 gloss2-gQs:12.6181 gloss3-dQs:12.5615 gloss4-tgtQ:12.6231 dloss:6.3207 exploreP:0.9947
Episode:2428 Steps:26 meanR:22.0600 R:27.0000 gloss:27.4409 gloss1-lgP:0.7001 gloss2-gQs:13.2294 gloss3-dQs:12.7450 gloss4-tgtQ:13.2348 dloss:11.3402 exploreP:0.9947
Episode:2429 Steps:16 meanR:22.0900 R:17.0000 gloss:24.6464 gloss1-lgP:0.7001 gloss2-gQs:11.3013 gloss3-dQs:12.5628 gloss4-tgtQ:11.3295 dloss:29.4276 exploreP:0.9947
Episode:

Episode:2474 Steps:46 meanR:23.2800 R:47.0000 gloss:26.7373 gloss1-lgP:0.6971 gloss2-gQs:12.6127 gloss3-dQs:13.0945 gloss4-tgtQ:12.6438 dloss:9.2569 exploreP:0.9946
Episode:2475 Steps:12 meanR:23.0600 R:13.0000 gloss:27.6864 gloss1-lgP:0.6998 gloss2-gQs:13.2698 gloss3-dQs:12.9669 gloss4-tgtQ:13.3033 dloss:11.9048 exploreP:0.9946
Episode:2476 Steps:23 meanR:22.9900 R:24.0000 gloss:25.5466 gloss1-lgP:0.6934 gloss2-gQs:12.1086 gloss3-dQs:12.6504 gloss4-tgtQ:12.0831 dloss:6.1889 exploreP:0.9946
Episode:2477 Steps:16 meanR:22.9300 R:17.0000 gloss:25.3870 gloss1-lgP:0.6956 gloss2-gQs:12.0065 gloss3-dQs:12.4188 gloss4-tgtQ:12.0477 dloss:9.2563 exploreP:0.9946
Episode:2478 Steps:19 meanR:22.6400 R:20.0000 gloss:26.1725 gloss1-lgP:0.6946 gloss2-gQs:12.6809 gloss3-dQs:12.2790 gloss4-tgtQ:12.7234 dloss:7.3999 exploreP:0.9946
Episode:2479 Steps:10 meanR:22.4800 R:11.0000 gloss:25.7585 gloss1-lgP:0.6948 gloss2-gQs:12.2417 gloss3-dQs:12.5466 gloss4-tgtQ:12.2995 dloss:6.2024 exploreP:0.9946
Episode:2

Episode:2524 Steps:11 meanR:22.1500 R:12.0000 gloss:25.9705 gloss1-lgP:0.6965 gloss2-gQs:12.4160 gloss3-dQs:12.5335 gloss4-tgtQ:12.3541 dloss:13.6514 exploreP:0.9945
Episode:2525 Steps:30 meanR:22.3100 R:31.0000 gloss:26.0957 gloss1-lgP:0.6957 gloss2-gQs:12.4076 gloss3-dQs:12.6599 gloss4-tgtQ:12.4491 dloss:9.3004 exploreP:0.9945
Episode:2526 Steps:16 meanR:22.1600 R:17.0000 gloss:25.0398 gloss1-lgP:0.6946 gloss2-gQs:11.9351 gloss3-dQs:12.1422 gloss4-tgtQ:11.9676 dloss:9.1270 exploreP:0.9945
Episode:2527 Steps:24 meanR:22.0500 R:25.0000 gloss:26.0888 gloss1-lgP:0.6965 gloss2-gQs:12.5628 gloss3-dQs:12.2832 gloss4-tgtQ:12.6126 dloss:6.4631 exploreP:0.9945
Episode:2528 Steps:41 meanR:22.2000 R:42.0000 gloss:27.0881 gloss1-lgP:0.6967 gloss2-gQs:13.0754 gloss3-dQs:12.7008 gloss4-tgtQ:13.1020 dloss:7.4294 exploreP:0.9945
Episode:2529 Steps:17 meanR:22.2100 R:18.0000 gloss:27.5145 gloss1-lgP:0.6947 gloss2-gQs:13.2480 gloss3-dQs:13.1576 gloss4-tgtQ:13.1947 dloss:7.2591 exploreP:0.9945
Episode:2

Episode:2574 Steps:55 meanR:21.6800 R:56.0000 gloss:26.3445 gloss1-lgP:0.6947 gloss2-gQs:12.5983 gloss3-dQs:12.7104 gloss4-tgtQ:12.6088 dloss:5.7733 exploreP:0.9944
Episode:2575 Steps:12 meanR:21.6800 R:13.0000 gloss:27.5538 gloss1-lgP:0.6992 gloss2-gQs:13.2541 gloss3-dQs:12.8529 gloss4-tgtQ:13.2880 dloss:7.5638 exploreP:0.9944
Episode:2576 Steps:22 meanR:21.6700 R:23.0000 gloss:26.1784 gloss1-lgP:0.6957 gloss2-gQs:12.4183 gloss3-dQs:12.8001 gloss4-tgtQ:12.4317 dloss:11.2063 exploreP:0.9944
Episode:2577 Steps:26 meanR:21.7700 R:27.0000 gloss:26.6634 gloss1-lgP:0.6977 gloss2-gQs:12.7317 gloss3-dQs:12.7155 gloss4-tgtQ:12.7658 dloss:7.5111 exploreP:0.9944
Episode:2578 Steps:13 meanR:21.7100 R:14.0000 gloss:26.7281 gloss1-lgP:0.6904 gloss2-gQs:13.0479 gloss3-dQs:12.6439 gloss4-tgtQ:13.0116 dloss:8.2685 exploreP:0.9944
Episode:2579 Steps:12 meanR:21.7300 R:13.0000 gloss:25.1668 gloss1-lgP:0.7010 gloss2-gQs:11.5109 gloss3-dQs:12.7407 gloss4-tgtQ:11.6454 dloss:7.6508 exploreP:0.9944
Episode:2

Episode:2624 Steps:23 meanR:21.9900 R:24.0000 gloss:26.4939 gloss1-lgP:0.6943 gloss2-gQs:12.7589 gloss3-dQs:12.6402 gloss4-tgtQ:12.7661 dloss:7.2381 exploreP:0.9943
Episode:2625 Steps:11 meanR:21.8000 R:12.0000 gloss:27.0194 gloss1-lgP:0.7011 gloss2-gQs:12.9103 gloss3-dQs:12.6980 gloss4-tgtQ:12.9307 dloss:6.4331 exploreP:0.9943
Episode:2626 Steps:29 meanR:21.9300 R:30.0000 gloss:26.1222 gloss1-lgP:0.6948 gloss2-gQs:12.5034 gloss3-dQs:12.5790 gloss4-tgtQ:12.5134 dloss:5.4585 exploreP:0.9943
Episode:2627 Steps:16 meanR:21.8500 R:17.0000 gloss:26.2097 gloss1-lgP:0.6945 gloss2-gQs:12.4567 gloss3-dQs:12.7737 gloss4-tgtQ:12.5035 dloss:5.4398 exploreP:0.9943
Episode:2628 Steps:11 meanR:21.5500 R:12.0000 gloss:25.5608 gloss1-lgP:0.6948 gloss2-gQs:12.1934 gloss3-dQs:12.4204 gloss4-tgtQ:12.1789 dloss:6.3512 exploreP:0.9943
Episode:2629 Steps:10 meanR:21.4800 R:11.0000 gloss:25.7382 gloss1-lgP:0.6930 gloss2-gQs:12.3838 gloss3-dQs:12.3410 gloss4-tgtQ:12.4121 dloss:4.3338 exploreP:0.9943
Episode:26

Episode:2674 Steps:34 meanR:22.0900 R:35.0000 gloss:27.0288 gloss1-lgP:0.6930 gloss2-gQs:13.1390 gloss3-dQs:12.7318 gloss4-tgtQ:13.1268 dloss:8.3508 exploreP:0.9942
Episode:2675 Steps:17 meanR:22.1400 R:18.0000 gloss:26.2964 gloss1-lgP:0.6949 gloss2-gQs:12.5173 gloss3-dQs:12.8116 gloss4-tgtQ:12.5227 dloss:8.8104 exploreP:0.9942
Episode:2676 Steps:46 meanR:22.3800 R:47.0000 gloss:26.4481 gloss1-lgP:0.6952 gloss2-gQs:12.6156 gloss3-dQs:12.7910 gloss4-tgtQ:12.6302 dloss:8.9260 exploreP:0.9942
Episode:2677 Steps:17 meanR:22.2900 R:18.0000 gloss:26.7273 gloss1-lgP:0.6983 gloss2-gQs:12.8353 gloss3-dQs:12.5402 gloss4-tgtQ:12.8979 dloss:8.6557 exploreP:0.9942
Episode:2678 Steps:16 meanR:22.3200 R:17.0000 gloss:27.2659 gloss1-lgP:0.6950 gloss2-gQs:13.1456 gloss3-dQs:12.9352 gloss4-tgtQ:13.1502 dloss:5.1357 exploreP:0.9942
Episode:2679 Steps:46 meanR:22.6600 R:47.0000 gloss:27.7263 gloss1-lgP:0.6959 gloss2-gQs:13.3532 gloss3-dQs:13.1500 gloss4-tgtQ:13.3479 dloss:6.6111 exploreP:0.9942
Episode:26

Episode:2724 Steps:32 meanR:23.4300 R:33.0000 gloss:27.1964 gloss1-lgP:0.6935 gloss2-gQs:13.0114 gloss3-dQs:13.2067 gloss4-tgtQ:12.9984 dloss:5.4060 exploreP:0.9940
Episode:2725 Steps:26 meanR:23.5800 R:27.0000 gloss:26.1524 gloss1-lgP:0.6948 gloss2-gQs:12.3329 gloss3-dQs:12.9638 gloss4-tgtQ:12.3351 dloss:13.7774 exploreP:0.9940
Episode:2726 Steps:10 meanR:23.3900 R:11.0000 gloss:25.7488 gloss1-lgP:0.6989 gloss2-gQs:12.2398 gloss3-dQs:12.2941 gloss4-tgtQ:12.3241 dloss:22.5162 exploreP:0.9940
Episode:2727 Steps:14 meanR:23.3700 R:15.0000 gloss:25.2685 gloss1-lgP:0.6972 gloss2-gQs:11.8836 gloss3-dQs:12.4316 gloss4-tgtQ:11.9287 dloss:27.9038 exploreP:0.9940
Episode:2728 Steps:12 meanR:23.3800 R:13.0000 gloss:26.9038 gloss1-lgP:0.6970 gloss2-gQs:13.1039 gloss3-dQs:12.3900 gloss4-tgtQ:13.1145 dloss:9.4646 exploreP:0.9940
Episode:2729 Steps:19 meanR:23.4700 R:20.0000 gloss:25.7802 gloss1-lgP:0.6933 gloss2-gQs:12.2890 gloss3-dQs:12.5820 gloss4-tgtQ:12.3096 dloss:6.2077 exploreP:0.9940
Episode

Episode:2774 Steps:24 meanR:22.3100 R:25.0000 gloss:28.6444 gloss1-lgP:0.6951 gloss2-gQs:14.1145 gloss3-dQs:12.9278 gloss4-tgtQ:14.1593 dloss:14.5072 exploreP:0.9940
Episode:2775 Steps:10 meanR:22.2400 R:11.0000 gloss:25.9431 gloss1-lgP:0.6920 gloss2-gQs:12.2206 gloss3-dQs:13.0266 gloss4-tgtQ:12.2417 dloss:6.1418 exploreP:0.9939
Episode:2776 Steps:13 meanR:21.9100 R:14.0000 gloss:27.1512 gloss1-lgP:0.6917 gloss2-gQs:13.0222 gloss3-dQs:13.2127 gloss4-tgtQ:13.0160 dloss:6.0688 exploreP:0.9939
Episode:2777 Steps:20 meanR:21.9400 R:21.0000 gloss:26.9300 gloss1-lgP:0.6951 gloss2-gQs:12.8660 gloss3-dQs:12.9746 gloss4-tgtQ:12.8985 dloss:4.2008 exploreP:0.9939
Episode:2778 Steps:20 meanR:21.9800 R:21.0000 gloss:27.3632 gloss1-lgP:0.6969 gloss2-gQs:13.1576 gloss3-dQs:12.9260 gloss4-tgtQ:13.1732 dloss:5.9710 exploreP:0.9939
Episode:2779 Steps:14 meanR:21.6600 R:15.0000 gloss:27.3808 gloss1-lgP:0.6931 gloss2-gQs:13.1834 gloss3-dQs:13.1195 gloss4-tgtQ:13.1993 dloss:5.2472 exploreP:0.9939
Episode:2

Episode:2825 Steps:11 meanR:20.4600 R:12.0000 gloss:26.0213 gloss1-lgP:0.6964 gloss2-gQs:12.1263 gloss3-dQs:13.1021 gloss4-tgtQ:12.1547 dloss:14.0964 exploreP:0.9938
Episode:2826 Steps:24 meanR:20.6000 R:25.0000 gloss:25.9081 gloss1-lgP:0.6946 gloss2-gQs:12.2895 gloss3-dQs:12.7186 gloss4-tgtQ:12.2896 dloss:8.4823 exploreP:0.9938
Episode:2827 Steps:16 meanR:20.6200 R:17.0000 gloss:26.1293 gloss1-lgP:0.6948 gloss2-gQs:12.4399 gloss3-dQs:12.6581 gloss4-tgtQ:12.5083 dloss:9.7063 exploreP:0.9938
Episode:2828 Steps:21 meanR:20.7100 R:22.0000 gloss:27.3421 gloss1-lgP:0.6980 gloss2-gQs:13.1790 gloss3-dQs:12.7765 gloss4-tgtQ:13.2152 dloss:9.1853 exploreP:0.9938
Episode:2829 Steps:12 meanR:20.6400 R:13.0000 gloss:26.3307 gloss1-lgP:0.6947 gloss2-gQs:12.6369 gloss3-dQs:12.6137 gloss4-tgtQ:12.6519 dloss:6.4716 exploreP:0.9938
Episode:2830 Steps:28 meanR:20.7300 R:29.0000 gloss:26.9611 gloss1-lgP:0.6969 gloss2-gQs:12.9429 gloss3-dQs:12.7670 gloss4-tgtQ:12.9726 dloss:8.2680 exploreP:0.9938
Episode:2

Episode:2876 Steps:18 meanR:21.8200 R:19.0000 gloss:27.2325 gloss1-lgP:0.6974 gloss2-gQs:13.1174 gloss3-dQs:12.7875 gloss4-tgtQ:13.1488 dloss:6.5905 exploreP:0.9937
Episode:2877 Steps:10 meanR:21.7200 R:11.0000 gloss:26.8007 gloss1-lgP:0.6972 gloss2-gQs:12.7657 gloss3-dQs:12.9038 gloss4-tgtQ:12.7717 dloss:5.1915 exploreP:0.9937
Episode:2878 Steps:17 meanR:21.6900 R:18.0000 gloss:27.4953 gloss1-lgP:0.6969 gloss2-gQs:13.2255 gloss3-dQs:12.9803 gloss4-tgtQ:13.2441 dloss:5.7841 exploreP:0.9937
Episode:2879 Steps:16 meanR:21.7100 R:17.0000 gloss:26.5041 gloss1-lgP:0.6948 gloss2-gQs:12.6506 gloss3-dQs:12.8308 gloss4-tgtQ:12.6679 dloss:7.9590 exploreP:0.9937
Episode:2880 Steps:18 meanR:21.7600 R:19.0000 gloss:27.1945 gloss1-lgP:0.6951 gloss2-gQs:13.0729 gloss3-dQs:12.9673 gloss4-tgtQ:13.0803 dloss:8.4857 exploreP:0.9937
Episode:2881 Steps:42 meanR:21.8700 R:43.0000 gloss:27.9583 gloss1-lgP:0.6960 gloss2-gQs:13.4258 gloss3-dQs:13.2808 gloss4-tgtQ:13.4514 dloss:9.0904 exploreP:0.9937
Episode:28

Episode:2926 Steps:25 meanR:21.4600 R:26.0000 gloss:26.1277 gloss1-lgP:0.7002 gloss2-gQs:12.2213 gloss3-dQs:12.7670 gloss4-tgtQ:12.2966 dloss:12.8558 exploreP:0.9936
Episode:2927 Steps:11 meanR:21.4100 R:12.0000 gloss:25.5124 gloss1-lgP:0.6947 gloss2-gQs:12.1250 gloss3-dQs:12.5096 gloss4-tgtQ:12.1051 dloss:19.7659 exploreP:0.9936
Episode:2928 Steps:10 meanR:21.3000 R:11.0000 gloss:27.3727 gloss1-lgP:0.6962 gloss2-gQs:13.3676 gloss3-dQs:12.6159 gloss4-tgtQ:13.3332 dloss:8.5076 exploreP:0.9936
Episode:2929 Steps:24 meanR:21.4200 R:25.0000 gloss:25.4211 gloss1-lgP:0.6943 gloss2-gQs:12.0348 gloss3-dQs:12.5005 gloss4-tgtQ:12.0782 dloss:6.7924 exploreP:0.9936
Episode:2930 Steps:21 meanR:21.3500 R:22.0000 gloss:26.0025 gloss1-lgP:0.6964 gloss2-gQs:12.5210 gloss3-dQs:12.2714 gloss4-tgtQ:12.5522 dloss:12.3576 exploreP:0.9936
Episode:2931 Steps:14 meanR:20.9800 R:15.0000 gloss:25.7398 gloss1-lgP:0.6926 gloss2-gQs:12.4110 gloss3-dQs:12.3568 gloss4-tgtQ:12.3988 dloss:10.6203 exploreP:0.9936
Episod

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(gloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('G losses')

In [None]:
eps, arr = np.array(dloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses')

## Testing

Let's checkout how our trained agent plays the game.

In [36]:
import gym
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
# env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Episodes/epochs
    for _ in range(1):
        state = env.reset()
        total_reward = 0

        # Steps/batches
        while True:
            env.render()
            action_logits = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
            action = np.argmax(action_logits)
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                print('total_reward: {}'.format(total_reward))
                break
                
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt




total_reward: 500.0


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.