# Continuous Control

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Reacher.app"`
- **Windows** (x86): `"path/to/Reacher_Windows_x86/Reacher.exe"`
- **Windows** (x86_64): `"path/to/Reacher_Windows_x86_64/Reacher.exe"`
- **Linux** (x86): `"path/to/Reacher_Linux/Reacher.x86"`
- **Linux** (x86_64): `"path/to/Reacher_Linux/Reacher.x86_64"`
- **Linux** (x86, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86"`
- **Linux** (x86_64, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86_64"`

For instance, if you are using a Mac, then you downloaded `Reacher.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Reacher.app")
```

In [2]:
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux/Reacher.x86_64')
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_v1/Reacher.x86_64')
env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_OneAgent/Reacher_Linux/Reacher.x86_64')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 1
Size of each action: 4
There are 1 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726671e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: 0.0


When finished, you can close the environment.

In [6]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [7]:
# Testing the train mode
env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
state = env_info.vector_observations[0]                  # get the current state (for each agent)
#scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    action = np.random.randn(num_agents, action_size) # select an action (for each agent)
    #actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(action)[brain_name]           # send all actions to tne environment
    next_state = env_info.vector_observations[0]         # get next state (for each agent)
    reward = env_info.rewards[0]                         # get reward (for each agent)
    done = env_info.local_done[0]                        # see if episode finished
    #scores += env_info.rewards                         # update the score (for each agent)
    state = next_state                               # roll over states to next time step
    if done is True:                                  # exit loop if episode finished
        print(action.shape, reward)
        print(done)
        break
# print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

(1, 4) 0.0
True


## Option 1: Solve the First Version
The task is episodic, and in order to solve the environment, your agent must get an average score of +30 over 100 consecutive episodes.

In [8]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


In [9]:
def model_input(state_size, action_size):
    #states = tf.placeholder(tf.float32, [None, *state_size], name='states')
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.float32, [None, action_size], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    return states, actions, targetQs

In [10]:
# Generator/Controller: Generating/prediting the actions
def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        #predictions = tf.nn.softmax(logits)

        # return actions logits
        return logits

In [11]:
# Discriminator/Dopamine: Reward function/planner/naviator/advisor/supervisor/cortical columns
def discriminator(states, actions, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # Fusion/merge states and actions/ SA/ SM
        x_fused = tf.concat(axis=1, values=[states, actions])
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)        
        #predictions = tf.nn.softmax(logits)

        # return rewards logits
        return logits

In [12]:
def model_loss(action_size, hidden_size, states, actions, targetQs):
    actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    neg_log_prob_actions = tf.nn.sigmoid_cross_entropy_with_logits(logits=actions_logits, 
                                                                   labels=tf.nn.sigmoid(actions))
    Qs = discriminator(actions=actions_logits, hidden_size=hidden_size, states=states)
    #Qs = discriminator(actions=actions, hidden_size=hidden_size, states=states)
    g_loss = tf.reduce_mean(neg_log_prob_actions * Qs)
    g_loss1 = tf.reduce_mean(neg_log_prob_actions)
    g_loss2 = tf.reduce_mean(Qs)
    d_loss = tf.reduce_mean(tf.square(Qs - targetQs))
    return actions_logits, Qs, g_loss, d_loss, g_loss1, g_loss2

In [13]:
# Optimizating/training/learning G & D
def model_opt(g_loss, d_loss, learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars)
        d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars)

    return g_opt, d_opt

In [20]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs = model_input(state_size=state_size, action_size=action_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.Qs_logits, self.g_loss, self.d_loss, self.g_loss1, self.g_loss2 = model_loss(
            action_size=action_size, hidden_size=hidden_size, # model init parameters
            states=self.states, actions=self.actions, targetQs=self.targetQs) # model input
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(g_loss=self.g_loss, d_loss=self.d_loss, learning_rate=learning_rate)

In [21]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

In [22]:
print('state size:{}'.format(states.shape), 
      'actions:{}'.format(actions.shape)) 
print('action size:{}'.format(np.max(actions) - np.min(actions)+1))

state size:(1, 33) actions:(1, 4)
action size:2.90378975973461


In [23]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
state_size = 33
action_size = 4
hidden_size = 33*2             # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
gamma = 0.99                   # future reward discount

In [24]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, state_size=state_size, hidden_size=hidden_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

In [25]:
# Initializing the memory buffer
env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
state = env_info.vector_observations[0]                  # get the current state (for each agent)
for _ in range(memory_size):
    action = np.random.randn(num_agents, action_size) # select an action (for each agent)
    #action = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(action)[brain_name]           # send all actions to tne environment
    next_state = env_info.vector_observations[0]         # get next state (for each agent)
    reward = env_info.rewards[0]                         # get reward (for each agent)
    done = env_info.local_done[0]                        # see if episode finished
    memory.buffer.append([state, action.reshape([-1]), next_state, reward, float(done)])
    #print(state.shape, action.reshape([-1]).shape, reward, float(done))
    state = next_state                               # roll over states to next time step
    if done is True:                                  # exit loop if episode finished
        print(done)
        env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
        state = env_info.vector_observations[0]                  # get the current state (for each agent)
        break

True


In [26]:
# len(memory.buffer), memory.buffer[100]

In [None]:
# Save/load the model and save for plotting
saver = tf.train.Saver()
episode_rewards_list, rewards_list, gloss_list, dloss_list = [], [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        gloss_batch, dloss_batch = [], []
        gloss1_batch, gloss2_batch = [], []
        #state = env.reset()
        env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
        state = env_info.vector_observations[0]                  # get the current state (for each agent)

        # Training steps/batches
        while True:
            # Explore (Env) or Exploit (Model): NO
            total_step += 1
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            # if explore_p > np.random.rand():
            #     #action = env.action_space.sample()
            #     action = np.random.randn(num_agents, action_size) # select an action (for each agent)
            # else:
            action = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
            #print(action.shape)
            #action = np.reshape(action_logits, [-1]) # For continuous action space
            #action = np.argmax(action_logits) # For discrete action space
            #next_state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name]           # send all actions to tne environment
            next_state = env_info.vector_observations[0]         # get next state (for each agent)
            reward = env_info.rewards[0]                         # get reward (for each agent)
            done = env_info.local_done[0]                        # see if episode finished
            memory.buffer.append([state, action.reshape([-1]), next_state, reward, float(done)])
            total_reward += reward
            state = next_state

            # Training
            batch = memory.sample(batch_size)
            #batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            #print(actions.shape, actions.dtype)
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            nextQs_logits = sess.run(model.Qs_logits, feed_dict = {model.states: next_states})
            #nextQs = np.max(nextQs_logits, axis=1) * (1-dones)
            nextQs = nextQs_logits.reshape([-1]) * (1-dones)
            targetQs = rewards + (gamma * nextQs)
            #print(nextQs_logits.shape, targetQs.shape)
            gloss, dloss, gloss1, gloss2, _, _ = sess.run([model.g_loss, model.d_loss, 
                                                           model.g_loss1, model.g_loss2,
                                                           model.g_opt, model.d_opt],
                                                          feed_dict = {model.states: states, 
                                                                       model.actions: actions,
                                                                       model.targetQs: targetQs})
            gloss_batch.append(gloss)
            dloss_batch.append(dloss)
            gloss1_batch.append(gloss1)
            gloss2_batch.append(gloss2)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'gloss:{:.4f}'.format(np.mean(gloss_batch)),
              'dloss:{:.4f}'.format(np.mean(dloss_batch)),
                # g_loss1 = tf.reduce_mean(neg_log_prob_actions)
                # g_loss2 = tf.reduce_mean(Qs)
              'gloss1-As:{:.4f}'.format(np.mean(gloss1_batch)),
              'gloss2-Qs:{:.4f}'.format(np.mean(gloss2_batch)),
              'exploreP:{:.4f}'.format(explore_p))
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        gloss_list.append([ep, np.mean(gloss_batch)])
        dloss_list.append([ep, np.mean(dloss_batch)])
        # Break episode/epoch loop
        ## Option 1: Solve the First Version
        #The task is episodic, and in order to solve the environment, 
        #your agent must get an average score of +30 over 100 consecutive episodes.        
        if np.mean(episode_reward) >= +30:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:0.5000 R:0.5000 gloss:-0.0060 dloss:0.1196 gloss1-As:1.6245 gloss2-Qs:0.0216 exploreP:0.9057
Episode:1 meanR:0.4450 R:0.3900 gloss:-0.0767 dloss:0.0343 gloss1-As:2.0516 gloss2-Qs:-0.0346 exploreP:0.8204
Episode:2 meanR:0.4467 R:0.4500 gloss:0.0787 dloss:0.0272 gloss1-As:2.5895 gloss2-Qs:0.0337 exploreP:0.7432
Episode:3 meanR:0.4125 R:0.3100 gloss:0.0825 dloss:0.0175 gloss1-As:2.0652 gloss2-Qs:0.0450 exploreP:0.6734
Episode:4 meanR:0.4260 R:0.4800 gloss:0.0464 dloss:0.0130 gloss1-As:1.4764 gloss2-Qs:0.0327 exploreP:0.6102
Episode:5 meanR:0.4117 R:0.3400 gloss:0.0109 dloss:0.0100 gloss1-As:1.3720 gloss2-Qs:0.0155 exploreP:0.5530
Episode:6 meanR:0.4286 R:0.5300 gloss:0.0176 dloss:0.0087 gloss1-As:1.1610 gloss2-Qs:0.0215 exploreP:0.5013
Episode:7 meanR:0.4587 R:0.6700 gloss:-0.0107 dloss:0.0075 gloss1-As:1.2199 gloss2-Qs:-0.0034 exploreP:0.4545
Episode:8 meanR:0.5244 R:1.0500 gloss:0.0079 dloss:0.0089 gloss1-As:1.9174 gloss2-Qs:0.0065 exploreP:0.4121
Episode:9 meanR:0.5440 

Episode:76 meanR:0.9483 R:0.7100 gloss:0.0076 dloss:0.0025 gloss1-As:0.1300 gloss2-Qs:0.0581 exploreP:0.0104
Episode:77 meanR:0.9605 R:1.9000 gloss:0.0076 dloss:0.0018 gloss1-As:0.1258 gloss2-Qs:0.0605 exploreP:0.0104
Episode:78 meanR:0.9716 R:1.8400 gloss:0.0100 dloss:0.0035 gloss1-As:0.1281 gloss2-Qs:0.0768 exploreP:0.0104
Episode:79 meanR:0.9675 R:0.6400 gloss:0.0102 dloss:0.0012 gloss1-As:0.1258 gloss2-Qs:0.0814 exploreP:0.0103
Episode:80 meanR:0.9622 R:0.5400 gloss:0.0093 dloss:0.0029 gloss1-As:0.1254 gloss2-Qs:0.0730 exploreP:0.0103
Episode:81 meanR:0.9705 R:1.6400 gloss:0.0092 dloss:0.0011 gloss1-As:0.1270 gloss2-Qs:0.0724 exploreP:0.0103
Episode:82 meanR:0.9653 R:0.5400 gloss:0.0127 dloss:0.0019 gloss1-As:0.1295 gloss2-Qs:0.0977 exploreP:0.0102
Episode:83 meanR:0.9565 R:0.2300 gloss:0.0115 dloss:0.0011 gloss1-As:0.1267 gloss2-Qs:0.0906 exploreP:0.0102
Episode:84 meanR:0.9486 R:0.2800 gloss:0.0106 dloss:0.0025 gloss1-As:0.1263 gloss2-Qs:0.0843 exploreP:0.0102
Episode:85 meanR:0.

Episode:151 meanR:1.0269 R:1.0000 gloss:0.0105 dloss:0.0007 gloss1-As:0.1060 gloss2-Qs:0.0990 exploreP:0.0100
Episode:152 meanR:1.0324 R:2.0900 gloss:0.0110 dloss:0.0005 gloss1-As:0.1077 gloss2-Qs:0.1017 exploreP:0.0100
Episode:153 meanR:1.0385 R:1.6300 gloss:0.0110 dloss:0.0004 gloss1-As:0.1073 gloss2-Qs:0.1029 exploreP:0.0100
Episode:154 meanR:1.0361 R:1.6100 gloss:0.0121 dloss:0.0016 gloss1-As:0.1083 gloss2-Qs:0.1109 exploreP:0.0100
Episode:155 meanR:1.0379 R:1.2100 gloss:0.0118 dloss:0.0003 gloss1-As:0.1055 gloss2-Qs:0.1123 exploreP:0.0100
Episode:156 meanR:1.0450 R:1.3500 gloss:0.0114 dloss:0.0002 gloss1-As:0.1050 gloss2-Qs:0.1087 exploreP:0.0100
Episode:157 meanR:1.0435 R:1.1200 gloss:0.0113 dloss:0.0002 gloss1-As:0.1034 gloss2-Qs:0.1098 exploreP:0.0100
Episode:158 meanR:1.0448 R:0.9800 gloss:0.0122 dloss:0.0006 gloss1-As:0.1068 gloss2-Qs:0.1142 exploreP:0.0100
Episode:159 meanR:1.0499 R:0.9000 gloss:0.0124 dloss:0.0001 gloss1-As:0.1058 gloss2-Qs:0.1176 exploreP:0.0100
Episode:16

Episode:226 meanR:1.0666 R:0.2800 gloss:0.0129 dloss:0.0001 gloss1-As:0.1180 gloss2-Qs:0.1093 exploreP:0.0100
Episode:227 meanR:1.0604 R:0.7000 gloss:0.0115 dloss:0.0001 gloss1-As:0.1164 gloss2-Qs:0.0985 exploreP:0.0100
Episode:228 meanR:1.0579 R:1.5100 gloss:0.0117 dloss:0.0001 gloss1-As:0.1157 gloss2-Qs:0.1013 exploreP:0.0100
Episode:229 meanR:1.0573 R:0.8400 gloss:0.0121 dloss:0.0001 gloss1-As:0.1147 gloss2-Qs:0.1056 exploreP:0.0100
Episode:230 meanR:1.0581 R:0.7000 gloss:0.0114 dloss:0.0001 gloss1-As:0.1142 gloss2-Qs:0.1001 exploreP:0.0100
Episode:231 meanR:1.0517 R:0.0700 gloss:0.0105 dloss:0.0001 gloss1-As:0.1145 gloss2-Qs:0.0917 exploreP:0.0100
Episode:232 meanR:1.0489 R:0.4400 gloss:0.0089 dloss:0.0001 gloss1-As:0.1146 gloss2-Qs:0.0778 exploreP:0.0100
Episode:233 meanR:1.0577 R:1.0800 gloss:0.0087 dloss:0.0001 gloss1-As:0.1165 gloss2-Qs:0.0746 exploreP:0.0100
Episode:234 meanR:1.0510 R:0.7800 gloss:0.0085 dloss:0.0000 gloss1-As:0.1176 gloss2-Qs:0.0718 exploreP:0.0100
Episode:23

Episode:301 meanR:0.9426 R:1.5800 gloss:0.0109 dloss:0.0001 gloss1-As:0.1269 gloss2-Qs:0.0860 exploreP:0.0100
Episode:302 meanR:0.9378 R:0.9200 gloss:0.0122 dloss:0.0001 gloss1-As:0.1288 gloss2-Qs:0.0944 exploreP:0.0100
Episode:303 meanR:0.9345 R:0.5300 gloss:0.0123 dloss:0.0001 gloss1-As:0.1300 gloss2-Qs:0.0944 exploreP:0.0100
Episode:304 meanR:0.9355 R:0.9400 gloss:0.0114 dloss:0.0001 gloss1-As:0.1320 gloss2-Qs:0.0867 exploreP:0.0100
Episode:305 meanR:0.9429 R:1.3000 gloss:0.0118 dloss:0.0000 gloss1-As:0.1340 gloss2-Qs:0.0878 exploreP:0.0100
Episode:306 meanR:0.9383 R:0.6700 gloss:0.0114 dloss:0.0001 gloss1-As:0.1340 gloss2-Qs:0.0847 exploreP:0.0100
Episode:307 meanR:0.9446 R:1.0100 gloss:0.0104 dloss:0.0001 gloss1-As:0.1338 gloss2-Qs:0.0775 exploreP:0.0100
Episode:308 meanR:0.9455 R:1.8600 gloss:0.0120 dloss:0.0000 gloss1-As:0.1329 gloss2-Qs:0.0900 exploreP:0.0100
Episode:309 meanR:0.9378 R:1.0100 gloss:0.0130 dloss:0.0001 gloss1-As:0.1357 gloss2-Qs:0.0955 exploreP:0.0100
Episode:31

Episode:376 meanR:0.9001 R:1.4700 gloss:0.0094 dloss:0.0000 gloss1-As:0.1301 gloss2-Qs:0.0726 exploreP:0.0100
Episode:377 meanR:0.9076 R:1.2500 gloss:0.0110 dloss:0.0000 gloss1-As:0.1315 gloss2-Qs:0.0839 exploreP:0.0100
Episode:378 meanR:0.9022 R:0.1600 gloss:0.0111 dloss:0.0000 gloss1-As:0.1340 gloss2-Qs:0.0828 exploreP:0.0100
Episode:379 meanR:0.9040 R:0.5400 gloss:0.0105 dloss:0.0000 gloss1-As:0.1335 gloss2-Qs:0.0787 exploreP:0.0100
Episode:380 meanR:0.9133 R:1.2600 gloss:0.0106 dloss:0.0000 gloss1-As:0.1362 gloss2-Qs:0.0779 exploreP:0.0100
Episode:381 meanR:0.9181 R:0.8100 gloss:0.0109 dloss:0.0000 gloss1-As:0.1369 gloss2-Qs:0.0794 exploreP:0.0100
Episode:382 meanR:0.9170 R:1.9100 gloss:0.0117 dloss:0.0000 gloss1-As:0.1381 gloss2-Qs:0.0844 exploreP:0.0100
Episode:383 meanR:0.9135 R:0.8200 gloss:0.0115 dloss:0.0000 gloss1-As:0.1355 gloss2-Qs:0.0847 exploreP:0.0100
Episode:384 meanR:0.9206 R:1.1000 gloss:0.0127 dloss:0.0001 gloss1-As:0.1363 gloss2-Qs:0.0935 exploreP:0.0100
Episode:38

Episode:451 meanR:0.8975 R:1.2300 gloss:0.0090 dloss:0.0000 gloss1-As:0.1266 gloss2-Qs:0.0711 exploreP:0.0100
Episode:452 meanR:0.9022 R:1.1000 gloss:0.0110 dloss:0.0000 gloss1-As:0.1277 gloss2-Qs:0.0858 exploreP:0.0100
Episode:453 meanR:0.9077 R:0.7200 gloss:0.0114 dloss:0.0000 gloss1-As:0.1292 gloss2-Qs:0.0883 exploreP:0.0100
Episode:454 meanR:0.9013 R:1.2200 gloss:0.0105 dloss:0.0000 gloss1-As:0.1277 gloss2-Qs:0.0822 exploreP:0.0100
Episode:455 meanR:0.9001 R:0.7600 gloss:0.0105 dloss:0.0000 gloss1-As:0.1291 gloss2-Qs:0.0816 exploreP:0.0100
Episode:456 meanR:0.9065 R:1.5500 gloss:0.0122 dloss:0.0000 gloss1-As:0.1330 gloss2-Qs:0.0915 exploreP:0.0100
Episode:457 meanR:0.9162 R:1.2700 gloss:0.0130 dloss:0.0001 gloss1-As:0.1313 gloss2-Qs:0.0992 exploreP:0.0100
Episode:458 meanR:0.9101 R:0.6100 gloss:0.0141 dloss:0.0001 gloss1-As:0.1335 gloss2-Qs:0.1055 exploreP:0.0100
Episode:459 meanR:0.9042 R:0.3700 gloss:0.0136 dloss:0.0001 gloss1-As:0.1350 gloss2-Qs:0.1008 exploreP:0.0100
Episode:46

Episode:526 meanR:0.9550 R:0.3800 gloss:0.0096 dloss:0.0000 gloss1-As:0.1323 gloss2-Qs:0.0725 exploreP:0.0100
Episode:527 meanR:0.9494 R:0.3300 gloss:0.0092 dloss:0.0000 gloss1-As:0.1301 gloss2-Qs:0.0708 exploreP:0.0100
Episode:528 meanR:0.9564 R:1.1600 gloss:0.0095 dloss:0.0000 gloss1-As:0.1308 gloss2-Qs:0.0723 exploreP:0.0100
Episode:529 meanR:0.9547 R:0.3300 gloss:0.0090 dloss:0.0000 gloss1-As:0.1317 gloss2-Qs:0.0685 exploreP:0.0100
Episode:530 meanR:0.9472 R:0.6100 gloss:0.0090 dloss:0.0000 gloss1-As:0.1323 gloss2-Qs:0.0678 exploreP:0.0100
Episode:531 meanR:0.9478 R:0.9000 gloss:0.0084 dloss:0.0000 gloss1-As:0.1347 gloss2-Qs:0.0622 exploreP:0.0100
Episode:532 meanR:0.9583 R:1.3400 gloss:0.0090 dloss:0.0000 gloss1-As:0.1338 gloss2-Qs:0.0673 exploreP:0.0100
Episode:533 meanR:0.9517 R:0.9500 gloss:0.0104 dloss:0.0000 gloss1-As:0.1347 gloss2-Qs:0.0774 exploreP:0.0100
Episode:534 meanR:0.9504 R:0.4400 gloss:0.0097 dloss:0.0000 gloss1-As:0.1336 gloss2-Qs:0.0728 exploreP:0.0100
Episode:53

Episode:601 meanR:0.9494 R:1.1600 gloss:0.0134 dloss:0.0000 gloss1-As:0.1295 gloss2-Qs:0.1038 exploreP:0.0100
Episode:602 meanR:0.9451 R:0.3600 gloss:0.0125 dloss:0.0000 gloss1-As:0.1297 gloss2-Qs:0.0961 exploreP:0.0100
Episode:603 meanR:0.9379 R:0.7900 gloss:0.0114 dloss:0.0000 gloss1-As:0.1291 gloss2-Qs:0.0883 exploreP:0.0100
Episode:604 meanR:0.9404 R:0.8600 gloss:0.0114 dloss:0.0000 gloss1-As:0.1279 gloss2-Qs:0.0889 exploreP:0.0100
Episode:605 meanR:0.9496 R:1.4600 gloss:0.0112 dloss:0.0000 gloss1-As:0.1290 gloss2-Qs:0.0870 exploreP:0.0100
Episode:606 meanR:0.9573 R:0.7700 gloss:0.0121 dloss:0.0000 gloss1-As:0.1309 gloss2-Qs:0.0924 exploreP:0.0100
Episode:607 meanR:0.9665 R:1.6200 gloss:0.0132 dloss:0.0001 gloss1-As:0.1306 gloss2-Qs:0.1011 exploreP:0.0100
Episode:608 meanR:0.9584 R:0.7900 gloss:0.0129 dloss:0.0000 gloss1-As:0.1340 gloss2-Qs:0.0965 exploreP:0.0100
Episode:609 meanR:0.9554 R:1.1500 gloss:0.0129 dloss:0.0000 gloss1-As:0.1325 gloss2-Qs:0.0977 exploreP:0.0100
Episode:61

Episode:676 meanR:0.9603 R:0.9200 gloss:0.0124 dloss:0.0000 gloss1-As:0.1344 gloss2-Qs:0.0925 exploreP:0.0100
Episode:677 meanR:0.9577 R:0.5500 gloss:0.0131 dloss:0.0001 gloss1-As:0.1330 gloss2-Qs:0.0988 exploreP:0.0100
Episode:678 meanR:0.9612 R:0.9600 gloss:0.0138 dloss:0.0000 gloss1-As:0.1337 gloss2-Qs:0.1033 exploreP:0.0100
Episode:679 meanR:0.9562 R:0.6200 gloss:0.0126 dloss:0.0000 gloss1-As:0.1305 gloss2-Qs:0.0967 exploreP:0.0100
Episode:680 meanR:0.9539 R:0.7900 gloss:0.0120 dloss:0.0000 gloss1-As:0.1315 gloss2-Qs:0.0915 exploreP:0.0100
Episode:681 meanR:0.9552 R:0.5500 gloss:0.0116 dloss:0.0000 gloss1-As:0.1310 gloss2-Qs:0.0889 exploreP:0.0100
Episode:682 meanR:0.9383 R:1.1000 gloss:0.0121 dloss:0.0000 gloss1-As:0.1307 gloss2-Qs:0.0923 exploreP:0.0100
Episode:683 meanR:0.9412 R:1.0600 gloss:0.0120 dloss:0.0000 gloss1-As:0.1325 gloss2-Qs:0.0909 exploreP:0.0100
Episode:684 meanR:0.9469 R:1.1200 gloss:0.0120 dloss:0.0000 gloss1-As:0.1327 gloss2-Qs:0.0902 exploreP:0.0100
Episode:68

Episode:751 meanR:0.9609 R:1.2100 gloss:0.0130 dloss:0.0000 gloss1-As:0.1360 gloss2-Qs:0.0957 exploreP:0.0100
Episode:752 meanR:0.9653 R:1.3900 gloss:0.0132 dloss:0.0000 gloss1-As:0.1348 gloss2-Qs:0.0982 exploreP:0.0100
Episode:753 meanR:0.9700 R:0.8400 gloss:0.0140 dloss:0.0001 gloss1-As:0.1373 gloss2-Qs:0.1017 exploreP:0.0100
Episode:754 meanR:0.9704 R:0.9300 gloss:0.0142 dloss:0.0001 gloss1-As:0.1365 gloss2-Qs:0.1043 exploreP:0.0100
Episode:755 meanR:0.9670 R:0.8000 gloss:0.0135 dloss:0.0001 gloss1-As:0.1381 gloss2-Qs:0.0981 exploreP:0.0100
Episode:756 meanR:0.9717 R:0.9100 gloss:0.0140 dloss:0.0000 gloss1-As:0.1361 gloss2-Qs:0.1032 exploreP:0.0100
Episode:757 meanR:0.9735 R:0.9900 gloss:0.0135 dloss:0.0001 gloss1-As:0.1367 gloss2-Qs:0.0986 exploreP:0.0100
Episode:758 meanR:0.9644 R:0.7000 gloss:0.0133 dloss:0.0000 gloss1-As:0.1342 gloss2-Qs:0.0992 exploreP:0.0100
Episode:759 meanR:0.9691 R:1.3000 gloss:0.0122 dloss:0.0000 gloss1-As:0.1366 gloss2-Qs:0.0895 exploreP:0.0100
Episode:76

Episode:826 meanR:0.9182 R:0.4700 gloss:0.0094 dloss:0.0000 gloss1-As:0.1364 gloss2-Qs:0.0688 exploreP:0.0100
Episode:827 meanR:0.9326 R:1.7500 gloss:0.0094 dloss:0.0000 gloss1-As:0.1331 gloss2-Qs:0.0708 exploreP:0.0100
Episode:828 meanR:0.9305 R:0.7800 gloss:0.0103 dloss:0.0000 gloss1-As:0.1351 gloss2-Qs:0.0760 exploreP:0.0100
Episode:829 meanR:0.9234 R:0.4700 gloss:0.0107 dloss:0.0000 gloss1-As:0.1328 gloss2-Qs:0.0803 exploreP:0.0100
Episode:830 meanR:0.9160 R:0.9900 gloss:0.0104 dloss:0.0000 gloss1-As:0.1346 gloss2-Qs:0.0772 exploreP:0.0100
Episode:831 meanR:0.9173 R:0.7500 gloss:0.0107 dloss:0.0000 gloss1-As:0.1319 gloss2-Qs:0.0808 exploreP:0.0100
Episode:832 meanR:0.9250 R:1.3200 gloss:0.0108 dloss:0.0000 gloss1-As:0.1319 gloss2-Qs:0.0822 exploreP:0.0100
Episode:833 meanR:0.9240 R:0.4700 gloss:0.0108 dloss:0.0000 gloss1-As:0.1314 gloss2-Qs:0.0824 exploreP:0.0100
Episode:834 meanR:0.9501 R:3.2700 gloss:0.0110 dloss:0.0000 gloss1-As:0.1299 gloss2-Qs:0.0845 exploreP:0.0100
Episode:83

Episode:901 meanR:1.0185 R:0.6500 gloss:0.0135 dloss:0.0000 gloss1-As:0.1368 gloss2-Qs:0.0989 exploreP:0.0100
Episode:902 meanR:1.0000 R:0.4300 gloss:0.0130 dloss:0.0001 gloss1-As:0.1334 gloss2-Qs:0.0974 exploreP:0.0100
Episode:903 meanR:1.0091 R:1.5700 gloss:0.0122 dloss:0.0000 gloss1-As:0.1331 gloss2-Qs:0.0916 exploreP:0.0100
Episode:904 meanR:1.0134 R:0.9900 gloss:0.0117 dloss:0.0000 gloss1-As:0.1340 gloss2-Qs:0.0876 exploreP:0.0100
Episode:905 meanR:1.0152 R:0.4800 gloss:0.0111 dloss:0.0000 gloss1-As:0.1360 gloss2-Qs:0.0813 exploreP:0.0100
Episode:906 meanR:1.0189 R:0.8300 gloss:0.0111 dloss:0.0000 gloss1-As:0.1351 gloss2-Qs:0.0823 exploreP:0.0100
Episode:907 meanR:1.0062 R:0.5000 gloss:0.0116 dloss:0.0000 gloss1-As:0.1357 gloss2-Qs:0.0854 exploreP:0.0100
Episode:908 meanR:1.0153 R:1.6500 gloss:0.0102 dloss:0.0000 gloss1-As:0.1342 gloss2-Qs:0.0758 exploreP:0.0100
Episode:909 meanR:1.0122 R:1.1400 gloss:0.0110 dloss:0.0000 gloss1-As:0.1355 gloss2-Qs:0.0812 exploreP:0.0100
Episode:91

Episode:976 meanR:1.0496 R:1.9500 gloss:0.0106 dloss:0.0000 gloss1-As:0.1354 gloss2-Qs:0.0781 exploreP:0.0100
Episode:977 meanR:1.0386 R:0.5800 gloss:0.0106 dloss:0.0000 gloss1-As:0.1334 gloss2-Qs:0.0794 exploreP:0.0100
Episode:978 meanR:1.0372 R:0.8800 gloss:0.0111 dloss:0.0000 gloss1-As:0.1337 gloss2-Qs:0.0832 exploreP:0.0100
Episode:979 meanR:1.0305 R:0.4500 gloss:0.0105 dloss:0.0000 gloss1-As:0.1341 gloss2-Qs:0.0781 exploreP:0.0100
Episode:980 meanR:1.0208 R:0.5000 gloss:0.0105 dloss:0.0000 gloss1-As:0.1366 gloss2-Qs:0.0769 exploreP:0.0100
Episode:981 meanR:1.0308 R:2.5100 gloss:0.0109 dloss:0.0000 gloss1-As:0.1351 gloss2-Qs:0.0805 exploreP:0.0100
Episode:982 meanR:1.0351 R:1.6100 gloss:0.0129 dloss:0.0000 gloss1-As:0.1376 gloss2-Qs:0.0938 exploreP:0.0100
Episode:983 meanR:1.0279 R:0.2800 gloss:0.0130 dloss:0.0000 gloss1-As:0.1372 gloss2-Qs:0.0949 exploreP:0.0100
Episode:984 meanR:1.0369 R:1.5300 gloss:0.0117 dloss:0.0000 gloss1-As:0.1343 gloss2-Qs:0.0873 exploreP:0.0100
Episode:98

Episode:1051 meanR:1.0493 R:1.2900 gloss:0.0118 dloss:0.0000 gloss1-As:0.1360 gloss2-Qs:0.0867 exploreP:0.0100
Episode:1052 meanR:1.0412 R:0.9400 gloss:0.0116 dloss:0.0000 gloss1-As:0.1354 gloss2-Qs:0.0860 exploreP:0.0100
Episode:1053 meanR:1.0506 R:1.4000 gloss:0.0125 dloss:0.0000 gloss1-As:0.1357 gloss2-Qs:0.0921 exploreP:0.0100
Episode:1054 meanR:1.0542 R:0.9000 gloss:0.0126 dloss:0.0000 gloss1-As:0.1377 gloss2-Qs:0.0914 exploreP:0.0100
Episode:1055 meanR:1.0530 R:0.7800 gloss:0.0116 dloss:0.0000 gloss1-As:0.1375 gloss2-Qs:0.0840 exploreP:0.0100
Episode:1056 meanR:1.0505 R:1.6500 gloss:0.0119 dloss:0.0000 gloss1-As:0.1382 gloss2-Qs:0.0861 exploreP:0.0100
Episode:1057 meanR:1.0364 R:0.4400 gloss:0.0118 dloss:0.0000 gloss1-As:0.1375 gloss2-Qs:0.0861 exploreP:0.0100
Episode:1058 meanR:1.0332 R:0.9100 gloss:0.0117 dloss:0.0000 gloss1-As:0.1366 gloss2-Qs:0.0857 exploreP:0.0100
Episode:1059 meanR:1.0352 R:0.9100 gloss:0.0123 dloss:0.0000 gloss1-As:0.1351 gloss2-Qs:0.0910 exploreP:0.0100
E

Episode:1125 meanR:1.0348 R:0.3900 gloss:0.0148 dloss:0.0001 gloss1-As:0.1424 gloss2-Qs:0.1041 exploreP:0.0100
Episode:1126 meanR:1.0290 R:0.7300 gloss:0.0141 dloss:0.0000 gloss1-As:0.1393 gloss2-Qs:0.1012 exploreP:0.0100
Episode:1127 meanR:1.0371 R:1.0500 gloss:0.0129 dloss:0.0000 gloss1-As:0.1385 gloss2-Qs:0.0932 exploreP:0.0100
Episode:1128 meanR:1.0475 R:1.6800 gloss:0.0117 dloss:0.0000 gloss1-As:0.1380 gloss2-Qs:0.0849 exploreP:0.0100
Episode:1129 meanR:1.0447 R:0.2600 gloss:0.0114 dloss:0.0000 gloss1-As:0.1385 gloss2-Qs:0.0825 exploreP:0.0100
Episode:1130 meanR:1.0356 R:0.4000 gloss:0.0118 dloss:0.0000 gloss1-As:0.1406 gloss2-Qs:0.0837 exploreP:0.0100
Episode:1131 meanR:1.0349 R:0.9900 gloss:0.0111 dloss:0.0000 gloss1-As:0.1381 gloss2-Qs:0.0801 exploreP:0.0100
Episode:1132 meanR:1.0341 R:0.6100 gloss:0.0102 dloss:0.0000 gloss1-As:0.1382 gloss2-Qs:0.0739 exploreP:0.0100
Episode:1133 meanR:1.0582 R:2.5500 gloss:0.0110 dloss:0.0000 gloss1-As:0.1373 gloss2-Qs:0.0801 exploreP:0.0100
E

Episode:1199 meanR:1.0177 R:0.6300 gloss:0.0108 dloss:0.0000 gloss1-As:0.1329 gloss2-Qs:0.0810 exploreP:0.0100
Episode:1200 meanR:1.0161 R:0.6000 gloss:0.0098 dloss:0.0000 gloss1-As:0.1341 gloss2-Qs:0.0731 exploreP:0.0100
Episode:1201 meanR:1.0154 R:1.4300 gloss:0.0094 dloss:0.0000 gloss1-As:0.1370 gloss2-Qs:0.0683 exploreP:0.0100
Episode:1202 meanR:1.0181 R:0.7800 gloss:0.0103 dloss:0.0000 gloss1-As:0.1357 gloss2-Qs:0.0758 exploreP:0.0100
Episode:1203 meanR:1.0173 R:0.8400 gloss:0.0090 dloss:0.0000 gloss1-As:0.1365 gloss2-Qs:0.0659 exploreP:0.0100
Episode:1204 meanR:1.0154 R:0.5100 gloss:0.0097 dloss:0.0000 gloss1-As:0.1388 gloss2-Qs:0.0701 exploreP:0.0100
Episode:1205 meanR:1.0022 R:0.5700 gloss:0.0090 dloss:0.0000 gloss1-As:0.1380 gloss2-Qs:0.0649 exploreP:0.0100
Episode:1206 meanR:1.0004 R:0.2300 gloss:0.0090 dloss:0.0000 gloss1-As:0.1392 gloss2-Qs:0.0648 exploreP:0.0100
Episode:1207 meanR:1.0082 R:1.0000 gloss:0.0091 dloss:0.0000 gloss1-As:0.1366 gloss2-Qs:0.0667 exploreP:0.0100
E

Episode:1273 meanR:1.0106 R:1.4000 gloss:0.0127 dloss:0.0000 gloss1-As:0.1344 gloss2-Qs:0.0946 exploreP:0.0100
Episode:1274 meanR:1.0131 R:0.8900 gloss:0.0126 dloss:0.0000 gloss1-As:0.1354 gloss2-Qs:0.0932 exploreP:0.0100
Episode:1275 meanR:0.9935 R:0.4400 gloss:0.0118 dloss:0.0000 gloss1-As:0.1339 gloss2-Qs:0.0881 exploreP:0.0100
Episode:1276 meanR:1.0009 R:1.2900 gloss:0.0118 dloss:0.0000 gloss1-As:0.1351 gloss2-Qs:0.0875 exploreP:0.0100
Episode:1277 meanR:1.0067 R:1.1100 gloss:0.0109 dloss:0.0000 gloss1-As:0.1343 gloss2-Qs:0.0811 exploreP:0.0100
Episode:1278 meanR:1.0048 R:0.3300 gloss:0.0098 dloss:0.0000 gloss1-As:0.1318 gloss2-Qs:0.0747 exploreP:0.0100
Episode:1279 meanR:0.9990 R:0.9600 gloss:0.0101 dloss:0.0000 gloss1-As:0.1308 gloss2-Qs:0.0774 exploreP:0.0100
Episode:1280 meanR:0.9897 R:0.5300 gloss:0.0116 dloss:0.0000 gloss1-As:0.1347 gloss2-Qs:0.0860 exploreP:0.0100
Episode:1281 meanR:0.9892 R:0.9000 gloss:0.0110 dloss:0.0000 gloss1-As:0.1326 gloss2-Qs:0.0832 exploreP:0.0100
E

Episode:1347 meanR:0.9808 R:1.7400 gloss:0.0109 dloss:0.0000 gloss1-As:0.1421 gloss2-Qs:0.0766 exploreP:0.0100
Episode:1348 meanR:0.9766 R:0.6500 gloss:0.0121 dloss:0.0000 gloss1-As:0.1433 gloss2-Qs:0.0845 exploreP:0.0100
Episode:1349 meanR:0.9835 R:1.1600 gloss:0.0114 dloss:0.0000 gloss1-As:0.1417 gloss2-Qs:0.0802 exploreP:0.0100
Episode:1350 meanR:0.9793 R:1.4800 gloss:0.0124 dloss:0.0000 gloss1-As:0.1415 gloss2-Qs:0.0878 exploreP:0.0100
Episode:1351 meanR:0.9789 R:1.5600 gloss:0.0142 dloss:0.0000 gloss1-As:0.1415 gloss2-Qs:0.1005 exploreP:0.0100
Episode:1352 meanR:0.9863 R:1.6000 gloss:0.0140 dloss:0.0001 gloss1-As:0.1400 gloss2-Qs:0.1000 exploreP:0.0100
Episode:1353 meanR:0.9884 R:0.7800 gloss:0.0144 dloss:0.0001 gloss1-As:0.1378 gloss2-Qs:0.1047 exploreP:0.0100
Episode:1354 meanR:0.9686 R:0.5500 gloss:0.0140 dloss:0.0000 gloss1-As:0.1366 gloss2-Qs:0.1024 exploreP:0.0100
Episode:1355 meanR:0.9851 R:1.7600 gloss:0.0134 dloss:0.0001 gloss1-As:0.1362 gloss2-Qs:0.0987 exploreP:0.0100
E

Episode:1421 meanR:1.1097 R:1.8400 gloss:0.0157 dloss:0.0001 gloss1-As:0.1384 gloss2-Qs:0.1136 exploreP:0.0100
Episode:1422 meanR:1.1038 R:0.0000 gloss:0.0160 dloss:0.0001 gloss1-As:0.1362 gloss2-Qs:0.1176 exploreP:0.0100
Episode:1423 meanR:1.0976 R:0.7600 gloss:0.0154 dloss:0.0001 gloss1-As:0.1344 gloss2-Qs:0.1149 exploreP:0.0100
Episode:1424 meanR:1.0855 R:0.6300 gloss:0.0157 dloss:0.0001 gloss1-As:0.1337 gloss2-Qs:0.1173 exploreP:0.0100
Episode:1425 meanR:1.0851 R:0.4600 gloss:0.0151 dloss:0.0001 gloss1-As:0.1353 gloss2-Qs:0.1117 exploreP:0.0100
Episode:1426 meanR:1.0739 R:1.0000 gloss:0.0145 dloss:0.0001 gloss1-As:0.1343 gloss2-Qs:0.1082 exploreP:0.0100
Episode:1427 meanR:1.0708 R:0.9200 gloss:0.0141 dloss:0.0001 gloss1-As:0.1334 gloss2-Qs:0.1055 exploreP:0.0100
Episode:1428 meanR:1.0857 R:2.0400 gloss:0.0132 dloss:0.0001 gloss1-As:0.1343 gloss2-Qs:0.0984 exploreP:0.0100
Episode:1429 meanR:1.0735 R:1.7600 gloss:0.0134 dloss:0.0001 gloss1-As:0.1366 gloss2-Qs:0.0979 exploreP:0.0100
E

Episode:1495 meanR:0.9083 R:0.5300 gloss:0.0109 dloss:0.0000 gloss1-As:0.1369 gloss2-Qs:0.0798 exploreP:0.0100
Episode:1496 meanR:0.9076 R:1.0100 gloss:0.0110 dloss:0.0000 gloss1-As:0.1372 gloss2-Qs:0.0800 exploreP:0.0100
Episode:1497 meanR:0.9041 R:0.4800 gloss:0.0107 dloss:0.0000 gloss1-As:0.1380 gloss2-Qs:0.0773 exploreP:0.0100
Episode:1498 meanR:0.8944 R:0.9800 gloss:0.0111 dloss:0.0000 gloss1-As:0.1362 gloss2-Qs:0.0819 exploreP:0.0100
Episode:1499 meanR:0.8940 R:0.8700 gloss:0.0119 dloss:0.0000 gloss1-As:0.1380 gloss2-Qs:0.0861 exploreP:0.0100
Episode:1500 meanR:0.8917 R:1.0200 gloss:0.0120 dloss:0.0000 gloss1-As:0.1375 gloss2-Qs:0.0870 exploreP:0.0100
Episode:1501 meanR:0.8993 R:1.2300 gloss:0.0124 dloss:0.0000 gloss1-As:0.1372 gloss2-Qs:0.0905 exploreP:0.0100
Episode:1502 meanR:0.9121 R:2.0500 gloss:0.0125 dloss:0.0000 gloss1-As:0.1364 gloss2-Qs:0.0920 exploreP:0.0100
Episode:1503 meanR:0.9034 R:0.3800 gloss:0.0132 dloss:0.0000 gloss1-As:0.1350 gloss2-Qs:0.0975 exploreP:0.0100
E

Episode:1569 meanR:0.9034 R:0.9000 gloss:0.0110 dloss:0.0000 gloss1-As:0.1362 gloss2-Qs:0.0809 exploreP:0.0100
Episode:1570 meanR:0.9008 R:1.2600 gloss:0.0120 dloss:0.0000 gloss1-As:0.1385 gloss2-Qs:0.0868 exploreP:0.0100
Episode:1571 meanR:0.8987 R:0.5200 gloss:0.0124 dloss:0.0000 gloss1-As:0.1369 gloss2-Qs:0.0905 exploreP:0.0100
Episode:1572 meanR:0.8964 R:0.5400 gloss:0.0110 dloss:0.0000 gloss1-As:0.1374 gloss2-Qs:0.0803 exploreP:0.0100
Episode:1573 meanR:0.8985 R:0.7700 gloss:0.0100 dloss:0.0000 gloss1-As:0.1370 gloss2-Qs:0.0732 exploreP:0.0100
Episode:1574 meanR:0.9089 R:1.6200 gloss:0.0106 dloss:0.0000 gloss1-As:0.1391 gloss2-Qs:0.0764 exploreP:0.0100
Episode:1575 meanR:0.9045 R:1.2800 gloss:0.0117 dloss:0.0000 gloss1-As:0.1386 gloss2-Qs:0.0845 exploreP:0.0100
Episode:1576 meanR:0.9001 R:0.5600 gloss:0.0134 dloss:0.0000 gloss1-As:0.1413 gloss2-Qs:0.0946 exploreP:0.0100
Episode:1577 meanR:0.8935 R:0.2300 gloss:0.0128 dloss:0.0000 gloss1-As:0.1419 gloss2-Qs:0.0902 exploreP:0.0100
E

Episode:1643 meanR:0.9532 R:0.6500 gloss:0.0146 dloss:0.0001 gloss1-As:0.1399 gloss2-Qs:0.1040 exploreP:0.0100
Episode:1644 meanR:0.9646 R:1.2600 gloss:0.0142 dloss:0.0001 gloss1-As:0.1423 gloss2-Qs:0.1001 exploreP:0.0100
Episode:1645 meanR:0.9598 R:0.2400 gloss:0.0144 dloss:0.0001 gloss1-As:0.1441 gloss2-Qs:0.0998 exploreP:0.0100
Episode:1646 meanR:0.9651 R:1.6400 gloss:0.0141 dloss:0.0001 gloss1-As:0.1441 gloss2-Qs:0.0981 exploreP:0.0100
Episode:1647 meanR:0.9720 R:1.2600 gloss:0.0146 dloss:0.0001 gloss1-As:0.1454 gloss2-Qs:0.1005 exploreP:0.0100
Episode:1648 meanR:0.9708 R:0.4400 gloss:0.0142 dloss:0.0001 gloss1-As:0.1502 gloss2-Qs:0.0947 exploreP:0.0100
Episode:1649 meanR:0.9706 R:1.3100 gloss:0.0133 dloss:0.0000 gloss1-As:0.1470 gloss2-Qs:0.0903 exploreP:0.0100
Episode:1650 meanR:0.9617 R:0.8100 gloss:0.0131 dloss:0.0000 gloss1-As:0.1466 gloss2-Qs:0.0891 exploreP:0.0100
Episode:1651 meanR:0.9701 R:1.3200 gloss:0.0133 dloss:0.0000 gloss1-As:0.1441 gloss2-Qs:0.0923 exploreP:0.0100
E

Episode:1717 meanR:0.9652 R:0.5700 gloss:0.0122 dloss:0.0000 gloss1-As:0.1432 gloss2-Qs:0.0853 exploreP:0.0100
Episode:1718 meanR:0.9677 R:0.7800 gloss:0.0113 dloss:0.0000 gloss1-As:0.1402 gloss2-Qs:0.0805 exploreP:0.0100
Episode:1719 meanR:0.9818 R:2.0200 gloss:0.0114 dloss:0.0000 gloss1-As:0.1409 gloss2-Qs:0.0811 exploreP:0.0100
Episode:1720 meanR:0.9877 R:0.7500 gloss:0.0127 dloss:0.0000 gloss1-As:0.1408 gloss2-Qs:0.0901 exploreP:0.0100
Episode:1721 meanR:0.9797 R:0.3400 gloss:0.0124 dloss:0.0000 gloss1-As:0.1364 gloss2-Qs:0.0910 exploreP:0.0100
Episode:1722 meanR:0.9915 R:1.5800 gloss:0.0124 dloss:0.0000 gloss1-As:0.1366 gloss2-Qs:0.0908 exploreP:0.0100
Episode:1723 meanR:0.9910 R:1.3400 gloss:0.0128 dloss:0.0000 gloss1-As:0.1388 gloss2-Qs:0.0925 exploreP:0.0100
Episode:1724 meanR:0.9841 R:0.4400 gloss:0.0136 dloss:0.0001 gloss1-As:0.1395 gloss2-Qs:0.0971 exploreP:0.0100
Episode:1725 meanR:0.9866 R:1.4400 gloss:0.0134 dloss:0.0000 gloss1-As:0.1397 gloss2-Qs:0.0960 exploreP:0.0100
E

Episode:1791 meanR:0.9594 R:0.5400 gloss:0.0107 dloss:0.0000 gloss1-As:0.1398 gloss2-Qs:0.0766 exploreP:0.0100
Episode:1792 meanR:0.9354 R:0.4500 gloss:0.0106 dloss:0.0000 gloss1-As:0.1405 gloss2-Qs:0.0753 exploreP:0.0100
Episode:1793 meanR:0.9293 R:0.8000 gloss:0.0108 dloss:0.0000 gloss1-As:0.1409 gloss2-Qs:0.0769 exploreP:0.0100
Episode:1794 meanR:0.9286 R:0.4300 gloss:0.0113 dloss:0.0000 gloss1-As:0.1433 gloss2-Qs:0.0789 exploreP:0.0100
Episode:1795 meanR:0.9345 R:0.9300 gloss:0.0099 dloss:0.0000 gloss1-As:0.1420 gloss2-Qs:0.0700 exploreP:0.0100
Episode:1796 meanR:0.9300 R:0.7900 gloss:0.0086 dloss:0.0000 gloss1-As:0.1448 gloss2-Qs:0.0593 exploreP:0.0100
Episode:1797 meanR:0.9320 R:1.1200 gloss:0.0079 dloss:0.0000 gloss1-As:0.1454 gloss2-Qs:0.0546 exploreP:0.0100
Episode:1798 meanR:0.9337 R:0.8200 gloss:0.0091 dloss:0.0000 gloss1-As:0.1431 gloss2-Qs:0.0639 exploreP:0.0100
Episode:1799 meanR:0.9312 R:0.6400 gloss:0.0093 dloss:0.0000 gloss1-As:0.1440 gloss2-Qs:0.0643 exploreP:0.0100
E

Episode:1865 meanR:0.9371 R:1.8600 gloss:0.0139 dloss:0.0001 gloss1-As:0.1417 gloss2-Qs:0.0984 exploreP:0.0100
Episode:1866 meanR:0.9398 R:1.3700 gloss:0.0147 dloss:0.0001 gloss1-As:0.1413 gloss2-Qs:0.1044 exploreP:0.0100
Episode:1867 meanR:0.9427 R:1.1700 gloss:0.0158 dloss:0.0001 gloss1-As:0.1402 gloss2-Qs:0.1129 exploreP:0.0100
Episode:1868 meanR:0.9390 R:0.3800 gloss:0.0159 dloss:0.0001 gloss1-As:0.1413 gloss2-Qs:0.1122 exploreP:0.0100
Episode:1869 meanR:0.9254 R:0.3600 gloss:0.0153 dloss:0.0001 gloss1-As:0.1422 gloss2-Qs:0.1079 exploreP:0.0100
Episode:1870 meanR:0.9193 R:0.8400 gloss:0.0155 dloss:0.0001 gloss1-As:0.1430 gloss2-Qs:0.1085 exploreP:0.0100
Episode:1871 meanR:0.9241 R:1.0400 gloss:0.0140 dloss:0.0000 gloss1-As:0.1416 gloss2-Qs:0.0988 exploreP:0.0100
Episode:1872 meanR:0.9310 R:1.4400 gloss:0.0128 dloss:0.0000 gloss1-As:0.1413 gloss2-Qs:0.0906 exploreP:0.0100
Episode:1873 meanR:0.9265 R:0.3000 gloss:0.0131 dloss:0.0000 gloss1-As:0.1422 gloss2-Qs:0.0920 exploreP:0.0100
E

Episode:1939 meanR:1.0434 R:0.7800 gloss:0.0122 dloss:0.0000 gloss1-As:0.1485 gloss2-Qs:0.0820 exploreP:0.0100
Episode:1940 meanR:1.0431 R:1.6400 gloss:0.0138 dloss:0.0000 gloss1-As:0.1464 gloss2-Qs:0.0944 exploreP:0.0100
Episode:1941 meanR:1.0316 R:0.4200 gloss:0.0146 dloss:0.0000 gloss1-As:0.1450 gloss2-Qs:0.1005 exploreP:0.0100
Episode:1942 meanR:1.0286 R:0.7400 gloss:0.0137 dloss:0.0000 gloss1-As:0.1457 gloss2-Qs:0.0941 exploreP:0.0100
Episode:1943 meanR:1.0274 R:0.8400 gloss:0.0119 dloss:0.0000 gloss1-As:0.1469 gloss2-Qs:0.0809 exploreP:0.0100
Episode:1944 meanR:1.0236 R:0.5400 gloss:0.0128 dloss:0.0000 gloss1-As:0.1472 gloss2-Qs:0.0871 exploreP:0.0100
Episode:1945 meanR:1.0262 R:1.1500 gloss:0.0123 dloss:0.0000 gloss1-As:0.1463 gloss2-Qs:0.0843 exploreP:0.0100
Episode:1946 meanR:1.0373 R:1.3900 gloss:0.0134 dloss:0.0001 gloss1-As:0.1455 gloss2-Qs:0.0920 exploreP:0.0100
Episode:1947 meanR:1.0318 R:0.5400 gloss:0.0136 dloss:0.0000 gloss1-As:0.1452 gloss2-Qs:0.0938 exploreP:0.0100
E

Episode:2013 meanR:0.8641 R:0.7600 gloss:0.0086 dloss:0.0000 gloss1-As:0.1422 gloss2-Qs:0.0602 exploreP:0.0100
Episode:2014 meanR:0.8604 R:0.2500 gloss:0.0079 dloss:0.0000 gloss1-As:0.1414 gloss2-Qs:0.0557 exploreP:0.0100
Episode:2015 meanR:0.8607 R:0.7900 gloss:0.0080 dloss:0.0000 gloss1-As:0.1403 gloss2-Qs:0.0573 exploreP:0.0100
Episode:2016 meanR:0.8494 R:0.2900 gloss:0.0082 dloss:0.0000 gloss1-As:0.1410 gloss2-Qs:0.0582 exploreP:0.0100
Episode:2017 meanR:0.8633 R:2.0700 gloss:0.0083 dloss:0.0000 gloss1-As:0.1398 gloss2-Qs:0.0591 exploreP:0.0100
Episode:2018 meanR:0.8729 R:1.2100 gloss:0.0082 dloss:0.0000 gloss1-As:0.1410 gloss2-Qs:0.0582 exploreP:0.0100
Episode:2019 meanR:0.8757 R:0.9100 gloss:0.0090 dloss:0.0000 gloss1-As:0.1424 gloss2-Qs:0.0633 exploreP:0.0100
Episode:2020 meanR:0.8874 R:1.8900 gloss:0.0098 dloss:0.0000 gloss1-As:0.1415 gloss2-Qs:0.0694 exploreP:0.0100
Episode:2021 meanR:0.8841 R:0.5700 gloss:0.0114 dloss:0.0000 gloss1-As:0.1396 gloss2-Qs:0.0815 exploreP:0.0100
E

Episode:2087 meanR:0.9108 R:0.4900 gloss:0.0103 dloss:0.0000 gloss1-As:0.1410 gloss2-Qs:0.0733 exploreP:0.0100
Episode:2088 meanR:0.9084 R:0.2300 gloss:0.0107 dloss:0.0000 gloss1-As:0.1407 gloss2-Qs:0.0762 exploreP:0.0100
Episode:2089 meanR:0.9114 R:1.6100 gloss:0.0113 dloss:0.0000 gloss1-As:0.1409 gloss2-Qs:0.0804 exploreP:0.0100
Episode:2090 meanR:0.9144 R:1.0600 gloss:0.0106 dloss:0.0000 gloss1-As:0.1418 gloss2-Qs:0.0746 exploreP:0.0100
Episode:2091 meanR:0.9093 R:0.3600 gloss:0.0105 dloss:0.0000 gloss1-As:0.1409 gloss2-Qs:0.0748 exploreP:0.0100
Episode:2092 meanR:0.9018 R:1.0000 gloss:0.0109 dloss:0.0000 gloss1-As:0.1442 gloss2-Qs:0.0755 exploreP:0.0100
Episode:2093 meanR:0.9090 R:0.9300 gloss:0.0108 dloss:0.0000 gloss1-As:0.1454 gloss2-Qs:0.0745 exploreP:0.0100
Episode:2094 meanR:0.9186 R:1.3400 gloss:0.0109 dloss:0.0000 gloss1-As:0.1449 gloss2-Qs:0.0750 exploreP:0.0100
Episode:2095 meanR:0.9114 R:0.3800 gloss:0.0109 dloss:0.0000 gloss1-As:0.1442 gloss2-Qs:0.0758 exploreP:0.0100
E

Episode:2161 meanR:0.9087 R:1.1500 gloss:0.0128 dloss:0.0000 gloss1-As:0.1445 gloss2-Qs:0.0885 exploreP:0.0100
Episode:2162 meanR:0.9104 R:0.3500 gloss:0.0126 dloss:0.0000 gloss1-As:0.1417 gloss2-Qs:0.0890 exploreP:0.0100
Episode:2163 meanR:0.9028 R:0.5300 gloss:0.0120 dloss:0.0000 gloss1-As:0.1443 gloss2-Qs:0.0834 exploreP:0.0100
Episode:2164 meanR:0.9001 R:0.2900 gloss:0.0111 dloss:0.0000 gloss1-As:0.1422 gloss2-Qs:0.0780 exploreP:0.0100
Episode:2165 meanR:0.9065 R:1.4900 gloss:0.0093 dloss:0.0000 gloss1-As:0.1415 gloss2-Qs:0.0659 exploreP:0.0100
Episode:2166 meanR:0.8960 R:0.4900 gloss:0.0098 dloss:0.0000 gloss1-As:0.1409 gloss2-Qs:0.0697 exploreP:0.0100
Episode:2167 meanR:0.8938 R:0.9200 gloss:0.0098 dloss:0.0000 gloss1-As:0.1438 gloss2-Qs:0.0680 exploreP:0.0100
Episode:2168 meanR:0.8873 R:0.2300 gloss:0.0100 dloss:0.0000 gloss1-As:0.1418 gloss2-Qs:0.0704 exploreP:0.0100
Episode:2169 meanR:0.8851 R:0.8100 gloss:0.0092 dloss:0.0000 gloss1-As:0.1410 gloss2-Qs:0.0652 exploreP:0.0100
E

Episode:2235 meanR:0.9077 R:0.8200 gloss:0.0106 dloss:0.0000 gloss1-As:0.1447 gloss2-Qs:0.0732 exploreP:0.0100
Episode:2236 meanR:0.9046 R:0.7900 gloss:0.0105 dloss:0.0000 gloss1-As:0.1471 gloss2-Qs:0.0713 exploreP:0.0100
Episode:2237 meanR:0.8960 R:0.9400 gloss:0.0108 dloss:0.0000 gloss1-As:0.1445 gloss2-Qs:0.0750 exploreP:0.0100
Episode:2238 meanR:0.8973 R:0.7200 gloss:0.0113 dloss:0.0000 gloss1-As:0.1440 gloss2-Qs:0.0788 exploreP:0.0100
Episode:2239 meanR:0.8895 R:0.5400 gloss:0.0107 dloss:0.0000 gloss1-As:0.1440 gloss2-Qs:0.0747 exploreP:0.0100
Episode:2240 meanR:0.8910 R:0.8400 gloss:0.0103 dloss:0.0000 gloss1-As:0.1426 gloss2-Qs:0.0723 exploreP:0.0100
Episode:2241 meanR:0.8815 R:0.6800 gloss:0.0112 dloss:0.0000 gloss1-As:0.1455 gloss2-Qs:0.0767 exploreP:0.0100
Episode:2242 meanR:0.8828 R:0.7300 gloss:0.0103 dloss:0.0000 gloss1-As:0.1460 gloss2-Qs:0.0705 exploreP:0.0100
Episode:2243 meanR:0.8672 R:0.4000 gloss:0.0102 dloss:0.0000 gloss1-As:0.1460 gloss2-Qs:0.0701 exploreP:0.0100
E

Episode:2309 meanR:0.9326 R:0.7700 gloss:0.0108 dloss:0.0000 gloss1-As:0.1428 gloss2-Qs:0.0757 exploreP:0.0100
Episode:2310 meanR:0.9168 R:0.7700 gloss:0.0100 dloss:0.0000 gloss1-As:0.1427 gloss2-Qs:0.0700 exploreP:0.0100
Episode:2311 meanR:0.9196 R:0.6100 gloss:0.0094 dloss:0.0000 gloss1-As:0.1434 gloss2-Qs:0.0656 exploreP:0.0100
Episode:2312 meanR:0.9157 R:0.0900 gloss:0.0091 dloss:0.0000 gloss1-As:0.1425 gloss2-Qs:0.0636 exploreP:0.0100
Episode:2313 meanR:0.9211 R:0.9200 gloss:0.0083 dloss:0.0000 gloss1-As:0.1433 gloss2-Qs:0.0578 exploreP:0.0100
Episode:2314 meanR:0.9209 R:0.6400 gloss:0.0080 dloss:0.0000 gloss1-As:0.1449 gloss2-Qs:0.0552 exploreP:0.0100
Episode:2315 meanR:0.9161 R:0.3900 gloss:0.0083 dloss:0.0000 gloss1-As:0.1436 gloss2-Qs:0.0576 exploreP:0.0100
Episode:2316 meanR:0.9295 R:1.5900 gloss:0.0079 dloss:0.0000 gloss1-As:0.1433 gloss2-Qs:0.0550 exploreP:0.0100
Episode:2317 meanR:0.9293 R:0.9600 gloss:0.0094 dloss:0.0000 gloss1-As:0.1427 gloss2-Qs:0.0660 exploreP:0.0100
E

Episode:2383 meanR:0.9288 R:0.3800 gloss:0.0145 dloss:0.0000 gloss1-As:0.1513 gloss2-Qs:0.0959 exploreP:0.0100
Episode:2384 meanR:0.9286 R:0.9800 gloss:0.0142 dloss:0.0001 gloss1-As:0.1494 gloss2-Qs:0.0954 exploreP:0.0100
Episode:2385 meanR:0.9222 R:1.0300 gloss:0.0143 dloss:0.0001 gloss1-As:0.1477 gloss2-Qs:0.0967 exploreP:0.0100
Episode:2386 meanR:0.9397 R:2.0500 gloss:0.0154 dloss:0.0001 gloss1-As:0.1490 gloss2-Qs:0.1035 exploreP:0.0100
Episode:2387 meanR:0.9466 R:1.4900 gloss:0.0173 dloss:0.0001 gloss1-As:0.1485 gloss2-Qs:0.1168 exploreP:0.0100
Episode:2388 meanR:0.9387 R:0.4800 gloss:0.0186 dloss:0.0001 gloss1-As:0.1485 gloss2-Qs:0.1253 exploreP:0.0100
Episode:2389 meanR:0.9426 R:0.8800 gloss:0.0177 dloss:0.0001 gloss1-As:0.1469 gloss2-Qs:0.1202 exploreP:0.0100
Episode:2390 meanR:0.9480 R:1.3600 gloss:0.0156 dloss:0.0001 gloss1-As:0.1484 gloss2-Qs:0.1048 exploreP:0.0100
Episode:2391 meanR:0.9520 R:0.5900 gloss:0.0146 dloss:0.0001 gloss1-As:0.1471 gloss2-Qs:0.0993 exploreP:0.0100
E

Episode:2457 meanR:1.0349 R:0.8800 gloss:0.0149 dloss:0.0001 gloss1-As:0.1399 gloss2-Qs:0.1061 exploreP:0.0100
Episode:2458 meanR:1.0278 R:0.1500 gloss:0.0136 dloss:0.0001 gloss1-As:0.1401 gloss2-Qs:0.0969 exploreP:0.0100
Episode:2459 meanR:1.0270 R:0.9200 gloss:0.0128 dloss:0.0000 gloss1-As:0.1405 gloss2-Qs:0.0911 exploreP:0.0100
Episode:2460 meanR:1.0282 R:0.7400 gloss:0.0127 dloss:0.0000 gloss1-As:0.1427 gloss2-Qs:0.0888 exploreP:0.0100
Episode:2461 meanR:1.0284 R:0.7700 gloss:0.0115 dloss:0.0000 gloss1-As:0.1432 gloss2-Qs:0.0804 exploreP:0.0100
Episode:2462 meanR:1.0316 R:0.5100 gloss:0.0106 dloss:0.0000 gloss1-As:0.1474 gloss2-Qs:0.0719 exploreP:0.0100
Episode:2463 meanR:1.0315 R:0.8400 gloss:0.0102 dloss:0.0000 gloss1-As:0.1457 gloss2-Qs:0.0698 exploreP:0.0100
Episode:2464 meanR:1.0337 R:0.9700 gloss:0.0099 dloss:0.0000 gloss1-As:0.1457 gloss2-Qs:0.0679 exploreP:0.0100
Episode:2465 meanR:1.0320 R:0.6400 gloss:0.0093 dloss:0.0000 gloss1-As:0.1443 gloss2-Qs:0.0643 exploreP:0.0100
E

Episode:2531 meanR:0.9516 R:0.2900 gloss:0.0128 dloss:0.0000 gloss1-As:0.1390 gloss2-Qs:0.0922 exploreP:0.0100
Episode:2532 meanR:0.9448 R:0.8000 gloss:0.0121 dloss:0.0000 gloss1-As:0.1382 gloss2-Qs:0.0875 exploreP:0.0100
Episode:2533 meanR:0.9541 R:2.0100 gloss:0.0130 dloss:0.0000 gloss1-As:0.1390 gloss2-Qs:0.0935 exploreP:0.0100
Episode:2534 meanR:0.9483 R:1.0600 gloss:0.0140 dloss:0.0001 gloss1-As:0.1426 gloss2-Qs:0.0983 exploreP:0.0100
Episode:2535 meanR:0.9459 R:0.6500 gloss:0.0145 dloss:0.0001 gloss1-As:0.1447 gloss2-Qs:0.1004 exploreP:0.0100
Episode:2536 meanR:0.9451 R:0.7700 gloss:0.0135 dloss:0.0000 gloss1-As:0.1452 gloss2-Qs:0.0930 exploreP:0.0100
Episode:2537 meanR:0.9521 R:1.1300 gloss:0.0139 dloss:0.0000 gloss1-As:0.1462 gloss2-Qs:0.0950 exploreP:0.0100
Episode:2538 meanR:0.9666 R:2.0700 gloss:0.0135 dloss:0.0000 gloss1-As:0.1471 gloss2-Qs:0.0917 exploreP:0.0100
Episode:2539 meanR:0.9557 R:0.6500 gloss:0.0130 dloss:0.0000 gloss1-As:0.1465 gloss2-Qs:0.0889 exploreP:0.0100
E

Episode:2605 meanR:0.9990 R:1.6000 gloss:0.0130 dloss:0.0000 gloss1-As:0.1468 gloss2-Qs:0.0883 exploreP:0.0100
Episode:2606 meanR:0.9867 R:0.0900 gloss:0.0123 dloss:0.0000 gloss1-As:0.1437 gloss2-Qs:0.0856 exploreP:0.0100
Episode:2607 meanR:0.9830 R:0.8400 gloss:0.0113 dloss:0.0000 gloss1-As:0.1425 gloss2-Qs:0.0792 exploreP:0.0100
Episode:2608 meanR:0.9849 R:0.9100 gloss:0.0101 dloss:0.0000 gloss1-As:0.1428 gloss2-Qs:0.0706 exploreP:0.0100
Episode:2609 meanR:0.9954 R:1.5200 gloss:0.0112 dloss:0.0000 gloss1-As:0.1431 gloss2-Qs:0.0783 exploreP:0.0100
Episode:2610 meanR:0.9913 R:0.8600 gloss:0.0112 dloss:0.0000 gloss1-As:0.1435 gloss2-Qs:0.0783 exploreP:0.0100
Episode:2611 meanR:0.9981 R:1.0200 gloss:0.0128 dloss:0.0001 gloss1-As:0.1418 gloss2-Qs:0.0903 exploreP:0.0100
Episode:2612 meanR:0.9980 R:1.3400 gloss:0.0129 dloss:0.0000 gloss1-As:0.1426 gloss2-Qs:0.0901 exploreP:0.0100
Episode:2613 meanR:0.9960 R:0.4900 gloss:0.0132 dloss:0.0000 gloss1-As:0.1421 gloss2-Qs:0.0926 exploreP:0.0100
E

Episode:2679 meanR:0.9324 R:0.5900 gloss:0.0123 dloss:0.0000 gloss1-As:0.1471 gloss2-Qs:0.0838 exploreP:0.0100
Episode:2680 meanR:0.9282 R:0.4300 gloss:0.0126 dloss:0.0000 gloss1-As:0.1451 gloss2-Qs:0.0871 exploreP:0.0100
Episode:2681 meanR:0.9207 R:0.8500 gloss:0.0130 dloss:0.0000 gloss1-As:0.1471 gloss2-Qs:0.0881 exploreP:0.0100
Episode:2682 meanR:0.9226 R:1.3800 gloss:0.0129 dloss:0.0000 gloss1-As:0.1445 gloss2-Qs:0.0893 exploreP:0.0100
Episode:2683 meanR:0.9360 R:1.7100 gloss:0.0144 dloss:0.0001 gloss1-As:0.1463 gloss2-Qs:0.0984 exploreP:0.0100
Episode:2684 meanR:0.9335 R:0.5800 gloss:0.0142 dloss:0.0000 gloss1-As:0.1478 gloss2-Qs:0.0959 exploreP:0.0100
Episode:2685 meanR:0.9473 R:1.5800 gloss:0.0145 dloss:0.0001 gloss1-As:0.1499 gloss2-Qs:0.0967 exploreP:0.0100
Episode:2686 meanR:0.9486 R:0.4300 gloss:0.0136 dloss:0.0000 gloss1-As:0.1468 gloss2-Qs:0.0923 exploreP:0.0100
Episode:2687 meanR:0.9580 R:1.8300 gloss:0.0132 dloss:0.0001 gloss1-As:0.1480 gloss2-Qs:0.0889 exploreP:0.0100
E

Episode:2753 meanR:0.9680 R:0.9200 gloss:0.0141 dloss:0.0000 gloss1-As:0.1416 gloss2-Qs:0.0996 exploreP:0.0100
Episode:2754 meanR:0.9685 R:1.0600 gloss:0.0131 dloss:0.0000 gloss1-As:0.1410 gloss2-Qs:0.0926 exploreP:0.0100
Episode:2755 meanR:0.9627 R:0.3500 gloss:0.0132 dloss:0.0001 gloss1-As:0.1409 gloss2-Qs:0.0936 exploreP:0.0100
Episode:2756 meanR:0.9650 R:0.5500 gloss:0.0119 dloss:0.0000 gloss1-As:0.1382 gloss2-Qs:0.0862 exploreP:0.0100
Episode:2757 meanR:0.9647 R:0.8400 gloss:0.0113 dloss:0.0000 gloss1-As:0.1398 gloss2-Qs:0.0810 exploreP:0.0100
Episode:2758 meanR:0.9700 R:1.2800 gloss:0.0109 dloss:0.0000 gloss1-As:0.1416 gloss2-Qs:0.0767 exploreP:0.0100
Episode:2759 meanR:0.9789 R:1.3700 gloss:0.0116 dloss:0.0000 gloss1-As:0.1427 gloss2-Qs:0.0815 exploreP:0.0100
Episode:2760 meanR:0.9638 R:0.3600 gloss:0.0123 dloss:0.0000 gloss1-As:0.1479 gloss2-Qs:0.0834 exploreP:0.0100
Episode:2761 meanR:0.9653 R:1.4000 gloss:0.0120 dloss:0.0000 gloss1-As:0.1466 gloss2-Qs:0.0820 exploreP:0.0100
E

Episode:2827 meanR:0.9250 R:0.1000 gloss:0.0143 dloss:0.0001 gloss1-As:0.1484 gloss2-Qs:0.0965 exploreP:0.0100
Episode:2828 meanR:0.9251 R:1.0500 gloss:0.0140 dloss:0.0001 gloss1-As:0.1454 gloss2-Qs:0.0966 exploreP:0.0100
Episode:2829 meanR:0.9175 R:0.1800 gloss:0.0137 dloss:0.0000 gloss1-As:0.1438 gloss2-Qs:0.0956 exploreP:0.0100
Episode:2830 meanR:0.9107 R:0.3300 gloss:0.0139 dloss:0.0001 gloss1-As:0.1443 gloss2-Qs:0.0963 exploreP:0.0100
Episode:2831 meanR:0.9241 R:1.3900 gloss:0.0137 dloss:0.0000 gloss1-As:0.1473 gloss2-Qs:0.0930 exploreP:0.0100
Episode:2832 meanR:0.9226 R:0.8400 gloss:0.0138 dloss:0.0000 gloss1-As:0.1476 gloss2-Qs:0.0935 exploreP:0.0100
Episode:2833 meanR:0.9235 R:1.4000 gloss:0.0137 dloss:0.0000 gloss1-As:0.1453 gloss2-Qs:0.0944 exploreP:0.0100
Episode:2834 meanR:0.9295 R:0.8100 gloss:0.0128 dloss:0.0000 gloss1-As:0.1428 gloss2-Qs:0.0895 exploreP:0.0100
Episode:2835 meanR:0.9342 R:1.3400 gloss:0.0123 dloss:0.0000 gloss1-As:0.1424 gloss2-Qs:0.0861 exploreP:0.0100
E

Episode:2901 meanR:0.9501 R:0.6300 gloss:0.0096 dloss:0.0000 gloss1-As:0.1458 gloss2-Qs:0.0660 exploreP:0.0100
Episode:2902 meanR:0.9407 R:0.4800 gloss:0.0093 dloss:0.0000 gloss1-As:0.1431 gloss2-Qs:0.0653 exploreP:0.0100
Episode:2903 meanR:0.9462 R:0.6900 gloss:0.0095 dloss:0.0000 gloss1-As:0.1427 gloss2-Qs:0.0666 exploreP:0.0100
Episode:2904 meanR:0.9498 R:0.4100 gloss:0.0097 dloss:0.0000 gloss1-As:0.1456 gloss2-Qs:0.0667 exploreP:0.0100
Episode:2905 meanR:0.9452 R:0.3200 gloss:0.0085 dloss:0.0000 gloss1-As:0.1473 gloss2-Qs:0.0578 exploreP:0.0100
Episode:2906 meanR:0.9451 R:0.6300 gloss:0.0083 dloss:0.0000 gloss1-As:0.1478 gloss2-Qs:0.0560 exploreP:0.0100
Episode:2907 meanR:0.9535 R:1.2900 gloss:0.0084 dloss:0.0000 gloss1-As:0.1464 gloss2-Qs:0.0573 exploreP:0.0100
Episode:2908 meanR:0.9655 R:1.3900 gloss:0.0095 dloss:0.0000 gloss1-As:0.1489 gloss2-Qs:0.0640 exploreP:0.0100
Episode:2909 meanR:0.9665 R:1.0000 gloss:0.0098 dloss:0.0000 gloss1-As:0.1498 gloss2-Qs:0.0652 exploreP:0.0100
E

Episode:2975 meanR:0.9761 R:1.2100 gloss:0.0113 dloss:0.0000 gloss1-As:0.1454 gloss2-Qs:0.0780 exploreP:0.0100
Episode:2976 meanR:0.9672 R:0.2800 gloss:0.0113 dloss:0.0000 gloss1-As:0.1444 gloss2-Qs:0.0781 exploreP:0.0100
Episode:2977 meanR:0.9514 R:0.4700 gloss:0.0106 dloss:0.0000 gloss1-As:0.1459 gloss2-Qs:0.0723 exploreP:0.0100
Episode:2978 meanR:0.9484 R:0.2200 gloss:0.0089 dloss:0.0000 gloss1-As:0.1483 gloss2-Qs:0.0597 exploreP:0.0100
Episode:2979 meanR:0.9361 R:0.4900 gloss:0.0082 dloss:0.0000 gloss1-As:0.1464 gloss2-Qs:0.0557 exploreP:0.0100
Episode:2980 meanR:0.9378 R:0.9800 gloss:0.0090 dloss:0.0000 gloss1-As:0.1471 gloss2-Qs:0.0609 exploreP:0.0100
Episode:2981 meanR:0.9315 R:1.1300 gloss:0.0088 dloss:0.0000 gloss1-As:0.1470 gloss2-Qs:0.0598 exploreP:0.0100
Episode:2982 meanR:0.9244 R:0.7700 gloss:0.0091 dloss:0.0000 gloss1-As:0.1463 gloss2-Qs:0.0623 exploreP:0.0100
Episode:2983 meanR:0.9331 R:1.8300 gloss:0.0094 dloss:0.0000 gloss1-As:0.1455 gloss2-Qs:0.0649 exploreP:0.0100
E