# Continuous Control

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Reacher.app"`
- **Windows** (x86): `"path/to/Reacher_Windows_x86/Reacher.exe"`
- **Windows** (x86_64): `"path/to/Reacher_Windows_x86_64/Reacher.exe"`
- **Linux** (x86): `"path/to/Reacher_Linux/Reacher.x86"`
- **Linux** (x86_64): `"path/to/Reacher_Linux/Reacher.x86_64"`
- **Linux** (x86, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86"`
- **Linux** (x86_64, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86_64"`

For instance, if you are using a Mac, then you downloaded `Reacher.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Reacher.app")
```

In [2]:
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux/Reacher.x86_64')
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_v1/Reacher.x86_64')
env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_OneAgent/Reacher_Linux/Reacher.x86_64')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 1
Size of each action: 4
There are 1 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726671e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: 0.0


When finished, you can close the environment.

In [6]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [8]:
# Testing the train mode
env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
state = env_info.vector_observations[0]                  # get the current state (for each agent)
#scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    action = np.random.randn(num_agents, action_size) # select an action (for each agent)
    #actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(action)[brain_name]           # send all actions to tne environment
    next_state = env_info.vector_observations[0]         # get next state (for each agent)
    reward = env_info.rewards[0]                         # get reward (for each agent)
    done = env_info.local_done[0]                        # see if episode finished
    #scores += env_info.rewards                         # update the score (for each agent)
    state = next_state                               # roll over states to next time step
    if done is True:                                  # exit loop if episode finished
        print(action.shape, reward)
        print(done)
        break
# print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

(1, 4) 0.0
True


## Option 1: Solve the First Version
The task is episodic, and in order to solve the environment, your agent must get an average score of +30 over 100 consecutive episodes.

In [9]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


In [10]:
def model_input(state_size, action_size):
    #states = tf.placeholder(tf.float32, [None, *state_size], name='states')
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.float32, [None, action_size], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    return states, actions, targetQs

In [11]:
# Generator/Controller: Generating/prediting the actions
def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        #predictions = tf.nn.softmax(logits)

        # return actions logits
        return logits

In [12]:
# Discriminator/Dopamine: Reward function/planner/naviator/advisor/supervisor/cortical columns
def discriminator(states, actions, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # Fusion/merge states and actions/ SA/ SM
        x_fused = tf.concat(axis=1, values=[states, actions])
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)        
        #predictions = tf.nn.softmax(logits)

        # return rewards logits
        return logits

In [13]:
def model_loss(action_size, hidden_size, states, actions, targetQs):
    # G
    actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    #actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    actions_labels = tf.nn.sigmoid(actions)
    # neg_log_prob_actions = tf.nn.softmax_cross_entropy_with_logits_v2(logits=actions_logits, 
    #                                                                   labels=actions_labels)
    neg_log_prob_actions = tf.nn.sigmoid_cross_entropy_with_logits(logits=actions_logits, 
                                                                   labels=actions_labels)
    #g_loss = tf.reduce_mean(neg_log_prob_actions * targetQs) # error!
    
    # D
    Qs = discriminator(actions=actions_logits, hidden_size=hidden_size, states=states)
    d_loss = tf.reduce_mean(tf.square(Qs - targetQs))
    g_loss = tf.reduce_mean(neg_log_prob_actions * Qs)
    return actions_logits, Qs, g_loss, d_loss

In [14]:
# Optimizating/training/learning G & D
def model_opt(g_loss, d_loss, learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars)
        d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars)

    return g_opt, d_opt

In [15]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs = model_input(state_size=state_size, action_size=action_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.Qs_logits, self.g_loss, self.d_loss = model_loss(
            action_size=action_size, hidden_size=hidden_size, # model init parameters
            states=self.states, actions=self.actions, targetQs=self.targetQs) # model input
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(g_loss=self.g_loss, d_loss=self.d_loss, learning_rate=learning_rate)

In [16]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)

In [17]:
print('state size:{}'.format(states.shape), 
      'actions:{}'.format(actions.shape)) 
print('action size:{}'.format(np.max(actions) - np.min(actions)+1))

state size:(1, 33) actions:(1, 4)
action size:3.0


In [18]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
# state_size = 37
# state_size_ = (84, 84, 3)
state_size = 33
action_size = 4
hidden_size = 33*2             # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
gamma = 0.99                   # future reward discount
memory_size = 1000            # memory capacity
batch_size = 1000             # experience mini-batch size

In [19]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, state_size=state_size, hidden_size=hidden_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

In [20]:
# Initializing the memory buffer
env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
state = env_info.vector_observations[0]                  # get the current state (for each agent)
for _ in range(memory_size):
    action = np.random.randn(num_agents, action_size) # select an action (for each agent)
    #action = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(action)[brain_name]           # send all actions to tne environment
    next_state = env_info.vector_observations[0]         # get next state (for each agent)
    reward = env_info.rewards[0]                         # get reward (for each agent)
    done = env_info.local_done[0]                        # see if episode finished
    memory.buffer.append([state, action.reshape([-1]), next_state, reward, float(done)])
    #print(state.shape, action.reshape([-1]).shape, reward, float(done))
    state = next_state                               # roll over states to next time step
    if done is True:                                  # exit loop if episode finished
        print(done)
        env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
        state = env_info.vector_observations[0]                  # get the current state (for each agent)
        break

In [21]:
# len(memory.buffer), memory.buffer[100]

In [None]:
# Save/load the model and save for plotting
saver = tf.train.Saver()
episode_rewards_list, rewards_list, gloss_list, dloss_list = [], [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model-reacher-Continuous_Control.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        gloss_batch, dloss_batch = [], []
        #state = env.reset()
        env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
        state = env_info.vector_observations[0]                  # get the current state (for each agent)

        # Training steps/batches
        while True:
            # Explore (Env) or Exploit (Model)
            total_step += 1
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            if explore_p > np.random.rand():
                #action = env.action_space.sample()
                action = np.random.randn(num_agents, action_size) # select an action (for each agent)
            else:
                action = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
                #print(action.shape)
                #action = np.reshape(action_logits, [-1]) # For continuous action space
                #action = np.argmax(action_logits) # For discrete action space
            #next_state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name]           # send all actions to tne environment
            next_state = env_info.vector_observations[0]         # get next state (for each agent)
            reward = env_info.rewards[0]                         # get reward (for each agent)
            done = env_info.local_done[0]                        # see if episode finished
            memory.buffer.append([state, action.reshape([-1]), next_state, reward, float(done)])
            total_reward += reward
            state = next_state

            # Training
            #batch = memory.sample(batch_size)
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            nextQs_logits = sess.run(model.Qs_logits, feed_dict = {model.states: next_states})
            #nextQs = np.max(nextQs_logits, axis=1) * (1-dones)
            nextQs = nextQs_logits.reshape([-1]) * (1-dones)
            targetQs = rewards + (gamma * nextQs)
            #print(targetQs.shape)
            gloss, dloss, _, _ = sess.run([model.g_loss, model.d_loss, model.g_opt, model.d_opt],
                                            feed_dict = {model.states: states, 
                                                         model.actions: actions,
                                                         model.targetQs: targetQs})
            gloss_batch.append(gloss)
            dloss_batch.append(dloss)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'gloss:{:.4f}'.format(np.mean(gloss_batch)),
              'dloss:{:.4f}'.format(np.mean(dloss_batch)),
              'exploreP:{:.4f}'.format(explore_p))
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        gloss_list.append([ep, np.mean(gloss_batch)])
        dloss_list.append([ep, np.mean(dloss_batch)])
        # Break episode/epoch loop
        ## Option 1: Solve the First Version
        #The task is episodic, and in order to solve the environment, 
        #your agent must get an average score of +30 over 100 consecutive episodes.        
        if np.mean(episode_reward) >= +30:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model-reacher-Continuous_Control.ckpt')

Episode:0 meanR:0.0700 R:0.0700 gloss:-0.5387 dloss:0.0776 exploreP:0.9057
Episode:1 meanR:0.2200 R:0.3700 gloss:1.4941 dloss:0.0100 exploreP:0.8204
Episode:2 meanR:0.4133 R:0.8000 gloss:4.5103 dloss:0.0052 exploreP:0.7432
Episode:3 meanR:0.3700 R:0.2400 gloss:7.0446 dloss:0.0052 exploreP:0.6734
Episode:4 meanR:0.3920 R:0.4800 gloss:1.7065 dloss:0.0094 exploreP:0.6102
Episode:5 meanR:0.4600 R:0.8000 gloss:6.8126 dloss:0.0071 exploreP:0.5530
Episode:6 meanR:0.5229 R:0.9000 gloss:5.9324 dloss:0.0056 exploreP:0.5013
Episode:7 meanR:0.4925 R:0.2800 gloss:3.6661 dloss:0.0102 exploreP:0.4545
Episode:8 meanR:0.4722 R:0.3100 gloss:3.1590 dloss:0.0248 exploreP:0.4121
Episode:9 meanR:0.4940 R:0.6900 gloss:3.6598 dloss:0.0309 exploreP:0.3738
Episode:10 meanR:0.5064 R:0.6300 gloss:8.2172 dloss:0.0329 exploreP:0.3392
Episode:11 meanR:0.4642 R:0.0000 gloss:3.7638 dloss:0.0406 exploreP:0.3078
Episode:12 meanR:0.4423 R:0.1800 gloss:1.1668 dloss:0.1349 exploreP:0.2795
Episode:13 meanR:0.4271 R:0.2300 g

Episode:109 meanR:0.4601 R:0.0000 gloss:-0.2024 dloss:0.0470 exploreP:0.0100
Episode:110 meanR:0.4538 R:0.0000 gloss:-0.5011 dloss:0.0955 exploreP:0.0100
Episode:111 meanR:0.4554 R:0.1600 gloss:15.0748 dloss:0.6286 exploreP:0.0100
Episode:112 meanR:0.4574 R:0.3800 gloss:20.3729 dloss:0.2324 exploreP:0.0100
Episode:113 meanR:0.4661 R:1.1000 gloss:-0.1174 dloss:0.0832 exploreP:0.0100
Episode:114 meanR:0.4654 R:0.0000 gloss:0.0936 dloss:0.0601 exploreP:0.0100
Episode:115 meanR:0.4686 R:0.3200 gloss:0.0251 dloss:0.0326 exploreP:0.0100
Episode:116 meanR:0.4655 R:0.0600 gloss:-0.0970 dloss:0.0340 exploreP:0.0100
Episode:117 meanR:0.4690 R:0.3500 gloss:-0.2038 dloss:0.0606 exploreP:0.0100
Episode:118 meanR:0.4774 R:0.8400 gloss:-0.2867 dloss:0.0411 exploreP:0.0100
Episode:119 meanR:0.4871 R:0.9700 gloss:0.2189 dloss:0.0361 exploreP:0.0100
Episode:120 meanR:0.4899 R:0.6600 gloss:0.0201 dloss:0.0326 exploreP:0.0100
Episode:121 meanR:0.4994 R:0.9500 gloss:-0.2218 dloss:0.0341 exploreP:0.0100
Epi

Episode:217 meanR:0.9772 R:1.2000 gloss:0.0214 dloss:0.0000 exploreP:0.0100
Episode:218 meanR:0.9733 R:0.4500 gloss:0.0182 dloss:0.0000 exploreP:0.0100
Episode:219 meanR:0.9701 R:0.6500 gloss:0.0119 dloss:0.0000 exploreP:0.0100
Episode:220 meanR:0.9743 R:1.0800 gloss:0.0238 dloss:0.0000 exploreP:0.0100
Episode:221 meanR:0.9744 R:0.9600 gloss:0.0290 dloss:0.0001 exploreP:0.0100
Episode:222 meanR:0.9721 R:0.9100 gloss:0.0182 dloss:0.0000 exploreP:0.0100
Episode:223 meanR:0.9759 R:1.0000 gloss:0.0367 dloss:0.0001 exploreP:0.0100
Episode:224 meanR:0.9753 R:0.3200 gloss:0.0106 dloss:0.0000 exploreP:0.0100
Episode:225 meanR:0.9940 R:1.8700 gloss:0.0355 dloss:0.0001 exploreP:0.0100
Episode:226 meanR:0.9984 R:0.8300 gloss:0.0350 dloss:0.0001 exploreP:0.0100
Episode:227 meanR:0.9854 R:0.0000 gloss:0.0082 dloss:0.0000 exploreP:0.0100
Episode:228 meanR:0.9805 R:0.5200 gloss:0.0070 dloss:0.0000 exploreP:0.0100
Episode:229 meanR:0.9831 R:0.3300 gloss:0.0144 dloss:0.0000 exploreP:0.0100
Episode:230 

Episode:325 meanR:1.1658 R:0.7600 gloss:0.0810 dloss:0.0001 exploreP:0.0100
Episode:326 meanR:1.1671 R:0.9600 gloss:0.0240 dloss:0.0000 exploreP:0.0100
Episode:327 meanR:1.1740 R:0.6900 gloss:0.0521 dloss:0.0001 exploreP:0.0100
Episode:328 meanR:1.1826 R:1.3800 gloss:0.0278 dloss:0.0000 exploreP:0.0100
Episode:329 meanR:1.1966 R:1.7300 gloss:0.0906 dloss:0.0001 exploreP:0.0100
Episode:330 meanR:1.2130 R:2.4700 gloss:0.0707 dloss:0.0001 exploreP:0.0100
Episode:331 meanR:1.2006 R:0.1500 gloss:0.0818 dloss:0.0001 exploreP:0.0100
Episode:332 meanR:1.2036 R:0.9400 gloss:0.0289 dloss:0.0000 exploreP:0.0100
Episode:333 meanR:1.2110 R:1.1200 gloss:0.0363 dloss:0.0000 exploreP:0.0100
Episode:334 meanR:1.2152 R:0.5900 gloss:0.0501 dloss:0.0000 exploreP:0.0100
Episode:335 meanR:1.2197 R:1.3900 gloss:0.0258 dloss:0.0000 exploreP:0.0100
Episode:336 meanR:1.2196 R:0.1800 gloss:0.0551 dloss:0.0001 exploreP:0.0100
Episode:337 meanR:1.2213 R:1.0700 gloss:0.0182 dloss:0.0000 exploreP:0.0100
Episode:338 

Episode:433 meanR:0.9399 R:0.5400 gloss:0.0168 dloss:0.0000 exploreP:0.0100
Episode:434 meanR:0.9404 R:0.6400 gloss:0.0153 dloss:0.0000 exploreP:0.0100
Episode:435 meanR:0.9345 R:0.8000 gloss:0.0224 dloss:0.0000 exploreP:0.0100
Episode:436 meanR:0.9452 R:1.2500 gloss:0.0310 dloss:0.0000 exploreP:0.0100
Episode:437 meanR:0.9430 R:0.8500 gloss:0.0254 dloss:0.0000 exploreP:0.0100
Episode:438 meanR:0.9332 R:0.2500 gloss:0.0282 dloss:0.0000 exploreP:0.0100
Episode:439 meanR:0.9081 R:0.9000 gloss:0.0187 dloss:0.0000 exploreP:0.0100
Episode:440 meanR:0.8883 R:0.4200 gloss:0.0256 dloss:0.0000 exploreP:0.0100
Episode:441 meanR:0.8795 R:0.6200 gloss:0.0200 dloss:0.0000 exploreP:0.0100
Episode:442 meanR:0.8789 R:1.2200 gloss:0.0333 dloss:0.0000 exploreP:0.0100
Episode:443 meanR:0.8743 R:1.5800 gloss:0.0613 dloss:0.0001 exploreP:0.0100
Episode:444 meanR:0.8747 R:0.5800 gloss:0.0186 dloss:0.0000 exploreP:0.0100
Episode:445 meanR:0.8727 R:0.9600 gloss:0.0353 dloss:0.0000 exploreP:0.0100
Episode:446 

Episode:541 meanR:0.8468 R:1.4900 gloss:0.0502 dloss:0.0001 exploreP:0.0100
Episode:542 meanR:0.8527 R:1.8100 gloss:0.0662 dloss:0.0001 exploreP:0.0100
Episode:543 meanR:0.8386 R:0.1700 gloss:0.0493 dloss:0.0001 exploreP:0.0100
Episode:544 meanR:0.8443 R:1.1500 gloss:0.0329 dloss:0.0000 exploreP:0.0100
Episode:545 meanR:0.8394 R:0.4700 gloss:0.0268 dloss:0.0000 exploreP:0.0100
Episode:546 meanR:0.8431 R:1.3100 gloss:0.0332 dloss:0.0000 exploreP:0.0100
Episode:547 meanR:0.8556 R:1.3300 gloss:0.0712 dloss:0.0001 exploreP:0.0100
Episode:548 meanR:0.8609 R:1.8800 gloss:0.0749 dloss:0.0001 exploreP:0.0100
Episode:549 meanR:0.8651 R:1.0000 gloss:0.0617 dloss:0.0001 exploreP:0.0100
Episode:550 meanR:0.8542 R:0.4900 gloss:0.0409 dloss:0.0000 exploreP:0.0100
Episode:551 meanR:0.8648 R:1.7900 gloss:0.0753 dloss:0.0001 exploreP:0.0100
Episode:552 meanR:0.8767 R:2.2200 gloss:0.0688 dloss:0.0001 exploreP:0.0100
Episode:553 meanR:0.8822 R:1.7700 gloss:0.0820 dloss:0.0001 exploreP:0.0100
Episode:554 

Episode:649 meanR:1.0649 R:0.1000 gloss:0.0125 dloss:0.0000 exploreP:0.0100
Episode:650 meanR:1.0600 R:0.0000 gloss:0.0019 dloss:0.0000 exploreP:0.0100
Episode:651 meanR:1.0428 R:0.0700 gloss:0.0024 dloss:0.0000 exploreP:0.0100
Episode:652 meanR:1.0404 R:1.9800 gloss:0.0348 dloss:0.0000 exploreP:0.0100
Episode:653 meanR:1.0311 R:0.8400 gloss:0.0669 dloss:0.0001 exploreP:0.0100
Episode:654 meanR:1.0387 R:1.1700 gloss:0.0436 dloss:0.0000 exploreP:0.0100
Episode:655 meanR:1.0330 R:0.8300 gloss:0.0591 dloss:0.0001 exploreP:0.0100
Episode:656 meanR:1.0238 R:0.3000 gloss:0.0195 dloss:0.0000 exploreP:0.0100
Episode:657 meanR:1.0358 R:1.7100 gloss:0.0370 dloss:0.0000 exploreP:0.0100
Episode:658 meanR:1.0312 R:1.0300 gloss:0.0424 dloss:0.0001 exploreP:0.0100
Episode:659 meanR:1.0206 R:1.2700 gloss:0.0409 dloss:0.0000 exploreP:0.0100
Episode:660 meanR:1.0310 R:1.9900 gloss:0.0715 dloss:0.0001 exploreP:0.0100
Episode:661 meanR:1.0180 R:0.2900 gloss:0.0608 dloss:0.0001 exploreP:0.0100
Episode:662 

Episode:757 meanR:0.9395 R:0.6000 gloss:0.0363 dloss:0.0000 exploreP:0.0100
Episode:758 meanR:0.9632 R:3.4000 gloss:0.1047 dloss:0.0001 exploreP:0.0100
Episode:759 meanR:0.9656 R:1.5100 gloss:0.1084 dloss:0.0001 exploreP:0.0100
Episode:760 meanR:0.9564 R:1.0700 gloss:0.0707 dloss:0.0001 exploreP:0.0100
Episode:761 meanR:0.9747 R:2.1200 gloss:0.0674 dloss:0.0001 exploreP:0.0100
Episode:762 meanR:0.9790 R:0.9200 gloss:0.0929 dloss:0.0001 exploreP:0.0100
Episode:763 meanR:1.0011 R:2.5600 gloss:0.0708 dloss:0.0001 exploreP:0.0100
Episode:764 meanR:1.0144 R:1.4100 gloss:0.0939 dloss:0.0001 exploreP:0.0100
Episode:765 meanR:1.0160 R:1.1500 gloss:0.0723 dloss:0.0001 exploreP:0.0100
Episode:766 meanR:1.0139 R:0.5600 gloss:0.0327 dloss:0.0000 exploreP:0.0100
Episode:767 meanR:1.0288 R:1.9400 gloss:0.0546 dloss:0.0001 exploreP:0.0100
Episode:768 meanR:1.0132 R:0.3800 gloss:0.0715 dloss:0.0001 exploreP:0.0100
Episode:769 meanR:1.0224 R:1.3300 gloss:0.0365 dloss:0.0000 exploreP:0.0100
Episode:770 

Episode:865 meanR:0.9802 R:0.9700 gloss:0.0585 dloss:0.0001 exploreP:0.0100
Episode:866 meanR:0.9777 R:0.3100 gloss:0.0285 dloss:0.0000 exploreP:0.0100
Episode:867 meanR:0.9684 R:1.0100 gloss:0.0187 dloss:0.0000 exploreP:0.0100
Episode:868 meanR:0.9691 R:0.4500 gloss:0.0383 dloss:0.0000 exploreP:0.0100
Episode:869 meanR:0.9681 R:1.2300 gloss:0.0228 dloss:0.0000 exploreP:0.0100
Episode:870 meanR:0.9572 R:1.5000 gloss:0.0626 dloss:0.0001 exploreP:0.0100
Episode:871 meanR:0.9477 R:0.5000 gloss:0.0404 dloss:0.0000 exploreP:0.0100
Episode:872 meanR:0.9556 R:1.1300 gloss:0.0300 dloss:0.0000 exploreP:0.0100
Episode:873 meanR:0.9478 R:0.5800 gloss:0.0392 dloss:0.0000 exploreP:0.0100
Episode:874 meanR:0.9347 R:0.6600 gloss:0.0397 dloss:0.0000 exploreP:0.0100
Episode:875 meanR:0.9166 R:0.4900 gloss:0.0258 dloss:0.0000 exploreP:0.0100
Episode:876 meanR:0.9179 R:0.8100 gloss:0.0174 dloss:0.0000 exploreP:0.0100
Episode:877 meanR:0.9182 R:0.3600 gloss:0.0370 dloss:0.0000 exploreP:0.0100
Episode:878 

Episode:973 meanR:0.9938 R:0.6000 gloss:0.0377 dloss:0.0000 exploreP:0.0100
Episode:974 meanR:0.9886 R:0.1400 gloss:0.0275 dloss:0.0000 exploreP:0.0100
Episode:975 meanR:0.9979 R:1.4200 gloss:0.0407 dloss:0.0000 exploreP:0.0100
Episode:976 meanR:0.9993 R:0.9500 gloss:0.0593 dloss:0.0001 exploreP:0.0100
Episode:977 meanR:1.0002 R:0.4500 gloss:0.0413 dloss:0.0000 exploreP:0.0100
Episode:978 meanR:1.0050 R:1.4100 gloss:0.0597 dloss:0.0001 exploreP:0.0100
Episode:979 meanR:1.0228 R:1.9000 gloss:0.0788 dloss:0.0001 exploreP:0.0100
Episode:980 meanR:1.0245 R:0.8100 gloss:0.0745 dloss:0.0001 exploreP:0.0100
Episode:981 meanR:1.0331 R:1.2700 gloss:0.0613 dloss:0.0001 exploreP:0.0100
Episode:982 meanR:1.0462 R:1.8000 gloss:0.0816 dloss:0.0001 exploreP:0.0100
Episode:983 meanR:1.0458 R:0.6300 gloss:0.0641 dloss:0.0001 exploreP:0.0100
Episode:984 meanR:1.0454 R:0.5600 gloss:0.0316 dloss:0.0000 exploreP:0.0100
Episode:985 meanR:1.0571 R:1.1700 gloss:0.0361 dloss:0.0000 exploreP:0.0100
Episode:986 

Episode:1080 meanR:0.8893 R:0.5200 gloss:0.0494 dloss:0.0000 exploreP:0.0100
Episode:1081 meanR:0.8769 R:0.0300 gloss:0.0135 dloss:0.0000 exploreP:0.0100
Episode:1082 meanR:0.8589 R:0.0000 gloss:0.0001 dloss:0.0000 exploreP:0.0100
Episode:1083 meanR:0.8697 R:1.7100 gloss:0.0352 dloss:0.0000 exploreP:0.0100
Episode:1084 meanR:0.8641 R:0.0000 gloss:0.0360 dloss:0.0000 exploreP:0.0100
Episode:1085 meanR:0.8524 R:0.0000 gloss:0.0000 dloss:0.0000 exploreP:0.0100
Episode:1086 meanR:0.8468 R:0.2600 gloss:0.0597 dloss:0.0000 exploreP:0.0100
Episode:1087 meanR:0.8303 R:0.0000 gloss:0.0028 dloss:0.0000 exploreP:0.0100
Episode:1088 meanR:0.8347 R:1.2500 gloss:0.0237 dloss:0.0000 exploreP:0.0100
Episode:1089 meanR:0.8308 R:0.7000 gloss:0.0265 dloss:0.0000 exploreP:0.0100
Episode:1090 meanR:0.8323 R:0.9500 gloss:0.0336 dloss:0.0000 exploreP:0.0100
Episode:1091 meanR:0.8222 R:0.8600 gloss:0.0232 dloss:0.0000 exploreP:0.0100
Episode:1092 meanR:0.8109 R:0.1300 gloss:0.0194 dloss:0.0000 exploreP:0.0100

Episode:1187 meanR:1.1629 R:1.1300 gloss:0.0613 dloss:0.0001 exploreP:0.0100
Episode:1188 meanR:1.1616 R:1.1200 gloss:0.0480 dloss:0.0000 exploreP:0.0100
Episode:1189 meanR:1.1571 R:0.2500 gloss:0.0471 dloss:0.0000 exploreP:0.0100
Episode:1190 meanR:1.1618 R:1.4200 gloss:0.0331 dloss:0.0000 exploreP:0.0100
Episode:1191 meanR:1.1708 R:1.7600 gloss:0.0929 dloss:0.0001 exploreP:0.0100
Episode:1192 meanR:1.1742 R:0.4700 gloss:0.0592 dloss:0.0001 exploreP:0.0100
Episode:1193 meanR:1.1895 R:1.7600 gloss:0.0613 dloss:0.0001 exploreP:0.0100
Episode:1194 meanR:1.1941 R:0.9200 gloss:0.0759 dloss:0.0001 exploreP:0.0100
Episode:1195 meanR:1.2083 R:1.7000 gloss:0.0566 dloss:0.0000 exploreP:0.0100
Episode:1196 meanR:1.2167 R:1.1100 gloss:0.0815 dloss:0.0001 exploreP:0.0100
Episode:1197 meanR:1.2164 R:1.3900 gloss:0.0696 dloss:0.0001 exploreP:0.0100
Episode:1198 meanR:1.2131 R:1.0300 gloss:0.0714 dloss:0.0001 exploreP:0.0100
Episode:1199 meanR:1.2122 R:1.7600 gloss:0.0702 dloss:0.0001 exploreP:0.0100

Episode:1294 meanR:0.9988 R:0.1600 gloss:0.0226 dloss:0.0000 exploreP:0.0100
Episode:1295 meanR:1.0013 R:1.9500 gloss:0.0417 dloss:0.0000 exploreP:0.0100
Episode:1296 meanR:1.0077 R:1.7500 gloss:0.1215 dloss:0.0001 exploreP:0.0100
Episode:1297 meanR:1.0152 R:2.1400 gloss:0.1059 dloss:0.0001 exploreP:0.0100
Episode:1298 meanR:1.0111 R:0.6200 gloss:0.0530 dloss:0.0000 exploreP:0.0100
Episode:1299 meanR:1.0127 R:1.9200 gloss:0.0617 dloss:0.0001 exploreP:0.0100
Episode:1300 meanR:1.0105 R:0.5800 gloss:0.0854 dloss:0.0001 exploreP:0.0100
Episode:1301 meanR:1.0270 R:3.3000 gloss:0.1129 dloss:0.0001 exploreP:0.0100
Episode:1302 meanR:1.0361 R:1.8800 gloss:0.1413 dloss:0.0002 exploreP:0.0100
Episode:1303 meanR:1.0377 R:0.8400 gloss:0.0719 dloss:0.0001 exploreP:0.0100
Episode:1304 meanR:1.0231 R:0.7400 gloss:0.0366 dloss:0.0000 exploreP:0.0100
Episode:1305 meanR:1.0171 R:0.8700 gloss:0.0403 dloss:0.0000 exploreP:0.0100
Episode:1306 meanR:1.0243 R:1.2800 gloss:0.0739 dloss:0.0001 exploreP:0.0100

Episode:1401 meanR:0.8357 R:0.4400 gloss:0.0117 dloss:0.0000 exploreP:0.0100
Episode:1402 meanR:0.8182 R:0.1300 gloss:0.0131 dloss:0.0000 exploreP:0.0100
Episode:1403 meanR:0.8241 R:1.4300 gloss:0.0279 dloss:0.0000 exploreP:0.0100
Episode:1404 meanR:0.8183 R:0.1600 gloss:0.0266 dloss:0.0000 exploreP:0.0100
Episode:1405 meanR:0.8173 R:0.7700 gloss:0.0177 dloss:0.0000 exploreP:0.0100
Episode:1406 meanR:0.8117 R:0.7200 gloss:0.0237 dloss:0.0000 exploreP:0.0100
Episode:1407 meanR:0.8053 R:0.6300 gloss:0.0291 dloss:0.0000 exploreP:0.0100
Episode:1408 meanR:0.8138 R:0.9600 gloss:0.0359 dloss:0.0000 exploreP:0.0100
Episode:1409 meanR:0.7922 R:0.5600 gloss:0.0258 dloss:0.0000 exploreP:0.0100
Episode:1410 meanR:0.7930 R:0.5900 gloss:0.0204 dloss:0.0000 exploreP:0.0100
Episode:1411 meanR:0.7906 R:0.4900 gloss:0.0100 dloss:0.0000 exploreP:0.0100
Episode:1412 meanR:0.7880 R:0.8600 gloss:0.0277 dloss:0.0000 exploreP:0.0100
Episode:1413 meanR:0.7824 R:1.1300 gloss:0.0490 dloss:0.0001 exploreP:0.0100

Episode:1508 meanR:1.1333 R:1.3100 gloss:0.0544 dloss:0.0000 exploreP:0.0100
Episode:1509 meanR:1.1366 R:0.8900 gloss:0.0575 dloss:0.0001 exploreP:0.0100
Episode:1510 meanR:1.1481 R:1.7400 gloss:0.0449 dloss:0.0000 exploreP:0.0100
Episode:1511 meanR:1.1545 R:1.1300 gloss:0.0870 dloss:0.0001 exploreP:0.0100
Episode:1512 meanR:1.1508 R:0.4900 gloss:0.0395 dloss:0.0000 exploreP:0.0100
Episode:1513 meanR:1.1467 R:0.7200 gloss:0.0440 dloss:0.0000 exploreP:0.0100
Episode:1514 meanR:1.1497 R:1.4600 gloss:0.0488 dloss:0.0000 exploreP:0.0100
Episode:1515 meanR:1.1549 R:1.2700 gloss:0.0664 dloss:0.0001 exploreP:0.0100
Episode:1516 meanR:1.1508 R:0.1800 gloss:0.0403 dloss:0.0000 exploreP:0.0100
Episode:1517 meanR:1.1479 R:0.7200 gloss:0.0172 dloss:0.0000 exploreP:0.0100
Episode:1518 meanR:1.1597 R:2.2200 gloss:0.0900 dloss:0.0001 exploreP:0.0100
Episode:1519 meanR:1.1647 R:1.6000 gloss:0.1116 dloss:0.0001 exploreP:0.0100
Episode:1520 meanR:1.1650 R:1.0700 gloss:0.0454 dloss:0.0000 exploreP:0.0100

Episode:1615 meanR:0.9838 R:0.6700 gloss:0.0363 dloss:0.0000 exploreP:0.0100
Episode:1616 meanR:0.9934 R:1.1400 gloss:0.0589 dloss:0.0001 exploreP:0.0100
Episode:1617 meanR:0.9946 R:0.8400 gloss:0.0385 dloss:0.0000 exploreP:0.0100
Episode:1618 meanR:0.9800 R:0.7600 gloss:0.0434 dloss:0.0000 exploreP:0.0100
Episode:1619 meanR:0.9671 R:0.3100 gloss:0.0266 dloss:0.0000 exploreP:0.0100
Episode:1620 meanR:0.9611 R:0.4700 gloss:0.0154 dloss:0.0000 exploreP:0.0100
Episode:1621 meanR:0.9547 R:0.8600 gloss:0.0477 dloss:0.0000 exploreP:0.0100
Episode:1622 meanR:0.9556 R:1.3000 gloss:0.0482 dloss:0.0000 exploreP:0.0100
Episode:1623 meanR:0.9510 R:0.7100 gloss:0.0518 dloss:0.0000 exploreP:0.0100
Episode:1624 meanR:0.9584 R:1.6400 gloss:0.0557 dloss:0.0000 exploreP:0.0100
Episode:1625 meanR:0.9557 R:0.2100 gloss:0.0489 dloss:0.0000 exploreP:0.0100
Episode:1626 meanR:0.9514 R:1.5600 gloss:0.0300 dloss:0.0000 exploreP:0.0100
Episode:1627 meanR:0.9352 R:0.0000 gloss:0.0610 dloss:0.0001 exploreP:0.0100

Episode:1722 meanR:0.8836 R:1.1500 gloss:0.0869 dloss:0.0001 exploreP:0.0100
Episode:1723 meanR:0.9086 R:3.2100 gloss:0.1287 dloss:0.0001 exploreP:0.0100
Episode:1724 meanR:0.9055 R:1.3300 gloss:0.1112 dloss:0.0001 exploreP:0.0100
Episode:1725 meanR:0.9087 R:0.5300 gloss:0.0360 dloss:0.0000 exploreP:0.0100
Episode:1726 meanR:0.9116 R:1.8500 gloss:0.0574 dloss:0.0001 exploreP:0.0100
Episode:1727 meanR:0.9283 R:1.6700 gloss:0.0988 dloss:0.0001 exploreP:0.0100
Episode:1728 meanR:0.9294 R:0.6200 gloss:0.0493 dloss:0.0000 exploreP:0.0100
Episode:1729 meanR:0.9349 R:1.2200 gloss:0.0470 dloss:0.0000 exploreP:0.0100
Episode:1730 meanR:0.9256 R:1.0100 gloss:0.0695 dloss:0.0001 exploreP:0.0100
Episode:1731 meanR:0.9337 R:1.2800 gloss:0.0518 dloss:0.0000 exploreP:0.0100
Episode:1732 meanR:0.9481 R:2.0300 gloss:0.0871 dloss:0.0001 exploreP:0.0100
Episode:1733 meanR:0.9571 R:1.3500 gloss:0.0912 dloss:0.0001 exploreP:0.0100
Episode:1734 meanR:0.9583 R:0.6000 gloss:0.0367 dloss:0.0000 exploreP:0.0100

Episode:1829 meanR:0.8147 R:1.6500 gloss:0.0333 dloss:0.0000 exploreP:0.0100
Episode:1830 meanR:0.8046 R:0.0000 gloss:0.0274 dloss:0.0000 exploreP:0.0100
Episode:1831 meanR:0.7918 R:0.0000 gloss:-0.0000 dloss:0.0000 exploreP:0.0100
Episode:1832 meanR:0.7715 R:0.0000 gloss:0.0000 dloss:0.0000 exploreP:0.0100
Episode:1833 meanR:0.7580 R:0.0000 gloss:-0.0000 dloss:0.0000 exploreP:0.0100
Episode:1834 meanR:0.7520 R:0.0000 gloss:0.0000 dloss:0.0000 exploreP:0.0100
Episode:1835 meanR:0.7396 R:0.0000 gloss:0.0000 dloss:0.0000 exploreP:0.0100
Episode:1836 meanR:0.7362 R:0.0000 gloss:0.0000 dloss:0.0000 exploreP:0.0100
Episode:1837 meanR:0.7282 R:0.0000 gloss:-0.0000 dloss:0.0000 exploreP:0.0100
Episode:1838 meanR:0.7267 R:0.0000 gloss:-0.0007 dloss:0.0000 exploreP:0.0100
Episode:1839 meanR:0.7094 R:0.3200 gloss:0.0050 dloss:0.0000 exploreP:0.0100
Episode:1840 meanR:0.6990 R:0.4200 gloss:0.0061 dloss:0.0000 exploreP:0.0100
Episode:1841 meanR:0.6967 R:0.5900 gloss:0.0102 dloss:0.0000 exploreP:0.

Episode:1936 meanR:0.8268 R:0.4600 gloss:0.0251 dloss:0.0000 exploreP:0.0100
Episode:1937 meanR:0.8303 R:0.3500 gloss:0.0101 dloss:0.0000 exploreP:0.0100
Episode:1938 meanR:0.8439 R:1.3600 gloss:0.0376 dloss:0.0000 exploreP:0.0100
Episode:1939 meanR:0.8473 R:0.6600 gloss:0.0432 dloss:0.0000 exploreP:0.0100
Episode:1940 meanR:0.8462 R:0.3100 gloss:0.0218 dloss:0.0000 exploreP:0.0100
Episode:1941 meanR:0.8447 R:0.4400 gloss:0.0199 dloss:0.0000 exploreP:0.0100
Episode:1942 meanR:0.8410 R:1.0800 gloss:0.0345 dloss:0.0000 exploreP:0.0100
Episode:1943 meanR:0.8370 R:0.3500 gloss:0.0280 dloss:0.0000 exploreP:0.0100
Episode:1944 meanR:0.8367 R:0.3900 gloss:0.0124 dloss:0.0000 exploreP:0.0100
Episode:1945 meanR:0.8303 R:0.4200 gloss:0.0169 dloss:0.0000 exploreP:0.0100
Episode:1946 meanR:0.8371 R:0.9700 gloss:0.0266 dloss:0.0000 exploreP:0.0100
Episode:1947 meanR:0.8430 R:0.9800 gloss:0.0414 dloss:0.0000 exploreP:0.0100
Episode:1948 meanR:0.8477 R:1.6500 gloss:0.0587 dloss:0.0001 exploreP:0.0100

Episode:2043 meanR:0.8459 R:1.2000 gloss:0.0347 dloss:0.0000 exploreP:0.0100
Episode:2044 meanR:0.8443 R:0.2300 gloss:0.0377 dloss:0.0000 exploreP:0.0100
Episode:2045 meanR:0.8442 R:0.4100 gloss:0.0042 dloss:0.0000 exploreP:0.0100
Episode:2046 meanR:0.8439 R:0.9400 gloss:0.0252 dloss:0.0000 exploreP:0.0100
Episode:2047 meanR:0.8404 R:0.6300 gloss:0.0389 dloss:0.0000 exploreP:0.0100
Episode:2048 meanR:0.8266 R:0.2700 gloss:0.0062 dloss:0.0000 exploreP:0.0100
Episode:2049 meanR:0.8342 R:1.1700 gloss:0.0259 dloss:0.0000 exploreP:0.0100
Episode:2050 meanR:0.8324 R:0.6800 gloss:0.0231 dloss:0.0000 exploreP:0.0100
Episode:2051 meanR:0.8345 R:0.9500 gloss:0.0340 dloss:0.0000 exploreP:0.0100
Episode:2052 meanR:0.8425 R:0.9100 gloss:0.0302 dloss:0.0000 exploreP:0.0100
Episode:2053 meanR:0.8465 R:0.7000 gloss:0.0332 dloss:0.0000 exploreP:0.0100
Episode:2054 meanR:0.8300 R:0.5700 gloss:0.0201 dloss:0.0000 exploreP:0.0100
Episode:2055 meanR:0.8344 R:0.7800 gloss:0.0259 dloss:0.0000 exploreP:0.0100

Episode:2150 meanR:0.7969 R:1.3500 gloss:0.0305 dloss:0.0000 exploreP:0.0100
Episode:2151 meanR:0.8120 R:2.4600 gloss:0.0674 dloss:0.0001 exploreP:0.0100
Episode:2152 meanR:0.8103 R:0.7400 gloss:0.0708 dloss:0.0001 exploreP:0.0100
Episode:2153 meanR:0.8081 R:0.4800 gloss:0.0358 dloss:0.0000 exploreP:0.0100
Episode:2154 meanR:0.8075 R:0.5100 gloss:0.0130 dloss:0.0000 exploreP:0.0100
Episode:2155 meanR:0.8061 R:0.6400 gloss:0.0313 dloss:0.0000 exploreP:0.0100
Episode:2156 meanR:0.8284 R:2.3900 gloss:0.0555 dloss:0.0001 exploreP:0.0100
Episode:2157 meanR:0.8286 R:0.7000 gloss:0.0627 dloss:0.0001 exploreP:0.0100
Episode:2158 meanR:0.8222 R:0.0000 gloss:0.0159 dloss:0.0000 exploreP:0.0100
Episode:2159 meanR:0.8173 R:0.1600 gloss:0.0018 dloss:0.0000 exploreP:0.0100
Episode:2160 meanR:0.8125 R:0.7000 gloss:0.0213 dloss:0.0000 exploreP:0.0100
Episode:2161 meanR:0.8073 R:0.0200 gloss:0.0049 dloss:0.0000 exploreP:0.0100
Episode:2162 meanR:0.8058 R:0.4400 gloss:0.0061 dloss:0.0000 exploreP:0.0100

Episode:2257 meanR:0.8438 R:0.9800 gloss:0.0207 dloss:0.0000 exploreP:0.0100
Episode:2258 meanR:0.8521 R:0.8300 gloss:0.0307 dloss:0.0000 exploreP:0.0100
Episode:2259 meanR:0.8521 R:0.1600 gloss:0.0231 dloss:0.0000 exploreP:0.0100
Episode:2260 meanR:0.8505 R:0.5400 gloss:0.0060 dloss:0.0000 exploreP:0.0100
Episode:2261 meanR:0.8549 R:0.4600 gloss:0.0212 dloss:0.0000 exploreP:0.0100
Episode:2262 meanR:0.8604 R:0.9900 gloss:0.0266 dloss:0.0000 exploreP:0.0100
Episode:2263 meanR:0.8467 R:0.3800 gloss:0.0266 dloss:0.0000 exploreP:0.0100
Episode:2264 meanR:0.8504 R:1.2200 gloss:0.0293 dloss:0.0000 exploreP:0.0100
Episode:2265 meanR:0.8542 R:1.4200 gloss:0.0448 dloss:0.0001 exploreP:0.0100
Episode:2266 meanR:0.8485 R:0.0800 gloss:0.0412 dloss:0.0000 exploreP:0.0100
Episode:2267 meanR:0.8547 R:1.2700 gloss:0.0322 dloss:0.0000 exploreP:0.0100
Episode:2268 meanR:0.8511 R:0.2600 gloss:0.0283 dloss:0.0000 exploreP:0.0100
Episode:2269 meanR:0.8579 R:0.8400 gloss:0.0240 dloss:0.0000 exploreP:0.0100

Episode:2364 meanR:0.5704 R:0.5000 gloss:0.0113 dloss:0.0000 exploreP:0.0100
Episode:2365 meanR:0.5564 R:0.0200 gloss:0.0132 dloss:0.0000 exploreP:0.0100
Episode:2366 meanR:0.5583 R:0.2700 gloss:0.0053 dloss:0.0000 exploreP:0.0100
Episode:2367 meanR:0.5532 R:0.7600 gloss:0.0137 dloss:0.0000 exploreP:0.0100
Episode:2368 meanR:0.5558 R:0.5200 gloss:0.0269 dloss:0.0000 exploreP:0.0100
Episode:2369 meanR:0.5566 R:0.9200 gloss:0.0276 dloss:0.0000 exploreP:0.0100
Episode:2370 meanR:0.5414 R:0.2300 gloss:0.0218 dloss:0.0000 exploreP:0.0100
Episode:2371 meanR:0.5401 R:0.6100 gloss:0.0104 dloss:0.0000 exploreP:0.0100
Episode:2372 meanR:0.5348 R:1.1000 gloss:0.0481 dloss:0.0001 exploreP:0.0100
Episode:2373 meanR:0.5272 R:0.0000 gloss:0.0100 dloss:0.0000 exploreP:0.0100
Episode:2374 meanR:0.5273 R:1.2000 gloss:0.0192 dloss:0.0000 exploreP:0.0100
Episode:2375 meanR:0.5189 R:0.6600 gloss:0.0299 dloss:0.0000 exploreP:0.0100
Episode:2376 meanR:0.5335 R:1.9800 gloss:0.0379 dloss:0.0001 exploreP:0.0100

Episode:2471 meanR:0.8898 R:2.4400 gloss:0.0695 dloss:0.0001 exploreP:0.0100
Episode:2472 meanR:0.8866 R:0.7800 gloss:0.0514 dloss:0.0001 exploreP:0.0100
Episode:2473 meanR:0.8952 R:0.8600 gloss:0.0292 dloss:0.0000 exploreP:0.0100
Episode:2474 meanR:0.8888 R:0.5600 gloss:0.0183 dloss:0.0000 exploreP:0.0100
Episode:2475 meanR:0.8899 R:0.7700 gloss:0.0248 dloss:0.0000 exploreP:0.0100
Episode:2476 meanR:0.8726 R:0.2500 gloss:0.0141 dloss:0.0000 exploreP:0.0100
Episode:2477 meanR:0.8726 R:0.4200 gloss:0.0129 dloss:0.0000 exploreP:0.0100
Episode:2478 meanR:0.8789 R:0.7400 gloss:0.0110 dloss:0.0000 exploreP:0.0100
Episode:2479 meanR:0.8742 R:0.8700 gloss:0.0249 dloss:0.0000 exploreP:0.0100
Episode:2480 meanR:0.8853 R:1.5000 gloss:0.0327 dloss:0.0000 exploreP:0.0100
Episode:2481 meanR:0.8857 R:0.6300 gloss:0.0455 dloss:0.0001 exploreP:0.0100
Episode:2482 meanR:0.9015 R:1.7900 gloss:0.0152 dloss:0.0000 exploreP:0.0100
Episode:2483 meanR:0.8978 R:0.7700 gloss:0.0626 dloss:0.0001 exploreP:0.0100

Episode:2578 meanR:0.5900 R:0.0900 gloss:0.0056 dloss:0.0000 exploreP:0.0100
Episode:2579 meanR:0.5854 R:0.4100 gloss:0.0042 dloss:0.0000 exploreP:0.0100
Episode:2580 meanR:0.5704 R:0.0000 gloss:0.0071 dloss:0.0000 exploreP:0.0100
Episode:2581 meanR:0.5650 R:0.0900 gloss:0.0004 dloss:0.0000 exploreP:0.0100
Episode:2582 meanR:0.5580 R:1.0900 gloss:0.0075 dloss:0.0000 exploreP:0.0100
Episode:2583 meanR:0.5569 R:0.6600 gloss:0.0303 dloss:0.0001 exploreP:0.0100
Episode:2584 meanR:0.5739 R:1.7000 gloss:0.0178 dloss:0.0000 exploreP:0.0100
Episode:2585 meanR:0.5697 R:0.3000 gloss:0.0298 dloss:0.0001 exploreP:0.0100
Episode:2586 meanR:0.5703 R:0.6700 gloss:0.0114 dloss:0.0000 exploreP:0.0100
Episode:2587 meanR:0.5709 R:1.0500 gloss:0.0142 dloss:0.0000 exploreP:0.0100
Episode:2588 meanR:0.5784 R:1.0900 gloss:0.0326 dloss:0.0001 exploreP:0.0100
Episode:2589 meanR:0.5876 R:0.9200 gloss:0.0146 dloss:0.0000 exploreP:0.0100
Episode:2590 meanR:0.5887 R:0.2400 gloss:0.0126 dloss:0.0000 exploreP:0.0100

Episode:2685 meanR:0.6929 R:0.0300 gloss:0.0085 dloss:0.0000 exploreP:0.0100
Episode:2686 meanR:0.6935 R:0.7300 gloss:0.0168 dloss:0.0000 exploreP:0.0100
Episode:2687 meanR:0.6905 R:0.7500 gloss:0.0142 dloss:0.0000 exploreP:0.0100
Episode:2688 meanR:0.6947 R:1.5100 gloss:0.0366 dloss:0.0001 exploreP:0.0100
Episode:2689 meanR:0.6865 R:0.1000 gloss:0.0088 dloss:0.0000 exploreP:0.0100
Episode:2690 meanR:0.6885 R:0.4400 gloss:0.0084 dloss:0.0000 exploreP:0.0100
Episode:2691 meanR:0.6930 R:1.3000 gloss:0.0232 dloss:0.0000 exploreP:0.0100
Episode:2692 meanR:0.6848 R:0.1900 gloss:0.0211 dloss:0.0000 exploreP:0.0100
Episode:2693 meanR:0.6824 R:0.5400 gloss:0.0070 dloss:0.0000 exploreP:0.0100
Episode:2694 meanR:0.6897 R:1.1900 gloss:0.0207 dloss:0.0000 exploreP:0.0100
Episode:2695 meanR:0.7003 R:1.7900 gloss:0.0509 dloss:0.0001 exploreP:0.0100
Episode:2696 meanR:0.6996 R:0.0000 gloss:0.0244 dloss:0.0000 exploreP:0.0100
Episode:2697 meanR:0.6971 R:0.8300 gloss:0.0090 dloss:0.0000 exploreP:0.0100

Episode:2792 meanR:0.5046 R:0.3000 gloss:0.0067 dloss:0.0000 exploreP:0.0100
Episode:2793 meanR:0.5030 R:0.3800 gloss:0.0090 dloss:0.0000 exploreP:0.0100
Episode:2794 meanR:0.5001 R:0.9000 gloss:0.0112 dloss:0.0000 exploreP:0.0100
Episode:2795 meanR:0.4822 R:0.0000 gloss:0.0084 dloss:0.0000 exploreP:0.0100
Episode:2796 meanR:0.4856 R:0.3400 gloss:0.0006 dloss:0.0000 exploreP:0.0100
Episode:2797 meanR:0.4864 R:0.9100 gloss:0.0100 dloss:0.0000 exploreP:0.0100
Episode:2798 meanR:0.4952 R:1.4400 gloss:0.0290 dloss:0.0001 exploreP:0.0100
Episode:2799 meanR:0.4930 R:0.1300 gloss:0.0169 dloss:0.0000 exploreP:0.0100
Episode:2800 meanR:0.4890 R:0.8500 gloss:0.0091 dloss:0.0000 exploreP:0.0100
Episode:2801 meanR:0.4743 R:0.1800 gloss:0.0126 dloss:0.0000 exploreP:0.0100
Episode:2802 meanR:0.4849 R:1.3100 gloss:0.0166 dloss:0.0000 exploreP:0.0100
Episode:2803 meanR:0.4792 R:0.4600 gloss:0.0126 dloss:0.0000 exploreP:0.0100
Episode:2804 meanR:0.4761 R:0.1800 gloss:0.0089 dloss:0.0000 exploreP:0.0100

Episode:2899 meanR:0.4962 R:0.3000 gloss:0.0087 dloss:0.0000 exploreP:0.0100
Episode:2900 meanR:0.4927 R:0.5000 gloss:0.0054 dloss:0.0000 exploreP:0.0100
Episode:2901 meanR:0.4949 R:0.4000 gloss:0.0051 dloss:0.0000 exploreP:0.0100
Episode:2902 meanR:0.4865 R:0.4700 gloss:0.0110 dloss:0.0000 exploreP:0.0100
Episode:2903 meanR:0.4819 R:0.0000 gloss:0.0062 dloss:0.0000 exploreP:0.0100
Episode:2904 meanR:0.4848 R:0.4700 gloss:0.0032 dloss:0.0000 exploreP:0.0100
Episode:2905 meanR:0.4868 R:0.3300 gloss:0.0104 dloss:0.0000 exploreP:0.0100
Episode:2906 meanR:0.4905 R:0.7800 gloss:0.0124 dloss:0.0000 exploreP:0.0100
Episode:2907 meanR:0.5030 R:1.2500 gloss:0.0255 dloss:0.0001 exploreP:0.0100
Episode:2908 meanR:0.4978 R:0.1300 gloss:0.0121 dloss:0.0000 exploreP:0.0100
Episode:2909 meanR:0.4891 R:0.2200 gloss:0.0026 dloss:0.0000 exploreP:0.0100
Episode:2910 meanR:0.4913 R:0.3500 gloss:0.0140 dloss:0.0000 exploreP:0.0100
Episode:2911 meanR:0.4878 R:0.0500 gloss:0.0014 dloss:0.0000 exploreP:0.0100

Episode:3006 meanR:0.4735 R:0.0000 gloss:0.0048 dloss:0.0000 exploreP:0.0100
Episode:3007 meanR:0.4677 R:0.6700 gloss:0.0085 dloss:0.0000 exploreP:0.0100
Episode:3008 meanR:0.4798 R:1.3400 gloss:0.0262 dloss:0.0000 exploreP:0.0100
Episode:3009 meanR:0.4938 R:1.6200 gloss:0.0440 dloss:0.0001 exploreP:0.0100
Episode:3010 meanR:0.4914 R:0.1100 gloss:0.0175 dloss:0.0000 exploreP:0.0100
Episode:3011 meanR:0.4989 R:0.8000 gloss:0.0203 dloss:0.0000 exploreP:0.0100
Episode:3012 meanR:0.5004 R:0.4900 gloss:0.0113 dloss:0.0000 exploreP:0.0100
Episode:3013 meanR:0.4992 R:0.4000 gloss:0.0117 dloss:0.0000 exploreP:0.0100
Episode:3014 meanR:0.5026 R:0.9600 gloss:0.0212 dloss:0.0000 exploreP:0.0100
Episode:3015 meanR:0.4975 R:0.0700 gloss:0.0035 dloss:0.0000 exploreP:0.0100
Episode:3016 meanR:0.4925 R:0.0800 gloss:0.0028 dloss:0.0000 exploreP:0.0100
Episode:3017 meanR:0.4923 R:0.1300 gloss:0.0008 dloss:0.0000 exploreP:0.0100
Episode:3018 meanR:0.4942 R:0.4200 gloss:0.0054 dloss:0.0000 exploreP:0.0100

Episode:3113 meanR:0.5256 R:1.0000 gloss:0.0296 dloss:0.0000 exploreP:0.0100
Episode:3114 meanR:0.5201 R:0.4100 gloss:0.0257 dloss:0.0000 exploreP:0.0100
Episode:3115 meanR:0.5194 R:0.0000 gloss:0.0025 dloss:0.0000 exploreP:0.0100
Episode:3116 meanR:0.5193 R:0.0700 gloss:0.0018 dloss:0.0000 exploreP:0.0100
Episode:3117 meanR:0.5274 R:0.9400 gloss:0.0129 dloss:0.0000 exploreP:0.0100
Episode:3118 meanR:0.5264 R:0.3200 gloss:0.0214 dloss:0.0000 exploreP:0.0100
Episode:3119 meanR:0.5234 R:0.6100 gloss:0.0119 dloss:0.0000 exploreP:0.0100
Episode:3120 meanR:0.5317 R:0.8600 gloss:0.0255 dloss:0.0000 exploreP:0.0100
Episode:3121 meanR:0.5360 R:0.4300 gloss:0.0219 dloss:0.0000 exploreP:0.0100
Episode:3122 meanR:0.5524 R:1.6400 gloss:0.0220 dloss:0.0000 exploreP:0.0100
