# Continuous Control

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Reacher.app"`
- **Windows** (x86): `"path/to/Reacher_Windows_x86/Reacher.exe"`
- **Windows** (x86_64): `"path/to/Reacher_Windows_x86_64/Reacher.exe"`
- **Linux** (x86): `"path/to/Reacher_Linux/Reacher.x86"`
- **Linux** (x86_64): `"path/to/Reacher_Linux/Reacher.x86_64"`
- **Linux** (x86, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86"`
- **Linux** (x86_64, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86_64"`

For instance, if you are using a Mac, then you downloaded `Reacher.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Reacher.app")
```

In [2]:
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux/Reacher.x86_64')
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_v1/Reacher.x86_64')
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_OneAgent/Reacher_Linux/Reacher.x86_64')
env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_NoVis_OneAgent/Reacher_Linux_NoVis/Reacher.x86_64')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 1
Size of each action: 4
There are 1 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726671e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: 0.0


When finished, you can close the environment.

In [6]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [7]:
# Testing the train mode
env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
state = env_info.vector_observations[0]                  # get the current state (for each agent)
#scores = np.zeros(num_agents)                          # initialize the score (for each agent)
num_steps = 0
while True:
    num_steps += 1
    action = np.random.randn(num_agents, action_size) # select an action (for each agent)
    #print(action)
    action = np.clip(action, -1, 1)                  # all actions between -1 and 1
    #print(action)
    env_info = env.step(action)[brain_name]           # send all actions to tne environment
    next_state = env_info.vector_observations[0]         # get next state (for each agent)
    reward = env_info.rewards[0]                         # get reward (for each agent)
    done = env_info.local_done[0]                        # see if episode finished
    #scores += env_info.rewards                         # update the score (for each agent)
    state = next_state                               # roll over states to next time step
    if done is True:                                  # exit loop if episode finished
        #print(action.shape, reward)
        #print(done)
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))
num_steps

Total score (averaged over agents) this episode: 0.0


1001

## Option 1: Solve the First Version
The task is episodic, and in order to solve the environment, your agent must get an average score of +30 over 100 consecutive episodes.

In [8]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: /device:GPU:0


In [9]:
def model_input(state_size, action_size):
    #states = tf.placeholder(tf.float32, [None, *state_size], name='states')
    states = tf.placeholder(tf.float64, [None, state_size], name='states')
    actions = tf.placeholder(tf.float64, [None, action_size], name='actions')
    targetQs = tf.placeholder(tf.float64, [None], name='targetQs')
    return states, actions, targetQs

In [10]:
# Generator/Controller: Generating/prediting the actions
def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        #predictions = tf.nn.softmax(logits)

        # return actions logits
        return logits

In [11]:
# Discriminator/Dopamine: Reward function/planner/naviator/advisor/supervisor/cortical columns
def discriminator(states, actions, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # Fusion/merge states and actions/ SA/ SM
        x_fused = tf.concat(axis=1, values=[states, actions])
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)        
        #predictions = tf.nn.softmax(logits)

        # return rewards logits
        return logits

In [12]:
def model_loss(action_size, hidden_size, states, actions, targetQs):
    actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    neg_log_prob_actions = tf.nn.sigmoid_cross_entropy_with_logits(logits=actions_logits,
                                                                   labels=tf.nn.sigmoid(actions))
    targetQs = tf.reshape(targetQs, shape=[-1, 1])
    gloss = tf.reduce_mean(neg_log_prob_actions * targetQs) # DPG
    gQs = discriminator(actions=actions_logits, hidden_size=hidden_size, states=states) # nextQs
    dloss = tf.reduce_mean(tf.square(gQs - targetQs)) # DQN
    dQs = discriminator(actions=actions, hidden_size=hidden_size, states=states, reuse=True) #Qs
    dloss += tf.reduce_mean(tf.square(dQs - targetQs)) # DQN
    dloss /= 2 # if dQs==gQs
    gloss1 = tf.reduce_mean(neg_log_prob_actions)
    gloss2 = tf.reduce_mean(gQs)
    gloss3 = tf.reduce_mean(dQs)
    gloss4 = tf.reduce_mean(targetQs)
    return actions_logits, gQs, gloss, dloss, gloss1, gloss2, gloss3, gloss4

In [13]:
# Optimizating/training/learning G & D
def model_opt(g_loss, d_loss, g_learning_rate, d_learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        g_opt = tf.train.AdamOptimizer(g_learning_rate).minimize(g_loss, var_list=g_vars)
        d_opt = tf.train.AdamOptimizer(d_learning_rate).minimize(d_loss, var_list=d_vars)
        
    return g_opt, d_opt

In [14]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, glearning_rate, dlearning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs = model_input(state_size=state_size, action_size=action_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.Qs_logits, self.g_loss, self.d_loss, self.g_loss1, self.g_loss2, self.g_loss3, self.g_loss4 = model_loss(
            action_size=action_size, hidden_size=hidden_size, # model init parameters
            states=self.states, actions=self.actions, targetQs=self.targetQs) # model input
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(g_loss=self.g_loss, d_loss=self.d_loss, 
                                           g_learning_rate=glearning_rate, d_learning_rate=dlearning_rate)

In [15]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

In [16]:
print('state size:{}'.format(states.shape), 
      'actions:{}'.format(actions.shape)) 
print('action size:{}'.format(np.max(actions) - np.min(actions)+1))

state size:(1, 33) actions:(1, 4)
action size:2.697119985869685


In [17]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
state_size = 33
action_size = 4
hidden_size = 33*2             # number of units in each Q-network hidden layer
glearning_rate = 0.0001         # Q-network learning rate
dlearning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 100              # experience mini-batch size
gamma = 0.99                   # future reward discount

In [18]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, state_size=state_size, hidden_size=hidden_size, 
              glearning_rate=glearning_rate, dlearning_rate=dlearning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

In [19]:
# Initializing the memory buffer
env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
state = env_info.vector_observations[0]                  # get the current state (for each agent)
num_steps = 0
for _ in range(memory_size):
    num_steps += 1
    action = np.random.randn(num_agents, action_size) # select an action (for each agent)
    #action = np.clip(action, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(action)[brain_name]           # send all actions to tne environment
    next_state = env_info.vector_observations[0]         # get next state (for each agent)
    reward = env_info.rewards[0]                         # get reward (for each agent)
    done = env_info.local_done[0]                        # see if episode finished
    memory.buffer.append([state, action.reshape([-1]), next_state, reward, float(done)])
    #print(state.shape, action.reshape([-1]).shape, reward, float(done))
    state = next_state                               # roll over states to next time step
    if done is True:                                  # exit loop if episode finished
        #print(done)
        env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
        state = env_info.vector_observations[0]                  # get the current state (for each agent)
        break
num_steps

1001

In [20]:
# len(memory.buffer), memory.buffer[100]

In [None]:
# Save/load the model and save for plotting
saver = tf.train.Saver()
episode_rewards_list, rewards_list, gloss_list, dloss_list = [], [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        gloss_batch, dloss_batch = [], []
        gloss1_batch, gloss2_batch, gloss3_batch, gloss4_batch = [], [], [], []
        #state = env.reset()
        env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
        state = env_info.vector_observations[0]                  # get the current state (for each agent)

        # Training steps/batches
        for num_steps in range(1111111111):
            # Explore (Env) or Exploit (Model)
            total_step += 1/1 # 1000 episode length
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            if explore_p >= np.random.rand():
                #action = env.action_space.sample()
                action = np.random.randn(num_agents, action_size) # select an action (for each agent)
            else:
                action = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
            #next_state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name]           # send all actions to tne environment
            next_state = env_info.vector_observations[0]         # get next state (for each agent)
            reward = env_info.rewards[0]                         # get reward (for each agent)
            done = env_info.local_done[0]                        # see if episode finished
            memory.buffer.append([state, action.reshape([-1]), next_state, reward, float(done)])
            total_reward += reward
            state = next_state

            # Training
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            nextQs_logits = sess.run(model.Qs_logits, feed_dict = {model.states: next_states})
            #nextQs = np.max(nextQs_logits, axis=1) * (1-dones)
            nextQs = nextQs_logits.reshape([-1]) * (1-dones)
            targetQs = rewards + (gamma * nextQs)
            gloss, dloss, gloss1, gloss2, gloss3, gloss4, _, _ = sess.run([model.g_loss, model.d_loss,
                                                                           model.g_loss1, model.g_loss2, 
                                                                           model.g_loss3, model.g_loss4,
                                                                           model.g_opt, model.d_opt],
                                                                          feed_dict = {model.states: states, 
                                                                                       model.actions: actions,
                                                                                       model.targetQs: targetQs})
            gloss_batch.append(gloss)
            dloss_batch.append(dloss)
            gloss1_batch.append(gloss1)
            gloss2_batch.append(gloss2)
            gloss3_batch.append(gloss3)
            gloss4_batch.append(gloss4)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'Steps:{}'.format(num_steps),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'gloss:{:.4f}'.format(np.mean(gloss_batch)),
              'gloss1-lgP:{:.4f}'.format(np.mean(gloss1_batch)), #-logp
              'gloss2gQs:{:.4f}'.format(np.mean(gloss2_batch)),#gQs
              'gloss3dQs:{:.4f}'.format(np.mean(gloss3_batch)),#dQs
              'gloss4tgtQ:{:.4f}'.format(np.mean(gloss4_batch)),#tgtQs
              'dloss:{:.4f}'.format(np.mean(dloss_batch)),
              'exploreP:{:.4f}'.format(explore_p))

        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        gloss_list.append([ep, np.mean(gloss_batch)])
        dloss_list.append([ep, np.mean(dloss_batch)])
        # Break episode/epoch loop
        ## Option 1: Solve the First Version
        #The task is episodic, and in order to solve the environment, 
        #your agent must get an average score of +30 over 100 consecutive episodes.        
        if np.mean(episode_reward) >= +30:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 Steps:1000 meanR:0.0000 R:0.0000 gloss:0.5825 gloss1-lgP:0.8450 gloss2gQs:0.7646 gloss3dQs:0.7457 gloss4tgtQ:0.7574 dloss:0.0589 exploreP:0.9057
Episode:1 Steps:1000 meanR:0.0600 R:0.1200 gloss:0.4479 gloss1-lgP:0.8812 gloss2gQs:0.5601 gloss3dQs:0.5463 gloss4tgtQ:0.5538 dloss:0.0374 exploreP:0.8204
Episode:2 Steps:1000 meanR:0.0400 R:0.0000 gloss:0.2329 gloss1-lgP:0.7268 gloss2gQs:0.3202 gloss3dQs:0.3170 gloss4tgtQ:0.3165 dloss:0.0117 exploreP:0.7432
Episode:3 Steps:1000 meanR:0.0575 R:0.1100 gloss:0.0395 gloss1-lgP:0.7081 gloss2gQs:0.0574 gloss3dQs:0.0566 gloss4tgtQ:0.0569 dloss:0.0067 exploreP:0.6734
Episode:4 Steps:1000 meanR:0.0460 R:0.0000 gloss:0.0389 gloss1-lgP:0.7056 gloss2gQs:0.0566 gloss3dQs:0.0556 gloss4tgtQ:0.0563 dloss:0.0049 exploreP:0.6102
Episode:5 Steps:1000 meanR:0.0383 R:0.0000 gloss:0.0188 gloss1-lgP:0.7080 gloss2gQs:0.0281 gloss3dQs:0.0269 gloss4tgtQ:0.0276 dloss:0.0036 exploreP:0.5530
Episode:6 Steps:1000 meanR:0.0471 R:0.1000 gloss:0.0236 gloss1-lgP:0.7

Episode:53 Steps:1000 meanR:0.2852 R:0.9100 gloss:0.0170 gloss1-lgP:0.6387 gloss2gQs:0.0263 gloss3dQs:0.0263 gloss4tgtQ:0.0263 dloss:0.0000 exploreP:0.0144
Episode:54 Steps:1000 meanR:0.2800 R:0.0000 gloss:0.0178 gloss1-lgP:0.6459 gloss2gQs:0.0275 gloss3dQs:0.0275 gloss4tgtQ:0.0275 dloss:0.0001 exploreP:0.0140
Episode:55 Steps:1000 meanR:0.2750 R:0.0000 gloss:0.0182 gloss1-lgP:0.6449 gloss2gQs:0.0283 gloss3dQs:0.0283 gloss4tgtQ:0.0283 dloss:0.0001 exploreP:0.0136
Episode:56 Steps:1000 meanR:0.2853 R:0.8600 gloss:0.1464 gloss1-lgP:0.7928 gloss2gQs:0.1825 gloss3dQs:0.1789 gloss4tgtQ:0.1808 dloss:0.0037 exploreP:0.0133
Episode:57 Steps:1000 meanR:0.2803 R:0.0000 gloss:0.0253 gloss1-lgP:0.6332 gloss2gQs:0.0399 gloss3dQs:0.0399 gloss4tgtQ:0.0397 dloss:0.0001 exploreP:0.0130
Episode:58 Steps:1000 meanR:0.2834 R:0.4600 gloss:0.0124 gloss1-lgP:0.6323 gloss2gQs:0.0193 gloss3dQs:0.0192 gloss4tgtQ:0.0193 dloss:0.0001 exploreP:0.0127
Episode:59 Steps:1000 meanR:0.2880 R:0.5600 gloss:0.0142 gloss1-

Episode:106 Steps:1000 meanR:0.3633 R:0.7600 gloss:0.0171 gloss1-lgP:0.6578 gloss2gQs:0.0289 gloss3dQs:0.0291 gloss4tgtQ:0.0291 dloss:0.0001 exploreP:0.0100
Episode:107 Steps:1000 meanR:0.3745 R:1.1200 gloss:0.0237 gloss1-lgP:0.5881 gloss2gQs:0.0403 gloss3dQs:0.0404 gloss4tgtQ:0.0404 dloss:0.0001 exploreP:0.0100
Episode:108 Steps:1000 meanR:0.3875 R:1.3000 gloss:0.0280 gloss1-lgP:0.5935 gloss2gQs:0.0473 gloss3dQs:0.0475 gloss4tgtQ:0.0474 dloss:0.0002 exploreP:0.0100
Episode:109 Steps:1000 meanR:0.3997 R:1.2200 gloss:0.0323 gloss1-lgP:0.5870 gloss2gQs:0.0547 gloss3dQs:0.0547 gloss4tgtQ:0.0548 dloss:0.0001 exploreP:0.0100
Episode:110 Steps:1000 meanR:0.4087 R:1.3700 gloss:0.0416 gloss1-lgP:0.5956 gloss2gQs:0.0712 gloss3dQs:0.0711 gloss4tgtQ:0.0712 dloss:0.0001 exploreP:0.0100
Episode:111 Steps:1000 meanR:0.4189 R:1.0200 gloss:0.0442 gloss1-lgP:0.5768 gloss2gQs:0.0764 gloss3dQs:0.0764 gloss4tgtQ:0.0765 dloss:0.0001 exploreP:0.0100
Episode:112 Steps:1000 meanR:0.4275 R:1.0600 gloss:0.0503 

Episode:159 Steps:1000 meanR:0.5935 R:0.6200 gloss:0.0332 gloss1-lgP:0.5950 gloss2gQs:0.0560 gloss3dQs:0.0560 gloss4tgtQ:0.0560 dloss:0.0001 exploreP:0.0100
Episode:160 Steps:1000 meanR:0.5949 R:0.1400 gloss:0.0339 gloss1-lgP:0.5950 gloss2gQs:0.0572 gloss3dQs:0.0572 gloss4tgtQ:0.0572 dloss:0.0001 exploreP:0.0100
Episode:161 Steps:1000 meanR:0.5926 R:0.3300 gloss:0.0281 gloss1-lgP:0.5936 gloss2gQs:0.0474 gloss3dQs:0.0474 gloss4tgtQ:0.0474 dloss:0.0001 exploreP:0.0100
Episode:162 Steps:1000 meanR:0.5877 R:0.1500 gloss:0.0247 gloss1-lgP:0.5930 gloss2gQs:0.0416 gloss3dQs:0.0416 gloss4tgtQ:0.0416 dloss:0.0000 exploreP:0.0100
Episode:163 Steps:1000 meanR:0.5929 R:0.6100 gloss:0.0217 gloss1-lgP:0.6012 gloss2gQs:0.0366 gloss3dQs:0.0365 gloss4tgtQ:0.0365 dloss:0.0000 exploreP:0.0100
Episode:164 Steps:1000 meanR:0.5836 R:0.7800 gloss:0.0244 gloss1-lgP:0.5941 gloss2gQs:0.0411 gloss3dQs:0.0410 gloss4tgtQ:0.0411 dloss:0.0000 exploreP:0.0100
Episode:165 Steps:1000 meanR:0.5836 R:0.3900 gloss:0.0281 

Episode:212 Steps:1000 meanR:0.7006 R:0.7600 gloss:0.0377 gloss1-lgP:0.6089 gloss2gQs:0.0620 gloss3dQs:0.0620 gloss4tgtQ:0.0620 dloss:0.0001 exploreP:0.0100
Episode:213 Steps:1000 meanR:0.7006 R:0.9700 gloss:0.0402 gloss1-lgP:0.6090 gloss2gQs:0.0662 gloss3dQs:0.0662 gloss4tgtQ:0.0662 dloss:0.0001 exploreP:0.0100
Episode:214 Steps:1000 meanR:0.6955 R:0.4700 gloss:0.0423 gloss1-lgP:0.6103 gloss2gQs:0.0697 gloss3dQs:0.0697 gloss4tgtQ:0.0697 dloss:0.0001 exploreP:0.0100
Episode:215 Steps:1000 meanR:0.6926 R:0.3800 gloss:0.0388 gloss1-lgP:0.6112 gloss2gQs:0.0634 gloss3dQs:0.0634 gloss4tgtQ:0.0634 dloss:0.0001 exploreP:0.0100
Episode:216 Steps:1000 meanR:0.6961 R:0.6900 gloss:0.0387 gloss1-lgP:0.6097 gloss2gQs:0.0636 gloss3dQs:0.0636 gloss4tgtQ:0.0635 dloss:0.0001 exploreP:0.0100
Episode:217 Steps:1000 meanR:0.6906 R:0.0000 gloss:0.0389 gloss1-lgP:0.6080 gloss2gQs:0.0638 gloss3dQs:0.0638 gloss4tgtQ:0.0639 dloss:0.0001 exploreP:0.0100
Episode:218 Steps:1000 meanR:0.6933 R:0.6000 gloss:0.0349 

Episode:265 Steps:1000 meanR:0.6839 R:0.3900 gloss:0.0418 gloss1-lgP:0.6234 gloss2gQs:0.0670 gloss3dQs:0.0670 gloss4tgtQ:0.0670 dloss:0.0001 exploreP:0.0100
Episode:266 Steps:1000 meanR:0.6838 R:1.1900 gloss:0.0405 gloss1-lgP:0.6235 gloss2gQs:0.0649 gloss3dQs:0.0649 gloss4tgtQ:0.0649 dloss:0.0001 exploreP:0.0100
Episode:267 Steps:1000 meanR:0.6764 R:0.5900 gloss:0.0427 gloss1-lgP:0.6242 gloss2gQs:0.0683 gloss3dQs:0.0683 gloss4tgtQ:0.0683 dloss:0.0001 exploreP:0.0100
Episode:268 Steps:1000 meanR:0.6783 R:0.7300 gloss:0.0383 gloss1-lgP:0.6255 gloss2gQs:0.0610 gloss3dQs:0.0610 gloss4tgtQ:0.0610 dloss:0.0001 exploreP:0.0100
Episode:269 Steps:1000 meanR:0.6758 R:0.3300 gloss:0.0396 gloss1-lgP:0.6262 gloss2gQs:0.0631 gloss3dQs:0.0631 gloss4tgtQ:0.0631 dloss:0.0001 exploreP:0.0100
Episode:270 Steps:1000 meanR:0.6751 R:0.5800 gloss:0.0427 gloss1-lgP:0.6279 gloss2gQs:0.0681 gloss3dQs:0.0681 gloss4tgtQ:0.0681 dloss:0.0001 exploreP:0.0100
Episode:271 Steps:1000 meanR:0.6818 R:1.0100 gloss:0.0412 

Episode:318 Steps:1000 meanR:0.6451 R:0.2900 gloss:0.0345 gloss1-lgP:0.6279 gloss2gQs:0.0549 gloss3dQs:0.0548 gloss4tgtQ:0.0548 dloss:0.0001 exploreP:0.0100
Episode:319 Steps:1000 meanR:0.6399 R:0.7400 gloss:0.0412 gloss1-lgP:0.6259 gloss2gQs:0.0657 gloss3dQs:0.0657 gloss4tgtQ:0.0657 dloss:0.0001 exploreP:0.0100
Episode:320 Steps:1000 meanR:0.6491 R:1.2300 gloss:0.0405 gloss1-lgP:0.6253 gloss2gQs:0.0646 gloss3dQs:0.0646 gloss4tgtQ:0.0646 dloss:0.0001 exploreP:0.0100
Episode:321 Steps:1000 meanR:0.6581 R:1.3900 gloss:0.0397 gloss1-lgP:0.6253 gloss2gQs:0.0629 gloss3dQs:0.0629 gloss4tgtQ:0.0630 dloss:0.0001 exploreP:0.0100
Episode:322 Steps:1000 meanR:0.6581 R:0.5600 gloss:0.0480 gloss1-lgP:0.6325 gloss2gQs:0.0758 gloss3dQs:0.0757 gloss4tgtQ:0.0758 dloss:0.0001 exploreP:0.0100
Episode:323 Steps:1000 meanR:0.6511 R:0.2800 gloss:0.0467 gloss1-lgP:0.6312 gloss2gQs:0.0736 gloss3dQs:0.0735 gloss4tgtQ:0.0735 dloss:0.0001 exploreP:0.0100
Episode:324 Steps:1000 meanR:0.6634 R:1.6300 gloss:0.0495 

Episode:371 Steps:1000 meanR:0.7233 R:1.4600 gloss:0.0501 gloss1-lgP:0.6375 gloss2gQs:0.0785 gloss3dQs:0.0785 gloss4tgtQ:0.0785 dloss:0.0001 exploreP:0.0100
Episode:372 Steps:1000 meanR:0.7213 R:0.7600 gloss:0.0473 gloss1-lgP:0.6369 gloss2gQs:0.0741 gloss3dQs:0.0741 gloss4tgtQ:0.0741 dloss:0.0001 exploreP:0.0100
Episode:373 Steps:1000 meanR:0.7232 R:0.7100 gloss:0.0492 gloss1-lgP:0.6359 gloss2gQs:0.0771 gloss3dQs:0.0771 gloss4tgtQ:0.0771 dloss:0.0001 exploreP:0.0100
Episode:374 Steps:1000 meanR:0.7243 R:0.3200 gloss:0.0496 gloss1-lgP:0.6401 gloss2gQs:0.0777 gloss3dQs:0.0777 gloss4tgtQ:0.0777 dloss:0.0001 exploreP:0.0100
Episode:375 Steps:1000 meanR:0.7251 R:0.7400 gloss:0.0475 gloss1-lgP:0.6332 gloss2gQs:0.0749 gloss3dQs:0.0749 gloss4tgtQ:0.0749 dloss:0.0001 exploreP:0.0100
Episode:376 Steps:1000 meanR:0.7247 R:0.9200 gloss:0.0475 gloss1-lgP:0.6331 gloss2gQs:0.0747 gloss3dQs:0.0747 gloss4tgtQ:0.0747 dloss:0.0001 exploreP:0.0100
Episode:377 Steps:1000 meanR:0.7360 R:1.4700 gloss:0.0468 

Episode:424 Steps:1000 meanR:0.7124 R:0.4300 gloss:-1.3585 gloss1-lgP:4.9334 gloss2gQs:-0.2669 gloss3dQs:-0.2473 gloss4tgtQ:-0.2635 dloss:0.0101 exploreP:0.0100
Episode:425 Steps:1000 meanR:0.7111 R:0.8700 gloss:-4530060.7435 gloss1-lgP:178.0617 gloss2gQs:-7762.4654 gloss3dQs:-429.7463 gloss4tgtQ:-7664.1698 dloss:110818960.1933 exploreP:0.0100
Episode:426 Steps:1000 meanR:0.7036 R:0.4400 gloss:-9437694.1851 gloss1-lgP:681.1457 gloss2gQs:-6697.3540 gloss3dQs:-492.2535 gloss4tgtQ:-6612.1132 dloss:61509487.3629 exploreP:0.0100
Episode:427 Steps:1000 meanR:0.7022 R:0.2300 gloss:-2641758.2040 gloss1-lgP:554.8315 gloss2gQs:-2555.6513 gloss3dQs:-405.0683 gloss4tgtQ:-2527.4517 dloss:7896045.5252 exploreP:0.0100
Episode:428 Steps:1000 meanR:0.6918 R:0.1600 gloss:-513607.0323 gloss1-lgP:423.9400 gloss2gQs:-1109.9803 gloss3dQs:-332.7251 gloss4tgtQ:-1099.3663 dloss:1217430.6506 exploreP:0.0100
Episode:429 Steps:1000 meanR:0.6853 R:0.3200 gloss:-110497.3811 gloss1-lgP:469.3274 gloss2gQs:-126.9294 g

Episode:464 Steps:1000 meanR:0.6205 R:0.1200 gloss:-61172595303055106048.0000 gloss1-lgP:1532043.8402 gloss2gQs:-36470523866575.1797 gloss3dQs:-15702266002203.4414 gloss4tgtQ:-36043292304171.5469 dloss:349677707511588803852632064.0000 exploreP:0.0100
Episode:465 Steps:1000 meanR:0.6263 R:1.3000 gloss:-62773817861606694912.0000 gloss1-lgP:1236858.1117 gloss2gQs:-51949836735086.8203 gloss3dQs:-24123847367139.8398 gloss4tgtQ:-51337957877482.7969 dloss:642504667822589929798500352.0000 exploreP:0.0100
Episode:466 Steps:1000 meanR:0.6231 R:0.4900 gloss:-61585371975228850176.0000 gloss1-lgP:969141.1595 gloss2gQs:-71496503064679.4062 gloss3dQs:-35715494721588.4766 gloss4tgtQ:-70645178362906.2188 dloss:1075586346540685414551781376.0000 exploreP:0.0100
Episode:467 Steps:1000 meanR:0.6131 R:0.5900 gloss:-59236660916680851456.0000 gloss1-lgP:717574.3678 gloss2gQs:-96326773973761.4219 gloss3dQs:-52282841965634.7500 gloss4tgtQ:-95236663841349.8438 dloss:1663186191739746790997491712.0000 exploreP:0.0

Episode:496 Steps:1000 meanR:0.5486 R:0.3100 gloss:-57078371794124877594624.0000 gloss1-lgP:1766488.3446 gloss2gQs:-24419032208122844.0000 gloss3dQs:-18109829881161100.0000 gloss4tgtQ:-24128788624429884.0000 dloss:39253739489828084327179817582592.0000 exploreP:0.0100
Episode:497 Steps:1000 meanR:0.5522 R:0.5100 gloss:-62403369001556862042112.0000 gloss1-lgP:1733960.3832 gloss2gQs:-27809915629088436.0000 gloss3dQs:-20807040070530576.0000 gloss4tgtQ:-27473339613812796.0000 dloss:49580658287707576124750291795968.0000 exploreP:0.0100
Episode:498 Steps:1000 meanR:0.5512 R:0.4900 gloss:-71768068995239852900352.0000 gloss1-lgP:1776636.2926 gloss2gQs:-31202195932580092.0000 gloss3dQs:-23599809193760600.0000 gloss4tgtQ:-30831497592582452.0000 dloss:60801269328663570224361710288896.0000 exploreP:0.0100
Episode:499 Steps:1000 meanR:0.5452 R:0.2100 gloss:-82836101708176485777408.0000 gloss1-lgP:1828124.9071 gloss2gQs:-35177357805297532.0000 gloss3dQs:-26943005582798560.0000 gloss4tgtQ:-34758292949

Episode:527 Steps:1000 meanR:0.5658 R:0.1700 gloss:-4570197963924361336324096.0000 gloss1-lgP:5784517.5140 gloss2gQs:-686683570986719104.0000 gloss3dQs:-567945526083252736.0000 gloss4tgtQ:-679094849240891264.0000 dloss:18917258018837394334782242963849216.0000 exploreP:0.0100
Episode:528 Steps:1000 meanR:0.5670 R:0.2800 gloss:-5720446688713580954517504.0000 gloss1-lgP:6563813.8983 gloss2gQs:-755437423437328768.0000 gloss3dQs:-625616058353912192.0000 gloss4tgtQ:-746885384289138816.0000 dloss:22442342898435452510137489219911680.0000 exploreP:0.0100
Episode:529 Steps:1000 meanR:0.5659 R:0.2100 gloss:-5992420106050872088199168.0000 gloss1-lgP:6369738.9683 gloss2gQs:-813949846114775936.0000 gloss3dQs:-674218896390951552.0000 gloss4tgtQ:-804534468215406464.0000 dloss:25520748345763877992806005061189632.0000 exploreP:0.0100
Episode:530 Steps:1000 meanR:0.5728 R:0.9300 gloss:-6369959969266938968801280.0000 gloss1-lgP:6316198.6117 gloss2gQs:-882014489117150848.0000 gloss3dQs:-730355037116145536.