# Continuous Control

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Reacher.app"`
- **Windows** (x86): `"path/to/Reacher_Windows_x86/Reacher.exe"`
- **Windows** (x86_64): `"path/to/Reacher_Windows_x86_64/Reacher.exe"`
- **Linux** (x86): `"path/to/Reacher_Linux/Reacher.x86"`
- **Linux** (x86_64): `"path/to/Reacher_Linux/Reacher.x86_64"`
- **Linux** (x86, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86"`
- **Linux** (x86_64, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86_64"`

For instance, if you are using a Mac, then you downloaded `Reacher.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Reacher.app")
```

In [2]:
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux/Reacher.x86_64')
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_v1/Reacher.x86_64')
env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_OneAgent/Reacher_Linux/Reacher.x86_64')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 1
Size of each action: 4
There are 1 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726671e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: 0.0


When finished, you can close the environment.

In [6]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [8]:
# Testing the train mode
env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
state = env_info.vector_observations[0]                  # get the current state (for each agent)
#scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    action = np.random.randn(num_agents, action_size) # select an action (for each agent)
    #actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(action)[brain_name]           # send all actions to tne environment
    next_state = env_info.vector_observations[0]         # get next state (for each agent)
    reward = env_info.rewards[0]                         # get reward (for each agent)
    done = env_info.local_done[0]                        # see if episode finished
    #scores += env_info.rewards                         # update the score (for each agent)
    state = next_state                               # roll over states to next time step
    if done is True:                                  # exit loop if episode finished
        print(action.shape, reward)
        print(done)
        break
# print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

(1, 4) 0.0
True


## Option 1: Solve the First Version
The task is episodic, and in order to solve the environment, your agent must get an average score of +30 over 100 consecutive episodes.

In [9]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


In [10]:
def model_input(state_size, action_size):
    #states = tf.placeholder(tf.float32, [None, *state_size], name='states')
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.float32, [None, action_size], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    return states, actions, targetQs

In [11]:
# Generator/Controller: Generating/prediting the actions
def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        #predictions = tf.nn.softmax(logits)

        # return actions logits
        return logits

In [12]:
# Discriminator/Dopamine: Reward function/planner/naviator/advisor/supervisor/cortical columns
def discriminator(states, actions, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # Fusion/merge states and actions/ SA/ SM
        x_fused = tf.concat(axis=1, values=[states, actions])
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)        
        #predictions = tf.nn.softmax(logits)

        # return rewards logits
        return logits

In [13]:
def model_loss(action_size, hidden_size, states, actions, targetQs):
    # G
    actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    #actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    actions_labels = tf.nn.sigmoid(actions)
    # neg_log_prob_actions = tf.nn.softmax_cross_entropy_with_logits_v2(logits=actions_logits, 
    #                                                                   labels=actions_labels)
    neg_log_prob_actions = tf.nn.sigmoid_cross_entropy_with_logits(logits=actions_logits, 
                                                                   labels=actions_labels)
    #g_loss = tf.reduce_mean(neg_log_prob_actions * targetQs) # error!
    
    # D
    Qs = discriminator(actions=actions_logits, hidden_size=hidden_size, states=states)
    d_loss = tf.reduce_mean(tf.square(Qs - targetQs))
    g_loss = tf.reduce_mean(neg_log_prob_actions * Qs)
    return actions_logits, Qs, g_loss, d_loss

In [14]:
# Optimizating/training/learning G & D
def model_opt(g_loss, d_loss, learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars)
        d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars)

    return g_opt, d_opt

In [15]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs = model_input(state_size=state_size, action_size=action_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.Qs_logits, self.g_loss, self.d_loss = model_loss(
            action_size=action_size, hidden_size=hidden_size, # model init parameters
            states=self.states, actions=self.actions, targetQs=self.targetQs) # model input
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(g_loss=self.g_loss, d_loss=self.d_loss, learning_rate=learning_rate)

In [16]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)

In [17]:
print('state size:{}'.format(states.shape), 
      'actions:{}'.format(actions.shape)) 
print('action size:{}'.format(np.max(actions) - np.min(actions)+1))

state size:(1, 33) actions:(1, 4)
action size:3.0


In [18]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
# state_size = 37
# state_size_ = (84, 84, 3)
state_size = 33
action_size = 4
hidden_size = 33*2             # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
gamma = 0.99                   # future reward discount
memory_size = 1000            # memory capacity
batch_size = 1000             # experience mini-batch size

In [19]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, state_size=state_size, hidden_size=hidden_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

In [20]:
# Initializing the memory buffer
env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
state = env_info.vector_observations[0]                  # get the current state (for each agent)
for _ in range(memory_size):
    action = np.random.randn(num_agents, action_size) # select an action (for each agent)
    #action = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(action)[brain_name]           # send all actions to tne environment
    next_state = env_info.vector_observations[0]         # get next state (for each agent)
    reward = env_info.rewards[0]                         # get reward (for each agent)
    done = env_info.local_done[0]                        # see if episode finished
    memory.buffer.append([state, action.reshape([-1]), next_state, reward, float(done)])
    #print(state.shape, action.reshape([-1]).shape, reward, float(done))
    state = next_state                               # roll over states to next time step
    if done is True:                                  # exit loop if episode finished
        print(done)
        env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
        state = env_info.vector_observations[0]                  # get the current state (for each agent)
        break

In [21]:
# len(memory.buffer), memory.buffer[100]

In [None]:
# Save/load the model and save for plotting
saver = tf.train.Saver()
episode_rewards_list, rewards_list, gloss_list, dloss_list = [], [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model-reacher-Continuous_Control.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        gloss_batch, dloss_batch = [], []
        #state = env.reset()
        env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
        state = env_info.vector_observations[0]                  # get the current state (for each agent)

        # Training steps/batches
        while True:
            # Explore (Env) or Exploit (Model)
            total_step += 1
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            if explore_p > np.random.rand():
                #action = env.action_space.sample()
                action = np.random.randn(num_agents, action_size) # select an action (for each agent)
            else:
                action = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
                #print(action.shape)
                #action = np.reshape(action_logits, [-1]) # For continuous action space
                #action = np.argmax(action_logits) # For discrete action space
            #next_state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name]           # send all actions to tne environment
            next_state = env_info.vector_observations[0]         # get next state (for each agent)
            reward = env_info.rewards[0]                         # get reward (for each agent)
            done = env_info.local_done[0]                        # see if episode finished
            memory.buffer.append([state, action.reshape([-1]), next_state, reward, float(done)])
            total_reward += reward
            state = next_state

            # Training
            #batch = memory.sample(batch_size)
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            nextQs_logits = sess.run(model.Qs_logits, feed_dict = {model.states: next_states})
            #nextQs = np.max(nextQs_logits, axis=1) * (1-dones)
            nextQs = nextQs_logits.reshape([-1]) * (1-dones)
            targetQs = rewards + (gamma * nextQs)
            #print(targetQs.shape)
            gloss, dloss, _, _ = sess.run([model.g_loss, model.d_loss, model.g_opt, model.d_opt],
                                            feed_dict = {model.states: states, 
                                                         model.actions: actions,
                                                         model.targetQs: targetQs})
            gloss_batch.append(gloss)
            dloss_batch.append(dloss)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'gloss:{:.4f}'.format(np.mean(gloss_batch)),
              'dloss:{:.4f}'.format(np.mean(dloss_batch)),
              'exploreP:{:.4f}'.format(explore_p))
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        gloss_list.append([ep, np.mean(gloss_batch)])
        dloss_list.append([ep, np.mean(dloss_batch)])
        # Break episode/epoch loop
        ## Option 1: Solve the First Version
        #The task is episodic, and in order to solve the environment, 
        #your agent must get an average score of +30 over 100 consecutive episodes.        
        if np.mean(episode_reward) >= +30:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model-reacher-Continuous_Control.ckpt')

Episode:0 meanR:0.0700 R:0.0700 gloss:-0.5387 dloss:0.0776 exploreP:0.9057
Episode:1 meanR:0.2200 R:0.3700 gloss:1.4941 dloss:0.0100 exploreP:0.8204
Episode:2 meanR:0.4133 R:0.8000 gloss:4.5103 dloss:0.0052 exploreP:0.7432
Episode:3 meanR:0.3700 R:0.2400 gloss:7.0446 dloss:0.0052 exploreP:0.6734
Episode:4 meanR:0.3920 R:0.4800 gloss:1.7065 dloss:0.0094 exploreP:0.6102
Episode:5 meanR:0.4600 R:0.8000 gloss:6.8126 dloss:0.0071 exploreP:0.5530
Episode:6 meanR:0.5229 R:0.9000 gloss:5.9324 dloss:0.0056 exploreP:0.5013
Episode:7 meanR:0.4925 R:0.2800 gloss:3.6661 dloss:0.0102 exploreP:0.4545
Episode:8 meanR:0.4722 R:0.3100 gloss:3.1590 dloss:0.0248 exploreP:0.4121
Episode:9 meanR:0.4940 R:0.6900 gloss:3.6598 dloss:0.0309 exploreP:0.3738
Episode:10 meanR:0.5064 R:0.6300 gloss:8.2172 dloss:0.0329 exploreP:0.3392
Episode:11 meanR:0.4642 R:0.0000 gloss:3.7638 dloss:0.0406 exploreP:0.3078
Episode:12 meanR:0.4423 R:0.1800 gloss:1.1668 dloss:0.1349 exploreP:0.2795
Episode:13 meanR:0.4271 R:0.2300 g

Episode:109 meanR:0.4601 R:0.0000 gloss:-0.2024 dloss:0.0470 exploreP:0.0100
Episode:110 meanR:0.4538 R:0.0000 gloss:-0.5011 dloss:0.0955 exploreP:0.0100
Episode:111 meanR:0.4554 R:0.1600 gloss:15.0748 dloss:0.6286 exploreP:0.0100
Episode:112 meanR:0.4574 R:0.3800 gloss:20.3729 dloss:0.2324 exploreP:0.0100
Episode:113 meanR:0.4661 R:1.1000 gloss:-0.1174 dloss:0.0832 exploreP:0.0100
Episode:114 meanR:0.4654 R:0.0000 gloss:0.0936 dloss:0.0601 exploreP:0.0100
Episode:115 meanR:0.4686 R:0.3200 gloss:0.0251 dloss:0.0326 exploreP:0.0100
Episode:116 meanR:0.4655 R:0.0600 gloss:-0.0970 dloss:0.0340 exploreP:0.0100
Episode:117 meanR:0.4690 R:0.3500 gloss:-0.2038 dloss:0.0606 exploreP:0.0100
Episode:118 meanR:0.4774 R:0.8400 gloss:-0.2867 dloss:0.0411 exploreP:0.0100
Episode:119 meanR:0.4871 R:0.9700 gloss:0.2189 dloss:0.0361 exploreP:0.0100
Episode:120 meanR:0.4899 R:0.6600 gloss:0.0201 dloss:0.0326 exploreP:0.0100
Episode:121 meanR:0.4994 R:0.9500 gloss:-0.2218 dloss:0.0341 exploreP:0.0100
Epi

Episode:217 meanR:0.9772 R:1.2000 gloss:0.0214 dloss:0.0000 exploreP:0.0100
Episode:218 meanR:0.9733 R:0.4500 gloss:0.0182 dloss:0.0000 exploreP:0.0100
Episode:219 meanR:0.9701 R:0.6500 gloss:0.0119 dloss:0.0000 exploreP:0.0100
Episode:220 meanR:0.9743 R:1.0800 gloss:0.0238 dloss:0.0000 exploreP:0.0100
Episode:221 meanR:0.9744 R:0.9600 gloss:0.0290 dloss:0.0001 exploreP:0.0100
Episode:222 meanR:0.9721 R:0.9100 gloss:0.0182 dloss:0.0000 exploreP:0.0100
Episode:223 meanR:0.9759 R:1.0000 gloss:0.0367 dloss:0.0001 exploreP:0.0100
Episode:224 meanR:0.9753 R:0.3200 gloss:0.0106 dloss:0.0000 exploreP:0.0100
Episode:225 meanR:0.9940 R:1.8700 gloss:0.0355 dloss:0.0001 exploreP:0.0100
Episode:226 meanR:0.9984 R:0.8300 gloss:0.0350 dloss:0.0001 exploreP:0.0100
Episode:227 meanR:0.9854 R:0.0000 gloss:0.0082 dloss:0.0000 exploreP:0.0100
Episode:228 meanR:0.9805 R:0.5200 gloss:0.0070 dloss:0.0000 exploreP:0.0100
Episode:229 meanR:0.9831 R:0.3300 gloss:0.0144 dloss:0.0000 exploreP:0.0100
Episode:230 

Episode:325 meanR:1.1658 R:0.7600 gloss:0.0810 dloss:0.0001 exploreP:0.0100
Episode:326 meanR:1.1671 R:0.9600 gloss:0.0240 dloss:0.0000 exploreP:0.0100
Episode:327 meanR:1.1740 R:0.6900 gloss:0.0521 dloss:0.0001 exploreP:0.0100
Episode:328 meanR:1.1826 R:1.3800 gloss:0.0278 dloss:0.0000 exploreP:0.0100
Episode:329 meanR:1.1966 R:1.7300 gloss:0.0906 dloss:0.0001 exploreP:0.0100
Episode:330 meanR:1.2130 R:2.4700 gloss:0.0707 dloss:0.0001 exploreP:0.0100
Episode:331 meanR:1.2006 R:0.1500 gloss:0.0818 dloss:0.0001 exploreP:0.0100
Episode:332 meanR:1.2036 R:0.9400 gloss:0.0289 dloss:0.0000 exploreP:0.0100
Episode:333 meanR:1.2110 R:1.1200 gloss:0.0363 dloss:0.0000 exploreP:0.0100
Episode:334 meanR:1.2152 R:0.5900 gloss:0.0501 dloss:0.0000 exploreP:0.0100
Episode:335 meanR:1.2197 R:1.3900 gloss:0.0258 dloss:0.0000 exploreP:0.0100
Episode:336 meanR:1.2196 R:0.1800 gloss:0.0551 dloss:0.0001 exploreP:0.0100
Episode:337 meanR:1.2213 R:1.0700 gloss:0.0182 dloss:0.0000 exploreP:0.0100
Episode:338 

Episode:433 meanR:0.9399 R:0.5400 gloss:0.0168 dloss:0.0000 exploreP:0.0100
Episode:434 meanR:0.9404 R:0.6400 gloss:0.0153 dloss:0.0000 exploreP:0.0100
Episode:435 meanR:0.9345 R:0.8000 gloss:0.0224 dloss:0.0000 exploreP:0.0100
Episode:436 meanR:0.9452 R:1.2500 gloss:0.0310 dloss:0.0000 exploreP:0.0100
Episode:437 meanR:0.9430 R:0.8500 gloss:0.0254 dloss:0.0000 exploreP:0.0100
Episode:438 meanR:0.9332 R:0.2500 gloss:0.0282 dloss:0.0000 exploreP:0.0100
Episode:439 meanR:0.9081 R:0.9000 gloss:0.0187 dloss:0.0000 exploreP:0.0100
Episode:440 meanR:0.8883 R:0.4200 gloss:0.0256 dloss:0.0000 exploreP:0.0100
Episode:441 meanR:0.8795 R:0.6200 gloss:0.0200 dloss:0.0000 exploreP:0.0100
Episode:442 meanR:0.8789 R:1.2200 gloss:0.0333 dloss:0.0000 exploreP:0.0100
Episode:443 meanR:0.8743 R:1.5800 gloss:0.0613 dloss:0.0001 exploreP:0.0100
Episode:444 meanR:0.8747 R:0.5800 gloss:0.0186 dloss:0.0000 exploreP:0.0100
Episode:445 meanR:0.8727 R:0.9600 gloss:0.0353 dloss:0.0000 exploreP:0.0100
Episode:446 

Episode:541 meanR:0.8468 R:1.4900 gloss:0.0502 dloss:0.0001 exploreP:0.0100
Episode:542 meanR:0.8527 R:1.8100 gloss:0.0662 dloss:0.0001 exploreP:0.0100
Episode:543 meanR:0.8386 R:0.1700 gloss:0.0493 dloss:0.0001 exploreP:0.0100
Episode:544 meanR:0.8443 R:1.1500 gloss:0.0329 dloss:0.0000 exploreP:0.0100
Episode:545 meanR:0.8394 R:0.4700 gloss:0.0268 dloss:0.0000 exploreP:0.0100
Episode:546 meanR:0.8431 R:1.3100 gloss:0.0332 dloss:0.0000 exploreP:0.0100
Episode:547 meanR:0.8556 R:1.3300 gloss:0.0712 dloss:0.0001 exploreP:0.0100
Episode:548 meanR:0.8609 R:1.8800 gloss:0.0749 dloss:0.0001 exploreP:0.0100
Episode:549 meanR:0.8651 R:1.0000 gloss:0.0617 dloss:0.0001 exploreP:0.0100
Episode:550 meanR:0.8542 R:0.4900 gloss:0.0409 dloss:0.0000 exploreP:0.0100
Episode:551 meanR:0.8648 R:1.7900 gloss:0.0753 dloss:0.0001 exploreP:0.0100
Episode:552 meanR:0.8767 R:2.2200 gloss:0.0688 dloss:0.0001 exploreP:0.0100
Episode:553 meanR:0.8822 R:1.7700 gloss:0.0820 dloss:0.0001 exploreP:0.0100
Episode:554 