# Continuous Control

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Reacher.app"`
- **Windows** (x86): `"path/to/Reacher_Windows_x86/Reacher.exe"`
- **Windows** (x86_64): `"path/to/Reacher_Windows_x86_64/Reacher.exe"`
- **Linux** (x86): `"path/to/Reacher_Linux/Reacher.x86"`
- **Linux** (x86_64): `"path/to/Reacher_Linux/Reacher.x86_64"`
- **Linux** (x86, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86"`
- **Linux** (x86_64, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86_64"`

For instance, if you are using a Mac, then you downloaded `Reacher.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Reacher.app")
```

In [2]:
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux/Reacher.x86_64')
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_v1/Reacher.x86_64')
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_OneAgent/Reacher_Linux/Reacher.x86_64')
env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_NoVis_OneAgent/Reacher_Linux_NoVis/Reacher.x86_64')
# env = UnityEnvironment(file_name='/home/aras/unity-envs/Reacher_Linux_NoVis/Reacher.x86_64')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 1
Size of each action: 4
There are 1 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726671e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: 0.0


When finished, you can close the environment.

In [6]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [7]:
# Testing the train mode
env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
state = env_info.vector_observations[0]                  # get the current state (for each agent)
#scores = np.zeros(num_agents)                          # initialize the score (for each agent)
num_steps = 0
while True:
    num_steps += 1
    action = np.random.randn(num_agents, action_size) # select an action (for each agent)
    action = np.clip(action, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(action)[brain_name]           # send all actions to tne environment
    next_state = env_info.vector_observations[0]         # get next state (for each agent)
    reward = env_info.rewards[0]                         # get reward (for each agent)
    done = env_info.local_done[0]                        # see if episode finished
    #scores += env_info.rewards                         # update the score (for each agent)
    state = next_state                               # roll over states to next time step
    if done is True:                                  # exit loop if episode finished
        #print(action.shape, reward)
        #print(done)
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))
num_steps

Total score (averaged over agents) this episode: 0.0


1001

## Option 1: Solve the First Version
The task is episodic, and in order to solve the environment, your agent must get an average score of +30 over 100 consecutive episodes.

In [8]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


In [9]:
def model_input(state_size, action_size):
    #states = tf.placeholder(tf.float32, [None, *state_size], name='states')
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.float32, [None, action_size], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    rates = tf.placeholder(tf.float32, [None], name='rates')
    return states, actions, targetQs, rates

In [10]:
# Generator/Controller: Generating/prediting the actions
def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        #predictions = tf.nn.softmax(logits)

        # return actions logits
        return logits

In [11]:
# Discriminator/Dopamine: Reward function/planner/naviator/advisor/supervisor/cortical columns
def discriminator(states, actions, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=action_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        fused = tf.concat(axis=1, values=[nl1, actions])
        h2 = tf.layers.dense(inputs=fused, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
                
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)        
        #predictions = tf.nn.softmax(logits)

        # return rewards logits
        return logits

In [12]:
def model_loss(action_size, hidden_size, states, actions, targetQs, rates):
    actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    neg_log_prob = tf.nn.sigmoid_cross_entropy_with_logits(logits=actions_logits, # DPG
                                                           labels=actions) # 0-1
    targetQs = tf.reshape(targetQs, shape=[-1, 1])
    gloss = tf.reduce_mean(neg_log_prob * targetQs) # DPG: r+(gamma*nextQ)
    gQs = discriminator(actions=actions_logits, hidden_size=hidden_size, states=states)
    dQs = discriminator(actions=actions, hidden_size=hidden_size, states=states, reuse=True) # Qs
    rates = tf.reshape(rates, shape=[-1, 1])
    dlossA = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=gQs, # GAN
                                                                    labels=rates)) # 0-1
    dlossA += tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=dQs, # GAN
                                                                     labels=rates)) # 0-1
    dlossA /= 2
    dlossQ = tf.reduce_mean(tf.square(gQs - targetQs)) # DQN
    dlossQ += tf.reduce_mean(tf.square(dQs - targetQs)) # DQN
    dlossQ /= 2
    #return tf.nn.sigmoid(actions_logits), gQs, gloss, dlossA, dlossQ
    return actions_logits, gQs, gloss, dlossA, dlossQ

In [13]:
# Optimizating/training/learning G & D
def model_opt(g_loss, d_lossA, d_lossQ, g_learning_rate, d_learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        g_opt = tf.train.AdamOptimizer(g_learning_rate).minimize(g_loss, var_list=g_vars)
        d_optA = tf.train.AdamOptimizer(d_learning_rate).minimize(d_lossA, var_list=d_vars)
        d_optQ = tf.train.AdamOptimizer(d_learning_rate).minimize(d_lossQ, var_list=d_vars)

    return g_opt, d_optA, d_optQ

In [14]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, g_learning_rate, d_learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, self.rates = model_input(state_size=state_size, 
                                                                           action_size=action_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.Qs_logits, self.g_loss, self.d_lossA, self.d_lossQ = model_loss(
            action_size=action_size, hidden_size=hidden_size, # model init parameters
            states=self.states, actions=self.actions, 
            targetQs=self.targetQs, rates=self.rates) # model input
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_optA, self.d_optQ = model_opt(g_loss=self.g_loss, 
                                                         d_lossA=self.d_lossA, 
                                                         d_lossQ=self.d_lossQ, 
                                                         g_learning_rate=g_learning_rate, 
                                                         d_learning_rate=d_learning_rate)

In [15]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size) # data batch

In [None]:
# env_info.vector_observations.shape, env_info.previous_vector_actions.shape, \
# brain.vector_action_space_size, brain.number_visual_observations, \
brain.vector_action_space_size, brain.vector_observation_space_size

(4, 33)

In [None]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01           # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
state_size = 33
action_size = 4
hidden_size = 33*2             # number of units in each Q-network hidden layer
g_learning_rate = 1e-4         # Q-network learning rate
d_learning_rate = 1e-4         # Q-network learning rate

# Memory parameters
memory_size = int(1e5)            # memory capacity
batch_size = int(1e3)             # experience mini-batch size == one episode size is 1000/int(1e3) steps
gamma = 0.99                   # future reward discount

In [None]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, state_size=state_size, hidden_size=hidden_size,
              g_learning_rate=g_learning_rate, d_learning_rate=d_learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

In [None]:
#state = env.reset()
env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
state = env_info.vector_observations[0]                  # get the current state (for each agent)
total_reward = 0
num_step = 0
for each_step in range(memory_size):
    # action = env.action_space.sample()
    # next_state, reward, done, _ = env.step(action)
    action = np.random.randn(num_agents, action_size) # select an action (for each agent)
    action_clipped = np.clip(action, -1, 1) # all actions between -1 and 1
    action_normed = (action_clipped+1)/2 # normalizing to 0-1
    env_info = env.step(action_clipped)[brain_name]           # send all actions to tne environment
    next_state = env_info.vector_observations[0]         # get next state (for each agent)
    reward = env_info.rewards[0]                         # get reward (for each agent)
    done = env_info.local_done[0]                        # see if episode finished
    memory.buffer.append([state, action_normed.reshape([-1]), next_state, reward, float(done), -1])
    num_step += 1 # memory incremented
    total_reward += reward
    state = next_state
    if done is True:
        print('each_step:', each_step, 'Left percentage:', each_step/memory_size)
        #state = env.reset()
        env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
        state = env_info.vector_observations[0]                  # get the current state (for each agent)
        rate = total_reward/30 # Goal-required rewards
        rate_clipped = np.clip(rate, 0, 1)
        total_reward = 0 # reset
        for idx in range(num_step): # episode length
            if memory.buffer[-1-idx][-1] == -1:
                memory.buffer[-1-idx][-1] = rate_clipped
        num_step = 0 # reset

each_step: 1000 Left percentage: 0.01
each_step: 2001 Left percentage: 0.02001
each_step: 3002 Left percentage: 0.03002
each_step: 4003 Left percentage: 0.04003
each_step: 5004 Left percentage: 0.05004
each_step: 6005 Left percentage: 0.06005
each_step: 7006 Left percentage: 0.07006
each_step: 8007 Left percentage: 0.08007
each_step: 9008 Left percentage: 0.09008
each_step: 10009 Left percentage: 0.10009
each_step: 11010 Left percentage: 0.1101
each_step: 12011 Left percentage: 0.12011
each_step: 13012 Left percentage: 0.13012
each_step: 14013 Left percentage: 0.14013
each_step: 15014 Left percentage: 0.15014
each_step: 16015 Left percentage: 0.16015
each_step: 17016 Left percentage: 0.17016
each_step: 18017 Left percentage: 0.18017
each_step: 19018 Left percentage: 0.19018
each_step: 20019 Left percentage: 0.20019
each_step: 21020 Left percentage: 0.2102
each_step: 22021 Left percentage: 0.22021
each_step: 23022 Left percentage: 0.23022
each_step: 24023 Left percentage: 0.24023
each_s

In [None]:
# Save/load the model and save for plotting
saver = tf.train.Saver()
episode_rewards_list = [] # goal
rewards_list, gloss_list, dlossA_list, dlossQ_list = [], [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window

    # Training episodes/epochs
    for ep in range(1111):
        gloss_batch, dlossA_batch, dlossQ_batch= [], [], []
        #state = env.reset() # each episode
        env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
        state = env_info.vector_observations[0]                  # get the current state (for each agent)
        num_step = 0 # each episode
        total_reward = 0 # each episode

        # Training steps/batches
        while True:
            # Explore (Env) or Exploit (Model)
            total_step += 1
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            if explore_p > np.random.rand():
                #action = env.action_space.sample()
                action = np.random.randn(num_agents, action_size) # select an action (for each agent)
            else:
                action = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
            #next_state, reward, done, _ = env.step(action_clipped)
            action_clipped = np.clip(action, -1, 1) # all actions between -1 and 1; for env
            action_normed = (action_clipped+1)/2 # normalizing to 0-1; for training
            env_info = env.step(action_clipped)[brain_name]           # send all actions to tne environment
            next_state = env_info.vector_observations[0]         # get next state (for each agent)
            reward = env_info.rewards[0]                         # get reward (for each agent)
            done = env_info.local_done[0]                        # see if episode finished
            memory.buffer.append([state, action_normed.reshape([-1]), next_state, reward, float(done), -1])
            num_step += 1 # momory added
            total_reward += reward
            state = next_state
            
            # Rating the last played episode
            if done is True:
                rate = total_reward/30 # Goal-required rewards
                rate_clipped = np.clip(rate, 0, 1)
                for idx in range(num_step): # episode length
                    if memory.buffer[-1-idx][-1] == -1:
                        memory.buffer[-1-idx][-1] = rate_clipped
            
            # Training using a max rated batch
            while True:
                idx = np.random.choice(np.arange(memory_size// batch_size))
                batch = np.array(memory.buffer)[idx*batch_size:(idx+1)*batch_size]
                rates = np.array([each[5] for each in batch])
                if (np.max(rates)*0.9) > 0: # non-rated data -1
                    break
            batch = batch[rates >= (np.max(rates)*0.9)]
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            rates = np.array([each[5] for each in batch])            
            #print(states.shape, actions.shape, next_states.shape, rewards.shape, dones.shape, rates.shape)
            nextQs_logits = sess.run(model.Qs_logits, feed_dict = {model.states: next_states})
            #nextQs = np.max(nextQs_logits, axis=1) * (1-dones) # DQN
            nextQs = nextQs_logits.reshape([-1]) * (1-dones) # DPG
            targetQs = rewards + (gamma * nextQs)
            gloss, dlossA, dlossQ, _, _, _ = sess.run([model.g_loss, model.d_lossA, model.d_lossQ, 
                                                       model.g_opt, model.d_optA, model.d_optQ],
                                                      feed_dict = {model.states: states, 
                                                                   model.actions: actions,
                                                                   model.targetQs: targetQs, 
                                                                   model.rates: rates})
            gloss_batch.append(gloss)
            dlossA_batch.append(dlossA)
            dlossQ_batch.append(dlossQ)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'rate:{:.4f}'.format(rate),
              'gloss:{:.4f}'.format(np.mean(gloss_batch)),
              'dlossA:{:.4f}'.format(np.mean(dlossA_batch)),
              'dlossQ:{:.4f}'.format(np.mean(dlossQ_batch)),
              'exploreP:{:.4f}'.format(explore_p))

        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        gloss_list.append([ep, np.mean(gloss_batch)])
        dlossA_list.append([ep, np.mean(dlossA_batch)])
        dlossQ_list.append([ep, np.mean(dlossQ_batch)])
        
        # Break episode/epoch loop
        ## Option 1: Solve the First Version
        #The task is episodic, and in order to solve the environment, 
        #your agent must get an average score of +30 over 100 consecutive episodes.
        if np.mean(episode_reward) >= 30:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:0.0000 R:0.0000 rate:0.0000 gloss:-228.1849 dlossA:0.2053 dlossQ:2.0681 exploreP:0.9057
Episode:1 meanR:0.3700 R:0.7400 rate:0.0247 gloss:-5910.7075 dlossA:0.1143 dlossQ:6.5927 exploreP:0.8204
Episode:2 meanR:0.2467 R:0.0000 rate:0.0000 gloss:-18858.0332 dlossA:0.2535 dlossQ:20.1431 exploreP:0.7432
Episode:3 meanR:0.1850 R:0.0000 rate:0.0000 gloss:-75870.5547 dlossA:0.5636 dlossQ:204.9388 exploreP:0.6734
Episode:4 meanR:0.3120 R:0.8200 rate:0.0273 gloss:-486767.4062 dlossA:1.3853 dlossQ:1167.1914 exploreP:0.6102
Episode:5 meanR:0.4000 R:0.8400 rate:0.0280 gloss:-710808.8750 dlossA:1.4197 dlossQ:642.4069 exploreP:0.5530
Episode:6 meanR:0.3871 R:0.3100 rate:0.0103 gloss:-1676342.5000 dlossA:3.3548 dlossQ:6032.1714 exploreP:0.5013
Episode:7 meanR:0.4175 R:0.6300 rate:0.0210 gloss:-3224989.5000 dlossA:4.2110 dlossQ:10904.8350 exploreP:0.4545
Episode:8 meanR:0.4167 R:0.4100 rate:0.0137 gloss:-13316800.0000 dlossA:5.8433 dlossQ:41421.2305 exploreP:0.4121
Episode:9 meanR:0.431

Episode:68 meanR:0.5867 R:0.6300 rate:0.0210 gloss:-652963610624.0000 dlossA:3518.1802 dlossQ:5258274304.0000 exploreP:0.0110
Episode:69 meanR:0.5813 R:0.2100 rate:0.0070 gloss:-539048181760.0000 dlossA:1916.6689 dlossQ:2664715520.0000 exploreP:0.0109
Episode:70 meanR:0.5754 R:0.1600 rate:0.0053 gloss:-575328092160.0000 dlossA:3686.7856 dlossQ:8401902080.0000 exploreP:0.0108
Episode:71 meanR:0.5865 R:1.3800 rate:0.0460 gloss:-647510360064.0000 dlossA:3015.0962 dlossQ:8217395200.0000 exploreP:0.0107
Episode:72 meanR:0.5836 R:0.3700 rate:0.0123 gloss:-615644135424.0000 dlossA:2730.6479 dlossQ:4502477312.0000 exploreP:0.0107
Episode:73 meanR:0.5757 R:0.0000 rate:0.0000 gloss:-503513645056.0000 dlossA:2020.4854 dlossQ:1815625088.0000 exploreP:0.0106
Episode:74 meanR:0.5719 R:0.2900 rate:0.0097 gloss:-385604583424.0000 dlossA:1872.5297 dlossQ:1256276608.0000 exploreP:0.0105
Episode:75 meanR:0.5643 R:0.0000 rate:0.0000 gloss:-478232117248.0000 dlossA:6177.5884 dlossQ:9249180672.0000 exploreP

Episode:132 meanR:0.5440 R:0.7200 rate:0.0240 gloss:-595279609856.0000 dlossA:20216.5684 dlossQ:340759019520.0000 exploreP:0.0100
Episode:133 meanR:0.5435 R:0.4200 rate:0.0140 gloss:-2015381422080.0000 dlossA:32032.3516 dlossQ:2333508108288.0000 exploreP:0.0100
Episode:134 meanR:0.5451 R:0.8500 rate:0.0283 gloss:-1184275234816.0000 dlossA:15435.2607 dlossQ:675460153344.0000 exploreP:0.0100
Episode:135 meanR:0.5516 R:0.8900 rate:0.0297 gloss:-1059190407168.0000 dlossA:9860.5762 dlossQ:246696509440.0000 exploreP:0.0100
Episode:136 meanR:0.5427 R:0.1800 rate:0.0060 gloss:-1008568041472.0000 dlossA:10975.0186 dlossQ:244171522048.0000 exploreP:0.0100
Episode:137 meanR:0.5477 R:1.1400 rate:0.0380 gloss:-842649436160.0000 dlossA:10407.0752 dlossQ:187269660672.0000 exploreP:0.0100
Episode:138 meanR:0.5404 R:0.5500 rate:0.0183 gloss:-764476653568.0000 dlossA:14185.6123 dlossQ:606412013568.0000 exploreP:0.0100
Episode:139 meanR:0.5413 R:0.2800 rate:0.0093 gloss:-902598819840.0000 dlossA:39535.15

Episode:195 meanR:0.5528 R:0.2300 rate:0.0077 gloss:-6797793951744.0000 dlossA:83818.9688 dlossQ:3241198747648.0000 exploreP:0.0100
Episode:196 meanR:0.5534 R:0.3600 rate:0.0120 gloss:-8310220652544.0000 dlossA:31371.0273 dlossQ:1865994993664.0000 exploreP:0.0100
Episode:197 meanR:0.5553 R:1.0600 rate:0.0353 gloss:-16560552935424.0000 dlossA:57037.7500 dlossQ:8397163855872.0000 exploreP:0.0100
Episode:198 meanR:0.5490 R:0.1900 rate:0.0063 gloss:-17901117505536.0000 dlossA:39864.9180 dlossQ:7436615286784.0000 exploreP:0.0100
Episode:199 meanR:0.5531 R:0.4700 rate:0.0157 gloss:-19473033592832.0000 dlossA:44563.4414 dlossQ:6822155517952.0000 exploreP:0.0100
Episode:200 meanR:0.5480 R:0.4000 rate:0.0133 gloss:-19526204784640.0000 dlossA:53434.4141 dlossQ:6272003866624.0000 exploreP:0.0100
Episode:201 meanR:0.5456 R:0.1400 rate:0.0047 gloss:-19294683398144.0000 dlossA:58794.9258 dlossQ:4279533043712.0000 exploreP:0.0100
Episode:202 meanR:0.5483 R:0.5200 rate:0.0173 gloss:-19452385034240.000

Episode:257 meanR:0.5208 R:0.6400 rate:0.0213 gloss:-50515614367744.0000 dlossA:83201.0469 dlossQ:18180720295936.0000 exploreP:0.0100
Episode:258 meanR:0.5210 R:0.3900 rate:0.0130 gloss:-49397731688448.0000 dlossA:288401.0625 dlossQ:49274645643264.0000 exploreP:0.0100
Episode:259 meanR:0.5224 R:0.4100 rate:0.0137 gloss:-58902578200576.0000 dlossA:95531.8281 dlossQ:11047344799744.0000 exploreP:0.0100
Episode:260 meanR:0.5298 R:1.4000 rate:0.0467 gloss:-48952275632128.0000 dlossA:192083.1719 dlossQ:25232328884224.0000 exploreP:0.0100
Episode:261 meanR:0.5307 R:0.8000 rate:0.0267 gloss:-46678476324864.0000 dlossA:81586.4141 dlossQ:8281289916416.0000 exploreP:0.0100
Episode:262 meanR:0.5339 R:0.8300 rate:0.0277 gloss:-60796683943936.0000 dlossA:342643.0000 dlossQ:77426759565312.0000 exploreP:0.0100
Episode:263 meanR:0.5354 R:0.1500 rate:0.0050 gloss:-91731458523136.0000 dlossA:302021.6875 dlossQ:79787431297024.0000 exploreP:0.0100
Episode:264 meanR:0.5325 R:0.3900 rate:0.0130 gloss:-188615

Episode:318 meanR:0.5932 R:0.0000 rate:0.0000 gloss:-542997213085696.0000 dlossA:593132.6875 dlossQ:936462790950912.0000 exploreP:0.0100
Episode:319 meanR:0.5931 R:0.4200 rate:0.0140 gloss:-483874840772608.0000 dlossA:340434.9375 dlossQ:230357332918272.0000 exploreP:0.0100
Episode:320 meanR:0.5907 R:0.3600 rate:0.0120 gloss:-359412694777856.0000 dlossA:200212.7969 dlossQ:75330639364096.0000 exploreP:0.0100
Episode:321 meanR:0.5949 R:0.9100 rate:0.0303 gloss:-318998260482048.0000 dlossA:249196.0156 dlossQ:74699430166528.0000 exploreP:0.0100
Episode:322 meanR:0.5911 R:0.1900 rate:0.0063 gloss:-295483180318720.0000 dlossA:303275.2188 dlossQ:55868603236352.0000 exploreP:0.0100
Episode:323 meanR:0.5864 R:0.1500 rate:0.0050 gloss:-206936574263296.0000 dlossA:679346.0000 dlossQ:177837684293632.0000 exploreP:0.0100
Episode:324 meanR:0.5846 R:0.2100 rate:0.0070 gloss:-359113389244416.0000 dlossA:430795.0000 dlossQ:639439495757824.0000 exploreP:0.0100
Episode:325 meanR:0.5837 R:0.3600 rate:0.012

Episode:378 meanR:0.5979 R:1.0600 rate:0.0353 gloss:-3625135739240448.0000 dlossA:658463.0000 dlossQ:1771093786361856.0000 exploreP:0.0100
Episode:379 meanR:0.6025 R:0.7500 rate:0.0250 gloss:-4061963709579264.0000 dlossA:717469.0000 dlossQ:1641976734679040.0000 exploreP:0.0100
Episode:380 meanR:0.5992 R:0.4300 rate:0.0143 gloss:-3947600273211392.0000 dlossA:691382.5625 dlossQ:1357826395996160.0000 exploreP:0.0100
Episode:381 meanR:0.5907 R:0.0000 rate:0.0000 gloss:-3114972007628800.0000 dlossA:778641.3750 dlossQ:1405202166972416.0000 exploreP:0.0100
Episode:382 meanR:0.5846 R:0.1400 rate:0.0047 gloss:-2634761646374912.0000 dlossA:808158.4375 dlossQ:1188229948637184.0000 exploreP:0.0100
Episode:383 meanR:0.5762 R:0.0700 rate:0.0023 gloss:-2342052108959744.0000 dlossA:636927.0625 dlossQ:688421517393920.0000 exploreP:0.0100
Episode:384 meanR:0.5686 R:0.5300 rate:0.0177 gloss:-1959468099174400.0000 dlossA:867796.3750 dlossQ:758529140981760.0000 exploreP:0.0100
Episode:385 meanR:0.5567 R:0.

Episode:437 meanR:0.6193 R:1.0200 rate:0.0340 gloss:-8497138065473536.0000 dlossA:1104340.5000 dlossQ:2392748560744448.0000 exploreP:0.0100
Episode:438 meanR:0.6178 R:0.1200 rate:0.0040 gloss:-7694691608821760.0000 dlossA:1061900.2500 dlossQ:1932196130586624.0000 exploreP:0.0100
Episode:439 meanR:0.6146 R:1.0400 rate:0.0347 gloss:nan dlossA:nan dlossQ:nan exploreP:0.0100
Episode:440 meanR:0.6076 R:0.0000 rate:0.0000 gloss:nan dlossA:nan dlossQ:nan exploreP:0.0100
Episode:441 meanR:0.6076 R:0.0000 rate:0.0000 gloss:nan dlossA:nan dlossQ:nan exploreP:0.0100
Episode:442 meanR:0.5932 R:0.0000 rate:0.0000 gloss:nan dlossA:nan dlossQ:nan exploreP:0.0100
Episode:443 meanR:0.5896 R:0.0000 rate:0.0000 gloss:nan dlossA:nan dlossQ:nan exploreP:0.0100
Episode:444 meanR:0.5857 R:0.0000 rate:0.0000 gloss:nan dlossA:nan dlossQ:nan exploreP:0.0100
Episode:445 meanR:0.5827 R:0.0000 rate:0.0000 gloss:nan dlossA:nan dlossQ:nan exploreP:0.0100
Episode:446 meanR:0.5776 R:0.0000 rate:0.0000 gloss:nan dlossA