# Navigation

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the first project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

### 1. Start the Environment

We begin by importing some necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [3]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Banana.app"`
- **Windows** (x86): `"path/to/Banana_Windows_x86/Banana.exe"`
- **Windows** (x86_64): `"path/to/Banana_Windows_x86_64/Banana.exe"`
- **Linux** (x86): `"path/to/Banana_Linux/Banana.x86"`
- **Linux** (x86_64): `"path/to/Banana_Linux/Banana.x86_64"`
- **Linux** (x86, headless): `"path/to/Banana_Linux_NoVis/Banana.x86"`
- **Linux** (x86_64, headless): `"path/to/Banana_Linux_NoVis/Banana.x86_64"`

For instance, if you are using a Mac, then you downloaded `Banana.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Banana.app")
```

In [4]:
# env = UnityEnvironment(file_name="/home/arasdar/VisualBanana_Linux/Banana.x86")
env = UnityEnvironment(file_name="/home/arasdar/Banana_Linux/Banana.x86_64")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [5]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [6]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
# print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)
# print(state.shape, len(env_info.vector_observations), env_info.vector_observations.shape)

Number of agents: 1
Number of actions: 4
States have length: 37


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action (uniformly) at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [7]:
env_info = env.reset(train_mode=False)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
while True:
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        print(state.shape)
        break
    
print("Score: {}".format(score))

(37,)
Score: 0.0


When finished, you can close the environment.

In [8]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [9]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
while True:
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    #print(state)
    if done:                                       # exit loop if episode finished
        break
    
print("Score: {}".format(score))

Score: -1.0


In [10]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


In [11]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
batch = []
while True: # infinite number of steps
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    #print(state, action, reward, done)
    batch.append([action, state, reward, done])
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        break
    
# print("Score: {}".format(score))

In [12]:
batch[0], batch[0][1].shape

([3, array([1.        , 0.        , 0.        , 0.        , 0.35186431,
         1.        , 0.        , 0.        , 0.        , 0.37953866,
         1.        , 0.        , 0.        , 0.        , 0.11957462,
         1.        , 0.        , 0.        , 0.        , 0.43679786,
         0.        , 1.        , 0.        , 0.        , 0.7516005 ,
         0.        , 0.        , 1.        , 0.        , 0.6708644 ,
         0.        , 0.        , 1.        , 0.        , 0.36187497,
         0.        , 0.        ]), 0.0, False], (37,))

In [13]:
batch[0][1].shape

(37,)

In [14]:
batch[0]

[3, array([1.        , 0.        , 0.        , 0.        , 0.35186431,
        1.        , 0.        , 0.        , 0.        , 0.37953866,
        1.        , 0.        , 0.        , 0.        , 0.11957462,
        1.        , 0.        , 0.        , 0.        , 0.43679786,
        0.        , 1.        , 0.        , 0.        , 0.7516005 ,
        0.        , 0.        , 1.        , 0.        , 0.6708644 ,
        0.        , 0.        , 1.        , 0.        , 0.36187497,
        0.        , 0.        ]), 0.0, False]

In [15]:
actions = np.array([each[0] for each in batch])
states = np.array([each[1] for each in batch])
rewards = np.array([each[2] for each in batch])
dones = np.array([each[3] for each in batch])
# infos = np.array([each[4] for each in batch])

In [16]:
# print(rewards[:])
print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print(np.max(np.array(actions)), np.min(np.array(actions)), 
      (np.max(np.array(actions)) - np.min(np.array(actions)))+1)
print(np.max(np.array(rewards)), np.min(np.array(rewards)))
print(np.max(np.array(states)), np.min(np.array(states)))

(300,) (300, 37) (300,) (300,)
float64 float64 int64 bool
3 0 4
1.0 -1.0
10.711227416992188 -10.516661643981934


In [17]:
# Data of the model
def model_input(state_size):
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    reward = tf.placeholder(tf.float32, [], name='reward')
    return states, actions, targetQs, reward

In [18]:
# Generator/Controller: Generating/prediting the actions
def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        #predictions = tf.nn.softmax(logits)

        # return actions logits
        return logits

In [19]:
# Discriminator/Dopamine: Reward function/planner/naviator/advisor/supervisor/cortical columns
def discriminator(states, actions, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # Fusion/merge states and actions/ SA/ SM
        x_fused = tf.concat(axis=1, values=[states, actions])
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)        
        #predictions = tf.nn.softmax(logits)

        # return rewards logits
        return logits

In [20]:
def model_loss(action_size, hidden_size, states, actions, targetQs, reward):
    # G
    actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    neg_log_prob_actions = tf.nn.softmax_cross_entropy_with_logits_v2(logits=actions_logits, 
                                                                      labels=actions_labels)
    g_loss = tf.reduce_mean(neg_log_prob_actions[:-1] * targetQs[1:])
    
    # D
    Qs_logits = discriminator(actions=actions_logits, hidden_size=hidden_size, states=states)
    rewards = reward * tf.ones_like(Qs_logits)
    d_lossR = tf.reduce_mean(tf.square(tf.nn.tanh(Qs_logits) - rewards))
    # d_lossR = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=Qs_logits,
    #                                                                  labels=tf.nn.sigmoid(rewards)))
    targetQs = tf.reshape(targetQs, shape=[-1, 1])
    d_lossQ = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=Qs_logits[:-1],
                                                                     labels=tf.nn.sigmoid(targetQs[1:])))
    d_loss = d_lossR + d_lossQ

    return actions_logits, Qs_logits, g_loss, d_loss, d_lossR, d_lossQ

In [21]:
# Optimizating/training/learning G & D
def model_opt(g_loss, d_loss, learning_rate):
    """
    Get optimization operations in order
    :param g_loss: Generator loss Tensor for action prediction
    :param d_loss: Discriminator loss Tensor for reward prediction for generated/prob/logits action
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars)
        d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars)

    return g_opt, d_opt

In [22]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, self.reward = model_input(state_size=state_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.Qs_logits, self.g_loss, self.d_loss, self.d_lossR, self.d_lossQ = model_loss(
            action_size=action_size, hidden_size=hidden_size, # model init parameters
            states=self.states, actions=self.actions, # model input
            targetQs=self.targetQs, reward=self.reward) # model input
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(g_loss=self.g_loss, d_loss=self.d_loss, learning_rate=learning_rate)

In [23]:
print('state size:{}'.format(states.shape), 
      'actions:{}'.format(actions.shape)) 
print('action size:{}'.format(np.max(actions) - np.min(actions)+1))

state size:(300, 37) actions:(300,)
action size:4


In [24]:
# Training parameters
# Network parameters
state_size = 37              # number of units for the input state/observation -- simulation
action_size = 4              # number of units for the output actions -- simulation
hidden_size = 37*16          # number of units in each Q-network hidden layer -- simulation
learning_rate = 0.001          # learning rate for adam

In [26]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

In [27]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment

while True: # infinite number of steps
#for _ in range(batch_size):
    state = env_info.vector_observations[0]   # get the next state
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    #memory.buffer.append([action, state, done])
    if done:                                       # exit loop if episode finished
        break

In [None]:
from collections import deque
episodes_total_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
saver = tf.train.Saver()
rewards_list, g_loss_list, d_loss_list = [], [], []
d_lossR_list, d_lossQ_list = [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Training episodes/epochs
    for ep in range(111111):
        batch = [] # every data batch
        total_reward = 0
        #state = env.reset() # env first state
        env_info = env.reset(train_mode=True)[brain_name] # reset the environment

        # Training steps/batches
        while True:
            state = env_info.vector_observations[0]   # get the next state
            action_logits, Q_logits = sess.run(fetches=[model.actions_logits, model.Qs_logits], 
                                               feed_dict={model.states: np.reshape(state, [1, -1])})
            action = np.argmax(action_logits)
            batch.append([state, action, Q_logits])
            #state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name]        # send the action to the environment
            reward = env_info.rewards[0]                   # get the reward
            done = env_info.local_done[0]                  # see if episode has finished
            total_reward += reward
            if done is True: # episode ended success/failure
                episodes_total_reward.append(total_reward) # stopping criteria
                #rate = total_reward/ 500 # success is 500 points, rate is between 0 and +1 ~ sigmoid
                rate = total_reward/ +13 # success is +13; rate is between -1 and +1 ~ tanh
                if rate > +1: rate = +1
                if rate < -1: rate = -1
                break

        # Training using batches
        #batch = memory.buffer
        states = np.array([each[0] for each in batch])
        actions = np.array([each[1] for each in batch])
        targetQs = np.array([each[2] for each in batch])
        g_loss, d_loss, d_lossR, d_lossQ, _, _ = sess.run([model.g_loss, model.d_loss,
                                                           model.d_lossR, model.d_lossQ, 
                                                           model.g_opt, model.d_opt],
                                                          feed_dict = {model.states: states, 
                                                                       model.actions: actions,
                                                                       model.reward: rate,
                                                                       model.targetQs: targetQs.reshape([-1])})
        # Average 100 episode total reward
        # Print out
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episodes_total_reward)),
              'rate:{:.4f}'.format(rate),
              'gloss:{:.4f}'.format(g_loss),
              'dloss:{:.4f}'.format(d_loss),
              'dlossR:{:.4f}'.format(d_lossR),
              'dlossQ:{:.4f}'.format(d_lossQ))
        # Ploting out
        rewards_list.append([ep, np.mean(episodes_total_reward)])
        g_loss_list.append([ep, g_loss])
        d_loss_list.append([ep, d_loss])
        d_lossR_list.append([ep, d_lossR])
        d_lossQ_list.append([ep, d_lossQ])
        # Break episode/epoch loop
        if np.mean(episodes_total_reward) >= +13:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints-nav/model.ckpt')

Episode:0 meanR:0.0000 rate:0.0000 gloss:0.0828 dloss:0.6980 dlossR:0.0055 dlossQ:0.6926
Episode:1 meanR:0.0000 rate:0.0000 gloss:-0.4040 dloss:0.8208 dlossR:0.1442 dlossQ:0.6766
Episode:2 meanR:0.0000 rate:0.0000 gloss:0.1308 dloss:0.7073 dlossR:0.0153 dlossQ:0.6920
Episode:3 meanR:0.0000 rate:0.0000 gloss:-0.1106 dloss:0.7051 dlossR:0.0127 dlossQ:0.6924
Episode:4 meanR:-0.2000 rate:-0.0769 gloss:0.0798 dloss:0.7200 dlossR:0.0271 dlossQ:0.6929
Episode:5 meanR:-0.1667 rate:0.0000 gloss:-0.1763 dloss:0.7547 dlossR:0.0525 dlossQ:0.7022
Episode:6 meanR:-0.1429 rate:0.0000 gloss:0.1701 dloss:0.7378 dlossR:0.0475 dlossQ:0.6903
Episode:7 meanR:-0.1250 rate:0.0000 gloss:0.0537 dloss:0.7204 dlossR:0.0269 dlossQ:0.6934
Episode:8 meanR:0.0000 rate:0.0769 gloss:0.0458 dloss:0.7260 dlossR:0.0252 dlossQ:0.7007
Episode:9 meanR:0.0000 rate:0.0000 gloss:0.0103 dloss:0.6991 dlossR:0.0048 dlossQ:0.6942
Episode:10 meanR:0.0909 rate:0.0769 gloss:-0.1147 dloss:0.8303 dlossR:0.1459 dlossQ:0.6844
Episode:11 

Episode:91 meanR:0.0652 rate:0.0769 gloss:-0.0000 dloss:0.6991 dlossR:0.0059 dlossQ:0.6931
Episode:92 meanR:0.0645 rate:0.0000 gloss:-0.0000 dloss:0.6933 dlossR:0.0001 dlossQ:0.6932
Episode:93 meanR:0.0638 rate:0.0000 gloss:-0.0000 dloss:0.6933 dlossR:0.0001 dlossQ:0.6932
Episode:94 meanR:0.0632 rate:0.0000 gloss:0.0001 dloss:0.6941 dlossR:0.0010 dlossQ:0.6931
Episode:95 meanR:0.0625 rate:0.0000 gloss:0.0001 dloss:0.6943 dlossR:0.0013 dlossQ:0.6930
Episode:96 meanR:0.0619 rate:0.0000 gloss:0.0001 dloss:0.6935 dlossR:0.0003 dlossQ:0.6932
Episode:97 meanR:0.0612 rate:0.0000 gloss:0.0001 dloss:0.6942 dlossR:0.0012 dlossQ:0.6930
Episode:98 meanR:0.0606 rate:0.0000 gloss:0.0000 dloss:0.6941 dlossR:0.0010 dlossQ:0.6931
Episode:99 meanR:0.0600 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0002 dlossQ:0.6932
Episode:100 meanR:0.0600 rate:0.0000 gloss:-0.0000 dloss:0.6935 dlossR:0.0004 dlossQ:0.6931
Episode:101 meanR:0.0600 rate:0.0000 gloss:-0.0001 dloss:0.6935 dlossR:0.0003 dlossQ:0.6931
Epi

Episode:181 meanR:0.0100 rate:0.0000 gloss:-0.0002 dloss:0.6963 dlossR:0.0035 dlossQ:0.6928
Episode:182 meanR:0.0000 rate:-0.0769 gloss:-0.0002 dloss:0.6939 dlossR:0.0010 dlossQ:0.6929
Episode:183 meanR:0.0000 rate:0.0000 gloss:-0.0001 dloss:0.6953 dlossR:0.0024 dlossQ:0.6929
Episode:184 meanR:0.0000 rate:0.0000 gloss:-0.0001 dloss:0.6946 dlossR:0.0016 dlossQ:0.6930
Episode:185 meanR:0.0100 rate:0.0769 gloss:-0.0000 dloss:0.7012 dlossR:0.0081 dlossQ:0.6931
Episode:186 meanR:0.0100 rate:0.0000 gloss:0.0002 dloss:0.6934 dlossR:0.0003 dlossQ:0.6931
Episode:187 meanR:0.0100 rate:0.0000 gloss:0.0001 dloss:0.6949 dlossR:0.0020 dlossQ:0.6929
Episode:188 meanR:0.0000 rate:0.0000 gloss:0.0004 dloss:0.6978 dlossR:0.0052 dlossQ:0.6925
Episode:189 meanR:0.0000 rate:0.0000 gloss:0.0002 dloss:0.6979 dlossR:0.0054 dlossQ:0.6925
Episode:190 meanR:0.0100 rate:0.0769 gloss:0.0003 dloss:0.6932 dlossR:0.0005 dlossQ:0.6927
Episode:191 meanR:0.0000 rate:0.0000 gloss:0.0001 dloss:0.6951 dlossR:0.0022 dlossQ:

Episode:271 meanR:-0.0100 rate:-0.0769 gloss:-0.0001 dloss:0.6973 dlossR:0.0041 dlossQ:0.6931
Episode:272 meanR:-0.0100 rate:0.0000 gloss:-0.0001 dloss:0.6944 dlossR:0.0014 dlossQ:0.6930
Episode:273 meanR:-0.0200 rate:0.0000 gloss:-0.0002 dloss:0.6953 dlossR:0.0024 dlossQ:0.6929
Episode:274 meanR:-0.0200 rate:0.0000 gloss:-0.0002 dloss:0.6970 dlossR:0.0044 dlossQ:0.6927
Episode:275 meanR:-0.0200 rate:0.0000 gloss:-0.0002 dloss:0.6957 dlossR:0.0028 dlossQ:0.6928
Episode:276 meanR:-0.0100 rate:0.0000 gloss:-0.0002 dloss:0.6950 dlossR:0.0021 dlossQ:0.6929
Episode:277 meanR:-0.0100 rate:0.0000 gloss:-0.0001 dloss:0.6936 dlossR:0.0005 dlossQ:0.6931
Episode:278 meanR:-0.0100 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0001 dlossQ:0.6932
Episode:279 meanR:-0.0100 rate:0.0000 gloss:0.0001 dloss:0.6938 dlossR:0.0007 dlossQ:0.6931
Episode:280 meanR:-0.0100 rate:0.0000 gloss:0.0002 dloss:0.6943 dlossR:0.0013 dlossQ:0.6930
Episode:281 meanR:-0.0100 rate:0.0000 gloss:0.0003 dloss:0.6948 dlossR:0

Episode:361 meanR:0.0000 rate:0.0000 gloss:-0.0001 dloss:0.6936 dlossR:0.0005 dlossQ:0.6931
Episode:362 meanR:0.0000 rate:0.0000 gloss:-0.0001 dloss:0.6934 dlossR:0.0003 dlossQ:0.6931
Episode:363 meanR:0.0000 rate:0.0000 gloss:-0.0000 dloss:0.6933 dlossR:0.0002 dlossQ:0.6932
Episode:364 meanR:0.0000 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:365 meanR:0.0000 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0001 dlossQ:0.6931
Episode:366 meanR:0.0000 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:367 meanR:0.0000 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0001 dlossQ:0.6931
Episode:368 meanR:-0.0100 rate:-0.0769 gloss:0.0001 dloss:0.7015 dlossR:0.0084 dlossQ:0.6931
Episode:369 meanR:-0.0100 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6932
Episode:370 meanR:0.0000 rate:0.0769 gloss:-0.0001 dloss:0.7008 dlossR:0.0077 dlossQ:0.6931
Episode:371 meanR:0.0100 rate:0.0000 gloss:-0.0000 dloss:0.6933 dlossR:0.0002 dloss

Episode:450 meanR:-0.0100 rate:0.0000 gloss:0.0001 dloss:0.6933 dlossR:0.0001 dlossQ:0.6932
Episode:451 meanR:-0.0100 rate:0.0000 gloss:0.0000 dloss:0.6934 dlossR:0.0003 dlossQ:0.6931
Episode:452 meanR:-0.0100 rate:0.0000 gloss:0.0001 dloss:0.6934 dlossR:0.0003 dlossQ:0.6931
Episode:453 meanR:-0.0100 rate:0.0000 gloss:0.0000 dloss:0.6934 dlossR:0.0003 dlossQ:0.6932
Episode:454 meanR:0.0000 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6932
Episode:455 meanR:0.0000 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6932
Episode:456 meanR:0.0000 rate:0.0000 gloss:-0.0000 dloss:0.6933 dlossR:0.0001 dlossQ:0.6931
Episode:457 meanR:0.0000 rate:0.0000 gloss:-0.0001 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:458 meanR:0.0000 rate:0.0000 gloss:-0.0000 dloss:0.6933 dlossR:0.0001 dlossQ:0.6931
Episode:459 meanR:0.0000 rate:0.0000 gloss:-0.0000 dloss:0.6934 dlossR:0.0002 dlossQ:0.6931
Episode:460 meanR:0.0000 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0001 dl

Episode:540 meanR:0.0600 rate:-0.0769 gloss:0.0001 dloss:0.7039 dlossR:0.0108 dlossQ:0.6931
Episode:541 meanR:0.0600 rate:0.0000 gloss:0.0001 dloss:0.6946 dlossR:0.0017 dlossQ:0.6929
Episode:542 meanR:0.0600 rate:0.0000 gloss:0.0001 dloss:0.6947 dlossR:0.0018 dlossQ:0.6929
Episode:543 meanR:0.0600 rate:-0.0769 gloss:0.0001 dloss:0.7059 dlossR:0.0129 dlossQ:0.6930
Episode:544 meanR:0.0600 rate:0.0000 gloss:0.0000 dloss:0.6934 dlossR:0.0003 dlossQ:0.6931
Episode:545 meanR:0.0500 rate:0.0000 gloss:-0.0000 dloss:0.6934 dlossR:0.0002 dlossQ:0.6931
Episode:546 meanR:0.0500 rate:0.0000 gloss:-0.0001 dloss:0.6934 dlossR:0.0003 dlossQ:0.6931
Episode:547 meanR:0.0400 rate:0.0000 gloss:-0.0002 dloss:0.6948 dlossR:0.0018 dlossQ:0.6930
Episode:548 meanR:0.0500 rate:0.0000 gloss:-0.0001 dloss:0.6960 dlossR:0.0032 dlossQ:0.6928
Episode:549 meanR:0.0500 rate:0.0000 gloss:-0.0001 dloss:0.6946 dlossR:0.0017 dlossQ:0.6930
Episode:550 meanR:0.0500 rate:0.0000 gloss:-0.0001 dloss:0.6940 dlossR:0.0009 dloss

Episode:630 meanR:0.0300 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:631 meanR:0.0300 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:632 meanR:0.0300 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:633 meanR:0.0300 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6932
Episode:634 meanR:0.0300 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:635 meanR:0.0300 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:636 meanR:0.0300 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:637 meanR:0.0300 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:638 meanR:0.0300 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0001 dlossQ:0.6931
Episode:639 meanR:0.0100 rate:0.0000 gloss:0.0001 dloss:0.6933 dlossR:0.0001 dlossQ:0.6931
Episode:640 meanR:0.0200 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0

Episode:720 meanR:0.0300 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:721 meanR:0.0400 rate:0.0769 gloss:0.0000 dloss:0.6970 dlossR:0.0039 dlossQ:0.6931
Episode:722 meanR:0.0400 rate:0.0000 gloss:0.0001 dloss:0.6938 dlossR:0.0007 dlossQ:0.6931
Episode:723 meanR:0.0400 rate:0.0000 gloss:0.0001 dloss:0.6940 dlossR:0.0010 dlossQ:0.6930
Episode:724 meanR:0.0400 rate:0.0000 gloss:0.0001 dloss:0.6942 dlossR:0.0012 dlossQ:0.6930
Episode:725 meanR:0.0400 rate:0.0000 gloss:0.0001 dloss:0.6938 dlossR:0.0007 dlossQ:0.6931
Episode:726 meanR:0.0400 rate:0.0000 gloss:0.0000 dloss:0.6935 dlossR:0.0004 dlossQ:0.6931
Episode:727 meanR:0.0400 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0001 dlossQ:0.6932
Episode:728 meanR:0.0400 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6932
Episode:729 meanR:0.0400 rate:0.0000 gloss:-0.0000 dloss:0.6933 dlossR:0.0001 dlossQ:0.6932
Episode:730 meanR:0.0400 rate:0.0000 gloss:-0.0000 dloss:0.6934 dlossR:0.0003 dlossQ:0.69

Episode:810 meanR:-0.0200 rate:0.0769 gloss:0.0000 dloss:0.6991 dlossR:0.0060 dlossQ:0.6931
Episode:811 meanR:-0.0200 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:812 meanR:-0.0300 rate:-0.0769 gloss:0.0000 dloss:0.7031 dlossR:0.0101 dlossQ:0.6931
Episode:813 meanR:-0.0300 rate:0.0000 gloss:0.0000 dloss:0.6934 dlossR:0.0002 dlossQ:0.6931
Episode:814 meanR:-0.0300 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:815 meanR:-0.0300 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:816 meanR:-0.0300 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:817 meanR:-0.0300 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:818 meanR:-0.0300 rate:0.0000 gloss:-0.0000 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:819 meanR:-0.0300 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:820 meanR:-0.0400 rate:-0.0769 gloss:-0.0000 dloss:0.6979 dlossR:0.

Episode:899 meanR:-0.0800 rate:0.0000 gloss:0.0002 dloss:0.6939 dlossR:0.0008 dlossQ:0.6931
Episode:900 meanR:-0.0800 rate:0.0000 gloss:0.0001 dloss:0.6936 dlossR:0.0005 dlossQ:0.6931
Episode:901 meanR:-0.0800 rate:0.0000 gloss:0.0000 dloss:0.6934 dlossR:0.0003 dlossQ:0.6932
Episode:902 meanR:-0.0800 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:903 meanR:-0.0800 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:904 meanR:-0.0800 rate:0.0000 gloss:-0.0001 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:905 meanR:-0.0800 rate:0.0000 gloss:-0.0002 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:906 meanR:-0.0800 rate:0.0000 gloss:-0.0001 dloss:0.6935 dlossR:0.0003 dlossQ:0.6931
Episode:907 meanR:-0.0900 rate:0.0000 gloss:-0.0001 dloss:0.6934 dlossR:0.0003 dlossQ:0.6931
Episode:908 meanR:-0.0900 rate:0.0000 gloss:-0.0002 dloss:0.6935 dlossR:0.0004 dlossQ:0.6931
Episode:909 meanR:-0.0900 rate:0.0000 gloss:-0.0001 dloss:0.6934 dlossR:0.

Episode:988 meanR:-0.0100 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6932
Episode:989 meanR:-0.0100 rate:0.0000 gloss:-0.0001 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:990 meanR:-0.0100 rate:0.0000 gloss:-0.0001 dloss:0.6934 dlossR:0.0003 dlossQ:0.6931
Episode:991 meanR:-0.0200 rate:0.0000 gloss:-0.0002 dloss:0.6935 dlossR:0.0004 dlossQ:0.6931
Episode:992 meanR:-0.0200 rate:0.0000 gloss:-0.0001 dloss:0.6936 dlossR:0.0005 dlossQ:0.6931
Episode:993 meanR:-0.0200 rate:0.0000 gloss:-0.0003 dloss:0.6939 dlossR:0.0009 dlossQ:0.6930
Episode:994 meanR:-0.0200 rate:0.0000 gloss:-0.0002 dloss:0.6937 dlossR:0.0006 dlossQ:0.6931
Episode:995 meanR:-0.0200 rate:0.0000 gloss:-0.0002 dloss:0.6937 dlossR:0.0006 dlossQ:0.6931
Episode:996 meanR:-0.0100 rate:0.0769 gloss:-0.0003 dloss:0.7026 dlossR:0.0095 dlossQ:0.6931
Episode:997 meanR:-0.0100 rate:0.0000 gloss:-0.0001 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:998 meanR:-0.0100 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR

Episode:1077 meanR:0.0600 rate:0.0000 gloss:0.0002 dloss:0.6948 dlossR:0.0019 dlossQ:0.6929
Episode:1078 meanR:0.0700 rate:0.0769 gloss:0.0001 dloss:0.6949 dlossR:0.0019 dlossQ:0.6930
Episode:1079 meanR:0.0700 rate:0.0000 gloss:0.0001 dloss:0.6939 dlossR:0.0009 dlossQ:0.6930
Episode:1080 meanR:0.0600 rate:0.0000 gloss:0.0001 dloss:0.6936 dlossR:0.0005 dlossQ:0.6931
Episode:1081 meanR:0.0600 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:1082 meanR:0.0600 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:1083 meanR:0.0600 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:1084 meanR:0.0600 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6932
Episode:1085 meanR:0.0600 rate:0.0000 gloss:-0.0000 dloss:0.6933 dlossR:0.0001 dlossQ:0.6931
Episode:1086 meanR:0.0600 rate:0.0000 gloss:-0.0000 dloss:0.6934 dlossR:0.0002 dlossQ:0.6932
Episode:1087 meanR:0.0700 rate:0.0000 gloss:-0.0000 dloss:0.6935 dlossR:0.00

Episode:1166 meanR:0.0700 rate:0.0769 gloss:-0.0000 dloss:0.6991 dlossR:0.0060 dlossQ:0.6932
Episode:1167 meanR:0.0700 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1168 meanR:0.0700 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1169 meanR:0.0700 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1170 meanR:0.0700 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6932
Episode:1171 meanR:0.0500 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1172 meanR:0.0400 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1173 meanR:0.0500 rate:0.0769 gloss:0.0000 dloss:0.6987 dlossR:0.0056 dlossQ:0.6932
Episode:1174 meanR:0.0500 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0001 dlossQ:0.6931
Episode:1175 meanR:0.0600 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:1176 meanR:0.0600 rate:0.0000 gloss:0.0001 dloss:0.6934 dlossR:0.0003 d

Episode:1255 meanR:0.0300 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:1256 meanR:0.0200 rate:-0.0769 gloss:0.0000 dloss:0.7019 dlossR:0.0088 dlossQ:0.6931
Episode:1257 meanR:0.0100 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:1258 meanR:0.0100 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1259 meanR:0.0000 rate:-0.0769 gloss:0.0000 dloss:0.6992 dlossR:0.0061 dlossQ:0.6932
Episode:1260 meanR:0.0100 rate:0.0769 gloss:-0.0000 dloss:0.7013 dlossR:0.0081 dlossQ:0.6931
Episode:1261 meanR:0.0100 rate:0.0000 gloss:-0.0000 dloss:0.6935 dlossR:0.0004 dlossQ:0.6931
Episode:1262 meanR:0.0100 rate:0.0000 gloss:-0.0000 dloss:0.6936 dlossR:0.0005 dlossQ:0.6931
Episode:1263 meanR:0.0100 rate:0.0000 gloss:-0.0000 dloss:0.6936 dlossR:0.0005 dlossQ:0.6931
Episode:1264 meanR:0.0000 rate:-0.0769 gloss:-0.0001 dloss:0.6962 dlossR:0.0031 dlossQ:0.6931
Episode:1265 meanR:0.0000 rate:0.0000 gloss:-0.0000 dloss:0.6936 dlossR:

Episode:1344 meanR:0.0200 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:1345 meanR:0.0200 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0001 dlossQ:0.6931
Episode:1346 meanR:0.0200 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1347 meanR:0.0200 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1348 meanR:0.0200 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:1349 meanR:0.0200 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:1350 meanR:0.0200 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1351 meanR:0.0200 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:1352 meanR:0.0300 rate:0.0769 gloss:-0.0000 dloss:0.7002 dlossR:0.0071 dlossQ:0.6931
Episode:1353 meanR:0.0300 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:1354 meanR:0.0300 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0

Episode:1433 meanR:0.0100 rate:0.0000 gloss:-0.0000 dloss:0.6934 dlossR:0.0003 dlossQ:0.6931
Episode:1434 meanR:0.0100 rate:0.0000 gloss:-0.0000 dloss:0.6933 dlossR:0.0001 dlossQ:0.6931
Episode:1435 meanR:0.0100 rate:0.0000 gloss:-0.0000 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:1436 meanR:0.0100 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:1437 meanR:0.0100 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1438 meanR:0.0100 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1439 meanR:0.0100 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1440 meanR:0.0000 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:1441 meanR:0.0100 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1442 meanR:0.0000 rate:-0.0769 gloss:0.0000 dloss:0.7000 dlossR:0.0068 dlossQ:0.6931
Episode:1443 meanR:0.0000 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0

Episode:1522 meanR:0.0200 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1523 meanR:0.0300 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6932
Episode:1524 meanR:0.0300 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1525 meanR:0.0300 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1526 meanR:0.0400 rate:0.0769 gloss:-0.0000 dloss:0.6999 dlossR:0.0067 dlossQ:0.6931
Episode:1527 meanR:0.0400 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6932
Episode:1528 meanR:0.0500 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1529 meanR:0.0500 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:1530 meanR:0.0500 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:1531 meanR:0.0500 rate:0.0000 gloss:0.0000 dloss:0.6934 dlossR:0.0003 dlossQ:0.6931
Episode:1532 meanR:0.0500 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0001 

Episode:1611 meanR:0.0300 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1612 meanR:0.0300 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1613 meanR:0.0300 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1614 meanR:0.0300 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1615 meanR:0.0400 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1616 meanR:0.0400 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1617 meanR:0.0400 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1618 meanR:0.0300 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1619 meanR:0.0300 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6932
Episode:1620 meanR:0.0300 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1621 meanR:0.0300 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0

Episode:1700 meanR:0.0900 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0001 dlossQ:0.6931
Episode:1701 meanR:0.0900 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:1702 meanR:0.0900 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:1703 meanR:0.0900 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:1704 meanR:0.0900 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0001 dlossQ:0.6931
Episode:1705 meanR:0.0900 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1706 meanR:0.0900 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1707 meanR:0.0900 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1708 meanR:0.0900 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6932
Episode:1709 meanR:0.0900 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1710 meanR:0.1000 rate:0.0769 gloss:-0.0000 dloss:0.6996 dlossR:0.0064

Episode:1789 meanR:0.0100 rate:0.0000 gloss:-0.0000 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:1790 meanR:0.0000 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:1791 meanR:0.0000 rate:0.0000 gloss:-0.0000 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:1792 meanR:0.0000 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1793 meanR:-0.0100 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1794 meanR:-0.0100 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1795 meanR:-0.0100 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0001 dlossQ:0.6931
Episode:1796 meanR:-0.0100 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1797 meanR:-0.0100 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0001 dlossQ:0.6931
Episode:1798 meanR:-0.0100 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1799 meanR:-0.0100 rate:0.0000 gloss:0.0000 dloss:0.6932 dlo

Episode:1877 meanR:-0.0600 rate:-0.0769 gloss:0.0000 dloss:0.7010 dlossR:0.0078 dlossQ:0.6931
Episode:1878 meanR:-0.0600 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:1879 meanR:-0.0500 rate:0.0000 gloss:0.0000 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:1880 meanR:-0.0500 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:1881 meanR:-0.0400 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1882 meanR:-0.0400 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:1883 meanR:-0.0400 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1884 meanR:-0.0400 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1885 meanR:-0.0400 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1886 meanR:-0.0400 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1887 meanR:-0.0300 rate:0.0769 gloss:-0.0000 dloss:0.6991 dl

Episode:1965 meanR:-0.0500 rate:0.0000 gloss:0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1966 meanR:-0.0500 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1967 meanR:-0.0500 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1968 meanR:-0.0500 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0000 dlossQ:0.6931
Episode:1969 meanR:-0.0500 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:1970 meanR:-0.0500 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:1971 meanR:-0.0500 rate:0.0000 gloss:-0.0000 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:1972 meanR:-0.0500 rate:0.0000 gloss:-0.0000 dloss:0.6932 dlossR:0.0001 dlossQ:0.6931
Episode:1973 meanR:-0.0600 rate:-0.0769 gloss:-0.0000 dloss:0.6975 dlossR:0.0043 dlossQ:0.6931
Episode:1974 meanR:-0.0600 rate:0.0000 gloss:-0.0000 dloss:0.6933 dlossR:0.0002 dlossQ:0.6931
Episode:1975 meanR:-0.0600 rate:0.0000 gloss:-0.0000 dloss:0

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

In [33]:
# # import gym
# # # env = gym.make('CartPole-v0')
# # env = gym.make('CartPole-v1')
# # # env = gym.make('Acrobot-v1')
# # # env = gym.make('MountainCar-v0')
# # # env = gym.make('Pendulum-v0')
# # # env = gym.make('Blackjack-v0')
# # # env = gym.make('FrozenLake-v0')
# # # env = gym.make('AirRaid-ram-v0')
# # # env = gym.make('AirRaid-v0')
# # # env = gym.make('BipedalWalker-v2')
# # # env = gym.make('Copy-v0')
# # # env = gym.make('CarRacing-v0')
# # # env = gym.make('Ant-v2') #mujoco
# # # env = gym.make('FetchPickAndPlace-v1') # mujoco required!

# with tf.Session() as sess:
#     #sess.run(tf.global_variables_initializer())
#     saver.restore(sess, 'checkpoints/model-nav.ckpt')    
#     #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
#     # Episodes/epochs
#     for _ in range(1):
#         state = env.reset()
#         total_reward = 0

#         # Steps/batches
#         #for _ in range(111111111111111111):
#         while True:
#             env.render()
#             action_logits = sess.run(model.actions_logits, feed_dict={model.states: np.reshape(state, [1, -1])})
#             action = np.argmax(action_logits)
#             state, reward, done, _ = env.step(action)
#             total_reward += reward
#             if done:
#                 break
                
#         # Closing the env
#         print('total_reward: {:.2f}'.format(total_reward))
#         env.close()