# Navigation

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the first project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

### 1. Start the Environment

We begin by importing some necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Banana.app"`
- **Windows** (x86): `"path/to/Banana_Windows_x86/Banana.exe"`
- **Windows** (x86_64): `"path/to/Banana_Windows_x86_64/Banana.exe"`
- **Linux** (x86): `"path/to/Banana_Linux/Banana.x86"`
- **Linux** (x86_64): `"path/to/Banana_Linux/Banana.x86_64"`
- **Linux** (x86, headless): `"path/to/Banana_Linux_NoVis/Banana.x86"`
- **Linux** (x86_64, headless): `"path/to/Banana_Linux_NoVis/Banana.x86_64"`

For instance, if you are using a Mac, then you downloaded `Banana.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Banana.app")
```

In [2]:
# env = UnityEnvironment(file_name="/home/arasdar/VisualBanana_Linux/Banana.x86")
env = UnityEnvironment(file_name="/home/arasdar/Banana_Linux/Banana.x86_64")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
# print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)
# print(state.shape, len(env_info.vector_observations), env_info.vector_observations.shape)

Number of agents: 1
Number of actions: 4
States have length: 37


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action (uniformly) at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
env_info = env.reset(train_mode=False)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
while True:
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        print(state.shape)
        break
    
print("Score: {}".format(score))

(37,)
Score: 0.0


When finished, you can close the environment.

In [6]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [7]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
while True:
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    #print(state)
    if done:                                       # exit loop if episode finished
        break
    
print("Score: {}".format(score))

Score: 0.0


In [8]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


In [9]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
batch = []
while True: # infinite number of steps
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    #print(state, action, reward, done)
    batch.append([action, state, reward, done])
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        break
    
# print("Score: {}".format(score))

In [10]:
batch[0], batch[0][1].shape

([0, array([0.        , 0.        , 1.        , 0.        , 0.07196155,
         0.        , 0.        , 1.        , 0.        , 0.04892623,
         0.        , 1.        , 0.        , 0.        , 0.71655846,
         1.        , 0.        , 0.        , 0.        , 0.34735984,
         1.        , 0.        , 0.        , 0.        , 0.34166035,
         0.        , 0.        , 1.        , 0.        , 0.0420079 ,
         0.        , 0.        , 1.        , 0.        , 0.21867581,
         0.        , 0.        ]), 0.0, False], (37,))

In [11]:
batch[0][1].shape

(37,)

In [12]:
batch[0]

[0, array([0.        , 0.        , 1.        , 0.        , 0.07196155,
        0.        , 0.        , 1.        , 0.        , 0.04892623,
        0.        , 1.        , 0.        , 0.        , 0.71655846,
        1.        , 0.        , 0.        , 0.        , 0.34735984,
        1.        , 0.        , 0.        , 0.        , 0.34166035,
        0.        , 0.        , 1.        , 0.        , 0.0420079 ,
        0.        , 0.        , 1.        , 0.        , 0.21867581,
        0.        , 0.        ]), 0.0, False]

In [13]:
actions = np.array([each[0] for each in batch])
states = np.array([each[1] for each in batch])
rewards = np.array([each[2] for each in batch])
dones = np.array([each[3] for each in batch])
# infos = np.array([each[4] for each in batch])

In [14]:
# print(rewards[:])
print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print(np.max(np.array(actions)), np.min(np.array(actions)), 
      (np.max(np.array(actions)) - np.min(np.array(actions)))+1)
print(np.max(np.array(rewards)), np.min(np.array(rewards)))
print(np.max(np.array(states)), np.min(np.array(states)))

(300,) (300, 37) (300,) (300,)
float64 float64 int64 bool
3 0 4
0.0 -1.0
10.589945793151855 -10.711225509643555


In [16]:
# Data of the model
def model_input(state_size):
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    rewards = tf.placeholder(tf.float32, [None], name='rewards')
    rate = tf.placeholder(tf.float32, [], name='rate')
    return states, actions, targetQs, rewards, rate

In [17]:
# Generator/Controller: Generating/prediting the actions
def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        #predictions = tf.nn.softmax(logits)

        # return actions logits
        return logits

In [18]:
# Discriminator/Dopamine: Reward function/planner/naviator/advisor/supervisor/cortical columns
def discriminator(states, actions, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # Fusion/merge states and actions/ SA/ SM
        x_fused = tf.concat(axis=1, values=[states, actions])
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)        
        #predictions = tf.nn.softmax(logits)

        # return rewards logits
        return logits

In [19]:
def model_loss(action_size, hidden_size, states, actions, targetQs, rewards, rate):
    # G
    actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    neg_log_prob_actions = tf.nn.softmax_cross_entropy_with_logits_v2(logits=actions_logits, 
                                                                      labels=actions_labels)
    Qs_labels = rewards[:-1] + (0.99 * targetQs[1:])
    #g_loss = tf.reduce_mean(neg_log_prob_actions[:-1] * targetQs[1:])
    g_loss = tf.reduce_mean(neg_log_prob_actions[:-1] * Qs_labels)
    
    # D
    Qs_logits = discriminator(actions=actions_logits, hidden_size=hidden_size, states=states)
    d_lossR = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=Qs_logits,
                                                                     labels=rate * tf.ones_like(Qs_logits)))
    d_lossQ = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=tf.reshape(Qs_logits[:-1], shape=[-1]),
                                                                     labels=tf.nn.sigmoid(Qs_labels)))
    d_loss = d_lossR + d_lossQ

    return actions_logits, Qs_logits, g_loss, d_loss, d_lossR, d_lossQ

In [20]:
# Optimizating/training/learning G & D
def model_opt(g_loss, d_loss, learning_rate):
    """
    Get optimization operations in order
    :param g_loss: Generator loss Tensor for action prediction
    :param d_loss: Discriminator loss Tensor for reward prediction for generated/prob/logits action
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars)
        d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars)

    return g_opt, d_opt

In [21]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, self.rewards, self.rate = model_input(state_size=state_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.Qs_logits, self.g_loss, self.d_loss, self.d_lossR, self.d_lossQ = model_loss(
            action_size=action_size, hidden_size=hidden_size, # model init parameters
            states=self.states, actions=self.actions, # model input
            targetQs=self.targetQs, rewards=self.rewards, rate=self.rate) # model input
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(g_loss=self.g_loss, d_loss=self.d_loss, learning_rate=learning_rate)

In [22]:
print('state size:{}'.format(states.shape), 
      'actions:{}'.format(actions.shape)) 
print('action size:{}'.format(np.max(actions) - np.min(actions)+1))

state size:(300, 37) actions:(300,)
action size:4


In [23]:
# Training parameters
# Network parameters
state_size = 37              # number of units for the input state/observation -- simulation
action_size = 4              # number of units for the output actions -- simulation
hidden_size = 37*16          # number of units in each Q-network hidden layer -- simulation
learning_rate = 0.001          # learning rate for adam

In [24]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

In [25]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment

while True: # infinite number of steps
#for _ in range(batch_size):
    state = env_info.vector_observations[0]   # get the next state
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    #memory.buffer.append([action, state, done])
    if done:                                       # exit loop if episode finished
        break

In [None]:
from collections import deque
episodes_total_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
saver = tf.train.Saver()
rewards_list, g_loss_list, d_loss_list = [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Training episodes/epochs
    for ep in range(111111):
        batch = [] # every data batch
        total_reward = 0
        #state = env.reset() # env first state
        env_info = env.reset(train_mode=True)[brain_name] # reset the environment

        # Training steps/batches
        while True:
            state = env_info.vector_observations[0]   # get the next state
            action_logits, Q_logits = sess.run(fetches=[model.actions_logits, model.Qs_logits], 
                                               feed_dict={model.states: np.reshape(state, [1, -1])})
            action = np.argmax(action_logits)
            #state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name]        # send the action to the environment
            reward = env_info.rewards[0]                   # get the reward
            batch.append([state, action, Q_logits, reward])
            total_reward += reward
            done = env_info.local_done[0]                  # see if episode has finished
            if done is True: # episode ended success/failure
                episodes_total_reward.append(total_reward) # stopping criteria
                #rate = total_reward/ 500 # success is 500 points, rate is between 0 and +1 ~ sigmoid
                rate = total_reward/ +13 # success is +13; rate is between -1 and +1 ~ tanh
                if rate >= +1: rate = +1
                if rate <= -1: rate = -1
                # min -13, max +13
                # min -1, max +1
                #reward = x-(-13)/ 26 # 0-1
                #prob_rate = (rate - (-1))/ (1-(-1))
                #prob_rate = (rate - rate_min)/ (rate_max-rate_min)
                rate_prob = (rate + 1)/2 # success rate 0-1
                break

        # Training using batches
        #batch = memory.buffer
        states = np.array([each[0] for each in batch])
        actions = np.array([each[1] for each in batch])
        targetQs = np.array([each[2] for each in batch])
        rewards = np.array([each[3] for each in batch])
        g_loss, d_loss, d_lossR, d_lossQ, _, _ = sess.run([model.g_loss, model.d_loss, 
                                                           model.d_lossR, model.d_lossQ, 
                                                           model.g_opt, model.d_opt],
                                                          feed_dict = {model.states: states, 
                                                                       model.actions: actions,
                                                                       model.targetQs: targetQs.reshape([-1]),
                                                                       model.rewards: rewards, 
                                                                       model.rate: rate})
        # Average 100 episode total reward
        # Print out
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episodes_total_reward)),
              'rate:{:.4f}'.format(rate),
              'gloss:{:.4f}'.format(g_loss),
              'dloss:{:.4f}'.format(d_loss),
              'dlossR:{:.4f}'.format(d_lossR),
              'dlossQ:{:.4f}'.format(d_lossQ))
        # Ploting out
        rewards_list.append([ep, np.mean(episodes_total_reward)])
        g_loss_list.append([ep, g_loss])
        d_loss_list.append([ep, d_loss])
        # Break episode/epoch loop
        if np.mean(episodes_total_reward) >= +13:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model-nav.ckpt')

Episode:0 meanR:-1.0000 rate:-0.0769 gloss:-0.0960 dloss:1.3444 dlossR:0.6518 dlossQ:0.6925
Episode:1 meanR:-1.0000 rate:-0.0769 gloss:-1.0128 dloss:0.8274 dlossR:0.2438 dlossQ:0.5836
Episode:2 meanR:-1.0000 rate:-0.0769 gloss:-1.9897 dloss:0.3188 dlossR:-0.0569 dlossQ:0.3757
Episode:3 meanR:-1.0000 rate:-0.0769 gloss:-2.2554 dloss:0.3928 dlossR:-0.0129 dlossQ:0.4057
Episode:4 meanR:-0.8000 rate:0.0000 gloss:-2.7214 dloss:0.4067 dlossR:0.0958 dlossQ:0.3109
Episode:5 meanR:-0.8333 rate:-0.0769 gloss:-3.4543 dloss:0.0417 dlossR:-0.1692 dlossQ:0.2110
Episode:6 meanR:-0.7143 rate:0.0000 gloss:-4.2276 dloss:0.1659 dlossR:0.0290 dlossQ:0.1369
Episode:7 meanR:-0.6250 rate:0.0000 gloss:-4.7804 dloss:0.0998 dlossR:0.0159 dlossQ:0.0839
Episode:8 meanR:-0.5556 rate:0.0000 gloss:-6.7373 dloss:0.0364 dlossR:0.0048 dlossQ:0.0316
Episode:9 meanR:-0.4000 rate:0.0769 gloss:-6.3728 dloss:0.6303 dlossR:0.6170 dlossQ:0.0133
Episode:10 meanR:-0.4545 rate:-0.0769 gloss:-6.2946 dloss:-0.6500 dlossR:-0.6578 d

Episode:88 meanR:-0.1011 rate:0.0000 gloss:-22.3054 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:89 meanR:-0.1222 rate:-0.1538 gloss:-23.0185 dloss:-2.7985 dlossR:-2.7986 dlossQ:0.0000
Episode:90 meanR:-0.1209 rate:0.0000 gloss:-28.3139 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:91 meanR:-0.1413 rate:-0.1538 gloss:-32.0400 dloss:-3.8123 dlossR:-3.8123 dlossQ:0.0000
Episode:92 meanR:-0.1505 rate:-0.0769 gloss:-28.9348 dloss:-1.7199 dlossR:-1.7200 dlossQ:0.0000
Episode:93 meanR:-0.1702 rate:-0.1538 gloss:-31.6225 dloss:-3.7509 dlossR:-3.7509 dlossQ:0.0000
Episode:94 meanR:-0.1789 rate:-0.0769 gloss:-39.7995 dloss:-2.3477 dlossR:-2.3477 dlossQ:0.0000
Episode:95 meanR:-0.1875 rate:-0.0769 gloss:-43.2032 dloss:-2.5325 dlossR:-2.5325 dlossQ:0.0000
Episode:96 meanR:-0.1753 rate:0.0769 gloss:-50.5707 dloss:3.0322 dlossR:3.0322 dlossQ:0.0000
Episode:97 meanR:-0.1735 rate:0.0000 gloss:-48.0225 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:98 meanR:-0.1818 rate:-0.0769 gloss:-40.7710

Episode:174 meanR:-0.2600 rate:0.2308 gloss:-447.9164 dloss:76.4020 dlossR:76.4020 dlossQ:0.0000
Episode:175 meanR:-0.2600 rate:-0.0769 gloss:-399.7676 dloss:-22.7611 dlossR:-22.7611 dlossQ:0.0000
Episode:176 meanR:-0.2600 rate:0.0000 gloss:-369.2938 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:177 meanR:-0.2500 rate:0.0769 gloss:-373.4454 dloss:21.3266 dlossR:21.3266 dlossQ:0.0000
Episode:178 meanR:-0.2500 rate:0.0000 gloss:-379.1991 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:179 meanR:-0.2500 rate:0.0000 gloss:-323.5201 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:180 meanR:-0.2800 rate:-0.2308 gloss:-360.5891 dloss:-61.3709 dlossR:-61.3709 dlossQ:0.0000
Episode:181 meanR:-0.2800 rate:0.0000 gloss:-360.7635 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:182 meanR:-0.2900 rate:-0.0769 gloss:-387.2152 dloss:-22.0906 dlossR:-22.0906 dlossQ:0.0000
Episode:183 meanR:-0.2900 rate:0.0000 gloss:-387.1479 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:184 meanR:-0.2900 rate:

Episode:259 meanR:-0.2300 rate:0.0769 gloss:-1175.0355 dloss:66.3027 dlossR:66.3027 dlossQ:0.0000
Episode:260 meanR:-0.2200 rate:0.0000 gloss:-1043.8257 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:261 meanR:-0.2400 rate:-0.1538 gloss:-1100.7161 dloss:-124.8912 dlossR:-124.8912 dlossQ:0.0000
Episode:262 meanR:-0.2200 rate:0.0000 gloss:-1097.1277 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:263 meanR:-0.2000 rate:0.0000 gloss:-1092.0068 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:264 meanR:-0.2300 rate:-0.3077 gloss:-1337.6400 dloss:-302.6616 dlossR:-302.6616 dlossQ:0.0000
Episode:265 meanR:-0.2300 rate:0.0000 gloss:-1237.0875 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:266 meanR:-0.2000 rate:0.0769 gloss:-1359.6835 dloss:77.4857 dlossR:77.4857 dlossQ:0.0000
Episode:267 meanR:-0.1800 rate:0.0000 gloss:-1387.4218 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:268 meanR:-0.1800 rate:0.0000 gloss:-1536.4841 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:269 meanR:-0.2

Episode:343 meanR:-0.1400 rate:-0.0769 gloss:-1078.4763 dloss:-60.7371 dlossR:-60.7371 dlossQ:0.0000
Episode:344 meanR:-0.1400 rate:0.0000 gloss:-1087.7310 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:345 meanR:-0.1300 rate:-0.0769 gloss:-1480.0924 dloss:-83.2410 dlossR:-83.2410 dlossQ:0.0000
Episode:346 meanR:-0.1300 rate:-0.0769 gloss:-1581.4884 dloss:-88.7949 dlossR:-88.7949 dlossQ:0.0000
Episode:347 meanR:-0.1400 rate:-0.0769 gloss:-1529.4735 dloss:-86.0644 dlossR:-86.0644 dlossQ:0.0000
Episode:348 meanR:-0.1300 rate:0.0769 gloss:-1532.7257 dloss:86.2775 dlossR:86.2775 dlossQ:0.0000
Episode:349 meanR:-0.1300 rate:0.0000 gloss:-1445.9327 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:350 meanR:-0.1100 rate:0.0000 gloss:-1475.1337 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:351 meanR:-0.1300 rate:-0.1538 gloss:-1375.8595 dloss:-155.1231 dlossR:-155.1231 dlossQ:0.0000
Episode:352 meanR:-0.1300 rate:-0.0769 gloss:-1477.3549 dloss:-83.1807 dlossR:-83.1807 dlossQ:0.0000
Episod

Episode:427 meanR:-0.0200 rate:0.0769 gloss:-1270.3295 dloss:71.4443 dlossR:71.4443 dlossQ:0.0000
Episode:428 meanR:-0.0500 rate:-0.0769 gloss:-1484.8170 dloss:-83.3979 dlossR:-83.3979 dlossQ:0.0000
Episode:429 meanR:-0.0400 rate:0.1538 gloss:-1324.5264 dloss:148.7534 dlossR:148.7534 dlossQ:0.0000
Episode:430 meanR:-0.0700 rate:-0.1538 gloss:-1188.9473 dloss:-133.6135 dlossR:-133.6135 dlossQ:0.0000
Episode:431 meanR:-0.0500 rate:0.1538 gloss:-1429.7743 dloss:160.5686 dlossR:160.5686 dlossQ:0.0000
Episode:432 meanR:-0.0500 rate:0.0000 gloss:-1738.6239 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:433 meanR:-0.0400 rate:0.0000 gloss:-1429.0695 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:434 meanR:-0.0200 rate:0.0769 gloss:-1705.1191 dloss:95.6519 dlossR:95.6519 dlossQ:0.0000
Episode:435 meanR:-0.0400 rate:0.0000 gloss:-1567.4060 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:436 meanR:-0.0800 rate:-0.3846 gloss:-1190.7365 dloss:-334.5394 dlossR:-334.5394 dlossQ:0.0000
Episode:4

Episode:511 meanR:-0.0100 rate:0.0000 gloss:-1350.2809 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:512 meanR:-0.0300 rate:-0.0769 gloss:-1285.3473 dloss:-72.1069 dlossR:-72.1069 dlossQ:0.0000
Episode:513 meanR:-0.0300 rate:0.0000 gloss:-1402.0217 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:514 meanR:-0.0300 rate:0.0000 gloss:-1237.4478 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:515 meanR:-0.0300 rate:0.0000 gloss:-1005.5048 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:516 meanR:-0.0300 rate:0.0000 gloss:-1204.1154 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:517 meanR:0.0200 rate:0.3077 gloss:-987.9174 dloss:221.5622 dlossR:221.5622 dlossQ:0.0000
Episode:518 meanR:0.0700 rate:0.0769 gloss:-949.5247 dloss:53.3119 dlossR:53.3119 dlossQ:0.0000
Episode:519 meanR:0.1100 rate:0.0000 gloss:-1368.3026 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:520 meanR:0.1200 rate:0.0769 gloss:-1058.7081 dloss:59.3644 dlossR:59.3644 dlossQ:0.0000
Episode:521 meanR:0.1000 rate:-0.

Episode:596 meanR:0.0200 rate:0.0769 gloss:-1109.3406 dloss:62.2924 dlossR:62.2924 dlossQ:0.0000
Episode:597 meanR:0.0200 rate:0.0000 gloss:-1125.5854 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:598 meanR:-0.0400 rate:0.0769 gloss:-1350.0594 dloss:75.7491 dlossR:75.7491 dlossQ:0.0000
Episode:599 meanR:-0.0300 rate:0.0000 gloss:-1478.8955 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:600 meanR:-0.0300 rate:0.0000 gloss:-1339.5002 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:601 meanR:-0.0300 rate:0.0000 gloss:-1148.3549 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:602 meanR:-0.0300 rate:0.0000 gloss:-1317.1490 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:603 meanR:-0.0400 rate:-0.0769 gloss:-1211.3918 dloss:-67.9417 dlossR:-67.9417 dlossQ:0.0000
Episode:604 meanR:-0.0500 rate:-0.1538 gloss:-1283.5492 dloss:-143.9723 dlossR:-143.9723 dlossQ:0.0000
Episode:605 meanR:-0.0600 rate:0.0000 gloss:-1100.0135 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:606 meanR:-0.0400 

Episode:681 meanR:0.0400 rate:0.0000 gloss:-1026.7194 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:682 meanR:0.0400 rate:0.0000 gloss:-975.4412 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:683 meanR:0.0600 rate:0.2308 gloss:-887.8188 dloss:149.4717 dlossR:149.4717 dlossQ:0.0000
Episode:684 meanR:0.0700 rate:0.1538 gloss:-848.8059 dloss:95.3968 dlossR:95.3968 dlossQ:0.0000
Episode:685 meanR:0.0700 rate:0.0769 gloss:-979.8604 dloss:54.9393 dlossR:54.9393 dlossQ:0.0000
Episode:686 meanR:0.0800 rate:-0.0769 gloss:-848.9452 dloss:-47.7313 dlossR:-47.7313 dlossQ:0.0000
Episode:687 meanR:0.0900 rate:0.0000 gloss:-991.2678 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:688 meanR:0.0800 rate:0.0000 gloss:-1289.1023 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:689 meanR:0.0900 rate:0.0000 gloss:-1319.1196 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:690 meanR:0.1200 rate:0.0769 gloss:-1140.7622 dloss:63.9553 dlossR:63.9553 dlossQ:0.0000
Episode:691 meanR:0.0900 rate:-0.0769 glo

Episode:767 meanR:0.3600 rate:0.0000 gloss:-583.0620 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:768 meanR:0.3500 rate:-0.0769 gloss:-374.3598 dloss:-21.0474 dlossR:-21.0474 dlossQ:0.0000
Episode:769 meanR:0.3200 rate:0.0769 gloss:-350.2140 dloss:19.6086 dlossR:19.6086 dlossQ:0.0000
Episode:770 meanR:0.3200 rate:0.0000 gloss:-361.2056 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:771 meanR:0.3300 rate:0.0769 gloss:-375.7027 dloss:21.1182 dlossR:21.1182 dlossQ:0.0000
Episode:772 meanR:0.3300 rate:0.0000 gloss:-320.7170 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:773 meanR:0.3300 rate:0.0769 gloss:-359.6692 dloss:20.1991 dlossR:20.1991 dlossQ:0.0000
Episode:774 meanR:0.3400 rate:0.0769 gloss:-367.1304 dloss:20.5829 dlossR:20.5829 dlossQ:0.0000
Episode:775 meanR:0.3300 rate:0.0000 gloss:-499.4108 dloss:0.0000 dlossR:0.0000 dlossQ:0.0000
Episode:776 meanR:0.3300 rate:0.1538 gloss:-517.7791 dloss:58.0466 dlossR:58.0466 dlossQ:0.0000
Episode:777 meanR:0.3200 rate:-0.1538 gloss:-

Episode:853 meanR:0.2300 rate:0.0769 gloss:-131.7823 dloss:8.4290 dlossR:7.7353 dlossQ:0.6937
Episode:854 meanR:0.2400 rate:0.1538 gloss:-91.4142 dloss:12.2171 dlossR:11.4020 dlossQ:0.8151
Episode:855 meanR:0.3000 rate:0.5385 gloss:-91.0146 dloss:37.8485 dlossR:37.0636 dlossQ:0.7850
Episode:856 meanR:0.2900 rate:-0.0769 gloss:-60.0790 dloss:1.7886 dlossR:-0.1389 dlossQ:1.9276
Episode:857 meanR:0.2900 rate:0.0000 gloss:-54.5755 dloss:5.5006 dlossR:3.4009 dlossQ:2.0997
Episode:858 meanR:0.2600 rate:-0.2308 gloss:-56.3348 dloss:-4.9660 dlossR:-7.0627 dlossQ:2.0967
Episode:859 meanR:0.3200 rate:0.5385 gloss:-57.3140 dloss:29.2410 dlossR:26.3358 dlossQ:2.9053
Episode:860 meanR:0.3700 rate:0.4615 gloss:-55.6066 dloss:26.2788 dlossR:22.5684 dlossQ:3.7104
Episode:861 meanR:0.4000 rate:0.3846 gloss:-78.4146 dloss:27.3583 dlossR:24.6094 dlossQ:2.7489
Episode:862 meanR:0.3800 rate:-0.1538 gloss:-99.8030 dloss:-7.4333 dlossR:-9.9403 dlossQ:2.5070
Episode:863 meanR:0.3900 rate:0.0769 gloss:-239.364

Episode:940 meanR:1.2900 rate:-0.0769 gloss:-21.9936 dloss:-1.2485 dlossR:-1.2498 dlossQ:0.0013
Episode:941 meanR:1.2600 rate:-0.1538 gloss:-139.0272 dloss:-15.5944 dlossR:-15.5951 dlossQ:0.0008
Episode:942 meanR:1.2400 rate:-0.0769 gloss:-50.7199 dloss:-2.8620 dlossR:-2.8623 dlossQ:0.0003
Episode:943 meanR:1.1900 rate:-0.0769 gloss:-23.8521 dloss:-1.3527 dlossR:-1.3535 dlossQ:0.0008
Episode:944 meanR:1.1900 rate:-0.0769 gloss:-19.9021 dloss:-1.1216 dlossR:-1.1222 dlossQ:0.0006
Episode:945 meanR:1.2100 rate:0.1538 gloss:-19.2694 dloss:2.2252 dlossR:2.2174 dlossQ:0.0078
Episode:946 meanR:1.2300 rate:0.3846 gloss:-21.5767 dloss:6.2034 dlossR:6.1955 dlossQ:0.0079
Episode:947 meanR:1.2800 rate:0.3846 gloss:-18.9010 dloss:5.3639 dlossR:5.3620 dlossQ:0.0019
Episode:948 meanR:1.3000 rate:0.1538 gloss:-22.2992 dloss:2.5141 dlossR:2.5121 dlossQ:0.0020
Episode:949 meanR:1.2800 rate:0.2308 gloss:-18.7416 dloss:3.2778 dlossR:3.2667 dlossQ:0.0111
Episode:950 meanR:1.3200 rate:0.1538 gloss:-17.4518 

Episode:1028 meanR:1.8000 rate:0.4615 gloss:-1.4379 dloss:1.4902 dlossR:0.8536 dlossQ:0.6366
Episode:1029 meanR:1.7900 rate:0.0769 gloss:-1.4114 dloss:1.0464 dlossR:0.4170 dlossQ:0.6294
Episode:1030 meanR:1.8000 rate:0.1538 gloss:-1.5933 dloss:1.1352 dlossR:0.5047 dlossQ:0.6305
Episode:1031 meanR:1.7900 rate:0.0000 gloss:-1.8225 dloss:0.8246 dlossR:0.2587 dlossQ:0.5659
Episode:1032 meanR:1.8100 rate:0.2308 gloss:-2.0922 dloss:1.1770 dlossR:0.6133 dlossQ:0.5636
Episode:1033 meanR:1.8000 rate:0.0000 gloss:-53.6066 dloss:0.0970 dlossR:0.0111 dlossQ:0.0859
Episode:1034 meanR:1.8000 rate:0.0769 gloss:-2.7039 dloss:0.7297 dlossR:0.3062 dlossQ:0.4235
Episode:1035 meanR:1.8100 rate:0.0769 gloss:-3.8069 dloss:0.7893 dlossR:0.3488 dlossQ:0.4405
Episode:1036 meanR:1.8000 rate:-0.0769 gloss:-3.4877 dloss:0.2155 dlossR:-0.1144 dlossQ:0.3299
Episode:1037 meanR:1.8300 rate:0.1538 gloss:-3.2841 dloss:0.8547 dlossR:0.4950 dlossQ:0.3597
Episode:1038 meanR:1.8600 rate:0.2308 gloss:-3.7408 dloss:1.0068 dl

Episode:1116 meanR:2.1700 rate:0.0000 gloss:-2.1832 dloss:1.3458 dlossR:0.4230 dlossQ:0.9228
Episode:1117 meanR:2.1900 rate:0.2308 gloss:-3.2332 dloss:1.9310 dlossR:0.9967 dlossQ:0.9343
Episode:1118 meanR:2.2200 rate:0.3077 gloss:-1.5427 dloss:1.6395 dlossR:0.8258 dlossQ:0.8137
Episode:1119 meanR:2.2000 rate:0.0000 gloss:-21.4873 dloss:0.4759 dlossR:0.1222 dlossQ:0.3537
Episode:1120 meanR:2.1800 rate:0.1538 gloss:-1.1184 dloss:1.3605 dlossR:0.5949 dlossQ:0.7655
Episode:1121 meanR:2.1300 rate:0.1538 gloss:-3.2395 dloss:1.5545 dlossR:0.7604 dlossQ:0.7940
Episode:1122 meanR:2.1300 rate:0.1538 gloss:-1.6668 dloss:1.2525 dlossR:0.5573 dlossQ:0.6951
Episode:1123 meanR:2.1100 rate:0.0769 gloss:-2.6816 dloss:1.1791 dlossR:0.4346 dlossQ:0.7444
Episode:1124 meanR:2.0900 rate:0.0000 gloss:-2.3534 dloss:0.9630 dlossR:0.2905 dlossQ:0.6725
Episode:1125 meanR:2.0700 rate:0.0769 gloss:-1.8973 dloss:1.0935 dlossR:0.4145 dlossQ:0.6789
Episode:1126 meanR:2.0700 rate:0.1538 gloss:-2.7432 dloss:1.1167 dlos

Episode:1204 meanR:2.2900 rate:0.3077 gloss:-2.1489 dloss:1.4352 dlossR:0.7978 dlossQ:0.6374
Episode:1205 meanR:2.3500 rate:0.2308 gloss:-1.7803 dloss:1.2215 dlossR:0.6086 dlossQ:0.6129
Episode:1206 meanR:2.3900 rate:0.3077 gloss:-1.7827 dloss:1.4544 dlossR:0.7804 dlossQ:0.6740
Episode:1207 meanR:2.3400 rate:0.2308 gloss:-1.3224 dloss:1.3474 dlossR:0.6575 dlossQ:0.6899
Episode:1208 meanR:2.3100 rate:0.3077 gloss:-2.5089 dloss:1.7232 dlossR:0.9666 dlossQ:0.7566
Episode:1209 meanR:2.3000 rate:0.3846 gloss:-0.9319 dloss:1.4133 dlossR:0.7216 dlossQ:0.6916
Episode:1210 meanR:2.3200 rate:0.4615 gloss:-1.1931 dloss:1.6855 dlossR:0.9427 dlossQ:0.7428
Episode:1211 meanR:2.3000 rate:0.1538 gloss:-1.5301 dloss:1.2743 dlossR:0.5834 dlossQ:0.6909
Episode:1212 meanR:2.2700 rate:0.3077 gloss:-2.1435 dloss:1.6816 dlossR:0.8931 dlossQ:0.7885
Episode:1213 meanR:2.2800 rate:0.0769 gloss:-2.8565 dloss:1.2466 dlossR:0.5279 dlossQ:0.7187
Episode:1214 meanR:2.3300 rate:0.4615 gloss:-1.3882 dloss:1.5109 dloss

Episode:1293 meanR:2.4900 rate:-0.0769 gloss:-1.7851 dloss:0.8202 dlossR:0.2410 dlossQ:0.5792
Episode:1294 meanR:2.4600 rate:0.1538 gloss:-1.9938 dloss:1.0069 dlossR:0.4658 dlossQ:0.5411
Episode:1295 meanR:2.5000 rate:0.3846 gloss:-2.3834 dloss:1.2770 dlossR:0.8405 dlossQ:0.4365
Episode:1296 meanR:2.4900 rate:0.1538 gloss:-2.2634 dloss:0.8899 dlossR:0.4358 dlossQ:0.4541
Episode:1297 meanR:2.5100 rate:0.3077 gloss:-2.9067 dloss:1.2365 dlossR:0.8152 dlossQ:0.4213
Episode:1298 meanR:2.5600 rate:0.4615 gloss:-3.7318 dloss:1.9246 dlossR:1.4381 dlossQ:0.4865
Episode:1299 meanR:2.5200 rate:-0.0769 gloss:-6.2570 dloss:0.3540 dlossR:-0.2001 dlossQ:0.5541
Episode:1300 meanR:2.4900 rate:0.0769 gloss:-7.6981 dloss:1.1384 dlossR:0.5772 dlossQ:0.5612
Episode:1301 meanR:2.5200 rate:0.2308 gloss:-1.7670 dloss:1.0933 dlossR:0.5545 dlossQ:0.5388
Episode:1302 meanR:2.5100 rate:0.1538 gloss:-5.1915 dloss:1.3690 dlossR:0.7833 dlossQ:0.5857
Episode:1303 meanR:2.4600 rate:0.0000 gloss:-2.6205 dloss:0.7817 dl

Episode:1381 meanR:2.2300 rate:0.4615 gloss:-1.3604 dloss:1.4156 dlossR:0.7996 dlossQ:0.6160
Episode:1382 meanR:2.2100 rate:0.0000 gloss:-2.0446 dloss:0.7635 dlossR:0.2398 dlossQ:0.5236
Episode:1383 meanR:2.1900 rate:-0.0769 gloss:-1.9033 dloss:0.6817 dlossR:0.1468 dlossQ:0.5349
Episode:1384 meanR:2.1800 rate:0.0769 gloss:-3.1205 dloss:0.8345 dlossR:0.3575 dlossQ:0.4770
Episode:1385 meanR:2.1600 rate:0.2308 gloss:-2.6287 dloss:1.0876 dlossR:0.6231 dlossQ:0.4645
Episode:1386 meanR:2.1300 rate:-0.0769 gloss:-6.4149 dloss:0.1558 dlossR:-0.2701 dlossQ:0.4259
Episode:1387 meanR:2.1400 rate:0.0769 gloss:-5.1775 dloss:0.8479 dlossR:0.4141 dlossQ:0.4338
Episode:1388 meanR:2.1200 rate:0.0769 gloss:-3.7507 dloss:0.7167 dlossR:0.3341 dlossQ:0.3826
Episode:1389 meanR:2.1300 rate:0.0769 gloss:-5.5935 dloss:0.7071 dlossR:0.3905 dlossQ:0.3166
Episode:1390 meanR:2.1200 rate:0.3846 gloss:-4.5905 dloss:1.7525 dlossR:1.4029 dlossQ:0.3496
Episode:1391 meanR:2.0900 rate:-0.0769 gloss:-4.6167 dloss:0.0941 d

Episode:1469 meanR:1.9200 rate:0.2308 gloss:-2.6106 dloss:1.0423 dlossR:0.5989 dlossQ:0.4434
Episode:1470 meanR:1.9400 rate:0.2308 gloss:-1.9138 dloss:1.1909 dlossR:0.6404 dlossQ:0.5505
Episode:1471 meanR:1.9500 rate:0.2308 gloss:-0.9820 dloss:1.6102 dlossR:0.8729 dlossQ:0.7373
Episode:1472 meanR:1.8900 rate:0.0769 gloss:-1.8618 dloss:0.9376 dlossR:0.3771 dlossQ:0.5605
Episode:1473 meanR:1.8400 rate:0.0000 gloss:-2.0201 dloss:0.7378 dlossR:0.2215 dlossQ:0.5163
Episode:1474 meanR:1.8800 rate:0.6154 gloss:-1.9206 dloss:1.6171 dlossR:1.1007 dlossQ:0.5163
Episode:1475 meanR:1.8900 rate:0.2308 gloss:-2.0313 dloss:1.0443 dlossR:0.5529 dlossQ:0.4914
Episode:1476 meanR:1.8700 rate:0.0000 gloss:-4.4953 dloss:0.6549 dlossR:0.1591 dlossQ:0.4958
Episode:1477 meanR:1.9200 rate:0.4615 gloss:-1.9946 dloss:1.3934 dlossR:0.8951 dlossQ:0.4984
Episode:1478 meanR:1.9400 rate:0.1538 gloss:-5.9773 dloss:1.3703 dlossR:0.8300 dlossQ:0.5403
Episode:1479 meanR:1.9600 rate:0.3077 gloss:-2.3399 dloss:1.3599 dloss

Episode:1557 meanR:1.6600 rate:0.0000 gloss:-3.0283 dloss:0.5261 dlossR:0.1362 dlossQ:0.3899
Episode:1558 meanR:1.6900 rate:0.2308 gloss:-2.4771 dloss:1.0070 dlossR:0.5810 dlossQ:0.4260
Episode:1559 meanR:1.7000 rate:0.2308 gloss:-2.9902 dloss:1.0327 dlossR:0.6383 dlossQ:0.3944
Episode:1560 meanR:1.7100 rate:0.2308 gloss:-2.6680 dloss:0.9986 dlossR:0.5956 dlossQ:0.4030
Episode:1561 meanR:1.6900 rate:0.0769 gloss:-4.3174 dloss:0.7089 dlossR:0.3425 dlossQ:0.3664
Episode:1562 meanR:1.6800 rate:0.1538 gloss:-4.1521 dloss:0.9613 dlossR:0.5776 dlossQ:0.3837
Episode:1563 meanR:1.7400 rate:0.2308 gloss:-3.5101 dloss:1.1370 dlossR:0.7214 dlossQ:0.4155
Episode:1564 meanR:1.7600 rate:0.1538 gloss:-3.1012 dloss:0.9224 dlossR:0.4976 dlossQ:0.4248
Episode:1565 meanR:1.7700 rate:0.2308 gloss:-2.7067 dloss:1.1080 dlossR:0.6390 dlossQ:0.4691
Episode:1566 meanR:1.7500 rate:0.1538 gloss:-3.5315 dloss:1.0181 dlossR:0.5534 dlossQ:0.4647
Episode:1567 meanR:1.7900 rate:0.3077 gloss:-5.9491 dloss:1.9341 dloss

Episode:1645 meanR:1.8300 rate:0.0000 gloss:-2.8405 dloss:0.5358 dlossR:0.1368 dlossQ:0.3991
Episode:1646 meanR:1.8300 rate:0.1538 gloss:-2.1745 dloss:0.9081 dlossR:0.4369 dlossQ:0.4712
Episode:1647 meanR:1.8100 rate:0.0000 gloss:-2.6098 dloss:0.5679 dlossR:0.1515 dlossQ:0.4164
Episode:1648 meanR:1.8100 rate:0.0769 gloss:-3.3187 dloss:0.6166 dlossR:0.2890 dlossQ:0.3276
Episode:1649 meanR:1.8300 rate:0.3077 gloss:-3.2752 dloss:1.1703 dlossR:0.8432 dlossQ:0.3271
Episode:1650 meanR:1.8100 rate:0.0769 gloss:-2.7985 dloss:0.6844 dlossR:0.2943 dlossQ:0.3901
Episode:1651 meanR:1.8200 rate:0.0769 gloss:-3.0206 dloss:0.7025 dlossR:0.3008 dlossQ:0.4017
Episode:1652 meanR:1.8300 rate:0.2308 gloss:-2.8936 dloss:1.0223 dlossR:0.6256 dlossQ:0.3966
Episode:1653 meanR:1.8400 rate:0.1538 gloss:-3.4108 dloss:0.7863 dlossR:0.4758 dlossQ:0.3104
Episode:1654 meanR:1.8100 rate:0.0769 gloss:-2.8129 dloss:0.6960 dlossR:0.2975 dlossQ:0.3985
Episode:1655 meanR:1.8100 rate:0.0000 gloss:-3.6201 dloss:0.3637 dloss

Episode:1733 meanR:0.8100 rate:0.0000 gloss:-5.4188 dloss:0.1266 dlossR:0.0212 dlossQ:0.1054
Episode:1734 meanR:0.7600 rate:-0.0769 gloss:-3.9435 dloss:0.0563 dlossR:-0.1638 dlossQ:0.2201
Episode:1735 meanR:0.7200 rate:0.0769 gloss:-5.1129 dloss:0.4407 dlossR:0.3138 dlossQ:0.1269
Episode:1736 meanR:0.7200 rate:0.0000 gloss:-5.8934 dloss:0.1095 dlossR:0.0180 dlossQ:0.0914
Episode:1737 meanR:0.6900 rate:-0.0769 gloss:-6.0432 dloss:-0.2344 dlossR:-0.3221 dlossQ:0.0877
Episode:1738 meanR:0.6800 rate:0.0769 gloss:-5.8859 dloss:0.4630 dlossR:0.3521 dlossQ:0.1109
Episode:1739 meanR:0.6500 rate:0.0000 gloss:-6.3093 dloss:0.1218 dlossR:0.0194 dlossQ:0.1024
Episode:1740 meanR:0.6400 rate:0.0769 gloss:-8.3636 dloss:0.5148 dlossR:0.4750 dlossQ:0.0398
Episode:1741 meanR:0.6200 rate:0.0769 gloss:-6.1829 dloss:0.4735 dlossR:0.3675 dlossQ:0.1060
Episode:1742 meanR:0.6200 rate:0.0769 gloss:-5.5526 dloss:0.4345 dlossR:0.3319 dlossQ:0.1026
Episode:1743 meanR:0.5900 rate:0.0000 gloss:-4.7664 dloss:0.2174 

Episode:1821 meanR:0.6800 rate:0.5385 gloss:-4.5162 dloss:2.1101 dlossR:1.8504 dlossQ:0.2597
Episode:1822 meanR:0.7100 rate:0.1538 gloss:-3.7020 dloss:0.8103 dlossR:0.5036 dlossQ:0.3067
Episode:1823 meanR:0.6900 rate:0.0000 gloss:-3.8657 dloss:0.4055 dlossR:0.0884 dlossQ:0.3172
Episode:1824 meanR:0.6800 rate:0.0769 gloss:-3.8759 dloss:0.6311 dlossR:0.3094 dlossQ:0.3217
Episode:1825 meanR:0.7000 rate:0.0000 gloss:-5.9764 dloss:0.3736 dlossR:0.0721 dlossQ:0.3016
Episode:1826 meanR:0.7100 rate:0.0769 gloss:-4.5663 dloss:0.6804 dlossR:0.3495 dlossQ:0.3309
Episode:1827 meanR:0.7500 rate:0.3077 gloss:-3.4641 dloss:1.2624 dlossR:0.8939 dlossQ:0.3685
Episode:1828 meanR:0.7500 rate:0.0769 gloss:-3.4741 dloss:0.6974 dlossR:0.3167 dlossQ:0.3808
Episode:1829 meanR:0.7700 rate:0.1538 gloss:-2.6862 dloss:0.8694 dlossR:0.4515 dlossQ:0.4179
Episode:1830 meanR:0.7600 rate:0.0000 gloss:-2.4531 dloss:0.5901 dlossR:0.1613 dlossQ:0.4288
Episode:1831 meanR:0.7400 rate:-0.0769 gloss:-2.1495 dloss:0.5948 dlos

Episode:1909 meanR:0.6200 rate:0.0000 gloss:-4.1827 dloss:0.4514 dlossR:0.1127 dlossQ:0.3388
Episode:1910 meanR:0.6400 rate:0.2308 gloss:-4.2227 dloss:1.0083 dlossR:0.7754 dlossQ:0.2329
Episode:1911 meanR:0.6500 rate:0.0000 gloss:-4.3661 dloss:0.2241 dlossR:0.0436 dlossQ:0.1806
Episode:1912 meanR:0.6300 rate:-0.0769 gloss:-4.7993 dloss:-0.0847 dlossR:-0.2350 dlossQ:0.1503
Episode:1913 meanR:0.6200 rate:0.0769 gloss:-4.5852 dloss:0.4859 dlossR:0.3007 dlossQ:0.1852
Episode:1914 meanR:0.6300 rate:0.0769 gloss:-6.0359 dloss:0.5421 dlossR:0.3729 dlossQ:0.1692
Episode:1915 meanR:0.6200 rate:0.0000 gloss:-6.2584 dloss:0.2065 dlossR:0.0339 dlossQ:0.1726
Episode:1916 meanR:0.6300 rate:0.1538 gloss:-4.9849 dloss:0.8040 dlossR:0.6054 dlossQ:0.1986
Episode:1917 meanR:0.6100 rate:-0.0769 gloss:-5.8245 dloss:-0.1011 dlossR:-0.2876 dlossQ:0.1865
Episode:1918 meanR:0.5800 rate:-0.0769 gloss:-4.5702 dloss:-0.0080 dlossR:-0.2070 dlossQ:0.1990
Episode:1919 meanR:0.5800 rate:0.0769 gloss:-5.8421 dloss:0.5

Episode:1997 meanR:0.3900 rate:0.0000 gloss:-8.1814 dloss:0.0897 dlossR:0.0120 dlossQ:0.0777
Episode:1998 meanR:0.3800 rate:0.0000 gloss:-7.1135 dloss:0.1011 dlossR:0.0150 dlossQ:0.0861
Episode:1999 meanR:0.3700 rate:0.0000 gloss:-7.0798 dloss:0.1158 dlossR:0.0176 dlossQ:0.0982
Episode:2000 meanR:0.3700 rate:0.0769 gloss:-6.7895 dloss:0.4973 dlossR:0.3983 dlossQ:0.0991
Episode:2001 meanR:0.3700 rate:0.0000 gloss:-7.3785 dloss:0.1005 dlossR:0.0133 dlossQ:0.0872
Episode:2002 meanR:0.3700 rate:0.0000 gloss:-5.9878 dloss:0.1056 dlossR:0.0167 dlossQ:0.0889
Episode:2003 meanR:0.3800 rate:0.1538 gloss:-6.1956 dloss:0.7887 dlossR:0.7099 dlossQ:0.0788
Episode:2004 meanR:0.3700 rate:0.0000 gloss:-4.4027 dloss:0.3252 dlossR:0.0658 dlossQ:0.2594
Episode:2005 meanR:0.3600 rate:0.0000 gloss:-3.6901 dloss:0.5055 dlossR:0.1139 dlossQ:0.3916
Episode:2006 meanR:0.3500 rate:0.0769 gloss:-4.9305 dloss:0.4433 dlossR:0.3063 dlossQ:0.1370
Episode:2007 meanR:0.3200 rate:-0.0769 gloss:-6.2330 dloss:-0.2453 dlo

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

In [33]:
# # import gym
# # # env = gym.make('CartPole-v0')
# # env = gym.make('CartPole-v1')
# # # env = gym.make('Acrobot-v1')
# # # env = gym.make('MountainCar-v0')
# # # env = gym.make('Pendulum-v0')
# # # env = gym.make('Blackjack-v0')
# # # env = gym.make('FrozenLake-v0')
# # # env = gym.make('AirRaid-ram-v0')
# # # env = gym.make('AirRaid-v0')
# # # env = gym.make('BipedalWalker-v2')
# # # env = gym.make('Copy-v0')
# # # env = gym.make('CarRacing-v0')
# # # env = gym.make('Ant-v2') #mujoco
# # # env = gym.make('FetchPickAndPlace-v1') # mujoco required!

# with tf.Session() as sess:
#     #sess.run(tf.global_variables_initializer())
#     saver.restore(sess, 'checkpoints/model-nav.ckpt')    
#     #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
#     # Episodes/epochs
#     for _ in range(1):
#         state = env.reset()
#         total_reward = 0

#         # Steps/batches
#         #for _ in range(111111111111111111):
#         while True:
#             env.render()
#             action_logits = sess.run(model.actions_logits, feed_dict={model.states: np.reshape(state, [1, -1])})
#             action = np.argmax(action_logits)
#             state, reward, done, _ = env.step(action)
#             total_reward += reward
#             if done:
#                 break
                
#         # Closing the env
#         print('total_reward: {:.2f}'.format(total_reward))
#         env.close()