# Navigation

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the first project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

### 1. Start the Environment

We begin by importing some necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Banana.app"`
- **Windows** (x86): `"path/to/Banana_Windows_x86/Banana.exe"`
- **Windows** (x86_64): `"path/to/Banana_Windows_x86_64/Banana.exe"`
- **Linux** (x86): `"path/to/Banana_Linux/Banana.x86"`
- **Linux** (x86_64): `"path/to/Banana_Linux/Banana.x86_64"`
- **Linux** (x86, headless): `"path/to/Banana_Linux_NoVis/Banana.x86"`
- **Linux** (x86_64, headless): `"path/to/Banana_Linux_NoVis/Banana.x86_64"`

For instance, if you are using a Mac, then you downloaded `Banana.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Banana.app")
```

In [2]:
# env = UnityEnvironment(file_name="/home/arasdar/VisualBanana_Linux/Banana.x86")
env = UnityEnvironment(file_name="/home/arasdar/Banana_Linux/Banana.x86_64")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
# print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)
# print(state.shape, len(env_info.vector_observations), env_info.vector_observations.shape)

Number of agents: 1
Number of actions: 4
States have length: 37


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action (uniformly) at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
env_info = env.reset(train_mode=False)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
while True:
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        print(state.shape)
        break
    
print("Score: {}".format(score))

(37,)
Score: 0.0


When finished, you can close the environment.

In [6]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [7]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
while True:
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    #print(state)
    if done:                                       # exit loop if episode finished
        break
    
print("Score: {}".format(score))

Score: 1.0


In [8]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


In [9]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
batch = []
while True: # infinite number of steps
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    #print(state, action, reward, done)
    batch.append([action, state, reward, done])
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        break
    
# print("Score: {}".format(score))

In [10]:
batch[0], batch[0][1].shape

([0, array([1.        , 0.        , 0.        , 0.        , 0.35186431,
         1.        , 0.        , 0.        , 0.        , 0.37953866,
         1.        , 0.        , 0.        , 0.        , 0.11957462,
         1.        , 0.        , 0.        , 0.        , 0.43679786,
         0.        , 1.        , 0.        , 0.        , 0.7516005 ,
         0.        , 0.        , 1.        , 0.        , 0.6708644 ,
         0.        , 0.        , 1.        , 0.        , 0.36187497,
         0.        , 0.        ]), 0.0, False], (37,))

In [11]:
batch[0][1].shape

(37,)

In [12]:
batch[0]

[0, array([1.        , 0.        , 0.        , 0.        , 0.35186431,
        1.        , 0.        , 0.        , 0.        , 0.37953866,
        1.        , 0.        , 0.        , 0.        , 0.11957462,
        1.        , 0.        , 0.        , 0.        , 0.43679786,
        0.        , 1.        , 0.        , 0.        , 0.7516005 ,
        0.        , 0.        , 1.        , 0.        , 0.6708644 ,
        0.        , 0.        , 1.        , 0.        , 0.36187497,
        0.        , 0.        ]), 0.0, False]

In [13]:
actions = np.array([each[0] for each in batch])
states = np.array([each[1] for each in batch])
rewards = np.array([each[2] for each in batch])
dones = np.array([each[3] for each in batch])
# infos = np.array([each[4] for each in batch])

In [14]:
# print(rewards[:])
print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print(np.max(np.array(actions)), np.min(np.array(actions)), 
      (np.max(np.array(actions)) - np.min(np.array(actions)))+1)
print(np.max(np.array(rewards)), np.min(np.array(rewards)))
print(np.max(np.array(states)), np.min(np.array(states)))

(300,) (300, 37) (300,) (300,)
float64 float64 int64 bool
3 0 4
1.0 0.0
10.869853973388672 -10.982420921325684


In [15]:
# Data of the model
def model_input(state_size):
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    reward = tf.placeholder(tf.float32, [], name='reward')
    return states, actions, targetQs, reward

In [16]:
# Generator/Controller: Generating/prediting the actions
def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        #predictions = tf.nn.softmax(logits)

        # return actions logits
        return logits

In [17]:
# Discriminator/Dopamine: Reward function/planner/naviator/advisor/supervisor/cortical columns
def discriminator(states, actions, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # Fusion/merge states and actions/ SA/ SM
        x_fused = tf.concat(axis=1, values=[states, actions])
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)        
        #predictions = tf.nn.softmax(logits)

        # return rewards logits
        return logits

In [25]:
def model_loss(action_size, hidden_size, states, actions, targetQs, reward):
    # G
    actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    neg_log_prob_actions = tf.nn.softmax_cross_entropy_with_logits_v2(logits=actions_logits, 
                                                                      labels=actions_labels)
    rewards = reward * tf.ones_like(targetQs)
    #Qs_labels = targetQs[1:]
    Qs_labels = rewards[:-1] + (0.99*targetQs[1:])
    Qs_labels = tf.concat(axis=0, values=[Qs_labels, tf.zeros([1])])
    g_loss = tf.reduce_mean(neg_log_prob_actions * Qs_labels)
    #g_loss = tf.reduce_mean(neg_log_prob_actions[:-1] * Qs_labels)
    
    # D
    Qs_logits = discriminator(actions=actions_logits, hidden_size=hidden_size, states=states)
    d_lossR = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=tf.reshape(Qs_logits, [-1]),
                                                                     labels=rewards))
    d_lossQ = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=tf.reshape(Qs_logits, [-1]),
                                                                     labels=tf.nn.sigmoid(Qs_labels)))
    # d_lossQ = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=tf.reshape(Qs_logits[:-1], [-1]),
    #                                                                  labels=tf.nn.sigmoid(Qs_labels)))
    d_loss = d_lossR + d_lossQ

    return actions_logits, Qs_logits, g_loss, d_loss, d_lossR, d_lossQ

In [26]:
# Optimizating/training/learning G & D
def model_opt(g_loss, d_loss, learning_rate):
    """
    Get optimization operations in order
    :param g_loss: Generator loss Tensor for action prediction
    :param d_loss: Discriminator loss Tensor for reward prediction for generated/prob/logits action
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars)
        d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars)

    return g_opt, d_opt

In [27]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, self.reward = model_input(state_size=state_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.Qs_logits, self.g_loss, self.d_loss, self.d_lossR, self.d_lossQ = model_loss(
            action_size=action_size, hidden_size=hidden_size, # model init parameters
            states=self.states, actions=self.actions, # model input
            targetQs=self.targetQs, reward=self.reward) # model input
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(g_loss=self.g_loss, d_loss=self.d_loss, learning_rate=learning_rate)

In [28]:
print('state size:{}'.format(states.shape), 
      'actions:{}'.format(actions.shape)) 
print('action size:{}'.format(np.max(actions) - np.min(actions)+1))

state size:(300, 37) actions:(300,)
action size:4


In [29]:
# Training parameters
# Network parameters
state_size = 37              # number of units for the input state/observation -- simulation
action_size = 4              # number of units for the output actions -- simulation
hidden_size = 37*16          # number of units in each Q-network hidden layer -- simulation
learning_rate = 0.001          # learning rate for adam

In [30]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

In [31]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment

while True: # infinite number of steps
#for _ in range(batch_size):
    state = env_info.vector_observations[0]   # get the next state
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    #memory.buffer.append([action, state, done])
    if done:                                       # exit loop if episode finished
        break

In [None]:
from collections import deque
episodes_total_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
saver = tf.train.Saver()
rewards_list, g_loss_list, d_loss_list = [], [], []
d_lossR_list, d_lossQ_list = [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Training episodes/epochs
    for ep in range(111111):
        batch = [] # every data batch
        total_reward = 0
        #state = env.reset() # env first state
        env_info = env.reset(train_mode=True)[brain_name] # reset the environment

        # Training steps/batches
        while True:
            state = env_info.vector_observations[0]   # get the next state
            action_logits, Q_logits = sess.run(fetches=[model.actions_logits, model.Qs_logits], 
                                               feed_dict={model.states: np.reshape(state, [1, -1])})
            action = np.argmax(action_logits)
            batch.append([state, action, Q_logits])
            #state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name]        # send the action to the environment
            reward = env_info.rewards[0]                   # get the reward
            done = env_info.local_done[0]                  # see if episode has finished
            total_reward += reward
            if done is True: # episode ended success/failure
                episodes_total_reward.append(total_reward) # stopping criteria
                #rate = total_reward/ 500 # success is 500 points, rate is between 0 and +1 ~ sigmoid
                rate = total_reward/ +13 # success is +13; rate is between -1 and +1 ~ tanh
                if rate >= +1: rate = +1
                if rate <= -1: rate = -1
                break

        # Training using batches
        #batch = memory.buffer
        states = np.array([each[0] for each in batch])
        actions = np.array([each[1] for each in batch])
        targetQs = np.array([each[2] for each in batch])
        g_loss, d_loss, d_lossR, d_lossQ, _, _ = sess.run([model.g_loss, model.d_loss,
                                                           model.d_lossR, model.d_lossQ, 
                                                           model.g_opt, model.d_opt],
                                                          feed_dict = {model.states: states, 
                                                                       model.actions: actions,
                                                                       model.reward: rate,
                                                                       model.targetQs: targetQs.reshape([-1])})
        # Average 100 episode total reward
        # Print out
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episodes_total_reward)),
              'rate:{:.4f}'.format(rate),
              'gloss:{:.4f}'.format(g_loss),
              'dloss:{:.4f}'.format(d_loss),
              'dlossR:{:.4f}'.format(d_lossR),
              'dlossQ:{:.4f}'.format(d_lossQ))
        # Ploting out
        rewards_list.append([ep, np.mean(episodes_total_reward)])
        g_loss_list.append([ep, g_loss])
        d_loss_list.append([ep, d_loss])
        d_lossR_list.append([ep, d_lossR])
        d_lossQ_list.append([ep, d_lossQ])
        # Break episode/epoch loop
        if np.mean(episodes_total_reward) >= +13:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints-nav/model.ckpt')

Episode:0 meanR:-1.0000 rate:-0.0769 gloss:-0.1316 dloss:1.3706 dlossR:0.6773 dlossQ:0.6933
Episode:1 meanR:-1.0000 rate:-0.0769 gloss:-0.7123 dloss:1.0514 dlossR:0.4117 dlossQ:0.6397
Episode:2 meanR:-0.6667 rate:0.0000 gloss:-2.2430 dloss:0.4708 dlossR:0.1167 dlossQ:0.3541
Episode:3 meanR:-0.5000 rate:0.0000 gloss:-1.9378 dloss:0.5582 dlossR:0.1505 dlossQ:0.4077
Episode:4 meanR:-0.4000 rate:0.0000 gloss:-2.5756 dloss:0.3596 dlossR:0.0808 dlossQ:0.2788
Episode:5 meanR:-0.3333 rate:0.0000 gloss:-3.3259 dloss:0.2474 dlossR:0.0481 dlossQ:0.1993
Episode:6 meanR:-0.1429 rate:0.0769 gloss:-3.8222 dloss:0.4762 dlossR:0.2936 dlossQ:0.1826
Episode:7 meanR:-0.1250 rate:0.0000 gloss:-4.5931 dloss:0.1158 dlossR:0.0177 dlossQ:0.0981
Episode:8 meanR:-0.1111 rate:0.0000 gloss:-5.3130 dloss:0.0846 dlossR:0.0118 dlossQ:0.0728
Episode:9 meanR:-0.1000 rate:0.0000 gloss:-6.1249 dloss:0.0456 dlossR:0.0050 dlossQ:0.0406
Episode:10 meanR:-0.0909 rate:0.0000 gloss:-6.6945 dloss:0.0358 dlossR:0.0033 dlossQ:0.0

Episode:89 meanR:0.1222 rate:0.0769 gloss:-4.9906 dloss:0.5519 dlossR:0.3447 dlossQ:0.2072
Episode:90 meanR:0.1099 rate:-0.0769 gloss:-6.2658 dloss:-0.0555 dlossR:-0.3289 dlossQ:0.2734
Episode:91 meanR:0.1196 rate:0.0769 gloss:-6.3559 dloss:0.5692 dlossR:0.4113 dlossQ:0.1578
Episode:92 meanR:0.1290 rate:0.0769 gloss:-7.2748 dloss:0.6754 dlossR:0.4671 dlossQ:0.2083
Episode:93 meanR:0.1170 rate:-0.0769 gloss:-5.8568 dloss:-0.1527 dlossR:-0.3167 dlossQ:0.1640
Episode:94 meanR:0.1158 rate:0.0000 gloss:-6.4807 dloss:0.2158 dlossR:0.0274 dlossQ:0.1884
Episode:95 meanR:0.1354 rate:0.1538 gloss:-7.1775 dloss:1.0732 dlossR:0.9109 dlossQ:0.1622
Episode:96 meanR:0.1546 rate:0.1538 gloss:-5.5053 dloss:0.8447 dlossR:0.7018 dlossQ:0.1429
Episode:97 meanR:0.1633 rate:0.0769 gloss:-5.0454 dloss:0.4961 dlossR:0.3340 dlossQ:0.1621
Episode:98 meanR:0.1616 rate:0.0000 gloss:-5.5479 dloss:0.1849 dlossR:0.0287 dlossQ:0.1562
Episode:99 meanR:0.1700 rate:0.0769 gloss:-5.6715 dloss:0.5481 dlossR:0.3738 dlossQ:

Episode:178 meanR:0.0700 rate:-0.1538 gloss:-6.4764 dloss:-0.6435 dlossR:-0.7165 dlossQ:0.0730
Episode:179 meanR:0.0600 rate:-0.0769 gloss:-7.1288 dloss:-0.3462 dlossR:-0.4007 dlossQ:0.0544
Episode:180 meanR:0.0800 rate:0.0769 gloss:-7.1074 dloss:0.4845 dlossR:0.4239 dlossQ:0.0607
Episode:181 meanR:0.0900 rate:-0.0769 gloss:-6.7778 dloss:-0.3072 dlossR:-0.3726 dlossQ:0.0655
Episode:182 meanR:0.1500 rate:0.4615 gloss:-8.5889 dloss:3.2731 dlossR:3.2229 dlossQ:0.0501
Episode:183 meanR:0.1500 rate:-0.0769 gloss:-8.1131 dloss:-0.4213 dlossR:-0.4557 dlossQ:0.0344
Episode:184 meanR:0.1500 rate:0.0000 gloss:-7.7196 dloss:0.0400 dlossR:0.0038 dlossQ:0.0362
Episode:185 meanR:0.1600 rate:0.0769 gloss:-7.7394 dloss:0.4918 dlossR:0.4555 dlossQ:0.0363
Episode:186 meanR:0.1600 rate:0.0000 gloss:-7.3096 dloss:0.0436 dlossR:0.0044 dlossQ:0.0392
Episode:187 meanR:0.1500 rate:0.0000 gloss:-5.8111 dloss:0.0962 dlossR:0.0138 dlossQ:0.0824
Episode:188 meanR:0.1500 rate:0.0000 gloss:-5.0896 dloss:0.1504 dlos

Episode:266 meanR:-0.0100 rate:0.0000 gloss:-8.6233 dloss:0.0720 dlossR:0.0079 dlossQ:0.0642
Episode:267 meanR:-0.0300 rate:-0.1538 gloss:-8.4223 dloss:-0.9064 dlossR:-0.9478 dlossQ:0.0414
Episode:268 meanR:-0.0300 rate:-0.0769 gloss:-8.3815 dloss:-0.4108 dlossR:-0.4715 dlossQ:0.0607
Episode:269 meanR:-0.0500 rate:-0.0769 gloss:-8.9644 dloss:-0.4479 dlossR:-0.5065 dlossQ:0.0585
Episode:270 meanR:-0.0700 rate:-0.1538 gloss:-7.6879 dloss:-0.7971 dlossR:-0.8621 dlossQ:0.0651
Episode:271 meanR:-0.0900 rate:-0.1538 gloss:-7.5284 dloss:-0.8080 dlossR:-0.8496 dlossQ:0.0415
Episode:272 meanR:-0.0800 rate:0.0769 gloss:-7.6809 dloss:0.5081 dlossR:0.4564 dlossQ:0.0517
Episode:273 meanR:-0.0300 rate:0.3846 gloss:-8.8949 dloss:2.7482 dlossR:2.7219 dlossQ:0.0263
Episode:274 meanR:-0.0100 rate:0.0000 gloss:-10.6412 dloss:0.0405 dlossR:0.0027 dlossQ:0.0378
Episode:275 meanR:0.0200 rate:0.0000 gloss:-11.2298 dloss:0.0571 dlossR:0.0027 dlossQ:0.0544
Episode:276 meanR:0.0100 rate:-0.0769 gloss:-14.1845 d

Episode:354 meanR:0.9200 rate:-0.0769 gloss:-2.2822 dloss:0.5966 dlossR:0.0803 dlossQ:0.5163
Episode:355 meanR:0.9300 rate:0.0769 gloss:-1.9830 dloss:0.8418 dlossR:0.3271 dlossQ:0.5148
Episode:356 meanR:0.9500 rate:0.0769 gloss:-1.5729 dloss:0.9319 dlossR:0.3660 dlossQ:0.5659
Episode:357 meanR:0.9500 rate:0.0000 gloss:-2.0637 dloss:0.6932 dlossR:0.2028 dlossQ:0.4905
Episode:358 meanR:0.9500 rate:0.0000 gloss:-1.6846 dloss:0.8003 dlossR:0.2617 dlossQ:0.5387
Episode:359 meanR:0.9500 rate:0.0000 gloss:-2.2903 dloss:0.6122 dlossR:0.1715 dlossQ:0.4407
Episode:360 meanR:0.9700 rate:0.0769 gloss:-1.6090 dloss:0.8975 dlossR:0.3466 dlossQ:0.5509
Episode:361 meanR:0.9600 rate:0.0000 gloss:-2.3196 dloss:0.6071 dlossR:0.1685 dlossQ:0.4385
Episode:362 meanR:0.9500 rate:0.0000 gloss:-2.4289 dloss:0.5863 dlossR:0.1616 dlossQ:0.4247
Episode:363 meanR:0.9300 rate:-0.0769 gloss:-3.1793 dloss:0.2504 dlossR:-0.0693 dlossQ:0.3197
Episode:364 meanR:0.9300 rate:0.0000 gloss:-3.9018 dloss:0.2915 dlossR:0.0592

Episode:443 meanR:0.2800 rate:0.0000 gloss:-11.0727 dloss:0.0181 dlossR:0.0005 dlossQ:0.0176
Episode:444 meanR:0.2700 rate:0.0000 gloss:-10.1782 dloss:0.0203 dlossR:0.0006 dlossQ:0.0196
Episode:445 meanR:0.2300 rate:0.0000 gloss:-12.1951 dloss:0.0157 dlossR:0.0002 dlossQ:0.0155
Episode:446 meanR:0.2000 rate:0.0000 gloss:-10.8834 dloss:0.0205 dlossR:0.0004 dlossQ:0.0201
Episode:447 meanR:0.2000 rate:-0.0769 gloss:-8.5757 dloss:-0.4512 dlossR:-0.4768 dlossQ:0.0255
Episode:448 meanR:0.2100 rate:0.0769 gloss:-9.4893 dloss:0.5648 dlossR:0.5448 dlossQ:0.0200
Episode:449 meanR:0.2100 rate:0.0000 gloss:-8.6463 dloss:0.0278 dlossR:0.0019 dlossQ:0.0259
Episode:450 meanR:0.1900 rate:0.0000 gloss:-10.5569 dloss:0.0176 dlossR:0.0005 dlossQ:0.0171
Episode:451 meanR:0.2000 rate:0.1538 gloss:-11.6203 dloss:1.3519 dlossR:1.3357 dlossQ:0.0162
Episode:452 meanR:0.1800 rate:0.0000 gloss:-12.2977 dloss:0.0182 dlossR:0.0001 dlossQ:0.0180
Episode:453 meanR:0.1500 rate:0.0000 gloss:-11.4191 dloss:0.0192 dloss

Episode:532 meanR:0.2500 rate:0.0000 gloss:-4.8003 dloss:0.1695 dlossR:0.0295 dlossQ:0.1400
Episode:533 meanR:0.2400 rate:0.0000 gloss:-3.8728 dloss:0.3034 dlossR:0.0650 dlossQ:0.2383
Episode:534 meanR:0.2300 rate:-0.1538 gloss:-4.6997 dloss:-0.3057 dlossR:-0.4653 dlossQ:0.1597
Episode:535 meanR:0.2500 rate:0.0769 gloss:-4.4173 dloss:0.4798 dlossR:0.2960 dlossQ:0.1838
Episode:536 meanR:0.2600 rate:0.0769 gloss:-4.2197 dloss:0.4964 dlossR:0.2913 dlossQ:0.2051
Episode:537 meanR:0.2700 rate:0.0000 gloss:-4.9824 dloss:0.1730 dlossR:0.0302 dlossQ:0.1428
Episode:538 meanR:0.2700 rate:0.0000 gloss:-4.5999 dloss:0.2048 dlossR:0.0373 dlossQ:0.1675
Episode:539 meanR:0.2800 rate:0.0000 gloss:-4.6257 dloss:0.2015 dlossR:0.0359 dlossQ:0.1656
Episode:540 meanR:0.2700 rate:-0.0769 gloss:-5.0526 dloss:-0.1178 dlossR:-0.2506 dlossQ:0.1328
Episode:541 meanR:0.2700 rate:0.0000 gloss:-4.0955 dloss:0.2615 dlossR:0.0524 dlossQ:0.2091
Episode:542 meanR:0.2500 rate:-0.1538 gloss:-5.3096 dloss:-0.4351 dlossR:-

Episode:620 meanR:-0.0800 rate:0.0000 gloss:-10.4572 dloss:0.0281 dlossR:0.0019 dlossQ:0.0263
Episode:621 meanR:-0.0400 rate:0.2308 gloss:-11.2947 dloss:2.0053 dlossR:1.9790 dlossQ:0.0262
Episode:622 meanR:0.0200 rate:0.5385 gloss:-8.4951 dloss:3.7202 dlossR:3.6631 dlossQ:0.0571
Episode:623 meanR:0.0200 rate:0.2308 gloss:-9.1687 dloss:1.6933 dlossR:1.6222 dlossQ:0.0711
Episode:624 meanR:0.0300 rate:0.0769 gloss:-11.5210 dloss:0.6999 dlossR:0.6638 dlossQ:0.0362
Episode:625 meanR:0.0100 rate:0.0000 gloss:-11.6532 dloss:0.0633 dlossR:0.0055 dlossQ:0.0579
Episode:626 meanR:0.0000 rate:0.0769 gloss:-14.3224 dloss:0.8758 dlossR:0.8251 dlossQ:0.0507
Episode:627 meanR:0.0300 rate:0.0769 gloss:-10.0044 dloss:0.6441 dlossR:0.5789 dlossQ:0.0653
Episode:628 meanR:0.0300 rate:0.0000 gloss:-8.1046 dloss:0.0786 dlossR:0.0093 dlossQ:0.0694
Episode:629 meanR:0.0200 rate:-0.0769 gloss:-10.1971 dloss:-0.5374 dlossR:-0.5659 dlossQ:0.0285
Episode:630 meanR:0.0300 rate:0.0769 gloss:-4.8995 dloss:0.6547 dlos

Episode:709 meanR:0.3200 rate:0.0000 gloss:-5.0954 dloss:0.1721 dlossR:0.0283 dlossQ:0.1438
Episode:710 meanR:0.3000 rate:-0.0769 gloss:-7.0114 dloss:-0.3178 dlossR:-0.3800 dlossQ:0.0622
Episode:711 meanR:0.2800 rate:-0.2308 gloss:-9.3161 dloss:-1.4886 dlossR:-1.5194 dlossQ:0.0307
Episode:712 meanR:0.2600 rate:0.0000 gloss:-6.7449 dloss:0.1032 dlossR:0.0131 dlossQ:0.0901
Episode:713 meanR:0.2200 rate:0.0000 gloss:-7.7545 dloss:0.0531 dlossR:0.0052 dlossQ:0.0479
Episode:714 meanR:0.2700 rate:0.3846 gloss:-6.7360 dloss:2.1282 dlossR:2.0560 dlossQ:0.0722
Episode:715 meanR:0.2500 rate:0.0000 gloss:-6.9531 dloss:0.0934 dlossR:0.0114 dlossQ:0.0820
Episode:716 meanR:0.2500 rate:0.0000 gloss:-7.0990 dloss:0.0788 dlossR:0.0094 dlossQ:0.0694
Episode:717 meanR:0.2500 rate:0.0000 gloss:-6.7777 dloss:0.0986 dlossR:0.0126 dlossQ:0.0861
Episode:718 meanR:0.2700 rate:0.0000 gloss:-7.8960 dloss:0.0805 dlossR:0.0078 dlossQ:0.0727
Episode:719 meanR:0.2700 rate:-0.0769 gloss:-6.4088 dloss:-0.2626 dlossR:-

Episode:797 meanR:-0.0400 rate:0.0000 gloss:-11.4510 dloss:0.0167 dlossR:0.0002 dlossQ:0.0165
Episode:798 meanR:-0.0300 rate:0.0000 gloss:-10.0290 dloss:0.0203 dlossR:0.0011 dlossQ:0.0192
Episode:799 meanR:-0.0100 rate:0.0000 gloss:-9.3520 dloss:0.0217 dlossR:0.0011 dlossQ:0.0206
Episode:800 meanR:0.0000 rate:0.0000 gloss:-10.3010 dloss:0.0178 dlossR:0.0006 dlossQ:0.0172
Episode:801 meanR:0.0100 rate:0.0000 gloss:-8.9199 dloss:0.0242 dlossR:0.0015 dlossQ:0.0227
Episode:802 meanR:0.0100 rate:0.0000 gloss:-9.5269 dloss:0.0245 dlossR:0.0014 dlossQ:0.0231
Episode:803 meanR:0.0100 rate:0.0000 gloss:-11.9282 dloss:0.0174 dlossR:0.0003 dlossQ:0.0171
Episode:804 meanR:0.0000 rate:0.0769 gloss:-11.6874 dloss:0.6859 dlossR:0.6662 dlossQ:0.0197
Episode:805 meanR:0.0000 rate:0.0000 gloss:-12.8093 dloss:0.0157 dlossR:0.0002 dlossQ:0.0155
Episode:806 meanR:0.0100 rate:0.0769 gloss:-12.4593 dloss:0.7295 dlossR:0.7094 dlossQ:0.0201
Episode:807 meanR:0.0100 rate:0.0000 gloss:-10.7480 dloss:0.0193 dloss

Episode:886 meanR:0.1300 rate:0.0000 gloss:-5.0896 dloss:0.1856 dlossR:0.0322 dlossQ:0.1534
Episode:887 meanR:0.1300 rate:-0.0769 gloss:-5.8377 dloss:-0.2002 dlossR:-0.3036 dlossQ:0.1034
Episode:888 meanR:0.1200 rate:0.0000 gloss:-5.6870 dloss:0.2069 dlossR:0.0287 dlossQ:0.1782
Episode:889 meanR:0.1100 rate:-0.0769 gloss:-5.1964 dloss:-0.1429 dlossR:-0.2625 dlossQ:0.1196
Episode:890 meanR:0.1000 rate:0.0000 gloss:-5.5803 dloss:0.1150 dlossR:0.0172 dlossQ:0.0978
Episode:891 meanR:0.1000 rate:0.0000 gloss:-5.3746 dloss:0.1322 dlossR:0.0211 dlossQ:0.1112
Episode:892 meanR:0.1000 rate:0.0000 gloss:-5.2187 dloss:0.1431 dlossR:0.0232 dlossQ:0.1199
Episode:893 meanR:0.1200 rate:0.1538 gloss:-5.6666 dloss:0.7837 dlossR:0.6806 dlossQ:0.1031
Episode:894 meanR:0.1200 rate:0.0000 gloss:-7.0863 dloss:0.0870 dlossR:0.0107 dlossQ:0.0763
Episode:895 meanR:0.1500 rate:0.1538 gloss:-6.3115 dloss:0.8805 dlossR:0.7551 dlossQ:0.1254
Episode:896 meanR:0.1400 rate:0.0000 gloss:-6.6712 dloss:0.1088 dlossR:0.0

Episode:975 meanR:0.0000 rate:-0.0769 gloss:-8.6987 dloss:-0.4545 dlossR:-0.4836 dlossQ:0.0291
Episode:976 meanR:0.0100 rate:0.0769 gloss:-8.9039 dloss:0.5392 dlossR:0.5122 dlossQ:0.0270
Episode:977 meanR:0.0400 rate:0.0769 gloss:-8.5632 dloss:0.5238 dlossR:0.4929 dlossQ:0.0309
Episode:978 meanR:0.0200 rate:-0.1538 gloss:-7.7741 dloss:-0.8089 dlossR:-0.8492 dlossQ:0.0402
Episode:979 meanR:0.0500 rate:0.2308 gloss:-7.0667 dloss:1.3217 dlossR:1.2583 dlossQ:0.0634
Episode:980 meanR:0.0400 rate:-0.0769 gloss:-7.7604 dloss:-0.3891 dlossR:-0.4278 dlossQ:0.0387
Episode:981 meanR:0.0300 rate:-0.0769 gloss:-7.2625 dloss:-0.3500 dlossR:-0.3983 dlossQ:0.0482
Episode:982 meanR:0.0400 rate:-0.0769 gloss:-6.9501 dloss:-0.3260 dlossR:-0.3788 dlossQ:0.0528
Episode:983 meanR:0.0300 rate:-0.0769 gloss:-6.8458 dloss:-0.3104 dlossR:-0.3715 dlossQ:0.0611
Episode:984 meanR:0.0300 rate:0.0000 gloss:-6.3803 dloss:0.0842 dlossR:0.0112 dlossQ:0.0729
Episode:985 meanR:0.0100 rate:-0.0769 gloss:-7.0499 dloss:-0.3

Episode:1063 meanR:0.1400 rate:0.0769 gloss:-4.0047 dloss:0.5045 dlossR:0.2828 dlossQ:0.2217
Episode:1064 meanR:0.1400 rate:0.0000 gloss:-4.8071 dloss:0.1774 dlossR:0.0310 dlossQ:0.1464
Episode:1065 meanR:0.1300 rate:-0.0769 gloss:-5.4311 dloss:-0.1609 dlossR:-0.2763 dlossQ:0.1153
Episode:1066 meanR:0.1000 rate:-0.0769 gloss:-4.5521 dloss:-0.0440 dlossR:-0.2108 dlossQ:0.1668
Episode:1067 meanR:0.0900 rate:0.0000 gloss:-4.5764 dloss:0.2023 dlossR:0.0369 dlossQ:0.1654
Episode:1068 meanR:0.0900 rate:0.0769 gloss:-4.9379 dloss:0.4610 dlossR:0.3145 dlossQ:0.1465
Episode:1069 meanR:0.1200 rate:0.0000 gloss:-5.5973 dloss:0.1178 dlossR:0.0179 dlossQ:0.1000
Episode:1070 meanR:0.1000 rate:-0.0769 gloss:-6.0453 dloss:-0.2410 dlossR:-0.3213 dlossQ:0.0803
Episode:1071 meanR:0.0700 rate:0.0000 gloss:-5.4217 dloss:0.1254 dlossR:0.0194 dlossQ:0.1060
Episode:1072 meanR:0.0600 rate:-0.0769 gloss:-6.0537 dloss:-0.2358 dlossR:-0.3200 dlossQ:0.0841
Episode:1073 meanR:0.0400 rate:-0.0769 gloss:-6.4832 dloss

Episode:1150 meanR:-0.0900 rate:-0.1538 gloss:-10.2168 dloss:-1.1096 dlossR:-1.1292 dlossQ:0.0195
Episode:1151 meanR:-0.0700 rate:0.1538 gloss:-9.4872 dloss:1.1173 dlossR:1.0957 dlossQ:0.0216
Episode:1152 meanR:-0.0700 rate:0.0000 gloss:-10.0695 dloss:0.0192 dlossR:0.0008 dlossQ:0.0185
Episode:1153 meanR:-0.0600 rate:0.0769 gloss:-8.2950 dloss:0.5106 dlossR:0.4773 dlossQ:0.0333
Episode:1154 meanR:-0.0600 rate:0.0000 gloss:-8.7688 dloss:0.0296 dlossR:0.0023 dlossQ:0.0273
Episode:1155 meanR:-0.0800 rate:-0.0769 gloss:-8.2906 dloss:-0.4288 dlossR:-0.4590 dlossQ:0.0302
Episode:1156 meanR:-0.0800 rate:0.0000 gloss:-7.7278 dloss:0.0530 dlossR:0.0058 dlossQ:0.0472
Episode:1157 meanR:-0.0900 rate:0.0000 gloss:-8.9097 dloss:0.0246 dlossR:0.0016 dlossQ:0.0230
Episode:1158 meanR:-0.0800 rate:0.0000 gloss:-7.4440 dloss:0.0538 dlossR:0.0062 dlossQ:0.0476
Episode:1159 meanR:-0.0700 rate:0.0769 gloss:-5.9123 dloss:0.4433 dlossR:0.3546 dlossQ:0.0887
Episode:1160 meanR:-0.0400 rate:0.0769 gloss:-6.0782

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

In [33]:
# # import gym
# # # env = gym.make('CartPole-v0')
# # env = gym.make('CartPole-v1')
# # # env = gym.make('Acrobot-v1')
# # # env = gym.make('MountainCar-v0')
# # # env = gym.make('Pendulum-v0')
# # # env = gym.make('Blackjack-v0')
# # # env = gym.make('FrozenLake-v0')
# # # env = gym.make('AirRaid-ram-v0')
# # # env = gym.make('AirRaid-v0')
# # # env = gym.make('BipedalWalker-v2')
# # # env = gym.make('Copy-v0')
# # # env = gym.make('CarRacing-v0')
# # # env = gym.make('Ant-v2') #mujoco
# # # env = gym.make('FetchPickAndPlace-v1') # mujoco required!

# with tf.Session() as sess:
#     #sess.run(tf.global_variables_initializer())
#     saver.restore(sess, 'checkpoints/model-nav.ckpt')    
#     #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
#     # Episodes/epochs
#     for _ in range(1):
#         state = env.reset()
#         total_reward = 0

#         # Steps/batches
#         #for _ in range(111111111111111111):
#         while True:
#             env.render()
#             action_logits = sess.run(model.actions_logits, feed_dict={model.states: np.reshape(state, [1, -1])})
#             action = np.argmax(action_logits)
#             state, reward, done, _ = env.step(action)
#             total_reward += reward
#             if done:
#                 break
                
#         # Closing the env
#         print('total_reward: {:.2f}'.format(total_reward))
#         env.close()