# Navigation

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the first project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

### 1. Start the Environment

We begin by importing some necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Banana.app"`
- **Windows** (x86): `"path/to/Banana_Windows_x86/Banana.exe"`
- **Windows** (x86_64): `"path/to/Banana_Windows_x86_64/Banana.exe"`
- **Linux** (x86): `"path/to/Banana_Linux/Banana.x86"`
- **Linux** (x86_64): `"path/to/Banana_Linux/Banana.x86_64"`
- **Linux** (x86, headless): `"path/to/Banana_Linux_NoVis/Banana.x86"`
- **Linux** (x86_64, headless): `"path/to/Banana_Linux_NoVis/Banana.x86_64"`

For instance, if you are using a Mac, then you downloaded `Banana.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Banana.app")
```

In [4]:
# env = UnityEnvironment(file_name="/home/arasdar/VisualBanana_Linux/Banana.x86")
# env = UnityEnvironment(file_name="/home/aras/unity-envs/Banana_Linux/Banana.x86_64")
env = UnityEnvironment(file_name="/home/aras/unity-envs/Banana_Linux_NoVis/Banana.x86_64")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [5]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [6]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
# print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)
# print(state.shape, len(env_info.vector_observations), env_info.vector_observations.shape)

Number of agents: 1
Number of actions: 4
States have length: 37


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action (uniformly) at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [7]:
env_info = env.reset(train_mode=False)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
num_steps = 0
while True:
    num_steps += 1
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        print(state.shape)
        break
    
print("Score: {}".format(score))
num_steps

(37,)
Score: 2.0


300

When finished, you can close the environment.

In [8]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [9]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
num_steps = 0
while True:
    num_steps += 1
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    #print(state)
    if done:                                       # exit loop if episode finished
        break
    
print("Score: {}".format(score))
num_steps

Score: 2.0


300

In [10]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

  from ._conv import register_converters as _register_converters


TensorFlow Version: 1.7.1
Default GPU Device: 


In [11]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
batch = []
num_steps = 0
while True: # infinite number of steps
    num_steps += 1
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    #print(state, action, reward, done)
    batch.append([state, action, next_state, reward, float(done)])
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        break
    
print("Score: {}".format(score))
num_steps

Score: 1.0


300

In [12]:
batch[0], batch[0][1]

([array([1.        , 0.        , 0.        , 0.        , 0.20790455,
         0.        , 0.        , 1.        , 0.        , 0.01327663,
         1.        , 0.        , 0.        , 0.        , 0.1713129 ,
         0.        , 0.        , 0.        , 1.        , 0.        ,
         0.        , 0.        , 1.        , 0.        , 0.01027161,
         0.        , 0.        , 1.        , 0.        , 0.02920904,
         0.        , 0.        , 1.        , 0.        , 0.00962341,
         0.        , 0.        ]),
  1,
  array([ 1.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
          2.17450604e-01,  0.00000000e+00,  0.00000000e+00,  1.00000000e+00,
          0.00000000e+00,  2.40637325e-02,  0.00000000e+00,  0.00000000e+00,
          0.00000000e+00,  1.00000000e+00,  0.00000000e+00,  0.00000000e+00,
          0.00000000e+00,  0.00000000e+00,  1.00000000e+00,  0.00000000e+00,
          1.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
          1.825

In [13]:
batch[0]

[array([1.        , 0.        , 0.        , 0.        , 0.20790455,
        0.        , 0.        , 1.        , 0.        , 0.01327663,
        1.        , 0.        , 0.        , 0.        , 0.1713129 ,
        0.        , 0.        , 0.        , 1.        , 0.        ,
        0.        , 0.        , 1.        , 0.        , 0.01027161,
        0.        , 0.        , 1.        , 0.        , 0.02920904,
        0.        , 0.        , 1.        , 0.        , 0.00962341,
        0.        , 0.        ]),
 1,
 array([ 1.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         2.17450604e-01,  0.00000000e+00,  0.00000000e+00,  1.00000000e+00,
         0.00000000e+00,  2.40637325e-02,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  1.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  1.00000000e+00,  0.00000000e+00,
         1.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         1.82566360e-01,  0.00

In [14]:
states = np.array([each[1] for each in batch])
actions = np.array([each[0] for each in batch])
next_states = np.array([each[1] for each in batch])
rewards = np.array([each[2] for each in batch])
dones = np.array([each[3] for each in batch])
# infos = np.array([each[4] for each in batch])

In [15]:
# print(rewards[:])
print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print(np.max(np.array(actions)), np.min(np.array(actions)), 
      (np.max(np.array(actions)) - np.min(np.array(actions)))+1)
print(np.max(np.array(rewards)), np.min(np.array(rewards)))
print(np.max(np.array(states)), np.min(np.array(states)))

(300, 37) (300,) (300, 37) (300,)
float64 int64 float64 float64
10.42125129699707 -11.176579475402832 22.597830772399902
10.42125129699707 -11.176579475402832
3 0


In [17]:
def model_input(state_size, hidden_size, batch_size=1):
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    # RNN
    cell = tf.nn.rnn_cell.GRUCell(hidden_size)
    #cell = tf.nn.rnn_cell.LSTMCell(hidden_size)
    cells = tf.nn.rnn_cell.MultiRNNCell([cell], state_is_tuple=True)
    initial_state = cells.zero_state(batch_size, tf.float32)
    return states, actions, targetQs, cells, initial_state

In [18]:
# RNN generator or sequence generator
def generator(states, action_size, initial_state, cells, hidden_size, reuse=False): 
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        inputs = tf.layers.dense(inputs=states, units=hidden_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        # dynamic means adapt to the batch_size and
        # static means can NOT adapt to the batch_size
        inputs_rnn = tf.reshape(inputs, [1, -1, hidden_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cells, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state)
        outputs = tf.reshape(outputs_rnn, [-1, hidden_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=action_size)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [19]:
def model_loss(action_size, hidden_size, states, cells, initial_state, actions, targetQs):
    actions_logits, final_state = generator(states=states, cells=cells, initial_state=initial_state, 
                                            hidden_size=hidden_size, action_size=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    Qs = tf.reduce_max(actions_logits*actions_labels, axis=1)
    loss = tf.reduce_mean(tf.square(Qs - targetQs))
    return actions_logits, final_state, loss

In [20]:
def model_opt(loss, learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]

    # # Optimize MLP/CNN
    # with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
    # #opt = tf.train.AdamOptimizer(learning_rate).minimize(loss, var_list=g_vars)

    # # Optimize RNN
    #grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(loss, g_vars), clip_norm=5) # usually around 1-5
    grads = tf.gradients(loss, g_vars)
    opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(grads, g_vars))

    return opt

In [21]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, cells, self.initial_state = model_input(
                state_size=state_size, hidden_size=hidden_size)
        
        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.final_state, self.loss = model_loss(
            action_size=action_size, hidden_size=hidden_size, 
            states=self.states, actions=self.actions, 
            targetQs=self.targetQs, cells=cells, initial_state=self.initial_state)

        # Update the model: backward pass and backprop
        self.opt = model_opt(loss=self.loss, learning_rate=learning_rate)

In [22]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
        self.states = deque(maxlen=max_size)

In [23]:
# Network parameters
action_size = 4
state_size = 37
hidden_size = 37*2             # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 128            # memory capacity - 1000 DQN
batch_size = 128             # experience mini-batch size - 20 DQN
gamma = 0.99                 # future reward discount

In [24]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

(?, 37) (?, 74)
(1, ?, 74) (<tf.Tensor 'MultiRNNCellZeroState/GRUCellZeroState/zeros:0' shape=(1, 74) dtype=float32>,)
(1, ?, 74) (<tf.Tensor 'generator/rnn/while/Exit_3:0' shape=(1, 74) dtype=float32>,)
(?, 74)
(?, 4)


In [25]:
model.initial_state[0]

<tf.Tensor 'MultiRNNCellZeroState/GRUCellZeroState/zeros:0' shape=(1, 74) dtype=float32>

In [26]:
# state = env.reset()
# for _ in range(batch_size):
#     action = env.action_space.sample()
#     next_state, reward, done, _ = env.step(action)
#     memory.buffer.append([state, action, next_state, reward, float(done)])
#     state = next_state
#     if done is True:
#         state = env.reset()

In [27]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]   # get the state
for _ in range(memory_size):
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    memory.buffer.append([state, action, next_state, reward, float(done)])
    memory.states.append(np.zeros([1, hidden_size])) # initial_states for rnn/mem
    state = next_state
    if done:                                       # exit loop if episode finished
        env_info = env.reset(train_mode=True)[brain_name] # reset the environment
        state = env_info.vector_observations[0]   # get the state
        break

In [28]:
memory.states[0].shape, model.initial_state[0].shape # gru
# memory.states[0][1].shape, model.initial_state[0][1].shape #lstm

((1, 74), TensorShape([Dimension(1), Dimension(74)]))

In [None]:
saver = tf.train.Saver()
episode_rewards_list, rewards_list, loss_list = [], [], []

# TF session for training
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        loss_batch = []
        #state = env.reset()
        env_info = env.reset(train_mode=True)[brain_name] # reset the environment
        state = env_info.vector_observations[0]   # get the current state
        initial_state = sess.run(model.initial_state)

        # Training steps/batches
        while True:
            action_logits, final_state = sess.run([model.actions_logits, model.final_state],
                                                  feed_dict = {model.states: state.reshape([1, -1]), 
                                                               model.initial_state: initial_state})
            action = np.argmax(action_logits)
            #state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name]        # send the action to the environment
            next_state = env_info.vector_observations[0]   # get the next state
            reward = env_info.rewards[0]                   # get the reward
            done = env_info.local_done[0]                  # see if episode has finished
            memory.buffer.append([state, action, next_state, reward, float(done)])
            memory.states.append(initial_state)
            total_reward += reward
            state = next_state
            initial_state = final_state
            
            # Training
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            initial_states = memory.states
            next_actions_logits = sess.run(model.actions_logits, 
                                           feed_dict = {model.states: next_states,
                                                        model.initial_state: initial_states[1]})
            nextQs = np.max(next_actions_logits, axis=1) * (1-dones)
            targetQs = rewards + (gamma * nextQs)
            loss, _ = sess.run([model.loss, model.opt], feed_dict = {model.states: states, 
                                                                     model.actions: actions,
                                                                     model.targetQs: targetQs,
                                                                     model.initial_state: initial_states[0]})
            # End of training
            loss_batch.append(loss)
            if done is True:
                break
                
        # Outputing: priting out/Potting
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'loss:{:.4f}'.format(np.mean(loss_batch)))
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        loss_list.append([ep, np.mean(loss_batch)])
        # Break episode/epoch loop
        if np.mean(episode_reward) >= +13:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:2.0000 R:2.0000 loss:0.0565
Episode:1 meanR:0.5000 R:-1.0000 loss:0.0721
Episode:2 meanR:0.6667 R:1.0000 loss:0.0371
Episode:3 meanR:0.5000 R:0.0000 loss:0.0599
Episode:4 meanR:0.6000 R:1.0000 loss:0.0772
Episode:5 meanR:0.6667 R:1.0000 loss:0.0497
Episode:6 meanR:0.7143 R:1.0000 loss:0.0194
Episode:7 meanR:0.6250 R:0.0000 loss:0.0324
Episode:8 meanR:0.4444 R:-1.0000 loss:0.0143
Episode:9 meanR:0.4000 R:0.0000 loss:0.0108
Episode:10 meanR:0.2727 R:-1.0000 loss:0.0164
Episode:11 meanR:0.1667 R:-1.0000 loss:0.0341
Episode:12 meanR:0.3077 R:2.0000 loss:0.0453
Episode:13 meanR:0.5714 R:4.0000 loss:0.0789
Episode:14 meanR:0.6000 R:1.0000 loss:0.0909
Episode:15 meanR:0.5625 R:0.0000 loss:0.0245
Episode:16 meanR:0.5882 R:1.0000 loss:0.0243
Episode:17 meanR:0.6667 R:2.0000 loss:0.0219
Episode:18 meanR:0.6316 R:0.0000 loss:0.0185
Episode:19 meanR:0.6500 R:1.0000 loss:0.0092
Episode:20 meanR:0.6190 R:0.0000 loss:0.0090
Episode:21 meanR:0.5909 R:0.0000 loss:0.0049
Episode:22 meanR

Episode:180 meanR:5.5600 R:8.0000 loss:0.0641
Episode:181 meanR:5.5900 R:11.0000 loss:0.0575
Episode:182 meanR:5.5200 R:2.0000 loss:0.0795
Episode:183 meanR:5.3900 R:1.0000 loss:0.0545
Episode:184 meanR:5.3900 R:9.0000 loss:0.0493
Episode:185 meanR:5.4500 R:11.0000 loss:0.0666
Episode:186 meanR:5.4800 R:8.0000 loss:0.0923
Episode:187 meanR:5.4500 R:0.0000 loss:0.0491
Episode:188 meanR:5.3400 R:2.0000 loss:0.0488
Episode:189 meanR:5.3900 R:7.0000 loss:0.0410
Episode:190 meanR:5.3700 R:8.0000 loss:0.1086
Episode:191 meanR:5.4000 R:5.0000 loss:0.0669
Episode:192 meanR:5.4700 R:8.0000 loss:0.0646
Episode:193 meanR:5.5500 R:13.0000 loss:0.0458
Episode:194 meanR:5.5200 R:6.0000 loss:0.0969
Episode:195 meanR:5.4300 R:2.0000 loss:0.0679
Episode:196 meanR:5.4700 R:5.0000 loss:0.0461
Episode:197 meanR:5.4600 R:2.0000 loss:0.0519
Episode:198 meanR:5.5800 R:12.0000 loss:0.0465
Episode:199 meanR:5.6700 R:8.0000 loss:0.0874
Episode:200 meanR:5.6600 R:4.0000 loss:0.0431
Episode:201 meanR:5.6700 R:7.0

Episode:358 meanR:4.1200 R:1.0000 loss:0.0355
Episode:359 meanR:4.2200 R:12.0000 loss:0.0519
Episode:360 meanR:4.3000 R:10.0000 loss:0.0811
Episode:361 meanR:4.2900 R:1.0000 loss:0.0537
Episode:362 meanR:4.3100 R:9.0000 loss:0.0314
Episode:363 meanR:4.3100 R:7.0000 loss:0.0496
Episode:364 meanR:4.4800 R:18.0000 loss:0.0644
Episode:365 meanR:4.5200 R:15.0000 loss:0.0710
Episode:366 meanR:4.5800 R:12.0000 loss:0.0813
Episode:367 meanR:4.5400 R:7.0000 loss:0.0671
Episode:368 meanR:4.6300 R:11.0000 loss:0.0522
Episode:369 meanR:4.7100 R:11.0000 loss:0.0348
Episode:370 meanR:4.7800 R:8.0000 loss:0.0644
Episode:371 meanR:4.8000 R:11.0000 loss:0.0812
Episode:372 meanR:4.9300 R:15.0000 loss:0.0486
Episode:373 meanR:4.8800 R:5.0000 loss:0.1062
Episode:374 meanR:4.8600 R:12.0000 loss:0.0488
Episode:375 meanR:4.8400 R:11.0000 loss:0.0657
Episode:376 meanR:4.9400 R:14.0000 loss:0.0623
Episode:377 meanR:5.0200 R:9.0000 loss:0.0506
Episode:378 meanR:5.0100 R:0.0000 loss:0.0601
Episode:379 meanR:5.09

Episode:535 meanR:6.7200 R:9.0000 loss:0.0655
Episode:536 meanR:6.6000 R:2.0000 loss:0.0475
Episode:537 meanR:6.5900 R:6.0000 loss:0.0274
Episode:538 meanR:6.6300 R:11.0000 loss:0.0501
Episode:539 meanR:6.6600 R:12.0000 loss:0.0768
Episode:540 meanR:6.7200 R:12.0000 loss:0.0675
Episode:541 meanR:6.6700 R:3.0000 loss:0.0465
Episode:542 meanR:6.6800 R:10.0000 loss:0.0649
Episode:543 meanR:6.6900 R:4.0000 loss:0.0566
Episode:544 meanR:6.6300 R:9.0000 loss:0.0528
Episode:545 meanR:6.5600 R:2.0000 loss:0.0384
Episode:546 meanR:6.5600 R:6.0000 loss:0.0297
Episode:547 meanR:6.4700 R:7.0000 loss:0.0408
Episode:548 meanR:6.5200 R:4.0000 loss:0.0386
Episode:549 meanR:6.5400 R:9.0000 loss:0.0439
Episode:550 meanR:6.5400 R:4.0000 loss:0.0318
Episode:551 meanR:6.4500 R:-3.0000 loss:0.0359
Episode:552 meanR:6.4500 R:2.0000 loss:0.0370
Episode:553 meanR:6.4400 R:6.0000 loss:0.0386
Episode:554 meanR:6.4000 R:4.0000 loss:0.0449
Episode:555 meanR:6.2800 R:3.0000 loss:0.0347
Episode:556 meanR:6.2600 R:7.

Episode:712 meanR:7.6300 R:9.0000 loss:0.0513
Episode:713 meanR:7.4700 R:3.0000 loss:0.0428
Episode:714 meanR:7.4300 R:11.0000 loss:0.0562
Episode:715 meanR:7.4100 R:4.0000 loss:0.0651
Episode:716 meanR:7.4500 R:11.0000 loss:0.0316
Episode:717 meanR:7.3800 R:8.0000 loss:0.0537
Episode:718 meanR:7.3400 R:6.0000 loss:0.0503
Episode:719 meanR:7.3600 R:4.0000 loss:0.0533
Episode:720 meanR:7.3300 R:9.0000 loss:0.0443
Episode:721 meanR:7.3200 R:9.0000 loss:0.0452
Episode:722 meanR:7.2500 R:1.0000 loss:0.0405
Episode:723 meanR:7.1500 R:0.0000 loss:0.0195
Episode:724 meanR:7.0300 R:0.0000 loss:0.0284
Episode:725 meanR:6.9600 R:1.0000 loss:0.0126
Episode:726 meanR:6.9200 R:2.0000 loss:0.0270
Episode:727 meanR:6.8500 R:4.0000 loss:0.0275
Episode:728 meanR:6.8000 R:3.0000 loss:0.0178
Episode:729 meanR:6.7300 R:2.0000 loss:0.0237
Episode:730 meanR:6.6100 R:-1.0000 loss:0.0083
Episode:731 meanR:6.5300 R:4.0000 loss:0.0305
Episode:732 meanR:6.4600 R:6.0000 loss:0.0532
Episode:733 meanR:6.4400 R:3.00

Episode:889 meanR:6.8900 R:6.0000 loss:0.0213
Episode:890 meanR:6.8700 R:8.0000 loss:0.0313
Episode:891 meanR:6.9600 R:9.0000 loss:0.0248
Episode:892 meanR:7.0700 R:12.0000 loss:0.0568
Episode:893 meanR:7.1000 R:9.0000 loss:0.0460
Episode:894 meanR:7.1000 R:10.0000 loss:0.0575
Episode:895 meanR:7.1100 R:9.0000 loss:0.0490
Episode:896 meanR:7.1800 R:12.0000 loss:0.0567
Episode:897 meanR:7.1700 R:8.0000 loss:0.0522
Episode:898 meanR:7.1900 R:6.0000 loss:0.0457
Episode:899 meanR:7.2200 R:14.0000 loss:0.0674
Episode:900 meanR:7.2500 R:16.0000 loss:0.0694
Episode:901 meanR:7.1900 R:7.0000 loss:0.0823
Episode:902 meanR:7.0700 R:2.0000 loss:0.0690
Episode:903 meanR:7.0000 R:5.0000 loss:0.0467
Episode:904 meanR:6.8900 R:0.0000 loss:0.0420
Episode:905 meanR:6.8400 R:9.0000 loss:0.0634
Episode:906 meanR:6.7800 R:4.0000 loss:0.0344
Episode:907 meanR:6.7800 R:4.0000 loss:0.0336
Episode:908 meanR:6.6900 R:1.0000 loss:0.0524
Episode:909 meanR:6.6800 R:6.0000 loss:0.0491
Episode:910 meanR:6.5300 R:1.

Episode:1065 meanR:4.5500 R:3.0000 loss:0.0493
Episode:1066 meanR:4.5800 R:5.0000 loss:0.0297
Episode:1067 meanR:4.6300 R:8.0000 loss:0.0580
Episode:1068 meanR:4.6600 R:3.0000 loss:0.0721
Episode:1069 meanR:4.6300 R:6.0000 loss:0.0435
Episode:1070 meanR:4.5900 R:1.0000 loss:0.0437
Episode:1071 meanR:4.5700 R:4.0000 loss:0.0475
Episode:1072 meanR:4.5200 R:5.0000 loss:0.0350
Episode:1073 meanR:4.5200 R:8.0000 loss:0.0243
Episode:1074 meanR:4.6000 R:9.0000 loss:0.0823
Episode:1075 meanR:4.6400 R:5.0000 loss:0.0598
Episode:1076 meanR:4.7200 R:9.0000 loss:0.0612
Episode:1077 meanR:4.7700 R:6.0000 loss:0.0653
Episode:1078 meanR:4.8700 R:7.0000 loss:0.0730
Episode:1079 meanR:4.9900 R:14.0000 loss:0.0527
Episode:1080 meanR:4.9000 R:3.0000 loss:0.0303
Episode:1081 meanR:4.8200 R:3.0000 loss:0.0373
Episode:1082 meanR:4.7700 R:0.0000 loss:0.0303
Episode:1083 meanR:4.8200 R:4.0000 loss:0.0320
Episode:1084 meanR:4.8600 R:6.0000 loss:0.0304
Episode:1085 meanR:4.9200 R:6.0000 loss:0.0376
Episode:1086

Episode:1239 meanR:4.4100 R:7.0000 loss:0.0716
Episode:1240 meanR:4.3800 R:6.0000 loss:0.0479
Episode:1241 meanR:4.3700 R:6.0000 loss:0.0448
Episode:1242 meanR:4.4900 R:14.0000 loss:0.0423
Episode:1243 meanR:4.3900 R:0.0000 loss:0.0697
Episode:1244 meanR:4.3200 R:0.0000 loss:0.0628
Episode:1245 meanR:4.1600 R:1.0000 loss:0.0345
Episode:1246 meanR:4.1700 R:1.0000 loss:0.0290
Episode:1247 meanR:4.1100 R:3.0000 loss:0.0464
Episode:1248 meanR:4.1400 R:4.0000 loss:0.0584
Episode:1249 meanR:4.1300 R:0.0000 loss:0.0494
Episode:1250 meanR:4.1300 R:2.0000 loss:0.0283
Episode:1251 meanR:4.2000 R:6.0000 loss:0.0457
Episode:1252 meanR:4.1700 R:6.0000 loss:0.0516
Episode:1253 meanR:4.1700 R:4.0000 loss:0.0488
Episode:1254 meanR:4.1700 R:3.0000 loss:0.0505
Episode:1255 meanR:4.1500 R:3.0000 loss:0.0404
Episode:1256 meanR:4.1000 R:0.0000 loss:0.0324
Episode:1257 meanR:4.0700 R:1.0000 loss:0.0406
Episode:1258 meanR:4.0400 R:0.0000 loss:0.0087
Episode:1259 meanR:3.9100 R:1.0000 loss:0.0119
Episode:1260

Episode:1413 meanR:5.7400 R:-1.0000 loss:0.0469
Episode:1414 meanR:5.6600 R:1.0000 loss:0.0480
Episode:1415 meanR:5.5600 R:5.0000 loss:0.0552
Episode:1416 meanR:5.4500 R:5.0000 loss:0.0520
Episode:1417 meanR:5.3600 R:1.0000 loss:0.0309
Episode:1418 meanR:5.3100 R:1.0000 loss:0.0389
Episode:1419 meanR:5.2100 R:-1.0000 loss:0.0299
Episode:1420 meanR:5.1500 R:3.0000 loss:0.0296
Episode:1421 meanR:5.0300 R:0.0000 loss:0.0166
Episode:1422 meanR:4.9300 R:1.0000 loss:0.0202
Episode:1423 meanR:4.8900 R:2.0000 loss:0.0156
Episode:1424 meanR:4.7700 R:2.0000 loss:0.0287
Episode:1425 meanR:4.6400 R:0.0000 loss:0.0245
Episode:1426 meanR:4.6100 R:2.0000 loss:0.0265
Episode:1427 meanR:4.6200 R:4.0000 loss:0.0281
Episode:1428 meanR:4.5700 R:-1.0000 loss:0.0269
Episode:1429 meanR:4.5700 R:1.0000 loss:0.0216
Episode:1430 meanR:4.5900 R:4.0000 loss:0.0229
Episode:1431 meanR:4.5600 R:1.0000 loss:0.0325
Episode:1432 meanR:4.5000 R:2.0000 loss:0.0278
Episode:1433 meanR:4.4300 R:0.0000 loss:0.0446
Episode:14

Episode:1587 meanR:4.9000 R:3.0000 loss:0.0430
Episode:1588 meanR:4.7900 R:0.0000 loss:0.0356
Episode:1589 meanR:4.7900 R:3.0000 loss:0.0211
Episode:1590 meanR:4.8100 R:7.0000 loss:0.0536
Episode:1591 meanR:4.8500 R:9.0000 loss:0.0486
Episode:1592 meanR:4.8300 R:5.0000 loss:0.0379
Episode:1593 meanR:4.8400 R:9.0000 loss:0.0482
Episode:1594 meanR:4.8900 R:9.0000 loss:0.0672
Episode:1595 meanR:4.7600 R:0.0000 loss:0.0629
Episode:1596 meanR:4.6800 R:-2.0000 loss:0.0235
Episode:1597 meanR:4.7000 R:7.0000 loss:0.0390
Episode:1598 meanR:4.7700 R:8.0000 loss:0.0598
Episode:1599 meanR:4.8300 R:11.0000 loss:0.0545
Episode:1600 meanR:4.8900 R:7.0000 loss:0.0711
Episode:1601 meanR:5.0000 R:11.0000 loss:0.0554
Episode:1602 meanR:5.0100 R:4.0000 loss:0.0555
Episode:1603 meanR:5.0700 R:11.0000 loss:0.0553
Episode:1604 meanR:5.0600 R:11.0000 loss:0.0660
Episode:1605 meanR:5.0000 R:6.0000 loss:0.0653
Episode:1606 meanR:5.0500 R:7.0000 loss:0.0573
Episode:1607 meanR:5.0500 R:6.0000 loss:0.0782
Episode:

Episode:1761 meanR:5.9400 R:2.0000 loss:0.0577
Episode:1762 meanR:6.0500 R:16.0000 loss:0.0734
Episode:1763 meanR:6.1700 R:15.0000 loss:0.0644
Episode:1764 meanR:6.2200 R:7.0000 loss:0.0704
Episode:1765 meanR:6.2800 R:14.0000 loss:0.0729
Episode:1766 meanR:6.2300 R:3.0000 loss:0.0732
Episode:1767 meanR:6.2900 R:12.0000 loss:0.0644
Episode:1768 meanR:6.3300 R:7.0000 loss:0.0775
Episode:1769 meanR:6.3300 R:4.0000 loss:0.0421
Episode:1770 meanR:6.3700 R:10.0000 loss:0.0701
Episode:1771 meanR:6.4700 R:13.0000 loss:0.0708
Episode:1772 meanR:6.5100 R:9.0000 loss:0.0613
Episode:1773 meanR:6.5700 R:7.0000 loss:0.0559
Episode:1774 meanR:6.6500 R:11.0000 loss:0.0697
Episode:1775 meanR:6.6200 R:2.0000 loss:0.0419
Episode:1776 meanR:6.6800 R:6.0000 loss:0.0604
Episode:1777 meanR:6.6900 R:5.0000 loss:0.0417
Episode:1778 meanR:6.7500 R:11.0000 loss:0.0426
Episode:1779 meanR:6.8100 R:12.0000 loss:0.0524
Episode:1780 meanR:6.8500 R:14.0000 loss:0.0660
Episode:1781 meanR:6.8400 R:14.0000 loss:0.0425
Ep

Episode:1935 meanR:6.7100 R:9.0000 loss:0.0648
Episode:1936 meanR:6.8100 R:10.0000 loss:0.0560
Episode:1937 meanR:6.8600 R:6.0000 loss:0.0412
Episode:1938 meanR:6.9300 R:9.0000 loss:0.0649
Episode:1939 meanR:6.9900 R:7.0000 loss:0.0819
Episode:1940 meanR:7.0800 R:10.0000 loss:0.0742
Episode:1941 meanR:7.1100 R:5.0000 loss:0.0774
Episode:1942 meanR:7.1300 R:8.0000 loss:0.0850
Episode:1943 meanR:7.1000 R:3.0000 loss:0.0732
Episode:1944 meanR:7.0900 R:3.0000 loss:0.0413
Episode:1945 meanR:7.0600 R:5.0000 loss:0.0518
Episode:1946 meanR:7.0200 R:3.0000 loss:0.0318
Episode:1947 meanR:6.9900 R:5.0000 loss:0.0414
Episode:1948 meanR:7.0200 R:8.0000 loss:0.0478
Episode:1949 meanR:7.0600 R:9.0000 loss:0.0421
Episode:1950 meanR:6.9900 R:2.0000 loss:0.0405
Episode:1951 meanR:6.9500 R:6.0000 loss:0.0429
Episode:1952 meanR:6.9200 R:8.0000 loss:0.0599
Episode:1953 meanR:6.9800 R:8.0000 loss:0.0562
Episode:1954 meanR:7.0500 R:9.0000 loss:0.0706
Episode:1955 meanR:7.1300 R:9.0000 loss:0.0613
Episode:195

Episode:2109 meanR:8.0200 R:6.0000 loss:0.0365
Episode:2110 meanR:8.0800 R:9.0000 loss:0.0470
Episode:2111 meanR:8.0300 R:6.0000 loss:0.0496
Episode:2112 meanR:7.9400 R:4.0000 loss:0.0409
Episode:2113 meanR:7.8300 R:4.0000 loss:0.0445
Episode:2114 meanR:7.8300 R:2.0000 loss:0.0391
Episode:2115 meanR:7.7600 R:1.0000 loss:0.0223
Episode:2116 meanR:7.7200 R:7.0000 loss:0.0415
Episode:2117 meanR:7.6200 R:3.0000 loss:0.0390
Episode:2118 meanR:7.5300 R:2.0000 loss:0.0566
Episode:2119 meanR:7.4400 R:3.0000 loss:0.0529
Episode:2120 meanR:7.3900 R:5.0000 loss:0.0504
Episode:2121 meanR:7.4600 R:10.0000 loss:0.0465
Episode:2122 meanR:7.5400 R:15.0000 loss:0.0722
Episode:2123 meanR:7.5700 R:8.0000 loss:0.0361
Episode:2124 meanR:7.5800 R:5.0000 loss:0.0799
Episode:2125 meanR:7.5100 R:0.0000 loss:0.0402
Episode:2126 meanR:7.5000 R:9.0000 loss:0.0498
Episode:2127 meanR:7.6100 R:15.0000 loss:0.0708
Episode:2128 meanR:7.6100 R:7.0000 loss:0.1018
Episode:2129 meanR:7.6200 R:0.0000 loss:0.0310
Episode:21

Episode:2283 meanR:5.6000 R:2.0000 loss:0.0381
Episode:2284 meanR:5.6400 R:6.0000 loss:0.0374
Episode:2285 meanR:5.6700 R:8.0000 loss:0.0361
Episode:2286 meanR:5.6700 R:9.0000 loss:0.0350
Episode:2287 meanR:5.7200 R:11.0000 loss:0.0509
Episode:2288 meanR:5.6400 R:4.0000 loss:0.0335
Episode:2289 meanR:5.6200 R:4.0000 loss:0.0453
Episode:2290 meanR:5.6200 R:7.0000 loss:0.0441
Episode:2291 meanR:5.6900 R:8.0000 loss:0.0529
Episode:2292 meanR:5.6000 R:5.0000 loss:0.0366
Episode:2293 meanR:5.6300 R:5.0000 loss:0.0399
Episode:2294 meanR:5.7100 R:8.0000 loss:0.0413
Episode:2295 meanR:5.7900 R:7.0000 loss:0.0605
Episode:2296 meanR:5.7400 R:2.0000 loss:0.0562
Episode:2297 meanR:5.7600 R:5.0000 loss:0.0301
Episode:2298 meanR:5.8000 R:5.0000 loss:0.0428
Episode:2299 meanR:5.7800 R:7.0000 loss:0.0632
Episode:2300 meanR:5.8100 R:8.0000 loss:0.0458
Episode:2301 meanR:5.8600 R:7.0000 loss:0.0720
Episode:2302 meanR:5.9100 R:5.0000 loss:0.0610
Episode:2303 meanR:5.9300 R:6.0000 loss:0.0391
Episode:2304

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Episode rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

In [30]:
# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Testing episodes/epochs
    for _ in range(1):
        total_reward = 0
        #state = env.reset()
        env_info = env.reset(train_mode=False)[brain_name] # reset the environment
        state = env_info.vector_observations[0]   # get the current state

        # Testing steps/batches
        while True:
            action_logits = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
            action = np.argmax(action_logits)
            #state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name]        # send the action to the environment
            state = env_info.vector_observations[0]   # get the next state
            reward = env_info.rewards[0]                   # get the reward
            done = env_info.local_done[0]                  # see if episode has finished
            total_reward += reward
            if done:
                break
                
        print('total_reward: {:.2f}'.format(total_reward))

INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt


total_reward: 2.00


In [None]:
# Be careful!!!!!!!!!!!!!!!!
# Closing the env
env.close()